mirror of
https://github.com/hwchase17/langchain
synced 2024-11-10 01:10:59 +00:00
84dc2dd059
- **Description:** Add a new format, `CHUNKS`, to `langchain_community.document_loaders.youtube.YoutubeLoader` which creates multiple `Document` objects from YouTube video transcripts (captions), each of a fixed duration. The metadata of each chunk `Document` includes the start time of each one and a URL to that time in the video on the YouTube website. I had implemented this for UMich (@umich-its-ai) in a local module, but it makes sense to contribute this to LangChain community for all to benefit and to simplify maintenance. - **Issue:** N/A - **Dependencies:** N/A - **Twitter:** lsloan_umich - **Mastodon:** [lsloan@mastodon.social](https://mastodon.social/@lsloan) With regards to **tests and documentation**, most existing features of the `YoutubeLoader` class are not tested. Only the `YoutubeLoader.extract_video_id()` static method had a test. However, while I was waiting for this PR to be reviewed and merged, I had time to add a test for the chunking feature I've proposed in this PR. I have added an example of using chunking to the `docs/docs/integrations/document_loaders/youtube_transcript.ipynb` notebook. --------- Co-authored-by: Bagatur <baskaryan@gmail.com> |
||
---|---|---|
.. | ||
blob_loaders | ||
loaders | ||
parsers | ||
sample_documents | ||
test_docs | ||
__init__.py | ||
test_airbyte.py | ||
test_arcgis_loader.py | ||
test_assemblyai.py | ||
test_bibtex.py | ||
test_bshtml.py | ||
test_confluence.py | ||
test_couchbase.py | ||
test_csv_loader.py | ||
test_cube_semantic.py | ||
test_detect_encoding.py | ||
test_directory_loader.py | ||
test_directory.py | ||
test_evernote_loader.py | ||
test_generic_loader.py | ||
test_git.py | ||
test_github.py | ||
test_hugging_face_model.py | ||
test_hugging_face.py | ||
test_imports.py | ||
test_json_loader.py | ||
test_lakefs.py | ||
test_mediawikidump.py | ||
test_mhtml.py | ||
test_mongodb.py | ||
test_notebook.py | ||
test_obsidian.py | ||
test_onenote.py | ||
test_oracleadb.py | ||
test_pebblo.py | ||
test_psychic.py | ||
test_readthedoc.py | ||
test_recursive_url_loader.py | ||
test_rspace_loader.py | ||
test_rss.py | ||
test_trello.py | ||
test_web_base.py | ||
test_youtube.py |