You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
langchain/libs/community/tests/unit_tests/document_loaders
Mr. Lance E Sloan «UMich» 84dc2dd059
community[patch]: Load YouTube transcripts (captions) as fixed-duration chunks with start times (#21710)
- **Description:** Add a new format, `CHUNKS`, to
`langchain_community.document_loaders.youtube.YoutubeLoader` which
creates multiple `Document` objects from YouTube video transcripts
(captions), each of a fixed duration. The metadata of each chunk
`Document` includes the start time of each one and a URL to that time in
the video on the YouTube website.
  
I had implemented this for UMich (@umich-its-ai) in a local module, but
it makes sense to contribute this to LangChain community for all to
benefit and to simplify maintenance.

- **Issue:** N/A
- **Dependencies:** N/A
- **Twitter:** lsloan_umich
- **Mastodon:**
[lsloan@mastodon.social](https://mastodon.social/@lsloan)

With regards to **tests and documentation**, most existing features of
the `YoutubeLoader` class are not tested. Only the
`YoutubeLoader.extract_video_id()` static method had a test. However,
while I was waiting for this PR to be reviewed and merged, I had time to
add a test for the chunking feature I've proposed in this PR.

I have added an example of using chunking to the
`docs/docs/integrations/document_loaders/youtube_transcript.ipynb`
notebook.

---------

Co-authored-by: Bagatur <baskaryan@gmail.com>
3 months ago
..
blob_loaders community[minor]: Add CloudBlobLoader that supports loading data from cloud buckets (#21957) 4 months ago
loaders community[patch]: upgrade to recent version of mypy (#21616) 4 months ago
parsers Community[minor]: Add language parser for Elixir (#22742) 3 months ago
sample_documents community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
test_docs community: Fix CSVLoader columns is None (#20701) 4 months ago
__init__.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
test_airbyte.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
test_arcgis_loader.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
test_assemblyai.py Merge pull request #18421 7 months ago
test_bibtex.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
test_bshtml.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
test_confluence.py Merge pull request #18436 7 months ago
test_couchbase.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
test_csv_loader.py community: Fix CSVLoader columns is None (#20701) 4 months ago
test_cube_semantic.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
test_detect_encoding.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
test_directory.py community[minor]: Implement DirectoryLoader lazy_load function (#19537) 6 months ago
test_directory_loader.py community: Fix CSVLoader columns is None (#20701) 4 months ago
test_evernote_loader.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
test_generic_loader.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
test_git.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
test_github.py community[patch]: upgrade to recent version of mypy (#21616) 4 months ago
test_hugging_face.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
test_hugging_face_model.py community[minor]: add hugging_face_model document loader (#17323) 7 months ago
test_imports.py community[minor]: Add Scrapfly Loader community integration (#22036) 4 months ago
test_json_loader.py community[minor]: use jq schema for content_key in json_loader (#18003) 7 months ago
test_lakefs.py community[minor]: import fix (#20995) 5 months ago
test_mediawikidump.py infra: add print rule to ruff (#16221) 8 months ago
test_mhtml.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
test_mongodb.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
test_notebook.py community[patch]: add NotebookLoader unit test (#17721) 6 months ago
test_obsidian.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
test_onenote.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
test_oracleadb.py community[minor]: add oracle autonomous database doc loader integration (#19536) 6 months ago
test_pebblo.py community[minor]: Add support for Pebblo cloud_api_key in PebbloSafeLoader (#19855) 6 months ago
test_psychic.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
test_readthedoc.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
test_recursive_url_loader.py community[patch]: recursive url loader fix and unit tests (#22521) 4 months ago
test_rspace_loader.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
test_rss.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
test_trello.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
test_web_base.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
test_youtube.py community[patch]: Load YouTube transcripts (captions) as fixed-duration chunks with start times (#21710) 3 months ago