langchain/libs
Mr. Lance E Sloan «UMich» 84dc2dd059
community[patch]: Load YouTube transcripts (captions) as fixed-duration chunks with start times (#21710)
- **Description:** Add a new format, `CHUNKS`, to
`langchain_community.document_loaders.youtube.YoutubeLoader` which
creates multiple `Document` objects from YouTube video transcripts
(captions), each of a fixed duration. The metadata of each chunk
`Document` includes the start time of each one and a URL to that time in
the video on the YouTube website.
  
I had implemented this for UMich (@umich-its-ai) in a local module, but
it makes sense to contribute this to LangChain community for all to
benefit and to simplify maintenance.

- **Issue:** N/A
- **Dependencies:** N/A
- **Twitter:** lsloan_umich
- **Mastodon:**
[lsloan@mastodon.social](https://mastodon.social/@lsloan)

With regards to **tests and documentation**, most existing features of
the `YoutubeLoader` class are not tested. Only the
`YoutubeLoader.extract_video_id()` static method had a test. However,
while I was waiting for this PR to be reviewed and merged, I had time to
add a test for the chunking feature I've proposed in this PR.

I have added an example of using chunking to the
`docs/docs/integrations/document_loaders/youtube_transcript.ipynb`
notebook.

---------

Co-authored-by: Bagatur <baskaryan@gmail.com>
2024-06-11 17:44:36 +00:00
..
cli couchbase: Add the initial version of Couchbase partner package (#22087) 2024-06-07 14:04:08 -07:00
community community[patch]: Load YouTube transcripts (captions) as fixed-duration chunks with start times (#21710) 2024-06-11 17:44:36 +00:00
core core: fix mustache falsy cases (#22747) 2024-06-10 14:00:12 -07:00
experimental multiple: get rid of pyproject extras (#22581) 2024-06-06 15:45:22 -07:00
langchain langchain[minor]: Add native async implementation to LLMFilter, add concurrency to both sync and async paths (#22739) 2024-06-11 10:55:40 -04:00
partners docs: standardize ChatHuggingFace (#22693) 2024-06-10 20:54:36 +00:00
standard-tests multiple: add stop attribute (#22573) 2024-06-06 12:11:52 -04:00
text-splitters Community[minor]: Add language parser for Elixir (#22742) 2024-06-10 15:56:57 +00:00