langchain/libs/community/langchain_community
Mr. Lance E Sloan «UMich» 84dc2dd059
community[patch]: Load YouTube transcripts (captions) as fixed-duration chunks with start times (#21710)
- **Description:** Add a new format, `CHUNKS`, to
`langchain_community.document_loaders.youtube.YoutubeLoader` which
creates multiple `Document` objects from YouTube video transcripts
(captions), each of a fixed duration. The metadata of each chunk
`Document` includes the start time of each one and a URL to that time in
the video on the YouTube website.
  
I had implemented this for UMich (@umich-its-ai) in a local module, but
it makes sense to contribute this to LangChain community for all to
benefit and to simplify maintenance.

- **Issue:** N/A
- **Dependencies:** N/A
- **Twitter:** lsloan_umich
- **Mastodon:**
[lsloan@mastodon.social](https://mastodon.social/@lsloan)

With regards to **tests and documentation**, most existing features of
the `YoutubeLoader` class are not tested. Only the
`YoutubeLoader.extract_video_id()` static method had a test. However,
while I was waiting for this PR to be reviewed and merged, I had time to
add a test for the chunking feature I've proposed in this PR.

I have added an example of using chunking to the
`docs/docs/integrations/document_loaders/youtube_transcript.ipynb`
notebook.

---------

Co-authored-by: Bagatur <baskaryan@gmail.com>
2024-06-11 17:44:36 +00:00
..
adapters
agent_toolkits community[patch]: Fix remaining __inits__ in community (#22037) 2024-05-22 17:42:17 +00:00
agents community: update how OpenAIAssistantV2Runnable creates threads with tool_resources (#22549) 2024-06-05 14:19:41 -04:00
callbacks community[patch]: Add missing type annotations (#22758) 2024-06-10 16:59:28 -04:00
chains community[patch]: Add missing type annotations (#22758) 2024-06-10 16:59:28 -04:00
chat_loaders infra: rm unused # noqa violations (#22049) 2024-05-22 15:21:08 -07:00
chat_message_histories community[minor]: Add native async support to SQLChatMessageHistory (#22065) 2024-06-05 15:10:38 +00:00
chat_models Ollama vision support (#22734) 2024-06-11 16:10:19 +00:00
cross_encoders
docstore community[patch]: Fix remaining __inits__ in community (#22037) 2024-05-22 17:42:17 +00:00
document_compressors community[minor]: add Volcengine Rerank (#22700) 2024-06-10 13:41:05 -07:00
document_loaders community[patch]: Load YouTube transcripts (captions) as fixed-duration chunks with start times (#21710) 2024-06-11 17:44:36 +00:00
document_transformers infra: rm unused # noqa violations (#22049) 2024-05-22 15:21:08 -07:00
embeddings community[minor]: Add support for OVHcloud AI Endpoints Embedding (#22667) 2024-06-10 21:07:25 +00:00
example_selectors
graphs infra: rm unused # noqa violations (#22049) 2024-05-22 15:21:08 -07:00
indexes
llms community[patch]: Add missing type annotations (#22758) 2024-06-10 16:59:28 -04:00
memory community[minor]: Add Zep Cloud components + docs + examples (#21671) 2024-05-27 12:50:13 -07:00
output_parsers infra: rm unused # noqa violations (#22049) 2024-05-22 15:21:08 -07:00
query_constructors
retrievers community[patch]: Add missing type annotations (#22758) 2024-06-10 16:59:28 -04:00
storage community[minor]: fix redis store docstring and streamline initialization code (#22730) 2024-06-11 14:08:05 +00:00
tools community[patch]: Add missing type annotations (#22758) 2024-06-10 16:59:28 -04:00
utilities community[minor]: Adds a vector store for Azure Cosmos DB for NoSQL (#21676) 2024-06-11 10:34:01 -07:00
utils community[patch]: Use Custom Logger Instead of Root Logger in get_user_agent Function (#22691) 2024-06-08 02:33:07 +00:00
vectorstores community[minor]: Adds a vector store for Azure Cosmos DB for NoSQL (#21676) 2024-06-11 10:34:01 -07:00
__init__.py
cache.py
py.typed