mirror of
https://github.com/hwchase17/langchain
synced 2024-11-10 01:10:59 +00:00
84dc2dd059
- **Description:** Add a new format, `CHUNKS`, to `langchain_community.document_loaders.youtube.YoutubeLoader` which creates multiple `Document` objects from YouTube video transcripts (captions), each of a fixed duration. The metadata of each chunk `Document` includes the start time of each one and a URL to that time in the video on the YouTube website. I had implemented this for UMich (@umich-its-ai) in a local module, but it makes sense to contribute this to LangChain community for all to benefit and to simplify maintenance. - **Issue:** N/A - **Dependencies:** N/A - **Twitter:** lsloan_umich - **Mastodon:** [lsloan@mastodon.social](https://mastodon.social/@lsloan) With regards to **tests and documentation**, most existing features of the `YoutubeLoader` class are not tested. Only the `YoutubeLoader.extract_video_id()` static method had a test. However, while I was waiting for this PR to be reviewed and merged, I had time to add a test for the chunking feature I've proposed in this PR. I have added an example of using chunking to the `docs/docs/integrations/document_loaders/youtube_transcript.ipynb` notebook. --------- Co-authored-by: Bagatur <baskaryan@gmail.com> |
||
---|---|---|
.. | ||
adapters | ||
agent_toolkits | ||
agents | ||
callbacks | ||
chains | ||
chat_loaders | ||
chat_message_histories | ||
chat_models | ||
cross_encoders | ||
docstore | ||
document_compressors | ||
document_loaders | ||
document_transformers | ||
embeddings | ||
example_selectors | ||
graphs | ||
indexes | ||
llms | ||
memory | ||
output_parsers | ||
query_constructors | ||
retrievers | ||
storage | ||
tools | ||
utilities | ||
utils | ||
vectorstores | ||
__init__.py | ||
cache.py | ||
py.typed |