You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
langchain/libs/community/langchain_community/document_loaders
Mr. Lance E Sloan «UMich» 84dc2dd059
community[patch]: Load YouTube transcripts (captions) as fixed-duration chunks with start times (#21710)
- **Description:** Add a new format, `CHUNKS`, to
`langchain_community.document_loaders.youtube.YoutubeLoader` which
creates multiple `Document` objects from YouTube video transcripts
(captions), each of a fixed duration. The metadata of each chunk
`Document` includes the start time of each one and a URL to that time in
the video on the YouTube website.
  
I had implemented this for UMich (@umich-its-ai) in a local module, but
it makes sense to contribute this to LangChain community for all to
benefit and to simplify maintenance.

- **Issue:** N/A
- **Dependencies:** N/A
- **Twitter:** lsloan_umich
- **Mastodon:**
[lsloan@mastodon.social](https://mastodon.social/@lsloan)

With regards to **tests and documentation**, most existing features of
the `YoutubeLoader` class are not tested. Only the
`YoutubeLoader.extract_video_id()` static method had a test. However,
while I was waiting for this PR to be reviewed and merged, I had time to
add a test for the chunking feature I've proposed in this PR.

I have added an example of using chunking to the
`docs/docs/integrations/document_loaders/youtube_transcript.ipynb`
notebook.

---------

Co-authored-by: Bagatur <baskaryan@gmail.com>
3 months ago
..
blob_loaders community[patch]: Update doc-string in CloudBlobLoader (#22069) 4 months ago
parsers Community[minor]: Add language parser for Elixir (#22742) 3 months ago
__init__.py infra: rm unused # noqa violations (#22049) 4 months ago
acreom.py community: better support of pathlib paths in document loaders (#18396) 6 months ago
airbyte.py community: Use default load() implementation in doc loaders (#18385) 6 months ago
airbyte_json.py community: better support of pathlib paths in document loaders (#18396) 6 months ago
airtable.py community[patch]: Airtable to allow for addtl params (#22092) 3 months ago
apify_dataset.py community[patch]: update apify integration to attribute API activity to langchain (#21909) 4 months ago
arcgis_loader.py community: Use default load() implementation in doc loaders (#18385) 6 months ago
arxiv.py community[minor]: Implement lazy_load() for ArxivLoader (#18664) 6 months ago
assemblyai.py community[patch]: docstrings update (#20301) 5 months ago
astradb.py (all): update removal in deprecation warnings from 0.2 to 0.3 (#21265) 4 months ago
async_html.py community[minor]: add user agent for web scraping loaders (#22480) 3 months ago
athena.py community[minor]: import fix (#20995) 4 months ago
azlyrics.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 9 months ago
azure_ai_data.py community: Use default load() implementation in doc loaders (#18385) 6 months ago
azure_blob_storage_container.py community[patch]: type ignore fixes (#18395) 6 months ago
azure_blob_storage_file.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 9 months ago
baiducloud_bos_directory.py community: Use default load() implementation in doc loaders (#18385) 6 months ago
baiducloud_bos_file.py community: Use default load() implementation in doc loaders (#18385) 6 months ago
base.py core: Move document loader interfaces to core (#17723) 6 months ago
base_o365.py community[minor]: Added propagation of document metadata from O365BaseLoader (#20663) 4 months ago
bibtex.py community: Use default load() implementation in doc loaders (#18385) 6 months ago
bigquery.py (all): update removal in deprecation warnings from 0.2 to 0.3 (#21265) 4 months ago
bilibili.py community[patch]: docstrings update (#20301) 5 months ago
blackboard.py infra: rm unused # noqa violations (#22049) 4 months ago
blockchain.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 9 months ago
brave_search.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 9 months ago
browserbase.py community: updated Browserbase loader (#21757) 4 months ago
browserless.py community: Use default load() implementation in doc loaders (#18385) 6 months ago
cassandra.py community[minor]: Add Cassandra ByteStore (#22064) 4 months ago
chatgpt.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 9 months ago
chm.py community[patch]: docstrings (#16810) 7 months ago
chromium.py community[minor]: add user agent for web scraping loaders (#22480) 3 months ago
college_confidential.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 9 months ago
concurrent.py community[patch]: import flattening fix (#20110) 5 months ago
confluence.py multiple: Remove unnecessary Ruff suppression comments (#21050) 4 months ago
conllu.py community: better support of pathlib paths in document loaders (#18396) 6 months ago
couchbase.py community: Use default load() implementation in doc loaders (#18385) 6 months ago
csv_loader.py community: Fix CSVLoader columns is None (#20701) 4 months ago
cube_semantic.py community[patch]: Implement lazy_load() for CubeSemanticLoader (#18535) 6 months ago
datadog_logs.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 9 months ago
dataframe.py community[patch]: support modin document loader (#18866) 6 months ago
diffbot.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 9 months ago
directory.py community: fix `DirectoryLoader` progress bar (#19821) 5 months ago
discord.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 9 months ago
doc_intelligence.py docs: community docstring updates (#21040) 4 months ago
docugami.py (all): update removal in deprecation warnings from 0.2 to 0.3 (#21265) 4 months ago
docusaurus.py docs: docstrings `langchain_community` update (#14889) 9 months ago
dropbox.py infra: add print rule to ruff (#16221) 7 months ago
duckdb_loader.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 9 months ago
email.py community[patch]: Small Fix in OutlookMessageLoader (Close the Message once Open) (#22744) 3 months ago
epub.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 9 months ago
etherscan.py community: Use default load() implementation in doc loaders (#18385) 6 months ago
evernote.py infra: rm unused # noqa violations (#22049) 4 months ago
excel.py community: better support of pathlib paths in document loaders (#18396) 6 months ago
facebook_chat.py community: better support of pathlib paths in document loaders (#18396) 6 months ago
fauna.py community: Use default load() implementation in doc loaders (#18385) 6 months ago
figma.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 9 months ago
firecrawl.py community[patch]: Update firecrawl api key name (#22183) 3 months ago
gcs_directory.py (all): update removal in deprecation warnings from 0.2 to 0.3 (#21265) 4 months ago
gcs_file.py (all): update removal in deprecation warnings from 0.2 to 0.3 (#21265) 4 months ago
generic.py community[patch]: import flattening fix (#20110) 5 months ago
geodataframe.py community: Use default load() implementation in doc loaders (#18385) 6 months ago
git.py Merge pull request #18539 6 months ago
gitbook.py community[minor]: Implement lazy_load() for GitbookLoader (#18670) 6 months ago
github.py community: Implement lazy_load() for GithubFileLoader (#18584) 6 months ago
glue_catalog.py community[minor]: Add glue catalog loader (#20220) 5 months ago
google_speech_to_text.py (all): update removal in deprecation warnings from 0.2 to 0.3 (#21265) 4 months ago
googledrive.py (all): update removal in deprecation warnings from 0.2 to 0.3 (#21265) 4 months ago
gutenberg.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 9 months ago
helpers.py community: better support of pathlib paths in document loaders (#18396) 6 months ago
hn.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 9 months ago
html.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 9 months ago
html_bs.py community: better support of pathlib paths in document loaders (#18396) 6 months ago
hugging_face_dataset.py community: Use default load() implementation in doc loaders (#18385) 6 months ago
hugging_face_model.py community: Use default load() implementation in doc loaders (#18385) 6 months ago
ifixit.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 9 months ago
image.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 9 months ago
image_captions.py community: better support of pathlib paths in document loaders (#18396) 6 months ago
imsdb.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 9 months ago
iugu.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 9 months ago
joplin.py community: Use default load() implementation in doc loaders (#18385) 6 months ago
json_loader.py multiple: Remove unnecessary Ruff suppression comments (#21050) 4 months ago
kinetica_loader.py community[patch]: Kinetica Integrations handled error in querying; quotes in table names; updated gpudb API (#22724) 3 months ago
lakefs.py docs: docstrings `langchain_community` update (#14889) 9 months ago
larksuite.py community[minor]: Add LarkSuite wiki document loader. (#21016) 4 months ago
llmsherpa.py community[minor]: add support for llmsherpa (#19741) 5 months ago
markdown.py corrected outdated link (#15053) 9 months ago
mastodon.py Merge pull request #18671 6 months ago
max_compute.py community: Use default load() implementation in doc loaders (#18385) 6 months ago
mediawikidump.py community: Use default load() implementation in doc loaders (#18385) 6 months ago
merge.py community: Use default load() implementation in doc loaders (#18385) 6 months ago
mhtml.py community[patch]: upgrade to recent version of mypy (#21616) 4 months ago
mintbase.py community[minor]: add mintbase loader to langchain (#20089) 4 months ago
modern_treasury.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 9 months ago
mongodb.py community[minor]: added a feature to filter documents in Mongoloader (#18253) 6 months ago
news.py multiple: Remove unnecessary Ruff suppression comments (#21050) 4 months ago
notebook.py community[patch]: add NotebookLoader unit test (#17721) 5 months ago
notion.py community: better support of pathlib paths in document loaders (#18396) 6 months ago
notiondb.py community[patch]: Fix NotionDBLoader 400 Error by conditionally adding filter parameter (#19075) 6 months ago
nuclia.py infra: add print rule to ruff (#16221) 7 months ago
obs_directory.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 9 months ago
obs_file.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 9 months ago
obsidian.py community: better support of pathlib paths in document loaders (#18396) 6 months ago
odt.py community: better support of pathlib paths in document loaders (#18396) 6 months ago
onedrive.py community: Use default load() implementation in doc loaders (#18385) 6 months ago
onedrive_file.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 9 months ago
onenote.py community[patch]: upgrade to recent version of mypy (#21616) 4 months ago
open_city_data.py community: Use default load() implementation in doc loaders (#18385) 6 months ago
oracleadb_loader.py community[minor]: add oracle autonomous database doc loader integration (#19536) 6 months ago
oracleai.py community[minor]: Oraclevs integration (#21123) 4 months ago
org_mode.py community: better support of pathlib paths in document loaders (#18396) 6 months ago
pdf.py community[patch]: upgrade to recent version of mypy (#21616) 4 months ago
pebblo.py community[minor]: Updating payload for pebblo discover API (#22309) 3 months ago
polars_dataframe.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 9 months ago
powerpoint.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 9 months ago
psychic.py multiple: Remove unnecessary Ruff suppression comments (#21050) 4 months ago
pubmed.py community[patch]: upgrade to recent version of mypy (#21616) 4 months ago
pyspark_dataframe.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 9 months ago
python.py community: better support of pathlib paths in document loaders (#18396) 6 months ago
quip.py community[major]: lint for usage of xml library (#22132) 4 months ago
readthedocs.py community: Use default load() implementation in doc loaders (#18385) 6 months ago
recursive_url_loader.py community[patch]: recursive url loader fix and unit tests (#22521) 3 months ago
reddit.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 9 months ago
roam.py community: better support of pathlib paths in document loaders (#18396) 6 months ago
rocksetdb.py community: Use default load() implementation in doc loaders (#18385) 6 months ago
rspace.py community: Use default load() implementation in doc loaders (#18385) 6 months ago
rss.py multiple: Remove unnecessary Ruff suppression comments (#21050) 4 months ago
rst.py community: better support of pathlib paths in document loaders (#18396) 6 months ago
rtf.py community: better support of pathlib paths in document loaders (#18396) 6 months ago
s3_directory.py community[patch]: Skip nested directories when using S3DirectoryLoader (#17829) 6 months ago
s3_file.py community[patch]: support unstructured_kwargs for s3 loader (#15473) 6 months ago
scrapfly.py community[minor]: Add Scrapfly Loader community integration (#22036) 4 months ago
sharepoint.py community[patch]: Put authorized identities behind a feature flag in SharepointLoader (#22125) 4 months ago
sitemap.py community[minor]: Implement lazy_load() for SitemapLoader (#18667) 6 months ago
slack_directory.py community: better support of pathlib paths in document loaders (#18396) 6 months ago
snowflake_loader.py community[patch]: upgrade to recent version of mypy (#21616) 4 months ago
spider.py doc list not empty (#21208) 4 months ago
spreedly.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 9 months ago
sql_database.py community: Use default load() implementation in doc loaders (#18385) 6 months ago
srt.py community: better support of pathlib paths in document loaders (#18396) 6 months ago
stripe.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 9 months ago
surrealdb.py community[patch]: SurrealDB fix for asyncio (#16092) 8 months ago
telegram.py community: better support of pathlib paths in document loaders (#18396) 6 months ago
tencent_cos_directory.py community: Use default load() implementation in doc loaders (#18385) 6 months ago
tencent_cos_file.py community: Use default load() implementation in doc loaders (#18385) 6 months ago
tensorflow_datasets.py community[patch]: upgrade to recent version of mypy (#21616) 4 months ago
text.py community: better support of pathlib paths in document loaders (#18396) 6 months ago
tidb.py community: Use default load() implementation in doc loaders (#18385) 6 months ago
tomarkdown.py community: Use default load() implementation in doc loaders (#18385) 6 months ago
toml.py community: Use default load() implementation in doc loaders (#18385) 6 months ago
trello.py community: Implement lazy_load() for TrelloLoader (#18658) 6 months ago
tsv.py community: better support of pathlib paths in document loaders (#18396) 6 months ago
twitter.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 9 months ago
unstructured.py community[minor]: import fix (#20995) 4 months ago
url.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 9 months ago
url_playwright.py docs: community docstring updates (#21040) 4 months ago
url_selenium.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 9 months ago
vsdx.py community[patch]: import flattening fix (#20110) 5 months ago
weather.py community[patch]: upgrade to recent version of mypy (#21616) 4 months ago
web_base.py community[minor]: add user agent for web scraping loaders (#22480) 3 months ago
whatsapp_chat.py community: Implement lazy_load() for WhatsAppChatLoader (#18677) 6 months ago
wikipedia.py community[patch]: upgrade to recent version of mypy (#21616) 4 months ago
word_document.py community: better support of pathlib paths in document loaders (#18396) 6 months ago
xml.py community: better support of pathlib paths in document loaders (#18396) 6 months ago
xorbits.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 9 months ago
youtube.py community[patch]: Load YouTube transcripts (captions) as fixed-duration chunks with start times (#21710) 3 months ago
yuque.py community[minor]: add Yuque document loader (#17924) 6 months ago