You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
langchain/docs/modules/indexes/document_loaders/examples
Lance Martin 4092fd21dc
YoutubeAudioLoader and updates to OpenAIWhisperParser (#5772)
This introduces the `YoutubeAudioLoader`, which will load blobs from a
YouTube url and write them. Blobs are then parsed by
`OpenAIWhisperParser()`, as show in this
[PR](https://github.com/hwchase17/langchain/pull/5580), but we extend
the parser to split audio such that each chuck meets the 25MB OpenAI
size limit. As shown in the notebook, this enables a very simple UX:

```
# Transcribe the video to text
loader = GenericLoader(YoutubeAudioLoader([url],save_dir),OpenAIWhisperParser())
docs = loader.load()
``` 

Tested on full set of Karpathy lecture videos:

```
# Karpathy lecture videos
urls = ["https://youtu.be/VMj-3S1tku0"
        "https://youtu.be/PaCmpygFfXo",
        "https://youtu.be/TCH_1BHY58I",
        "https://youtu.be/P6sfmUTpUmc",
        "https://youtu.be/q8SA3rM6ckI",
        "https://youtu.be/t3YJ5hKiMQ0",
        "https://youtu.be/kCc8FmEb1nY"]

# Directory to save audio files 
save_dir = "~/Downloads/YouTube"
 
# Transcribe the videos to text
loader = GenericLoader(YoutubeAudioLoader(urls,save_dir),OpenAIWhisperParser())
docs = loader.load()
```
1 year ago
..
example_data feat: add `UnstructuredExcelLoader` for `.xlsx` and `.xls` files (#5617) 1 year ago
airbyte_json.ipynb
alibaba_cloud_maxcompute.ipynb add maxcompute (#5533) 1 year ago
apify_dataset.ipynb
arxiv.ipynb docs: `ecosystem/integrations` update 1 (#5219) 1 year ago
audio.ipynb Create OpenAIWhisperParser for generating Documents from audio files (#5580) 1 year ago
aws_s3_directory.ipynb
aws_s3_file.ipynb
azlyrics.ipynb
azure_blob_storage_container.ipynb
azure_blob_storage_file.ipynb
bibtex.ipynb
bilibili.ipynb
blackboard.ipynb
blockchain.ipynb
chatgpt_loader.ipynb
college_confidential.ipynb
confluence.ipynb Implements support for Personal Access Token Authentication in the ConfluenceLoader (#5385) 1 year ago
conll-u.ipynb
copypaste.ipynb
csv.ipynb
diffbot.ipynb docs: `ecosystem/integrations` update 2 (#5282) 1 year ago
discord.ipynb docs `ecosystem/integrations` update 3 (#5470) 1 year ago
docugami.ipynb Documentation fixes (linting and broken links) (#5563) 1 year ago
duckdb.ipynb
email.ipynb
epub.ipynb
evernote.ipynb
excel.ipynb feat: add `UnstructuredExcelLoader` for `.xlsx` and `.xls` files (#5617) 1 year ago
facebook_chat.ipynb docs `ecosystem/integrations` update 3 (#5470) 1 year ago
figma.ipynb
file_directory.ipynb
git.ipynb
gitbook.ipynb
github.ipynb DocumentLoader for GitHub (#5408) 1 year ago
google_bigquery.ipynb
google_cloud_storage_directory.ipynb
google_cloud_storage_file.ipynb
google_drive.ipynb
gutenberg.ipynb
hacker_news.ipynb
html.ipynb
hugging_face_dataset.ipynb
ifixit.ipynb
image.ipynb
image_captions.ipynb
imsdb.ipynb
iugu.ipynb
joplin.ipynb
json.ipynb
jupyter_notebook.ipynb
markdown.ipynb
mastodon.ipynb
mediawikidump.ipynb
microsoft_onedrive.ipynb
microsoft_powerpoint.ipynb
microsoft_word.ipynb
modern_treasury.ipynb
notion.ipynb
notiondb.ipynb
obsidian.ipynb
odt.ipynb
pandas_dataframe.ipynb
pdf.ipynb
psychic.ipynb
pyspark_dataframe.ipynb Add minor fixes for PySpark Document Loader Docs (#5525) 1 year ago
readthedocs_documentation.ipynb
reddit.ipynb docs `ecosystem/integrations` update 3 (#5470) 1 year ago
roam.ipynb
sitemap.ipynb Add param requests_kwargs for WebBaseLoader (#5485) 1 year ago
slack.ipynb Fix a typo in the documentation for the Slack document loader (#5745) 1 year ago
spreedly.ipynb
stripe.ipynb
subtitle.ipynb
telegram.ipynb docs `ecosystem/integrations` update 4 (#5590) 1 year ago
tomarkdown.ipynb
toml.ipynb
trello.ipynb docs `ecosystem/integrations` update 4 (#5590) 1 year ago
twitter.ipynb
unstructured_file.ipynb docs: unstructured no longer requires installing detectron2 from source (#5524) 1 year ago
url.ipynb
weather.ipynb
web_base.ipynb
whatsapp_chat.ipynb docs `ecosystem/integrations` update 4 (#5590) 1 year ago
wikipedia.ipynb
youtube_audio.ipynb YoutubeAudioLoader and updates to OpenAIWhisperParser (#5772) 1 year ago
youtube_transcript.ipynb Harrison/youtube multi language (#5758) 1 year ago