You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
langchain/tests/unit_tests/document_loaders
Lance Martin 4092fd21dc
YoutubeAudioLoader and updates to OpenAIWhisperParser (#5772)
This introduces the `YoutubeAudioLoader`, which will load blobs from a
YouTube url and write them. Blobs are then parsed by
`OpenAIWhisperParser()`, as show in this
[PR](https://github.com/hwchase17/langchain/pull/5580), but we extend
the parser to split audio such that each chuck meets the 25MB OpenAI
size limit. As shown in the notebook, this enables a very simple UX:

```
# Transcribe the video to text
loader = GenericLoader(YoutubeAudioLoader([url],save_dir),OpenAIWhisperParser())
docs = loader.load()
``` 

Tested on full set of Karpathy lecture videos:

```
# Karpathy lecture videos
urls = ["https://youtu.be/VMj-3S1tku0"
        "https://youtu.be/PaCmpygFfXo",
        "https://youtu.be/TCH_1BHY58I",
        "https://youtu.be/P6sfmUTpUmc",
        "https://youtu.be/q8SA3rM6ckI",
        "https://youtu.be/t3YJ5hKiMQ0",
        "https://youtu.be/kCc8FmEb1nY"]

# Directory to save audio files 
save_dir = "~/Downloads/YouTube"
 
# Transcribe the videos to text
loader = GenericLoader(YoutubeAudioLoader(urls,save_dir),OpenAIWhisperParser())
docs = loader.load()
```
1 year ago
..
blob_loaders YoutubeAudioLoader and updates to OpenAIWhisperParser (#5772) 1 year ago
loaders fix(document_loaders/telegram): fix pandas calls + add tests (#4806) 1 year ago
parsers Create OpenAIWhisperParser for generating Documents from audio files (#5580) 1 year ago
sample_documents Bibtex integration for document loader and retriever (#5137) 1 year ago
test_docs Allow readthedoc loader to pass custom html tag (#5175) 1 year ago
__init__.py fix(document_loaders/telegram): fix pandas calls + add tests (#4806) 1 year ago
test_base.py fix(document_loaders/telegram): fix pandas calls + add tests (#4806) 1 year ago
test_bibtex.py Bibtex integration for document loader and retriever (#5137) 1 year ago
test_bshtml.py Add html parsers (#4874) 1 year ago
test_confluence.py Add Confluence Loader unit tests (#3333) 1 year ago
test_csv_loader.py fix(document_loaders/telegram): fix pandas calls + add tests (#4806) 1 year ago
test_detect_encoding.py feat #4479: TextLoader auto detect encoding and improved exceptions (#4927) 1 year ago
test_directory.py Add path validation to DirectoryLoader (#5327) 1 year ago
test_evernote_loader.py feature/4493 Improve Evernote Document Loader (#4577) 1 year ago
test_generic_loader.py Add a generic document loader (#4875) 1 year ago
test_github.py DocumentLoader for GitHub (#5408) 1 year ago
test_json_loader.py fix(document_loaders/telegram): fix pandas calls + add tests (#4806) 1 year ago
test_psychic.py Harrison/psychic (#5063) 1 year ago
test_readthedoc.py Allow readthedoc loader to pass custom html tag (#5175) 1 year ago
test_telegram.py fix(document_loaders/telegram): fix pandas calls + add tests (#4806) 1 year ago
test_trello.py New Trello document loader (#4767) 1 year ago
test_web_base.py fix(document_loaders/telegram): fix pandas calls + add tests (#4806) 1 year ago
test_youtube.py fix(document_loaders/telegram): fix pandas calls + add tests (#4806) 1 year ago