langchain/docs/modules/indexes
Lance Martin 4092fd21dc
YoutubeAudioLoader and updates to OpenAIWhisperParser (#5772)
This introduces the `YoutubeAudioLoader`, which will load blobs from a
YouTube url and write them. Blobs are then parsed by
`OpenAIWhisperParser()`, as show in this
[PR](https://github.com/hwchase17/langchain/pull/5580), but we extend
the parser to split audio such that each chuck meets the 25MB OpenAI
size limit. As shown in the notebook, this enables a very simple UX:

```
# Transcribe the video to text
loader = GenericLoader(YoutubeAudioLoader([url],save_dir),OpenAIWhisperParser())
docs = loader.load()
``` 

Tested on full set of Karpathy lecture videos:

```
# Karpathy lecture videos
urls = ["https://youtu.be/VMj-3S1tku0"
        "https://youtu.be/PaCmpygFfXo",
        "https://youtu.be/TCH_1BHY58I",
        "https://youtu.be/P6sfmUTpUmc",
        "https://youtu.be/q8SA3rM6ckI",
        "https://youtu.be/t3YJ5hKiMQ0",
        "https://youtu.be/kCc8FmEb1nY"]

# Directory to save audio files 
save_dir = "~/Downloads/YouTube"
 
# Transcribe the videos to text
loader = GenericLoader(YoutubeAudioLoader(urls,save_dir),OpenAIWhisperParser())
docs = loader.load()
```
2023-06-06 15:15:08 -07:00
..
document_loaders/examples YoutubeAudioLoader and updates to OpenAIWhisperParser (#5772) 2023-06-06 15:15:08 -07:00
retrievers/examples Harrison/pubmed integration (#5664) 2023-06-03 16:25:28 -07:00
text_splitters refactor: extract token text splitter function (#5179) 2023-06-04 14:41:44 -07:00
vectorstores Scores are explained in vectorestore docs (#5613) 2023-06-05 20:39:49 -07:00
document_loaders.rst fix ver 191 (#5784) 2023-06-06 09:17:23 -07:00
getting_started.ipynb Update getting_started.ipynb (#4850) 2023-05-17 13:19:14 -07:00
retrievers.rst
text_splitters.rst code splitter docs (#5480) 2023-05-31 07:11:53 -07:00
vectorstores.rst