langchain

mirror of https://github.com/hwchase17/langchain synced 2024-10-29 17:07:25 +00:00

History

Lance Martin 4092fd21dc YoutubeAudioLoader and updates to OpenAIWhisperParser (#5772 ) This introduces the `YoutubeAudioLoader`, which will load blobs from a YouTube url and write them. Blobs are then parsed by `OpenAIWhisperParser()`, as show in this [PR](https://github.com/hwchase17/langchain/pull/5580), but we extend the parser to split audio such that each chuck meets the 25MB OpenAI size limit. As shown in the notebook, this enables a very simple UX: ``` # Transcribe the video to text loader = GenericLoader(YoutubeAudioLoader([url],save_dir),OpenAIWhisperParser()) docs = loader.load() ``` Tested on full set of Karpathy lecture videos: ``` # Karpathy lecture videos urls = ["https://youtu.be/VMj-3S1tku0" "https://youtu.be/PaCmpygFfXo", "https://youtu.be/TCH_1BHY58I", "https://youtu.be/P6sfmUTpUmc", "https://youtu.be/q8SA3rM6ckI", "https://youtu.be/t3YJ5hKiMQ0", "https://youtu.be/kCc8FmEb1nY"] # Directory to save audio files save_dir = "~/Downloads/YouTube" # Transcribe the videos to text loader = GenericLoader(YoutubeAudioLoader(urls,save_dir),OpenAIWhisperParser()) docs = loader.load() ```		2023-06-06 15:15:08 -07:00
..
document_loaders/examples	YoutubeAudioLoader and updates to OpenAIWhisperParser (#5772 )	2023-06-06 15:15:08 -07:00
retrievers/examples	Harrison/pubmed integration (#5664 )	2023-06-03 16:25:28 -07:00
text_splitters	refactor: extract token text splitter function (#5179 )	2023-06-04 14:41:44 -07:00
vectorstores	Scores are explained in vectorestore docs (#5613 )	2023-06-05 20:39:49 -07:00
document_loaders.rst	fix ver 191 (#5784 )	2023-06-06 09:17:23 -07:00
getting_started.ipynb	Update getting_started.ipynb (#4850 )	2023-05-17 13:19:14 -07:00
retrievers.rst
text_splitters.rst	code splitter docs (#5480 )	2023-05-31 07:11:53 -07:00
vectorstores.rst