This introduces the `YoutubeAudioLoader`, which will load blobs from a
YouTube url and write them. Blobs are then parsed by
`OpenAIWhisperParser()`, as show in this
[PR](https://github.com/hwchase17/langchain/pull/5580), but we extend
the parser to split audio such that each chuck meets the 25MB OpenAI
size limit. As shown in the notebook, this enables a very simple UX:
```
# Transcribe the video to text
loader = GenericLoader(YoutubeAudioLoader([url],save_dir),OpenAIWhisperParser())
docs = loader.load()
```
Tested on full set of Karpathy lecture videos:
```
# Karpathy lecture videos
urls = ["https://youtu.be/VMj-3S1tku0"
"https://youtu.be/PaCmpygFfXo",
"https://youtu.be/TCH_1BHY58I",
"https://youtu.be/P6sfmUTpUmc",
"https://youtu.be/q8SA3rM6ckI",
"https://youtu.be/t3YJ5hKiMQ0",
"https://youtu.be/kCc8FmEb1nY"]
# Directory to save audio files
save_dir = "~/Downloads/YouTube"
# Transcribe the videos to text
loader = GenericLoader(YoutubeAudioLoader(urls,save_dir),OpenAIWhisperParser())
docs = loader.load()
```
# OpenAIWhisperParser
This PR creates a new parser, `OpenAIWhisperParser`, that uses the
[OpenAI Whisper
model](https://platform.openai.com/docs/guides/speech-to-text/quickstart)
to perform transcription of audio files to text (`Documents`). Please
see the notebook for usage.
# Implements support for Personal Access Token Authentication in the
ConfluenceLoader
Fixes#5191
Implements a new optional parameter for the ConfluenceLoader: `token`.
This allows the use of personal access authentication when using the
on-prem server version of Confluence.
## Who can review?
Community members can review the PR once tests pass. Tag
maintainers/contributors who might be interested:
@eyurtsev @Jflick58
Twitter Handle: felipe_yyc
---------
Co-authored-by: Felipe <feferreira@ea.com>
Co-authored-by: Dev 2049 <dev.dev2049@gmail.com>
# Unstructured Excel Loader
Adds an `UnstructuredExcelLoader` class for `.xlsx` and `.xls` files.
Works with `unstructured>=0.6.7`. A plain text representation of the
Excel file will be available under the `page_content` attribute in the
doc. If you use the loader in `"elements"` mode, an HTML representation
of the Excel file will be available under the `text_as_html` metadata
key. Each sheet in the Excel document is its own document.
### Testing
```python
from langchain.document_loaders import UnstructuredExcelLoader
loader = UnstructuredExcelLoader(
"example_data/stanley-cups.xlsx",
mode="elements"
)
docs = loader.load()
```
## Who can review?
@hwchase17
@eyurtsev
# Lint sphinx documentation and fix broken links
This PR lints multiple warnings shown in generation of the project
documentation (using "make docs_linkcheck" and "make docs_build").
Additionally documentation internal links to (now?) non-existent files
are modified to point to existing documents as it seemed the new correct
target.
The documentation is not updated content wise.
There are no source code changes.
Fixes # (issue)
- broken documentation links to other files within the project
- sphinx formatting (linting)
## Before submitting
No source code changes, so no new tests added.
---------
Co-authored-by: Dev 2049 <dev.dev2049@gmail.com>
# docs: `ecosystem_integrations` update 3
Next cycle of updating the `ecosystem/integrations`
* Added an integration `template` file
* Added missed integration files
* Fixed several document_loaders/notebooks
## Who can review?
Is it possible to assign somebody to review PRs on docs? Thanks.
# Update Unstructured docs to remove the `detectron2` install
instructions
Removes `detectron2` installation instructions from the Unstructured
docs because installing `detectron2` is no longer required for
`unstructured>=0.7.0`. The `detectron2` model now runs using the ONNX
runtime.
## Who can review?
@hwchase17
@eyurtsev
# Creates GitHubLoader (#5257)
GitHubLoader is a DocumentLoader that loads issues and PRs from GitHub.
Fixes#5257
---------
Co-authored-by: Dev 2049 <dev.dev2049@gmail.com>
# Added New Trello loader class and documentation
Simple Loader on top of py-trello wrapper.
With a board name you can pull cards and to do some field parameter
tweaks on load operation.
I included documentation and examples.
Included unit test cases using patch and a fixture for py-trello client
class.
---------
Co-authored-by: Dev 2049 <dev.dev2049@gmail.com>
# docs: ecosystem/integrations update
It is the first in a series of `ecosystem/integrations` updates.
The ecosystem/integrations list is missing many integrations.
I'm adding the missing integrations in a consistent format:
1. description of the integrated system
2. `Installation and Setup` section with 'pip install ...`, Key setup,
and other necessary settings
3. Sections like `LLM`, `Text Embedding Models`, `Chat Models`... with
links to correspondent examples and imports of the used classes.
This PR keeps new docs, that are presented in the
`docs/modules/models/text_embedding/examples` but missed in the
`ecosystem/integrations`. The next PRs will cover the next example
sections.
Also updated `integrations.rst`: added the `Dependencies` section with a
link to the packages used in LangChain.
## Who can review?
@hwchase17
@eyurtsev
@dev2049
# docs: ecosystem/integrations update 2
#5219 - part 1
The second part of this update (parts are independent of each other! no
overlap):
- added diffbot.md
- updated confluence.ipynb; added confluence.md
- updated college_confidential.md
- updated openai.md
- added blackboard.md
- added bilibili.md
- added azure_blob_storage.md
- added azlyrics.md
- added aws_s3.md
## Who can review?
@hwchase17@agola11
@agola11
@vowelparrot
@dev2049
# Bibtex integration
Wrap bibtexparser to retrieve a list of docs from a bibtex file.
* Get the metadata from the bibtex entries
* `page_content` get from the local pdf referenced in the `file` field
of the bibtex entry using `pymupdf`
* If no valid pdf file, `page_content` set to the `abstract` field of
the bibtex entry
* Support Zotero flavour using regex to get the file path
* Added usage example in
`docs/modules/indexes/document_loaders/examples/bibtex.ipynb`
---------
Co-authored-by: Sébastien M. Popoff <sebastien.popoff@espci.fr>
Co-authored-by: Dev 2049 <dev.dev2049@gmail.com>
# Add Joplin document loader
[Joplin](https://joplinapp.org/) is an open source note-taking app.
Joplin has a [REST API](https://joplinapp.org/api/references/rest_api/)
for accessing its local database. The proposed `JoplinLoader` uses the
API to retrieve all notes in the database and their metadata. Joplin
needs to be installed and running locally, and an access token is
required.
- The PR includes an integration test.
- The PR includes an example notebook.
---------
Co-authored-by: Dev 2049 <dev.dev2049@gmail.com>
# DOCS added missed document_loader examples
Added missed examples: `JSON`, `Open Document Format (ODT)`,
`Wikipedia`, `tomarkdown`.
Updated them to a consistent format.
## Who can review?
@hwchase17
@dev2049
# Add Mastodon toots loader.
Loader works either with public toots, or Mastodon app credentials. Toot
text and user info is loaded.
I've also added integration test for this new loader as it works with
public data, and a notebook with example output run now.
---------
Co-authored-by: Dev 2049 <dev.dev2049@gmail.com>
### Submit Multiple Files to the Unstructured API
Enables batching multiple files into a single Unstructured API requests.
Support for requests with multiple files was added to both
`UnstructuredAPIFileLoader` and `UnstructuredAPIFileIOLoader`. Note that
if you submit multiple files in "single" mode, the result will be
concatenated into a single document. We recommend using this feature in
"elements" mode.
### Testing
The following should load both documents, using two of the example docs
from the integration tests folder.
```python
from langchain.document_loaders import UnstructuredAPIFileLoader
file_paths = ["examples/layout-parser-paper.pdf", "examples/whatsapp_chat.txt"]
loader = UnstructuredAPIFileLoader(
file_paths=file_paths,
api_key="FAKE_API_KEY",
strategy="fast",
mode="elements",
)
docs = loader.load()
```
# Improve Evernote Document Loader
When exporting from Evernote you may export more than one note.
Currently the Evernote loader concatenates the content of all notes in
the export into a single document and only attaches the name of the
export file as metadata on the document.
This change ensures that each note is loaded as an independent document
and all available metadata on the note e.g. author, title, created,
updated are added as metadata on each document.
It also uses an existing optional dependency of `html2text` instead of
`pypandoc` to remove the need to download the pandoc application via
`download_pandoc()` to be able to use the `pypandoc` python bindings.
Fixes#4493
Co-authored-by: Mike McGarry <mike.mcgarry@finbourne.com>
Co-authored-by: Dev 2049 <dev.dev2049@gmail.com>
# Fix bilibili api import error
bilibili-api package is depracated and there is no sync module.
<!--
Thank you for contributing to LangChain! Your PR will appear in our next
release under the title you set. Please make sure it highlights your
valuable contribution.
Replace this with a description of the change, the issue it fixes (if
applicable), and relevant context. List any dependencies required for
this change.
After you're done, someone will review your PR. They may suggest
improvements. If no one reviews your PR within a few days, feel free to
@-mention the same people again, as notifications can get lost.
-->
<!-- Remove if not applicable -->
Fixes#2673#2724
## Before submitting
<!-- If you're adding a new integration, include an integration test and
an example notebook showing its use! -->
## Who can review?
Community members can review the PR once tests pass. Tag
maintainers/contributors who might be interested:
@vowelparrot @liaokongVFX
<!-- For a quicker response, figure out the right person to tag with @
@hwchase17 - project lead
Tracing / Callbacks
- @agola11
Async
- @agola11
DataLoaders
- @eyurtsev
Models
- @hwchase17
- @agola11
Agents / Tools / Toolkits
- @vowelparrot
VectorStores / Retrievers / Memory
- @dev2049
-->
# TextLoader auto detect encoding and enhanced exception handling
- Add an option to enable encoding detection on `TextLoader`.
- The detection is done using `chardet`
- The loading is done by trying all detected encodings by order of
confidence or raise an exception otherwise.
### New Dependencies:
- `chardet`
Fixes#4479
## Before submitting
<!-- If you're adding a new integration, include an integration test and
an example notebook showing its use! -->
## Who can review?
Community members can review the PR once tests pass. Tag
maintainers/contributors who might be interested:
- @eyurtsev
---------
Co-authored-by: blob42 <spike@w530>
# Load specific file types from Google Drive (issue #4878)
Add the possibility to define what file types you want to load from
Google Drive.
```
loader = GoogleDriveLoader(
folder_id="1yucgL9WGgWZdM1TOuKkeghlPizuzMYb5",
file_types=["document", "pdf"]
recursive=False
)
```
Fixes ##4878
## Who can review?
Community members can review the PR once tests pass. Tag
maintainers/contributors who might be interested:
DataLoaders
- @eyurtsev
Twitter: [@UmerHAdil](https://twitter.com/@UmerHAdil) | Discord:
RicChilligerDude#7589
---------
Co-authored-by: UmerHA <40663591+UmerHA@users.noreply.github.com>
# Docs: improvements in the `retrievers/examples/` notebooks
Its primary purpose is to make the Jupyter notebook examples
**consistent** and more suitable for first-time viewers.
- add links to the integration source (if applicable) with a short
description of this source;
- removed `_retriever` suffix from the file names (where it existed) for
consistency;
- removed ` retriever` from the notebook title (where it existed) for
consistency;
- added code to install necessary Python package(s);
- added code to set up the necessary API Key.
- very small fixes in notebooks from other folders (for consistency):
- docs/modules/indexes/vectorstores/examples/elasticsearch.ipynb
- docs/modules/indexes/vectorstores/examples/pinecone.ipynb
- docs/modules/models/llms/integrations/cohere.ipynb
- fixed misspelling in langchain/retrievers/time_weighted_retriever.py
comment (sorry, about this change in a .py file )
## Who can review
@dev2049
# Fix Telegram API loader + add tests.
I was testing this integration and it was broken with next error:
```python
message_threads = loader._get_message_threads(df)
KeyError: False
```
Also, this particular loader didn't have any tests / related group in
poetry, so I added those as well.
@hwchase17 / @eyurtsev please take a look on this fix PR.
---------
Co-authored-by: Dev 2049 <dev.dev2049@gmail.com>
### Adds a document loader for Docugami
Specifically:
1. Adds a data loader that talks to the [Docugami](http://docugami.com)
API to download processed documents as semantic XML
2. Parses the semantic XML into chunks, with additional metadata
capturing chunk semantics
3. Adds a detailed notebook showing how you can use additional metadata
returned by Docugami for techniques like the [self-querying
retriever](https://python.langchain.com/en/latest/modules/indexes/retrievers/examples/self_query_retriever.html)
4. Adds an integration test, and related documentation
Here is an example of a result that is not possible without the
capabilities added by Docugami (from the notebook):
<img width="1585" alt="image"
src="https://github.com/hwchase17/langchain/assets/749277/bb6c1ce3-13dc-4349-a53b-de16681fdd5b">
---------
Co-authored-by: Taqi Jaffri <tjaffri@docugami.com>
Co-authored-by: Taqi Jaffri <tjaffri@gmail.com>
# ODF File Loader
Adds a data loader for handling Open Office ODT files. Requires
`unstructured>=0.6.3`.
### Testing
The following should work using the `fake.odt` example doc from the
[`unstructured` repo](https://github.com/Unstructured-IO/unstructured).
```python
from langchain.document_loaders import UnstructuredODTLoader
loader = UnstructuredODTLoader(file_path="fake.odt", mode="elements")
loader.load()
loader = UnstructuredODTLoader(file_path="fake.odt", mode="single")
loader.load()
```
- Added the `Wikipedia` document loader. It is based on the existing
`unilities/WikipediaAPIWrapper`
- Added a respective ut-s and example notebook
- Sorted list of classes in __init__
- made notebooks consistent: titles, service/format descriptions.
- corrected short names to full names, for example, `Word` -> `Microsoft
Word`
- added missed descriptions
- renamed notebook files to make ToC correctly sorted
This implements a loader of text passages in JSON format. The `jq`
syntax is used to define a schema for accessing the relevant contents
from the JSON file. This requires dependency on the `jq` package:
https://pypi.org/project/jq/.
---------
Signed-off-by: Aivin V. Solatorio <avsolatorio@gmail.com>