The doc loaders index was picking up a bunch of subheadings because I
mistakenly made the MD titles H1s. Fixed that.
also the easy minor warnings from docs_build
I was testing out the WhatsApp Document loader, and noticed that
sometimes the date is of the following format (notice the additional
underscore):
```
3/24/23, 1:54_PM - +91 99999 99999 joined using this group's invite link
3/24/23, 6:29_PM - +91 99999 99999: When are we starting then?
```
Wierdly, the underscore is visible in Vim, but not on editors like
VSCode. I presume it is some unusual character/line terminator.
Nevertheless, I think handling this edge case will make the document
loader more robust.
Adds a loader for Slack Exports which can be a very valuable source of
knowledge to use for internal QA bots and other use cases.
```py
# Export data from your Slack Workspace first.
from langchain.document_loaders import SLackDirectoryLoader
SLACK_WORKSPACE_URL = "https://awesome.slack.com"
loader = ("Slack_Exports", SLACK_WORKSPACE_URL)
docs = loader.load()
```
---------
Co-authored-by: Mikhail Dubov <mikhail@chattermill.io>
I've added a bilibili loader, bilibili is a very active video site in
China and I think we need this loader.
Example:
```python
from langchain.document_loaders.bilibili import BiliBiliLoader
loader = BiliBiliLoader(
["https://www.bilibili.com/video/BV1xt411o7Xu/",
"https://www.bilibili.com/video/av330407025/"]
)
docs = loader.load()
```
Co-authored-by: 了空 <568250549@qq.com>
**Description**
Add custom vector field name and text field name while indexing and
querying for OpenSearch
**Issues**
https://github.com/hwchase17/langchain/issues/2500
Signed-off-by: Naveen Tatikonda <navtat@amazon.com>
Took me a bit to find the proper places to get the API keys. The link
earlier provided to setup search is still good, but why not provide
direct link to the Google cloud tools that give you ability to create
keys?
`combine_docs` does not go through the standard chain call path which
means that chain callbacks won't be triggered, meaning QA chains won't
be traced properly, this fixes that.
Also fix several errors in the chat_vector_db notebook
Adds a new pdf loader using the existing dependency on PDFMiner.
The new loader can be helpful for chunking texts semantically into
sections as the output html content can be parsed via `BeautifulSoup` to
get more structured and rich information about font size, page numbers,
pdf headers/footers, etc. which may not be available otherwise with
other pdf loaders
Improvements to Deep Lake Vector Store
- much faster view loading of embeddings after filters with
`fetch_chunks=True`
- 2x faster ingestion
- use np.float32 for embeddings to save 2x storage, LZ4 compression for
text and metadata storage (saves up to 4x storage for text data)
- user defined functions as filters
Docs
- Added retriever full example for analyzing twitter the-algorithm
source code with GPT4
- Added a use case for code analysis (please let us know your thoughts
how we can improve it)
---------
Co-authored-by: Davit Buniatyan <d@activeloop.ai>
## Why this PR?
Fixes#2624
There's a missing import statement in AzureOpenAI embeddings example.
## What's new in this PR?
- Import `OpenAIEmbeddings` before creating it's object.
## How it's tested?
- By running notebook and creating embedding object.
Signed-off-by: letmerecall <girishsharma001@gmail.com>