Hi there:
As I implement the AnalyticDB VectorStore use two table to store the
document before. It seems just use one table is a better way. So this
commit is try to improve AnalyticDB VectorStore implementation without
affecting user behavior:
**1. Streamline the `post_init `behavior by creating a single table with
vector indexing.
2. Update the `add_texts` API for document insertion.
3. Optimize `similarity_search_with_score_by_vector` to retrieve results
directly from the table.
4. Implement `_similarity_search_with_relevance_scores`.
5. Add `embedding_dimension` parameter to support different dimension
embedding functions.**
Users can continue using the API as before.
Test cases added before is enough to meet this commit.
<!--
Thank you for contributing to LangChain! Your PR will appear in our
release under the title you set. Please make sure it highlights your
valuable contribution.
Replace this with a description of the change, the issue it fixes (if
applicable), and relevant context. List any dependencies required for
this change.
After you're done, someone will review your PR. They may suggest
improvements. If no one reviews your PR within a few days, feel free to
@-mention the same people again, as notifications can get lost.
Finally, we'd love to show appreciation for your contribution - if you'd
like us to shout you out on Twitter, please also include your handle!
-->
<!-- Remove if not applicable -->
Fixes ##6039
#### Before submitting
<!-- If you're adding a new integration, please include:
1. a test for the integration - favor unit tests that does not rely on
network access.
2. an example notebook showing its use
See contribution guidelines for more information on how to write tests,
lint
etc:
https://github.com/hwchase17/langchain/blob/master/.github/CONTRIBUTING.md
-->
#### Who can review?
Tag maintainers/contributors who might be interested:
@hwchase17 @agola11
<!-- For a quicker response, figure out the right person to tag with @
@hwchase17 - project lead
Tracing / Callbacks
- @agola11
Async
- @agola11
DataLoaders
- @eyurtsev
Models
- @hwchase17
- @agola11
Agents / Tools / Toolkits
- @hwchase17
VectorStores / Retrievers / Memory
- @dev2049
-->
## DocArray as a Retriever
[DocArray](https://github.com/docarray/docarray) is an open-source tool
for managing your multi-modal data. It offers flexibility to store and
search through your data using various document index backends. This PR
introduces `DocArrayRetriever` - which works with any available backend
and serves as a retriever for Langchain apps.
Also, I added 2 notebooks:
DocArray Backends - intro to all 5 currently supported backends, how to
initialize, index, and use them as a retriever
DocArray Usage - showcasing what additional search parameters you can
pass to create versatile retrievers
Example:
```python
from docarray.index import InMemoryExactNNIndex
from docarray import BaseDoc, DocList
from docarray.typing import NdArray
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.retrievers import DocArrayRetriever
# define document schema
class MyDoc(BaseDoc):
description: str
description_embedding: NdArray[1536]
embeddings = OpenAIEmbeddings()
# create documents
descriptions = ["description 1", "description 2"]
desc_embeddings = embeddings.embed_documents(texts=descriptions)
docs = DocList[MyDoc](
[
MyDoc(description=desc, description_embedding=embedding)
for desc, embedding in zip(descriptions, desc_embeddings)
]
)
# initialize document index with data
db = InMemoryExactNNIndex[MyDoc](docs)
# create a retriever
retriever = DocArrayRetriever(
index=db,
embeddings=embeddings,
search_field="description_embedding",
content_field="description",
)
# find the relevant document
doc = retriever.get_relevant_documents("action movies")
print(doc)
```
#### Who can review?
@dev2049
---------
Signed-off-by: jupyterjazz <saba.sturua@jina.ai>
<!--
Thank you for contributing to LangChain! Your PR will appear in our
release under the title you set. Please make sure it highlights your
valuable contribution.
Replace this with a description of the change, the issue it fixes (if
applicable), and relevant context. List any dependencies required for
this change.
After you're done, someone will review your PR. They may suggest
improvements. If no one reviews your PR within a few days, feel free to
@-mention the same people again, as notifications can get lost.
Finally, we'd love to show appreciation for your contribution - if you'd
like us to shout you out on Twitter, please also include your handle!
-->
<!-- Remove if not applicable -->
Fixes #
links to prompt templates and example selectors on the
[Prompts](https://python.langchain.com/docs/modules/model_io/prompts/)
page are invalid.
#### Before submitting
Just a small note that I tried to run `make docs_clean` and other
related commands before PR written
[here](https://github.com/hwchase17/langchain/blob/master/.github/CONTRIBUTING.md#build-documentation-locally),
it gives me an error:
```bash
langchain % make docs_clean
Traceback (most recent call last):
File "/Users/masafumi/Downloads/langchain/.venv/bin/make", line 5, in <module>
from scripts.proto import main
ModuleNotFoundError: No module named 'scripts'
make: *** [docs_clean] Error 1
# Poetry (version 1.5.1)
# Python 3.9.13
```
I couldn't figure out how to fix this, so I didn't run those command.
But links should work.
#### Who can review?
Tag maintainers/contributors who might be interested:
@hwchase17
Similar issue #6323
Co-authored-by: masafumimori <m.masafumimori@outlook.com>
# Handle Managed Motorhead Data Key
Managed motorhead will return a payload with a `data` key. we need to
handle this to properly access messages from the server.
Just adds some comments and docstring improvements.
There was some behaviour that was quite unclear to me at first like:
- "when do things get updated?"
- "why are there only entity names and no summaries?"
- "why do the entity names disappear?"
Now it can be much more obvious to many.
I am lukestanley on Twitter.
1. Changed the implementation of add_texts interface for the AwaDB
vector store in order to improve the performance
2. Upgrade the AwaDB from 0.3.2 to 0.3.3
---------
Co-authored-by: vincent <awadb.vincent@gmail.com>
Fixes https://github.com/hwchase17/langchain/issues/6172
As described in https://github.com/hwchase17/langchain/issues/6172, I'd
love to help update the dev container in this project.
**Summary of changes:**
- Dev container now builds (the current container in this repo won't
build for me)
- Dockerfile updates
- Update image to our [currently-maintained Python
image](https://github.com/devcontainers/images/tree/main/src/python/.devcontainer)
(`mcr.microsoft.com/devcontainers/python`) rather than the deprecated
image from vscode-dev-containers
- Move Dockerfile to root of repo - in order for `COPY` to work
properly, it needs the files (in this case, `pyproject.toml` and
`poetry.toml`) in the same directory
- devcontainer.json updates
- Removed `customizations` and `remoteUser` since they should be covered
by the updated image in the Dockerfile
- Update comments
- Update docker-compose.yaml to properly point to updated Dockerfile
- Add a .gitattributes to avoid line ending conversions, which can
result in hundreds of pending changes
([info](https://code.visualstudio.com/docs/devcontainers/tips-and-tricks#_resolving-git-line-ending-issues-in-containers-resulting-in-many-modified-files))
- Add a README in the .devcontainer folder and info on the dev container
in the contributing.md
**Outstanding questions:**
- Is it expected for `poetry install` to take some time? It takes about
30 minutes for this dev container to finish building in a Codespace, but
a user should only have to experience this once. Through some online
investigation, this doesn't seem unusual
- Versions of poetry newer than 1.3.2 failed every time - based on some
of the guidance in contributing.md and other online resources, it seemed
changing poetry versions might be a good solution. 1.3.2 is from Jan
2023
---------
Co-authored-by: bamurtaugh <brmurtau@microsoft.com>
Co-authored-by: Samruddhi Khandale <samruddhikhandale@github.com>
This PR refactors the ArxivAPIWrapper class making
`doc_content_chars_max` parameter optional. Additionally, tests have
been added to ensure the functionality of the doc_content_chars_max
parameter.
Fixes#6027 (issue)
There will likely be another change or two coming over the next couple
weeks as we stabilize the API, but putting this one in now which just
makes the integration a bit more flexible with the response output
format.
```
(langchain) danielking@MML-1B940F4333E2 langchain % pytest tests/integration_tests/llms/test_mosaicml.py tests/integration_tests/embeddings/test_mosaicml.py
=================================================================================== test session starts ===================================================================================
platform darwin -- Python 3.10.11, pytest-7.3.1, pluggy-1.0.0
rootdir: /Users/danielking/github/langchain
configfile: pyproject.toml
plugins: asyncio-0.20.3, mock-3.10.0, dotenv-0.5.2, cov-4.0.0, anyio-3.6.2
asyncio: mode=strict
collected 12 items
tests/integration_tests/llms/test_mosaicml.py ...... [ 50%]
tests/integration_tests/embeddings/test_mosaicml.py ...... [100%]
=================================================================================== slowest 5 durations ===================================================================================
4.76s call tests/integration_tests/llms/test_mosaicml.py::test_retry_logic
4.74s call tests/integration_tests/llms/test_mosaicml.py::test_mosaicml_llm_call
4.13s call tests/integration_tests/llms/test_mosaicml.py::test_instruct_prompt
0.91s call tests/integration_tests/llms/test_mosaicml.py::test_short_retry_does_not_loop
0.66s call tests/integration_tests/llms/test_mosaicml.py::test_mosaicml_extra_kwargs
=================================================================================== 12 passed in 19.70s ===================================================================================
```
#### Who can review?
@hwchase17
@dev2049
the current implement put the doc itself as the metadata, but the
document chatgpt plugin retriever returned already has a `metadata`
field, it's better to use that instead.
the original code will throw the following exception when using
`RetrievalQAWithSourcesChain`, becuse it can not find the field
`metadata`:
```python
Exception has occurred: ValueError (note: full exception trace is shown but execution is paused at: _run_module_as_main)
Document prompt requires documents to have metadata variables: ['source']. Received document with missing metadata: ['source'].
File "/home/wangjie/anaconda3/envs/chatglm/lib/python3.10/site-packages/langchain/chains/combine_documents/base.py", line 27, in format_document
raise ValueError(
File "/home/wangjie/anaconda3/envs/chatglm/lib/python3.10/site-packages/langchain/chains/combine_documents/stuff.py", line 65, in <listcomp>
doc_strings = [format_document(doc, self.document_prompt) for doc in docs]
File "/home/wangjie/anaconda3/envs/chatglm/lib/python3.10/site-packages/langchain/chains/combine_documents/stuff.py", line 65, in _get_inputs
doc_strings = [format_document(doc, self.document_prompt) for doc in docs]
File "/home/wangjie/anaconda3/envs/chatglm/lib/python3.10/site-packages/langchain/chains/combine_documents/stuff.py", line 85, in combine_docs
inputs = self._get_inputs(docs, **kwargs)
File "/home/wangjie/anaconda3/envs/chatglm/lib/python3.10/site-packages/langchain/chains/combine_documents/base.py", line 84, in _call
output, extra_return_dict = self.combine_docs(
File "/home/wangjie/anaconda3/envs/chatglm/lib/python3.10/site-packages/langchain/chains/base.py", line 140, in __call__
raise e
```
Additionally, the `metadata` filed in the `chatgpt plugin retriever`
have these fileds by default:
```json
{
"source": "file", //email, file or chat
"source_id": "filename.docx", // the filename
"url": "",
...
}
```
so, we should set `source_id` to `source` in the langchain metadata.
```python
metadata = d.pop("metadata", d)
if(metadata.get("source_id")):
metadata["source"] = metadata.pop("source_id")
```
#### Who can review?
@dev2049
<!-- For a quicker response, figure out the right person to tag with @
@hwchase17 - project lead
Tracing / Callbacks
- @agola11
Async
- @agola11
DataLoaders
- @eyurtsev
Models
- @hwchase17
- @agola11
Agents / Tools / Toolkits
- @vowelparrot
VectorStores / Retrievers / Memory
- @dev2049
-->
---------
Co-authored-by: wangjie <wangjie@htffund.com>
**Short Description**
Added a new argument to AutoGPT class which allows to persist the chat
history to a file.
**Changes**
1. Removed the `self.full_message_history: List[BaseMessage] = []`
2. Replaced it with `chat_history_memory` which can take any subclasses
of `BaseChatMessageHistory`
---------
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
adding new loader for [acreom](https://acreom.com) vaults. It's based on
the Obsidian loader with some additional text processing for acreom
specific markdown elements.
@eyurtsev please take a look!
---------
Co-authored-by: rlm <pexpresss31@gmail.com>
Trying to call `ChatOpenAI.get_num_tokens_from_messages` returns the
following error for the newly announced models `gpt-3.5-turbo-0613` and
`gpt-4-0613`:
```
NotImplementedError: get_num_tokens_from_messages() is not presently implemented for model gpt-3.5-turbo-0613.See https://github.com/openai/openai-python/blob/main/chatml.md for information on how messages are converted to tokens.
```
This adds support for counting tokens for those models, by counting
tokens the same way they're counted for the previous versions of
`gpt-3.5-turbo` and `gpt-4`.
#### reviewers
- @hwchase17
- @agola11
Confluence API supports difference format of page content. The storage
format is the raw XML representation for storage. The view format is the
HTML representation for viewing with macros rendered as though it is
viewed by users.
Add the `content_format` parameter to `ConfluenceLoader.load()` to
specify the content format, this is
set to `ContentFormat.STORAGE` by default.
#### Who can review?
Tag maintainers/contributors who might be interested: @eyurtsev
---------
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
<!--
Thank you for contributing to LangChain! Your PR will appear in our
release under the title you set. Please make sure it highlights your
valuable contribution.
Replace this with a description of the change, the issue it fixes (if
applicable), and relevant context. List any dependencies required for
this change.
After you're done, someone will review your PR. They may suggest
improvements. If no one reviews your PR within a few days, feel free to
@-mention the same people again, as notifications can get lost.
Finally, we'd love to show appreciation for your contribution - if you'd
like us to shout you out on Twitter, please also include your handle!
-->
## Add Solidity programming language support for code splitter.
Twitter: @0xjord4n_
<!-- If you're adding a new integration, please include:
1. a test for the integration - favor unit tests that does not rely on
network access.
2. an example notebook showing its use
See contribution guidelines for more information on how to write tests,
lint
etc:
https://github.com/hwchase17/langchain/blob/master/.github/CONTRIBUTING.md
-->
#### Who can review?
Tag maintainers/contributors who might be interested:
@hwchase17
<!-- For a quicker response, figure out the right person to tag with @
@hwchase17 - project lead
Tracing / Callbacks
- @agola11
Async
- @agola11
DataLoaders
- @eyurtsev
Models
- @hwchase17
- @agola11
Agents / Tools / Toolkits
- @hwchase17
VectorStores / Retrievers / Memory
- @dev2049
-->
<!--
Thank you for contributing to LangChain! Your PR will appear in our
release under the title you set. Please make sure it highlights your
valuable contribution.
Replace this with a description of the change, the issue it fixes (if
applicable), and relevant context. List any dependencies required for
this change.
After you're done, someone will review your PR. They may suggest
improvements. If no one reviews your PR within a few days, feel free to
@-mention the same people again, as notifications can get lost.
Finally, we'd love to show appreciation for your contribution - if you'd
like us to shout you out on Twitter, please also include your handle!
-->
<!-- Remove if not applicable -->
Fixes # (issue)
#### Before submitting
<!-- If you're adding a new integration, please include:
1. a test for the integration - favor unit tests that does not rely on
network access.
2. an example notebook showing its use
See contribution guidelines for more information on how to write tests,
lint
etc:
https://github.com/hwchase17/langchain/blob/master/.github/CONTRIBUTING.md
-->
#### Who can review?
Tag maintainers/contributors who might be interested:
<!-- For a quicker response, figure out the right person to tag with @
@hwchase17 - project lead
Tracing / Callbacks
- @agola11
Async
- @agola11
DataLoaders
- @eyurtsev
Models
- @hwchase17
- @agola11
Agents / Tools / Toolkits
- @hwchase17
VectorStores / Retrievers / Memory
- @dev2049
-->
<!--
Thank you for contributing to LangChain! Your PR will appear in our
release under the title you set. Please make sure it highlights your
valuable contribution.
Replace this with a description of the change, the issue it fixes (if
applicable), and relevant context. List any dependencies required for
this change.
After you're done, someone will review your PR. They may suggest
improvements. If no one reviews your PR within a few days, feel free to
@-mention the same people again, as notifications can get lost.
Finally, we'd love to show appreciation for your contribution - if you'd
like us to shout you out on Twitter, please also include your handle!
-->
<!-- Remove if not applicable -->
Fixes # (issue)
#### Before submitting
<!-- If you're adding a new integration, please include:
1. a test for the integration - favor unit tests that does not rely on
network access.
2. an example notebook showing its use
See contribution guidelines for more information on how to write tests,
lint
etc:
https://github.com/hwchase17/langchain/blob/master/.github/CONTRIBUTING.md
-->
#### Who can review?
Tag maintainers/contributors who might be interested:
<!-- For a quicker response, figure out the right person to tag with @
@hwchase17 - project lead
Tracing / Callbacks
- @agola11
Async
- @agola11
DataLoaders
- @eyurtsev
Models
- @hwchase17
- @agola11
Agents / Tools / Toolkits
- @vowelparrot
VectorStores / Retrievers / Memory
- @dev2049
-->
This adds implementation of MMR search in pinecone; and I have two
semi-related observations about this vector store class:
- Maybe we should also have a
`similarity_search_by_vector_returning_embeddings` like in supabase, but
it's not in the base `VectorStore` class so I didn't implement
- Talking about the base class, there's
`similarity_search_with_relevance_scores`, but in pinecone it is called
`similarity_search_with_score`; maybe we should consider renaming it to
align with other `VectorStore` base and sub classes (or add that as an
alias for backward compatibility)
#### Who can review?
Tag maintainers/contributors who might be interested:
- VectorStores / Retrievers / Memory - @dev2049