This adds sqlite-vss as an option for a vector database. Contains the
code and a few tests. Tests are passing and the library sqlite-vss is
added as optional as explained in the contributing guidelines. I
adjusted the code for lint/black/ and mypy. It looks that everything is
currently passing.
Adding sqlite-vss was mentioned in this issue:
https://github.com/langchain-ai/langchain/issues/1019.
Also mentioned here in the sqlite-vss repo for the curious:
https://github.com/asg017/sqlite-vss/issues/66
Maintainer tag: @baskaryan
---------
Co-authored-by: Philippe Oger <philippe.oger@adevinta.com>
This PR fixes an issues I found when upgrading to a more recent version
of Langchain. I was using 0.0.142 before, and this issue popped up
already when the `_custom_parser` was added to `output_parsers/json`.
Anyway, the issue is that the parser tries to escape quotes when they
are double-escaped (e.g. `\\"`), leading to OutputParserException.
This is particularly undesired in my app, because I have an Agent that
uses a single input Tool, which expects as input a JSON string with the
structure:
```python
{
"foo": string,
"bar": string
}
```
The LLM (GPT3.5) response is (almost) always something like
`"action_input": "{\\"foo\\": \\"bar\\", \\"bar\\": \\"foo\\"}"` and
since the upgrade this is not correctly parsed.
---------
Co-authored-by: taamedag <Davide.Menini@swisscom.com>
Adds a call to Pydantic's `update_forward_refs` for the `Run` class (in
addition to the `ChainRun` and `ToolRun` classes, for which that method
is already called). Without it, the self-reference of child classes
(type `List[Run]`) is problematic. For example:
```python
from langchain.callbacks import StdOutCallbackHandler
from langchain.chains import LLMChain
from langchain.llms import OpenAI
from langchain.prompts import PromptTemplate
from wandb.integration.langchain import WandbTracer
llm = OpenAI()
prompt = PromptTemplate.from_template("1 + {number} = ")
chain = LLMChain(llm=llm, prompt=prompt, callbacks=[StdOutCallbackHandler(), WandbTracer()])
print(chain.run(number=2))
```
results in the following output before the change
```
WARNING:root:Error in on_chain_start callback: field "child_runs" not yet prepared so type is still a ForwardRef, you might need to call Run.update_forward_refs().
> Entering new LLMChain chain...
Prompt after formatting:
1 + 2 =
WARNING:root:Error in on_chain_end callback: No chain Run found to be traced
> Finished chain.
3
```
but afterwards the callback error messages are gone.
Hi there!
I'm excited to open this PR to add support for using 'Tencent Cloud
VectorDB' as a vector store.
Tencent Cloud VectorDB is a fully-managed, self-developed,
enterprise-level distributed database service designed for storing,
retrieving, and analyzing multi-dimensional vector data. The database
supports multiple index types and similarity calculation methods, with a
single index supporting vector scales up to 1 billion and capable of
handling millions of QPS with millisecond-level query latency. Tencent
Cloud VectorDB not only provides external knowledge bases for large
models to improve their accuracy, but also has wide applications in AI
fields such as recommendation systems, NLP services, computer vision,
and intelligent customer service.
The PR includes:
Implementation of Vectorstore.
I have read your [contributing
guidelines](72b7d76d79/.github/CONTRIBUTING.md).
And I have passed the tests below
make format
make lint
make coverage
make test
This PR brings structural updates to `PlaywrightURLLoader`, aiming at
making the code more readable and extensible through the abstraction of
page evaluation logic. These changes also align this implementation with
a similar structure used in LangChain.js.
The key enhancements include:
1. Introduction of 'PlaywrightEvaluator', an abstract base class for all
evaluators.
2. Creation of 'UnstructuredHtmlEvaluator', a concrete class
implementing 'PlaywrightEvaluator', which uses `unstructured` library
for processing page's HTML content.
3. Extension of 'PlaywrightURLLoader' constructor to optionally accept
an evaluator of the type 'PlaywrightEvaluator'. It defaults to
'UnstructuredHtmlEvaluator' if no evaluator is provided.
4. Refactoring of 'load' and 'aload' methods to use the 'evaluate' and
'evaluate_async' methods of the provided 'PageEvaluator' for page
content handling.
This update brings flexibility to 'PlaywrightURLLoader' as it can now
utilize different evaluators for page processing depending on the
requirement. The abstraction also improves code maintainability and
readability.
Twitter: @ywkim
- Description: Add bloomz_7b, llama-2-7b, llama-2-13b, llama-2-70b to
ErnieBotChat, which only supported ERNIE-Bot-turbo and ERNIE-Bot.
- Issue: #10022,
- Dependencies: no extra dependencies
---------
Co-authored-by: hetianfeng <hetianfeng@meituan.com>
### Description
The feature for anonymizing data has been implemented. In order to
protect private data, such as when querying external APIs (OpenAI), it
is worth pseudonymizing sensitive data to maintain full privacy.
Anonynization consists of two steps:
1. **Identification:** Identify all data fields that contain personally
identifiable information (PII).
2. **Replacement**: Replace all PIIs with pseudo values or codes that do
not reveal any personal information about the individual but can be used
for reference. We're not using regular encryption, because the language
model won't be able to understand the meaning or context of the
encrypted data.
We use *Microsoft Presidio* together with *Faker* framework for
anonymization purposes because of the wide range of functionalities they
provide. The full implementation is available in `PresidioAnonymizer`.
### Future works
- **deanonymization** - add the ability to reverse anonymization. For
example, the workflow could look like this: `anonymize -> LLMChain ->
deanonymize`. By doing this, we will retain anonymity in requests to,
for example, OpenAI, and then be able restore the original data.
- **instance anonymization** - at this point, each occurrence of PII is
treated as a separate entity and separately anonymized. Therefore, two
occurrences of the name John Doe in the text will be changed to two
different names. It is therefore worth introducing support for full
instance detection, so that repeated occurrences are treated as a single
object.
### Twitter handle
@deepsense_ai / @MaksOpp
---------
Co-authored-by: MaksOpp <maks.operlejn@gmail.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
- Description: this PR adds `s3_object_key` and `s3_bucket` to the doc
metadata when loading an S3 file. This is particularly useful when using
`S3DirectoryLoader` to remove the files from the dir once they have been
processed (getting the object keys from the metadata `source` field
seems brittle)
- Dependencies: N/A
- Tag maintainer: ?
- Twitter handle: _cbornet
---------
Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
This PR makes the following changes:
1. Documents become serializable using langhchain serialization
2. Make a utility to create a docstore kw store
Will help to address issue here:
https://github.com/langchain-ai/langchain/issues/9345
In the function _load_run_evaluators the function _get_keys was not
called if only custom_evaluators parameter is used
- Description: In the function _load_run_evaluators the function
_get_keys was not called if only custom_evaluators parameter is used,
- Issue: no issue created for this yet,
- Dependencies: None,
- Tag maintainer: @vowelparrot,
- Twitter handle: Buckler89
---------
Co-authored-by: ddroghini <d.droghini@mflgroup.com>
Description: This commit uses the new Service object in Selenium
webdriver as executable_path has been [deprecated and removed in
selenium version
4.11.2](9f5801c82f)
Issue: https://github.com/langchain-ai/langchain/issues/9808
Tag Maintainer: @eyurtsev
- Description: In my previous PR, I had modified the code to catch all
kinds of [SOURCES, sources, Source, Sources]. However, this change
included checking for a colon or a white space which should actually
have been only checking for a colon.
- Issue: the issue # it fixes (if applicable),
- Dependencies: any dependencies required for this change,
Adds support for [llmonitor](https://llmonitor.com) callbacks.
It enables:
- Requests tracking / logging / analytics
- Error debugging
- Cost analytics
- User tracking
Let me know if anythings neds to be changed for merge.
Thank you!
Co-authored-by: Daniel Brenot <dbrenot@pelmorex.com>
Co-authored-by: Daniel <daniel.alexander.brenot@gmail.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
- Description: the implementation for similarity_search_with_score did
not actually include a score or logic to filter. Now fixed.
- Tag maintainer: @rlancemartin
- Twitter handle: @ofermend
Recently we made the decision that PromptGuard takes a list of strings
instead of a string.
@ggroode implemented the integration change.
---------
Co-authored-by: ggroode <ggroode@berkeley.edu>
Co-authored-by: ggroode <46691276+ggroode@users.noreply.github.com>
Clearly document that the PAL and CPAL techniques involve generating
code, and that such code must be properly sandboxed and given
appropriate narrowly-scoped credentials in order to ensure security.
While our implementations include some mitigations, Python and SQL
sandboxing is well-known to be a very hard problem and our mitigations
are no replacement for proper sandboxing and permissions management. The
implementation of such techniques must be performed outside the scope of
the Python process where this package's code runs, so its correct setup
and administration must therefore be the responsibility of the user of
this code.
- Description: added the _cosine_relevance_score_fn to
_select_relevance_score_fn of faiss.py to enable the use of cosine
distance for similarity for this vector store and to comply with the
Error Message, that implies, that cosine should be a valid distance
strategy
- Issue: no relevant Issue found, but needed this function myself and
tested it in a private repo
- Dependencies: none
Neo4j has added vector index integration just recently. To allow both
ingestion and integrating it as vector RAG applications, I wrapped it as
a vector store as the implementation is completely different from
`GraphCypherQAChain`. Here, we are not generating any Cypher statements
at query time, we are simply doing the vector similarity search using
the new vector index as if we were dealing with a vector database.
---------
Co-authored-by: Bagatur <baskaryan@gmail.com>