### Description
Add multiple language support to Anonymizer
PII detection in Microsoft Presidio relies on several components - in
addition to the usual pattern matching (e.g. using regex), the analyser
uses a model for Named Entity Recognition (NER) to extract entities such
as:
- `PERSON`
- `LOCATION`
- `DATE_TIME`
- `NRP`
- `ORGANIZATION`
[[Source]](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/predefined_recognizers/spacy_recognizer.py)
To handle NER in specific languages, we utilize unique models from the
`spaCy` library, recognized for its extensive selection covering
multiple languages and sizes. However, it's not restrictive, allowing
for integration of alternative frameworks such as
[Stanza](https://microsoft.github.io/presidio/analyzer/nlp_engines/spacy_stanza/)
or
[transformers](https://microsoft.github.io/presidio/analyzer/nlp_engines/transformers/)
when necessary.
### Future works
- **automatic language detection** - instead of passing the language as
a parameter in `anonymizer.anonymize`, we could detect the language/s
beforehand and then use the corresponding NER model. We have discussed
this internally and @mateusz-wosinski-ds will look into a standalone
language detection tool/chain for LangChain 😄
### Twitter handle
@deepsense_ai / @MaksOpp
### Tag maintainer
@baskaryan @hwchase17 @hinthornw
- Description: Adding support for self-querying to Vectara integration
- Issue: per customer request
- Tag maintainer: @rlancemartin @baskaryan
- Twitter handle: @ofermend
Also updated some documentation, added self-query testing, and a demo
notebook with self-query example.
### Description
The feature for pseudonymizing data with ability to retrieve original
text (deanonymization) has been implemented. In order to protect private
data, such as when querying external APIs (OpenAI), it is worth
pseudonymizing sensitive data to maintain full privacy. But then, after
the model response, it would be good to have the data in the original
form.
I implemented the `PresidioReversibleAnonymizer`, which consists of two
parts:
1. anonymization - it works the same way as `PresidioAnonymizer`, plus
the object itself stores a mapping of made-up values to original ones,
for example:
```
{
"PERSON": {
"<anonymized>": "<original>",
"John Doe": "Slim Shady"
},
"PHONE_NUMBER": {
"111-111-1111": "555-555-5555"
}
...
}
```
2. deanonymization - using the mapping described above, it matches fake
data with original data and then substitutes it.
Between anonymization and deanonymization user can perform different
operations, for example, passing the output to LLM.
### Future works
- **instance anonymization** - at this point, each occurrence of PII is
treated as a separate entity and separately anonymized. Therefore, two
occurrences of the name John Doe in the text will be changed to two
different names. It is therefore worth introducing support for full
instance detection, so that repeated occurrences are treated as a single
object.
- **better matching and substitution of fake values for real ones** -
currently the strategy is based on matching full strings and then
substituting them. Due to the indeterminism of language models, it may
happen that the value in the answer is slightly changed (e.g. *John Doe*
-> *John* or *Main St, New York* -> *New York*) and such a substitution
is then no longer possible. Therefore, it is worth adjusting the
matching for your needs.
- **Q&A with anonymization** - when I'm done writing all the
functionality, I thought it would be a cool resource in documentation to
write a notebook about retrieval from documents using anonymization. An
iterative process, adding new recognizers to fit the data, lessons
learned and what to look out for
### Twitter handle
@deepsense_ai / @MaksOpp
---------
Co-authored-by: MaksOpp <maks.operlejn@gmail.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
Squashed from #7454 with updated features
We have separated the `SQLDatabseChain` from `VectorSQLDatabseChain` and
put everything into `experimental/`.
Below is the original PR message from #7454.
-------
We have been working on features to fill up the gap among SQL, vector
search and LLM applications. Some inspiring works like self-query
retrievers for VectorStores (for example
[Weaviate](https://python.langchain.com/en/latest/modules/indexes/retrievers/examples/weaviate_self_query.html)
and
[others](https://python.langchain.com/en/latest/modules/indexes/retrievers/examples/self_query.html))
really turn those vector search databases into a powerful knowledge
base! 🚀🚀
We are thinking if we can merge all in one, like SQL and vector search
and LLMChains, making this SQL vector database memory as the only source
of your data. Here are some benefits we can think of for now, maybe you
have more 👀:
With ALL data you have: since you store all your pasta in the database,
you don't need to worry about the foreign keys or links between names
from other data source.
Flexible data structure: Even if you have changed your schema, for
example added a table, the LLM will know how to JOIN those tables and
use those as filters.
SQL compatibility: We found that vector databases that supports SQL in
the marketplace have similar interfaces, which means you can change your
backend with no pain, just change the name of the distance function in
your DB solution and you are ready to go!
### Issue resolved:
- [Feature Proposal: VectorSearch enabled
SQLChain?](https://github.com/hwchase17/langchain/issues/5122)
### Change made in this PR:
- An improved schema handling that ignore `types.NullType` columns
- A SQL output Parser interface in `SQLDatabaseChain` to enable Vector
SQL capability and further more
- A Retriever based on `SQLDatabaseChain` to retrieve data from the
database for RetrievalQAChains and many others
- Allow `SQLDatabaseChain` to retrieve data in python native format
- Includes PR #6737
- Vector SQL Output Parser for `SQLDatabaseChain` and
`SQLDatabaseChainRetriever`
- Prompts that can implement text to VectorSQL
- Corresponding unit-tests and notebook
### Twitter handle:
- @MyScaleDB
### Tag Maintainer:
Prompts / General: @hwchase17, @baskaryan
DataLoaders / VectorStores / Retrievers: @rlancemartin, @eyurtsev
### Dependencies:
No dependency added
# Description
This pull request allows to use the
[NucliaDB](https://docs.nuclia.dev/docs/docs/nucliadb/intro) as a vector
store in LangChain.
It works with both a [local NucliaDB
instance](https://docs.nuclia.dev/docs/docs/nucliadb/deploy/basics) or
with [Nuclia Cloud](https://nuclia.cloud).
# Dependencies
It requires an up-to-date version of the `nuclia` Python package.
@rlancemartin, @eyurtsev, @hinthornw, please review it when you have a
moment :)
Note: our Twitter handler is `@NucliaAI`
This PR replaces the generic `SET search_path TO` statement by `USE` for
the Trino dialect since Trino does not support `SET search_path`.
Official Trino documentation can be found
[here](https://trino.io/docs/current/sql/use.html).
With this fix, the `SQLdatabase` will now be able to set the current
schema and execute queries using the Trino engine. It will use the
catalog set as default by the connection uri.
- Description: Remove hardcoded/duplicated distance strategies in the
PGVector store.
- Issue: NA
- Dependencies: NA
- Tag maintainer: for a quicker response, tag the relevant maintainer
(see below),
- Twitter handle: @archmonkeymojo
---------
Co-authored-by: Bagatur <baskaryan@gmail.com>
- Description: Updated Additional Resources section of documentation and
added to YouTube videos with excellent playlist of Langchain content
from Sam Witteveen
- Issue: None -- updating documentation
- Dependencies: None
- Tag maintainer: @baskaryan
I have updated the code to ensure consistent error handling for
ImportError. Instead of relying on ValueError as before, I've followed
the standard practice of raising ImportError while also including
detailed error messages. This modification improves code clarity and
explicitly indicates that any issues are related to module imports.
Follow-up PR for https://github.com/langchain-ai/langchain/pull/10047,
simply adding a notebook quickstart example for the vector store with
SQLite, using the class SQLiteVSS.
Maintainer tag @baskaryan
Co-authored-by: Philippe Oger <philippe.oger@adevinta.com>
`mypy` cannot type-check code that relies on dependencies that aren't
installed.
Eventually we'll probably want to install as many optional dependencies
as possible. However, the full "extended deps" setup for langchain
creates a 3GB cache file and takes a while to unpack and install. We'll
probably want something a bit more targeted.
This is a first step toward something better.
A test file was accidentally dropping a `results.json` file in the
current working directory as a result of running `make test`.
This is undesirable, since we don't want to risk accidentally adding
stray files into the repo if we run tests locally and then do `git add
.` without inspecting the file list very closely.
Makes it easier to do recursion using regular python compositional
patterns
```py
def lambda_decorator(func):
"""Decorate function as a RunnableLambda"""
return runnable.RunnableLambda(func)
@lambda_decorator
def fibonacci(a, config: runnable.RunnableConfig) -> int:
if a <= 1:
return a
else:
return fibonacci.invoke(
a - 1, config
) + fibonacci.invoke(a - 2, config)
fibonacci.invoke(10)
```
https://smith.langchain.com/public/cb98edb4-3a09-4798-9c22-a930037faf88/r
Also makes it more natural to do things like error handle and call other
langchain objects in ways we probably don't want to support in
`with_fallbacks()`
```py
@lambda_decorator
def handle_errors(a, config: runnable.RunnableConfig) -> int:
try:
return my_chain.invoke(a, config)
except MyExceptionType as exc:
return my_other_chain.invoke({"original": a, "error": exc}, config)
```
In this case, the next chain takes in the exception object. Maybe this
could be something we toggle in `with_fallbacks` but I fear we'll get
into uglier APIs + heavier cognitive load if we try to do too much there
---------
Co-authored-by: Nuno Campos <nuno@boringbits.io>
- Description: Fix bug in SPARQL intent selection
- Issue: After the change in #7758 the intent is always set to "UPDATE".
Indeed, if the answer to the prompt contains only "SELECT" the
`find("SELECT")` operation returns a higher value w.r.t. `-1` returned
by `find("UPDATE")`.
- Dependencies: None,
- Tag maintainer: @baskaryan @aditya-29
- Twitter handle: @mario_scrock
It seems the caching action was not always correctly recreating
softlinks. At first glance, the softlinks it created seemed fine, but
they didn't always work. Possibly hitting some kind of underlying bug,
but not particularly worth debugging in depth -- we can manually create
the soft links we need.
- Revert "Temporarily disable step that seems to be transiently failing.
(#10234)"
- Refresh shell hashtable and show poetry/python location and version.
Make sure that changes to CI infrastructure get tested on CI before
being merged.
Without this PR, changes to the poetry setup action don't trigger a CI
run and in principle could break `master` when merged.
Text Generation Inference's client permits the use of a None temperature
as seen
[here](033230ae66/clients/python/text_generation/client.py (L71C9-L71C20)).
While I haved dived into TGI's server code and don't know about the
implications of using None as a temperature setting, I think we should
grant users the option to pass None as a temperature parameter to TGI.
#9304 introduced a critical bug. The S3DirectoryLoader fails completely
because boto3 checks the naming of kw arguments and one of the args is
badly named (very sorry for that)
cc @baskaryan
Changes in:
- `create_sql_agent` function so that user can easily add custom tools
as complement for the toolkit.
- updating **sql use case** notebook to showcase 2 examples of extra
tools.
Motivation for these changes is having the possibility of including
domain expert knowledge to the agent, which improves accuracy and
reduces time/tokens.
---------
Co-authored-by: Manuel Soria <manuel.soria@greyscaleai.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
## Description
### Issue
This pull request addresses a lingering issue identified in PR #7070. In
that previous pull request, an attempt was made to address the problem
of empty embeddings when using the `OpenAIEmbeddings` class. While PR
#7070 introduced a mechanism to retry requests for embeddings, it didn't
fully resolve the issue as empty embeddings still occasionally
persisted.
### Problem
In certain specific use cases, empty embeddings can be encountered when
requesting data from the OpenAI API. In some cases, these empty
embeddings can be skipped or removed without affecting the functionality
of the application. However, they might not always be resolved through
retries, and their presence can adversely affect the functionality of
applications relying on the `OpenAIEmbeddings` class.
### Solution
To provide a more robust solution for handling empty embeddings, we
propose the introduction of an optional parameter, `skip_empty`, in the
`OpenAIEmbeddings` class. When set to `True`, this parameter will enable
the behavior of automatically skipping empty embeddings, ensuring that
problematic empty embeddings do not disrupt the processing flow. The
developer will be able to optionally toggle this behavior if needed
without disrupting the application flow.
## Changes Made
- Added an optional parameter, `skip_empty`, to the `OpenAIEmbeddings`
class.
- When `skip_empty` is set to `True`, empty embeddings are automatically
skipped without causing errors or disruptions.
### Example Usage
```python
from openai.embeddings import OpenAIEmbeddings
# Initialize the OpenAIEmbeddings class with skip_empty=True
embeddings = OpenAIEmbeddings(api_key="your_api_key", skip_empty=True)
# Request embeddings, empty embeddings are automatically skipped. docs is a variable containing the already splitted text.
results = embeddings.embed_documents(docs)
# Process results without interruption from empty embeddings
```
- Description:
Add a 'download_dir' argument to VLLM model (to change the cache
download directotu when retrieving a model from HF hub)
- Issue:
On some remote machine, I want the cache dir to be in a volume where I
have space (models are heavy nowadays). Sometimes the default HF cache
dir might not be what we want.
- Dependencies:
None
---------
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>