langchain

mirror of https://github.com/hwchase17/langchain synced 2024-11-10 01:10:59 +00:00

History

Sam Partee a28eea5767 Redis metadata filtering and specification, index customization (#8612 ) ### Description The previous Redis implementation did not allow for the user to specify the index configuration (i.e. changing the underlying algorithm) or add additional metadata to use for querying (i.e. hybrid or "filtered" search). This PR introduces the ability to specify custom index attributes and metadata attributes as well as use that metadata in filtered queries. Overall, more structure was introduced to the Redis implementation that should allow for easier maintainability moving forward. # New Features The following features are now available with the Redis integration into Langchain ## Index schema generation The schema for the index will now be automatically generated if not specified by the user. For example, the data above has the multiple metadata categories. The the following example ```python from langchain.embeddings import OpenAIEmbeddings from langchain.vectorstores.redis import Redis embeddings = OpenAIEmbeddings() rds, keys = Redis.from_texts_return_keys( texts, embeddings, metadatas=metadata, redis_url="redis://localhost:6379", index_name="users" ) ``` Loading the data in through this and the other ``from_documents`` and ``from_texts`` methods will now generate index schema in Redis like the following. view index schema with the ``redisvl`` tool. [link](redisvl.com) ```bash $ rvl index info -i users ``` Index Information: \| Index Name \| Storage Type \| Prefixes \| Index Options \| Indexing \| \|--------------\|----------------\|---------------\|-----------------\|------------\| \| users \| HASH \| ['doc:users'] \| [] \| 0 \| Index Fields: \| Name \| Attribute \| Type \| Field Option \| Option Value \| \|----------------\|----------------\|---------\|----------------\|----------------\| \| user \| user \| TEXT \| WEIGHT \| 1 \| \| job \| job \| TEXT \| WEIGHT \| 1 \| \| credit_score \| credit_score \| TEXT \| WEIGHT \| 1 \| \| content \| content \| TEXT \| WEIGHT \| 1 \| \| age \| age \| NUMERIC \| \| \| \| content_vector \| content_vector \| VECTOR \| \| \| ### Custom Metadata specification The metadata schema generation has the following rules 1. All text fields are indexed as text fields. 2. All numeric fields are index as numeric fields. If you would like to have a text field as a tag field, users can specify overrides like the following for the example data ```python # this can also be a path to a yaml file index_schema = { "text": [{"name": "user"}, {"name": "job"}], "tag": [{"name": "credit_score"}], "numeric": [{"name": "age"}], } rds, keys = Redis.from_texts_return_keys( texts, embeddings, metadatas=metadata, redis_url="redis://localhost:6379", index_name="users" ) ``` This will change the index specification to Index Information: \| Index Name \| Storage Type \| Prefixes \| Index Options \| Indexing \| \|--------------\|----------------\|----------------\|-----------------\|------------\| \| users2 \| HASH \| ['doc:users2'] \| [] \| 0 \| Index Fields: \| Name \| Attribute \| Type \| Field Option \| Option Value \| \|----------------\|----------------\|---------\|----------------\|----------------\| \| user \| user \| TEXT \| WEIGHT \| 1 \| \| job \| job \| TEXT \| WEIGHT \| 1 \| \| content \| content \| TEXT \| WEIGHT \| 1 \| \| credit_score \| credit_score \| TAG \| SEPARATOR \| , \| \| age \| age \| NUMERIC \| \| \| \| content_vector \| content_vector \| VECTOR \| \| \| and throw a warning to the user (log output) that the generated schema does not match the specified schema. ```text index_schema does not match generated schema from metadata. index_schema: {'text': [{'name': 'user'}, {'name': 'job'}], 'tag': [{'name': 'credit_score'}], 'numeric': [{'name': 'age'}]} generated_schema: {'text': [{'name': 'user'}, {'name': 'job'}, {'name': 'credit_score'}], 'numeric': [{'name': 'age'}]} ``` As long as this is on purpose, this is fine. The schema can be defined as a yaml file or a dictionary ```yaml text: - name: user - name: job tag: - name: credit_score numeric: - name: age ``` and you pass in a path like ```python rds, keys = Redis.from_texts_return_keys( texts, embeddings, metadatas=metadata, redis_url="redis://localhost:6379", index_name="users3", index_schema=Path("sample1.yml").resolve() ) ``` Which will create the same schema as defined in the dictionary example Index Information: \| Index Name \| Storage Type \| Prefixes \| Index Options \| Indexing \| \|--------------\|----------------\|----------------\|-----------------\|------------\| \| users3 \| HASH \| ['doc:users3'] \| [] \| 0 \| Index Fields: \| Name \| Attribute \| Type \| Field Option \| Option Value \| \|----------------\|----------------\|---------\|----------------\|----------------\| \| user \| user \| TEXT \| WEIGHT \| 1 \| \| job \| job \| TEXT \| WEIGHT \| 1 \| \| content \| content \| TEXT \| WEIGHT \| 1 \| \| credit_score \| credit_score \| TAG \| SEPARATOR \| , \| \| age \| age \| NUMERIC \| \| \| \| content_vector \| content_vector \| VECTOR \| \| \| ### Custom Vector Indexing Schema Users with large use cases may want to change how they formulate the vector index created by Langchain To utilize all the features of Redis for vector database use cases like this, you can now do the following to pass in index attribute modifiers like changing the indexing algorithm to HNSW. ```python vector_schema = { "algorithm": "HNSW" } rds, keys = Redis.from_texts_return_keys( texts, embeddings, metadatas=metadata, redis_url="redis://localhost:6379", index_name="users3", vector_schema=vector_schema ) ``` A more complex example may look like ```python vector_schema = { "algorithm": "HNSW", "ef_construction": 200, "ef_runtime": 20 } rds, keys = Redis.from_texts_return_keys( texts, embeddings, metadatas=metadata, redis_url="redis://localhost:6379", index_name="users3", vector_schema=vector_schema ) ``` All names correspond to the arguments you would set if using Redis-py or RedisVL. (put in doc link later) ### Better Querying Both vector queries and Range (limit) queries are now available and metadata is returned by default. The outputs are shown. ```python >>> query = "foo" >>> results = rds.similarity_search(query, k=1) >>> print(results) [Document(page_content='foo', metadata={'user': 'derrick', 'job': 'doctor', 'credit_score': 'low', 'age': '14', 'id': 'doc:users:657a47d7db8b447e88598b83da879b9d', 'score': '7.15255737305e-07'})] >>> results = rds.similarity_search_with_score(query, k=1, return_metadata=False) >>> print(results) # no metadata, but with scores [(Document(page_content='foo', metadata={}), 7.15255737305e-07)] >>> results = rds.similarity_search_limit_score(query, k=6, score_threshold=0.0001) >>> print(len(results)) # range query (only above threshold even if k is higher) 4 ``` ### Custom metadata filtering A big advantage of Redis in this space is being able to do filtering on data stored alongside the vector itself. With the example above, the following is now possible in langchain. The equivalence operators are overridden to describe a new expression language that mimic that of [redisvl](redisvl.com). This allows for arbitrarily long sequences of filters that resemble SQL commands that can be used directly with vector queries and range queries. There are two interfaces by which to do so and both are shown. ```python >>> from langchain.vectorstores.redis import RedisFilter, RedisNum, RedisText >>> age_filter = RedisFilter.num("age") > 18 >>> age_filter = RedisNum("age") > 18 # equivalent >>> results = rds.similarity_search(query, filter=age_filter) >>> print(len(results)) 3 >>> job_filter = RedisFilter.text("job") == "engineer" >>> job_filter = RedisText("job") == "engineer" # equivalent >>> results = rds.similarity_search(query, filter=job_filter) >>> print(len(results)) 2 # fuzzy match text search >>> job_filter = RedisFilter.text("job") % "eng*" >>> results = rds.similarity_search(query, filter=job_filter) >>> print(len(results)) 2 # combined filters (AND) >>> combined = age_filter & job_filter >>> results = rds.similarity_search(query, filter=combined) >>> print(len(results)) 1 # combined filters (OR) >>> combined = age_filter \| job_filter >>> results = rds.similarity_search(query, filter=combined) >>> print(len(results)) 4 ``` All the above filter results can be checked against the data above. ### Other - Issue: #3967 - Dependencies: No added dependencies - Tag maintainer: @hwchase17 @baskaryan @rlancemartin - Twitter handle: @sampartee --------- Co-authored-by: Naresh Rangan <naresh.rangan0@walmart.com> Co-authored-by: Bagatur <baskaryan@gmail.com>		2023-08-25 17:22:50 -07:00
..
faiss_index	mv module integrations docs (#8101 )	2023-07-23 23:23:16 -07:00
activeloop_deeplake.ipynb	Bagatur/deeplake docs fixes (#9275 )	2023-08-15 15:56:36 -07:00
alibabacloud_opensearch.ipynb	mv module integrations docs (#8101 )	2023-07-23 23:23:16 -07:00
analyticdb.ipynb	mv module integrations docs (#8101 )	2023-07-23 23:23:16 -07:00
annoy.ipynb	mv module integrations docs (#8101 )	2023-07-23 23:23:16 -07:00
atlas.ipynb	Updates to Nomic Atlas and GPT4All documentation (#9414 )	2023-08-23 17:49:44 -07:00
awadb.ipynb	mv module integrations docs (#8101 )	2023-07-23 23:23:16 -07:00
azuresearch.ipynb	Azure Cognitive Search - update sdk b8, mod user agent, search with scores (#9191 )	2023-08-25 02:34:09 -07:00
bageldb.ipynb	BagelDB (bageldb.ai), VectorStore integration. (#8971 )	2023-08-10 16:48:36 -07:00
cassandra.ipynb	mv module integrations docs (#8101 )	2023-07-23 23:23:16 -07:00
chroma.ipynb	fix error in chroma docker instructions (#8533 )	2023-07-31 16:32:53 -07:00
clarifai.ipynb	Improvements to the Clarifai integration (#9290 )	2023-08-21 12:53:36 -07:00
clickhouse.ipynb	docs: `integrations/providers` (#9631 )	2023-08-22 20:28:11 -07:00
dashvector.ipynb	Add dashvector vectorstore (#9163 )	2023-08-15 16:19:30 -07:00
dingo.ipynb	fix: max_marginal_relevance_search and docs in Dingo (#9244 )	2023-08-15 01:06:06 -07:00
docarray_hnsw.ipynb	docs: `integrations/providers` (#9631 )	2023-08-22 20:28:11 -07:00
docarray_in_memory.ipynb	docs: `integrations/providers` (#9631 )	2023-08-22 20:28:11 -07:00
elasticsearch.ipynb	ElasticsearchStore: improve error logging for adding documents (#9648 )	2023-08-23 07:04:09 -07:00
epsilla.ipynb	add Epsilla vectorstore (#9239 )	2023-08-21 12:51:15 -07:00
faiss.ipynb	docs: `integrations/providers` (#9631 )	2023-08-22 20:28:11 -07:00
hologres.ipynb	mv module integrations docs (#8101 )	2023-07-23 23:23:16 -07:00
index.mdx	mv module integrations docs (#8101 )	2023-07-23 23:23:16 -07:00
lancedb.ipynb	mv module integrations docs (#8101 )	2023-07-23 23:23:16 -07:00
marqo.ipynb	mv module integrations docs (#8101 )	2023-07-23 23:23:16 -07:00
matchingengine.ipynb	docs: `integrations/providers` (#9631 )	2023-08-22 20:28:11 -07:00
meilisearch.ipynb	docs: `integrations/providers` (#9631 )	2023-08-22 20:28:11 -07:00
milvus.ipynb	mv module integrations docs (#8101 )	2023-07-23 23:23:16 -07:00
mongodb_atlas.ipynb	docs: `integrations/providers` (#9631 )	2023-08-22 20:28:11 -07:00
myscale.ipynb	mv module integrations docs (#8101 )	2023-07-23 23:23:16 -07:00
opensearch.ipynb	Add summarization use-case (#8376 )	2023-08-02 14:25:11 -07:00
pgembedding.ipynb	docs: `integrations/providers` (#9631 )	2023-08-22 20:28:11 -07:00
pgvector.ipynb	mv module integrations docs (#8101 )	2023-07-23 23:23:16 -07:00
pinecone.ipynb	Minor fixes to enhance notebook usability: (#8389 )	2023-07-28 17:10:03 -07:00
qdrant.ipynb	Fix documentation for from_documents signature (#8482 )	2023-07-30 13:24:44 -07:00
redis.ipynb	Redis metadata filtering and specification, index customization (#8612 )	2023-08-25 17:22:50 -07:00
rockset.ipynb	Fix docs for Rockset (#8807 )	2023-08-06 15:04:01 -07:00
scann.ipynb	docs: `integrations/providers` (#9631 )	2023-08-22 20:28:11 -07:00
singlestoredb.ipynb	mv module integrations docs (#8101 )	2023-07-23 23:23:16 -07:00
sklearn.ipynb	Revert "add filter to sklearn vector store functions (#8113 )" (#8760 )	2023-08-04 08:13:32 -07:00
starrocks.ipynb	mv module integrations docs (#8101 )	2023-07-23 23:23:16 -07:00
supabase.ipynb	docs: `integrations/providers` (#9631 )	2023-08-22 20:28:11 -07:00
tair.ipynb	tair fix distance_type error, and add hybrid search (#9531 )	2023-08-23 16:38:31 -07:00
tigris.ipynb	mv module integrations docs (#8101 )	2023-07-23 23:23:16 -07:00
typesense.ipynb	mv module integrations docs (#8101 )	2023-07-23 23:23:16 -07:00
usearch.ipynb	docs: `integrations/providers` (#9631 )	2023-08-22 20:28:11 -07:00
vectara.ipynb	Fix typo in Vectara docs (#8925 )	2023-08-08 10:11:07 -07:00
weaviate.ipynb	Weaviate: adding auth example + fixing spelling in ReadME (#8939 )	2023-08-08 16:24:17 -07:00
xata.ipynb	docs: `integrations/providers` (#9631 )	2023-08-22 20:28:11 -07:00
zep.ipynb	zep/new ZepVectorStore (#9159 )	2023-08-16 00:23:07 -07:00
zilliz.ipynb	mv module integrations docs (#8101 )	2023-07-23 23:23:16 -07:00