mirror of
https://github.com/hwchase17/langchain
synced 2024-10-31 15:20:26 +00:00
b96ab4b763
# Docs: improvements in the `retrievers/examples/` notebooks Its primary purpose is to make the Jupyter notebook examples **consistent** and more suitable for first-time viewers. - add links to the integration source (if applicable) with a short description of this source; - removed `_retriever` suffix from the file names (where it existed) for consistency; - removed ` retriever` from the notebook title (where it existed) for consistency; - added code to install necessary Python package(s); - added code to set up the necessary API Key. - very small fixes in notebooks from other folders (for consistency): - docs/modules/indexes/vectorstores/examples/elasticsearch.ipynb - docs/modules/indexes/vectorstores/examples/pinecone.ipynb - docs/modules/models/llms/integrations/cohere.ipynb - fixed misspelling in langchain/retrievers/time_weighted_retriever.py comment (sorry, about this change in a .py file ) ## Who can review @dev2049
374 lines
12 KiB
Plaintext
374 lines
12 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "13afcae7",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Self-querying\n",
|
|
"\n",
|
|
"In the notebook we'll demo the `SelfQueryRetriever`, which, as the name suggests, has the ability to query itself. Specifically, given any natural language query, the retriever uses a query-constructing LLM chain to write a structured query and then applies that structured query to it's underlying VectorStore. This allows the retriever to not only use the user-input query for semantic similarity comparison with the contents of stored documented, but to also extract filters from the user query on the metadata of stored documents and to execute those filters."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "68e75fb9",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Creating a Pinecone index\n",
|
|
"First we'll want to create a `Pinecone` VectorStore and seed it with some data. We've created a small demo set of documents that contain summaries of movies.\n",
|
|
"\n",
|
|
"To use Pinecone, you to have `pinecone` package installed and you must have an API key and an Environment. Here are the [installation instructions](https://docs.pinecone.io/docs/quickstart).\n",
|
|
"\n",
|
|
"NOTE: The self-query retriever requires you to have `lark` package installed."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"id": "63a8af5b",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# !pip install lark"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "2f633445-57fe-45f3-84f7-80d3941b9e53",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"#!pip install pinecone-client"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"id": "3eb9c9a4",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"/Users/harrisonchase/.pyenv/versions/3.9.1/envs/langchain/lib/python3.9/site-packages/pinecone/index.py:4: TqdmExperimentalWarning: Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)\n",
|
|
" from tqdm.autonotebook import tqdm\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"import os\n",
|
|
"\n",
|
|
"import pinecone\n",
|
|
"\n",
|
|
"\n",
|
|
"pinecone.init(api_key=os.environ[\"PINECONE_API_KEY\"], environment=os.environ[\"PINECONE_ENV\"])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"id": "cb4a5787",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from langchain.schema import Document\n",
|
|
"from langchain.embeddings.openai import OpenAIEmbeddings\n",
|
|
"from langchain.vectorstores import Pinecone\n",
|
|
"\n",
|
|
"embeddings = OpenAIEmbeddings()\n",
|
|
"# create new index\n",
|
|
"pinecone.create_index(\"langchain-self-retriever-demo\", dimension=1536)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"id": "bcbe04d9",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"docs = [\n",
|
|
" Document(page_content=\"A bunch of scientists bring back dinosaurs and mayhem breaks loose\", metadata={\"year\": 1993, \"rating\": 7.7, \"genre\": [\"action\", \"science fiction\"]}),\n",
|
|
" Document(page_content=\"Leo DiCaprio gets lost in a dream within a dream within a dream within a ...\", metadata={\"year\": 2010, \"director\": \"Christopher Nolan\", \"rating\": 8.2}),\n",
|
|
" Document(page_content=\"A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea\", metadata={\"year\": 2006, \"director\": \"Satoshi Kon\", \"rating\": 8.6}),\n",
|
|
" Document(page_content=\"A bunch of normal-sized women are supremely wholesome and some men pine after them\", metadata={\"year\": 2019, \"director\": \"Greta Gerwig\", \"rating\": 8.3}),\n",
|
|
" Document(page_content=\"Toys come alive and have a blast doing so\", metadata={\"year\": 1995, \"genre\": \"animated\"}),\n",
|
|
" Document(page_content=\"Three men walk into the Zone, three men walk out of the Zone\", metadata={\"year\": 1979, \"rating\": 9.9, \"director\": \"Andrei Tarkovsky\", \"genre\": [\"science fiction\", \"thriller\"], \"rating\": 9.9})\n",
|
|
"]\n",
|
|
"vectorstore = Pinecone.from_documents(\n",
|
|
" docs, embeddings, index_name=\"langchain-self-retriever-demo\"\n",
|
|
")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "5ecaab6d",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Creating our self-querying retriever\n",
|
|
"Now we can instantiate our retriever. To do this we'll need to provide some information upfront about the metadata fields that our documents support and a short description of the document contents."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 4,
|
|
"id": "86e34dbf",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from langchain.llms import OpenAI\n",
|
|
"from langchain.retrievers.self_query.base import SelfQueryRetriever\n",
|
|
"from langchain.chains.query_constructor.base import AttributeInfo\n",
|
|
"\n",
|
|
"metadata_field_info=[\n",
|
|
" AttributeInfo(\n",
|
|
" name=\"genre\",\n",
|
|
" description=\"The genre of the movie\", \n",
|
|
" type=\"string or list[string]\", \n",
|
|
" ),\n",
|
|
" AttributeInfo(\n",
|
|
" name=\"year\",\n",
|
|
" description=\"The year the movie was released\", \n",
|
|
" type=\"integer\", \n",
|
|
" ),\n",
|
|
" AttributeInfo(\n",
|
|
" name=\"director\",\n",
|
|
" description=\"The name of the movie director\", \n",
|
|
" type=\"string\", \n",
|
|
" ),\n",
|
|
" AttributeInfo(\n",
|
|
" name=\"rating\",\n",
|
|
" description=\"A 1-10 rating for the movie\",\n",
|
|
" type=\"float\"\n",
|
|
" ),\n",
|
|
"]\n",
|
|
"document_content_description = \"Brief summary of a movie\"\n",
|
|
"llm = OpenAI(temperature=0)\n",
|
|
"retriever = SelfQueryRetriever.from_llm(llm, vectorstore, document_content_description, metadata_field_info, verbose=True)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "ea9df8d4",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Testing it out\n",
|
|
"And now we can try actually using our retriever!"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 5,
|
|
"id": "38a126e9",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"query='dinosaur' filter=None\n"
|
|
]
|
|
},
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"[Document(page_content='A bunch of scientists bring back dinosaurs and mayhem breaks loose', metadata={'genre': ['action', 'science fiction'], 'rating': 7.7, 'year': 1993.0}),\n",
|
|
" Document(page_content='Toys come alive and have a blast doing so', metadata={'genre': 'animated', 'year': 1995.0}),\n",
|
|
" Document(page_content='A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea', metadata={'director': 'Satoshi Kon', 'rating': 8.6, 'year': 2006.0}),\n",
|
|
" Document(page_content='Leo DiCaprio gets lost in a dream within a dream within a dream within a ...', metadata={'director': 'Christopher Nolan', 'rating': 8.2, 'year': 2010.0})]"
|
|
]
|
|
},
|
|
"execution_count": 5,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"# This example only specifies a relevant query\n",
|
|
"retriever.get_relevant_documents(\"What are some movies about dinosaurs\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 6,
|
|
"id": "fc3f1e6e",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"query=' ' filter=Comparison(comparator=<Comparator.GT: 'gt'>, attribute='rating', value=8.5)\n"
|
|
]
|
|
},
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"[Document(page_content='A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea', metadata={'director': 'Satoshi Kon', 'rating': 8.6, 'year': 2006.0}),\n",
|
|
" Document(page_content='Three men walk into the Zone, three men walk out of the Zone', metadata={'director': 'Andrei Tarkovsky', 'genre': ['science fiction', 'thriller'], 'rating': 9.9, 'year': 1979.0})]"
|
|
]
|
|
},
|
|
"execution_count": 6,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"# This example only specifies a filter\n",
|
|
"retriever.get_relevant_documents(\"I want to watch a movie rated higher than 8.5\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 7,
|
|
"id": "b19d4da0",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"query='women' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='director', value='Greta Gerwig')\n"
|
|
]
|
|
},
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"[Document(page_content='A bunch of normal-sized women are supremely wholesome and some men pine after them', metadata={'director': 'Greta Gerwig', 'rating': 8.3, 'year': 2019.0})]"
|
|
]
|
|
},
|
|
"execution_count": 7,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"# This example specifies a query and a filter\n",
|
|
"retriever.get_relevant_documents(\"Has Greta Gerwig directed any movies about women\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 8,
|
|
"id": "f900e40e",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"query=' ' filter=Operation(operator=<Operator.AND: 'and'>, arguments=[Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='genre', value='science fiction'), Comparison(comparator=<Comparator.GT: 'gt'>, attribute='rating', value=8.5)])\n"
|
|
]
|
|
},
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"[Document(page_content='Three men walk into the Zone, three men walk out of the Zone', metadata={'director': 'Andrei Tarkovsky', 'genre': ['science fiction', 'thriller'], 'rating': 9.9, 'year': 1979.0})]"
|
|
]
|
|
},
|
|
"execution_count": 8,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"# This example specifies a composite filter\n",
|
|
"retriever.get_relevant_documents(\"What's a highly rated (above 8.5) science fiction film?\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 9,
|
|
"id": "12a51522",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"query='toys' filter=Operation(operator=<Operator.AND: 'and'>, arguments=[Comparison(comparator=<Comparator.GT: 'gt'>, attribute='year', value=1990.0), Comparison(comparator=<Comparator.LT: 'lt'>, attribute='year', value=2005.0), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='genre', value='animated')])\n"
|
|
]
|
|
},
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"[Document(page_content='Toys come alive and have a blast doing so', metadata={'genre': 'animated', 'year': 1995.0})]"
|
|
]
|
|
},
|
|
"execution_count": 9,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"# This example specifies a query and composite filter\n",
|
|
"retriever.get_relevant_documents(\"What's a movie after 1990 but before 2005 that's all about toys, and preferably is animated\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "6fe7536c",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Filter k\n",
|
|
"\n",
|
|
"We can also use the self query retriever to specify `k`: the number of documents to fetch.\n",
|
|
"\n",
|
|
"We can do this by passing `enable_limit=True` to the constructor."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "3a2937c2",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"retriever = SelfQueryRetriever.from_llm(\n",
|
|
" llm, \n",
|
|
" vectorstore, \n",
|
|
" document_content_description, \n",
|
|
" metadata_field_info, \n",
|
|
" enable_limit=True,\n",
|
|
" verbose=True\n",
|
|
")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "83d233aa",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# This example only specifies a relevant query\n",
|
|
"retriever.get_relevant_documents(\"What are two movies about dinosaurs\")"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.10.6"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5
|
|
}
|