You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
langchain/docs/docs/integrations/vectorstores/weaviate.ipynb

922 lines
36 KiB
Plaintext

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

{
"cells": [
{
"cell_type": "raw",
"id": "1957f5cb",
"metadata": {},
"source": [
"---\n",
"sidebar_label: Weaviate\n",
"---"
]
},
{
"cell_type": "markdown",
"id": "ef1f0986",
"metadata": {},
"source": [
"# Weaviate\n",
"\n",
"This notebook covers how to get started with the Weaviate vector store in LangChain, using the `langchain-weaviate` package.\n",
"\n",
"> [Weaviate](https://weaviate.io/) is an open-source vector database. It allows you to store data objects and vector embeddings from your favorite ML-models, and scale seamlessly into billions of data objects.\n",
"\n",
"To use this integration, you need to have a running Weaviate database instance.\n",
"\n",
"## Minimum versions\n",
"\n",
"This module requires Weaviate `1.23.7` or higher. However, we recommend you use the latest version of Weaviate.\n",
"\n",
"## Connecting to Weaviate\n",
"\n",
"In this notebook, we assume that you have a local instance of Weaviate running on `http://localhost:8080` and port 50051 open for [gRPC traffic](https://weaviate.io/blog/grpc-performance-improvements). So, we will connect to Weaviate with:\n",
"\n",
"```python\n",
"weaviate_client = weaviate.connect_to_local()\n",
"```\n",
"\n",
"### Other deployment options\n",
"\n",
"Weaviate can be [deployed in many different ways](https://weaviate.io/developers/weaviate/starter-guides/which-weaviate) such as using [Weaviate Cloud Services (WCS)](https://console.weaviate.cloud), [Docker](https://weaviate.io/developers/weaviate/installation/docker-compose) or [Kubernetes](https://weaviate.io/developers/weaviate/installation/kubernetes). \n",
"\n",
"If your Weaviate instance is deployed in another way, [read more here](https://weaviate.io/developers/weaviate/client-libraries/python#instantiate-a-client) about different ways to connect to Weaviate. You can use different [helper functions](https://weaviate.io/developers/weaviate/client-libraries/python#python-client-v4-helper-functions) or [create a custom instance](https://weaviate.io/developers/weaviate/client-libraries/python#python-client-v4-explicit-connection).\n",
"\n",
"> Note that you require a `v4` client API, which will create a `weaviate.WeaviateClient` object.\n",
"\n",
"### Authentication\n",
"\n",
"Some Weaviate instances, such as those running on WCS, have authentication enabled, such as API key and/or username+password authentication.\n",
"\n",
"Read the [client authentication guide](https://weaviate.io/developers/weaviate/client-libraries/python#authentication) for more information, as well as the [in-depth authentication configuration page](https://weaviate.io/developers/weaviate/configuration/authentication)."
]
},
{
"cell_type": "markdown",
"id": "4a8437b1",
"metadata": {},
"source": [
"## Installation"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "d97b55c2",
"metadata": {},
"outputs": [],
"source": [
"# install package\n",
"# %pip install -Uqq langchain-weaviate\n",
"# %pip install openai tiktoken langchain"
]
},
{
"cell_type": "markdown",
"id": "36fdc060",
"metadata": {},
"source": [
"## Environment Setup\n",
"\n",
"This notebook uses the OpenAI API through `OpenAIEmbeddings`. We suggest obtaining an OpenAI API key and export it as an environment variable with the name `OPENAI_API_KEY`.\n",
"\n",
"Once this is done, your OpenAI API key will be read automatically. If you are new to environment variables, read more about them [here](https://docs.python.org/3/library/os.html#os.environ) or in [this guide](https://www.twilio.com/en-us/blog/environment-variables-python)."
]
},
{
"cell_type": "markdown",
"id": "a8e3a83f",
"metadata": {},
"source": [
"# Usage"
]
},
{
"cell_type": "markdown",
"id": "6efee7cd",
"metadata": {},
"source": [
"## Find objects by similarity"
]
},
{
"cell_type": "markdown",
"id": "dc37144c-208d-4ab3-9f3a-0407a69fe052",
"metadata": {
"tags": []
},
"source": [
"Here is an example of how to find objects by similarity to a query, from data import to querying the Weaviate instance.\n",
"\n",
"### Step 1: Data import\n",
"\n",
"First, we will create data to add to `Weaviate` by loading and chunking the contents of a long text file. "
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "9d0ab00c",
"metadata": {},
"outputs": [],
"source": [
"from langchain.text_splitter import CharacterTextSplitter\n",
"from langchain_community.document_loaders import TextLoader\n",
"from langchain_community.embeddings.openai import OpenAIEmbeddings"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "4618779d",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/workspaces/langchain-weaviate/.venv/lib/python3.12/site-packages/langchain_core/_api/deprecation.py:117: LangChainDeprecationWarning: The class `langchain_community.embeddings.openai.OpenAIEmbeddings` was deprecated in langchain-community 0.1.0 and will be removed in 0.2.0. An updated version of the class exists in the langchain-openai package and should be used instead. To use it run `pip install -U langchain-openai` and import as `from langchain_openai import OpenAIEmbeddings`.\n",
" warn_deprecated(\n"
]
}
],
"source": [
"loader = TextLoader(\"state_of_the_union.txt\")\n",
"documents = loader.load()\n",
"text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
"docs = text_splitter.split_documents(documents)\n",
"\n",
"embeddings = OpenAIEmbeddings()"
]
},
{
"cell_type": "markdown",
"id": "ae774cf5",
"metadata": {},
"source": [
"Now, we can import the data. \n",
"\n",
"To do so, connect to the Weaviate instance and use the resulting `weaviate_client` object. For example, we can import the documents as shown below:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "3fbda8c4",
"metadata": {},
"outputs": [],
"source": [
"import weaviate\n",
"from langchain_weaviate.vectorstores import WeaviateVectorStore"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "e06f64b7",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/workspaces/langchain-weaviate/.venv/lib/python3.12/site-packages/pydantic/main.py:1024: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/\n",
" warnings.warn('The `dict` method is deprecated; use `model_dump` instead.', category=PydanticDeprecatedSince20)\n"
]
}
],
"source": [
"weaviate_client = weaviate.connect_to_local()\n",
"db = WeaviateVectorStore.from_documents(docs, embeddings, client=weaviate_client)"
]
},
{
"cell_type": "markdown",
"id": "abe29115",
"metadata": {},
"source": [
"### Step 2: Perform the search"
]
},
{
"cell_type": "markdown",
"id": "2799f5a3",
"metadata": {},
"source": [
"We can now perform a similarity search. This will return the most similar documents to the query text, based on the embeddings stored in Weaviate and an equivalent embedding generated from the query text."
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "ebc3aa1e",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Document 1:\n",
"Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Ac...\n",
"\n",
"Document 2:\n",
"And so many families are living paycheck to paycheck, struggling to keep up with the rising cost of ...\n",
"\n",
"Document 3:\n",
"Vice President Harris and I ran for office with a new economic vision for America. \n",
"\n",
"Invest in Ameri...\n",
"\n",
"Document 4:\n",
"A former top litigator in private practice. A former federal public defender. And from a family of p...\n"
]
}
],
"source": [
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
"docs = db.similarity_search(query)\n",
"\n",
"# Print the first 100 characters of each result\n",
"for i, doc in enumerate(docs):\n",
" print(f\"\\nDocument {i+1}:\")\n",
" print(doc.page_content[:100] + \"...\")"
]
},
{
"cell_type": "markdown",
"id": "ca1134ef",
"metadata": {},
"source": [
"You can also add filters, which will either include or exclude results based on the filter conditions. (See [more filter examples](https://weaviate.io/developers/weaviate/search/filters).)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "d1210f90",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0\n",
"4\n"
]
}
],
"source": [
"from weaviate.classes.query import Filter\n",
"\n",
"for filter_str in [\"blah.txt\", \"state_of_the_union.txt\"]:\n",
" search_filter = Filter.by_property(\"source\").equal(filter_str)\n",
" filtered_search_results = db.similarity_search(query, filters=search_filter)\n",
" print(len(filtered_search_results))\n",
" if filter_str == \"state_of_the_union.txt\":\n",
" assert len(filtered_search_results) > 0 # There should be at least one result\n",
" else:\n",
" assert len(filtered_search_results) == 0 # There should be no results"
]
},
{
"cell_type": "markdown",
"id": "2646a25f",
"metadata": {},
"source": [
"It is also possible to provide `k`, which is the upper limit of the number of results to return."
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "6e53d7d5",
"metadata": {},
"outputs": [],
"source": [
"search_filter = Filter.by_property(\"source\").equal(\"state_of_the_union.txt\")\n",
"filtered_search_results = db.similarity_search(query, filters=search_filter, k=3)\n",
"assert len(filtered_search_results) <= 3"
]
},
{
"cell_type": "markdown",
"id": "26bfd9bc",
"metadata": {},
"source": [
"### Quantify result similarity"
]
},
{
"cell_type": "markdown",
"id": "3b286d60",
"metadata": {},
"source": [
"You can optionally retrieve a relevance \"score\". This is a relative score that indicates how good the particular search results is, amongst the pool of search results. \n",
"\n",
"Note that this is relative score, meaning that it should not be used to determine thresholds for relevance. However, it can be used to compare the relevance of different search results within the entire search result set."
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "b3b4a2f4",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.935 : For that purpose weve mobilized American ground forces, air squadrons, and ship deployments to prot...\n",
"0.500 : And built the strongest, freest, and most prosperous nation the world has ever known. \n",
"\n",
"Now is the h...\n",
"0.462 : If you travel 20 miles east of Columbus, Ohio, youll find 1,000 empty acres of land. \n",
"\n",
"It wont loo...\n",
"0.450 : And my report is this: the State of the Union is strong—because you, the American people, are strong...\n",
"0.442 : Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Ac...\n"
]
}
],
"source": [
"docs = db.similarity_search_with_score(\"country\", k=5)\n",
"\n",
"for doc in docs:\n",
" print(f\"{doc[1]:.3f}\", \":\", doc[0].page_content[:100] + \"...\")"
]
},
{
"cell_type": "markdown",
"id": "8abf9adc",
"metadata": {},
"source": [
"## Search mechanism"
]
},
{
"cell_type": "markdown",
"id": "d2d5b24a",
"metadata": {},
"source": [
"`similarity_search` uses Weaviate's [hybrid search](https://weaviate.io/developers/weaviate/api/graphql/search-operators#hybrid).\n",
"\n",
"A hybrid search combines a vector and a keyword search, with `alpha` as the weight of the vector search. The `similarity_search` function allows you to pass additional arguments as kwargs. See this [reference doc](https://weaviate.io/developers/weaviate/api/graphql/search-operators#hybrid) for the available arguments.\n",
"\n",
"So, you can perform a pure keyword search by adding `alpha=0` as shown below:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "74a7bae0",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Document(page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while youre at it, pass the Disclose Act so Americans can know who is funding our elections. \\n\\nTonight, Id like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \\n\\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \\n\\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nations top legal minds, who will continue Justice Breyers legacy of excellence.', metadata={'source': 'state_of_the_union.txt'})"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"docs = db.similarity_search(query, alpha=0)\n",
"docs[0]"
]
},
{
"cell_type": "markdown",
"id": "a2b75761",
"metadata": {},
"source": [
"## Persistence"
]
},
{
"cell_type": "markdown",
"id": "5f298bc0",
"metadata": {},
"source": [
"Any data added through `langchain-weaviate` will persist in Weaviate according to its configuration. \n",
"\n",
"WCS instances, for example, are configured to persist data indefinitely, and Docker instances can be set up to persist data in a volume. Read more about [Weaviate's persistence](https://weaviate.io/developers/weaviate/configuration/persistence)."
]
},
{
"cell_type": "markdown",
"id": "da874a61",
"metadata": {},
"source": [
"## Multi-tenancy"
]
},
{
"cell_type": "markdown",
"id": "67a0719f",
"metadata": {},
"source": [
"[Multi-tenancy](https://weaviate.io/developers/weaviate/concepts/data#multi-tenancy) allows you to have a high number of isolated collections of data, with the same collection configuration, in a single Weaviate instance. This is great for multi-user environments such as building a SaaS app, where each end user will have their own isolated data collection.\n",
"\n",
"To use multi-tenancy, the vector store need to be aware of the `tenant` parameter. \n",
"\n",
"So when adding any data, provide the `tenant` parameter as shown below."
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "8d365855",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"2024-Mar-26 03:40 PM - langchain_weaviate.vectorstores - INFO - Tenant Foo does not exist in index LangChain_30b9273d43b3492db4fb2aba2e0d6871. Creating tenant.\n"
]
}
],
"source": [
"db_with_mt = WeaviateVectorStore.from_documents(\n",
" docs, embeddings, client=weaviate_client, tenant=\"Foo\"\n",
")"
]
},
{
"cell_type": "markdown",
"id": "2b3e6107",
"metadata": {},
"source": [
"And when performing queries, provide the `tenant` parameter also."
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "49659eb3",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Document(page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while youre at it, pass the Disclose Act so Americans can know who is funding our elections. \\n\\nTonight, Id like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \\n\\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \\n\\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nations top legal minds, who will continue Justice Breyers legacy of excellence.', metadata={'source': 'state_of_the_union.txt'}),\n",
" Document(page_content='And so many families are living paycheck to paycheck, struggling to keep up with the rising cost of food, gas, housing, and so much more. \\n\\nI understand. \\n\\nI remember when my Dad had to leave our home in Scranton, Pennsylvania to find work. I grew up in a family where if the price of food went up, you felt it. \\n\\nThats why one of the first things I did as President was fight to pass the American Rescue Plan. \\n\\nBecause people were hurting. We needed to act, and we did. \\n\\nFew pieces of legislation have done more in a critical moment in our history to lift us out of crisis. \\n\\nIt fueled our efforts to vaccinate the nation and combat COVID-19. It delivered immediate economic relief for tens of millions of Americans. \\n\\nHelped put food on their table, keep a roof over their heads, and cut the cost of health insurance. \\n\\nAnd as my Dad used to say, it gave people a little breathing room.', metadata={'source': 'state_of_the_union.txt'}),\n",
" Document(page_content='He and his Dad both have Type 1 diabetes, which means they need insulin every day. Insulin costs about $10 a vial to make. \\n\\nBut drug companies charge families like Joshua and his Dad up to 30 times more. I spoke with Joshuas mom. \\n\\nImagine what its like to look at your child who needs insulin and have no idea how youre going to pay for it. \\n\\nWhat it does to your dignity, your ability to look your child in the eye, to be the parent you expect to be. \\n\\nJoshua is here with us tonight. Yesterday was his birthday. Happy birthday, buddy. \\n\\nFor Joshua, and for the 200,000 other young people with Type 1 diabetes, lets cap the cost of insulin at $35 a month so everyone can afford it. \\n\\nDrug companies will still do very well. And while were at it let Medicare negotiate lower prices for prescription drugs, like the VA already does.', metadata={'source': 'state_of_the_union.txt'}),\n",
" Document(page_content='Putins latest attack on Ukraine was premeditated and unprovoked. \\n\\nHe rejected repeated efforts at diplomacy. \\n\\nHe thought the West and NATO wouldnt respond. And he thought he could divide us at home. Putin was wrong. We were ready. Here is what we did. \\n\\nWe prepared extensively and carefully. \\n\\nWe spent months building a coalition of other freedom-loving nations from Europe and the Americas to Asia and Africa to confront Putin. \\n\\nI spent countless hours unifying our European allies. We shared with the world in advance what we knew Putin was planning and precisely how he would try to falsely justify his aggression. \\n\\nWe countered Russias lies with truth. \\n\\nAnd now that he has acted the free world is holding him accountable. \\n\\nAlong with twenty-seven members of the European Union including France, Germany, Italy, as well as countries like the United Kingdom, Canada, Japan, Korea, Australia, New Zealand, and many others, even Switzerland.', metadata={'source': 'state_of_the_union.txt'})]"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"db_with_mt.similarity_search(query, tenant=\"Foo\")"
]
},
{
"cell_type": "markdown",
"id": "24ecf858",
"metadata": {},
"source": [
"## Retriever options"
]
},
{
"cell_type": "markdown",
"id": "68e3757a",
"metadata": {},
"source": [
"Weaviate can also be used as a retriever"
]
},
{
"cell_type": "markdown",
"id": "f2a8712d",
"metadata": {},
"source": [
"### Maximal marginal relevance search (MMR)"
]
},
{
"cell_type": "markdown",
"id": "c92add51",
"metadata": {},
"source": [
"In addition to using similaritysearch in the retriever object, you can also use `mmr`."
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "cb302651",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/workspaces/langchain-weaviate/.venv/lib/python3.12/site-packages/pydantic/main.py:1024: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/\n",
" warnings.warn('The `dict` method is deprecated; use `model_dump` instead.', category=PydanticDeprecatedSince20)\n"
]
},
{
"data": {
"text/plain": [
"Document(page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while youre at it, pass the Disclose Act so Americans can know who is funding our elections. \\n\\nTonight, Id like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \\n\\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \\n\\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nations top legal minds, who will continue Justice Breyers legacy of excellence.', metadata={'source': 'state_of_the_union.txt'})"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"retriever = db.as_retriever(search_type=\"mmr\")\n",
"retriever.get_relevant_documents(query)[0]"
]
},
{
"cell_type": "markdown",
"id": "22f52ca4",
"metadata": {},
"source": [
"# Use with LangChain"
]
},
{
"cell_type": "markdown",
"id": "66690c78",
"metadata": {},
"source": [
"A known limitation of large languag models (LLMs) is that their training data can be outdated, or not include the specific domain knowledge that you require.\n",
"\n",
"Take a look at the example below:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "f74e20d8",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/workspaces/langchain-weaviate/.venv/lib/python3.12/site-packages/langchain_core/_api/deprecation.py:117: LangChainDeprecationWarning: The class `langchain_community.chat_models.openai.ChatOpenAI` was deprecated in langchain-community 0.0.10 and will be removed in 0.2.0. An updated version of the class exists in the langchain-openai package and should be used instead. To use it run `pip install -U langchain-openai` and import as `from langchain_openai import ChatOpenAI`.\n",
" warn_deprecated(\n",
"/workspaces/langchain-weaviate/.venv/lib/python3.12/site-packages/langchain_core/_api/deprecation.py:117: LangChainDeprecationWarning: The function `predict` was deprecated in LangChain 0.1.7 and will be removed in 0.2.0. Use invoke instead.\n",
" warn_deprecated(\n",
"/workspaces/langchain-weaviate/.venv/lib/python3.12/site-packages/pydantic/main.py:1024: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/\n",
" warnings.warn('The `dict` method is deprecated; use `model_dump` instead.', category=PydanticDeprecatedSince20)\n"
]
},
{
"data": {
"text/plain": [
"\"I'm sorry, I cannot provide real-time information as my responses are generated based on a mixture of licensed data, data created by human trainers, and publicly available data. The last update was in October 2021.\""
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from langchain_community.chat_models import ChatOpenAI\n",
"\n",
"llm = ChatOpenAI(model=\"gpt-3.5-turbo\", temperature=0)\n",
"llm.predict(\"What did the president say about Justice Breyer\")"
]
},
{
"cell_type": "markdown",
"id": "829720ad",
"metadata": {},
"source": [
"Vector stores complement LLMs by providing a way to store and retrieve relevant information. This allow you to combine the strengths of LLMs and vector stores, by using LLM's reasoning and linguistic capabilities with vector stores' ability to retrieve relevant information.\n",
"\n",
"Two well-known applications for combining LLMs and vector stores are:\n",
"- Question answering\n",
"- Retrieval-augmented generation (RAG)"
]
},
{
"cell_type": "markdown",
"id": "0ae85eb4",
"metadata": {},
"source": [
"### Question Answering with Sources"
]
},
{
"cell_type": "markdown",
"id": "902d8ba7",
"metadata": {},
"source": [
"Question answering in langchain can be enhanced by the use of vector stores. Let's see how this can be done.\n",
"\n",
"This section uses the `RetrievalQAWithSourcesChain`, which does the lookup of the documents from an Index. \n",
"\n",
"First, we will chunk the text again and import them into the Weaviate vector store."
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "ad91ded1",
"metadata": {},
"outputs": [],
"source": [
"from langchain.chains import RetrievalQAWithSourcesChain\n",
"from langchain_community.llms import OpenAI"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "2438d702",
"metadata": {},
"outputs": [],
"source": [
"with open(\"state_of_the_union.txt\") as f:\n",
" state_of_the_union = f.read()\n",
"text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
"texts = text_splitter.split_text(state_of_the_union)"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "b0e106ab",
"metadata": {},
"outputs": [],
"source": [
"docsearch = WeaviateVectorStore.from_texts(\n",
" texts,\n",
" embeddings,\n",
" client=weaviate_client,\n",
" metadatas=[{\"source\": f\"{i}-pl\"} for i in range(len(texts))],\n",
")"
]
},
{
"cell_type": "markdown",
"id": "546bc958",
"metadata": {},
"source": [
"Now we can construct the chain, with the retriever specified:"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "86bbb953",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/workspaces/langchain-weaviate/.venv/lib/python3.12/site-packages/langchain_core/_api/deprecation.py:117: LangChainDeprecationWarning: The class `langchain_community.llms.openai.OpenAI` was deprecated in langchain-community 0.0.10 and will be removed in 0.2.0. An updated version of the class exists in the langchain-openai package and should be used instead. To use it run `pip install -U langchain-openai` and import as `from langchain_openai import OpenAI`.\n",
" warn_deprecated(\n"
]
}
],
"source": [
"chain = RetrievalQAWithSourcesChain.from_chain_type(\n",
" OpenAI(temperature=0), chain_type=\"stuff\", retriever=docsearch.as_retriever()\n",
")"
]
},
{
"cell_type": "markdown",
"id": "f8371444",
"metadata": {},
"source": [
"And run the chain, to ask the question:"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "5c38cc39",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/workspaces/langchain-weaviate/.venv/lib/python3.12/site-packages/langchain_core/_api/deprecation.py:117: LangChainDeprecationWarning: The function `__call__` was deprecated in LangChain 0.1.0 and will be removed in 0.2.0. Use invoke instead.\n",
" warn_deprecated(\n",
"/workspaces/langchain-weaviate/.venv/lib/python3.12/site-packages/pydantic/main.py:1024: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/\n",
" warnings.warn('The `dict` method is deprecated; use `model_dump` instead.', category=PydanticDeprecatedSince20)\n",
"/workspaces/langchain-weaviate/.venv/lib/python3.12/site-packages/pydantic/main.py:1024: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/\n",
" warnings.warn('The `dict` method is deprecated; use `model_dump` instead.', category=PydanticDeprecatedSince20)\n"
]
},
{
"data": {
"text/plain": [
"{'answer': ' The president thanked Justice Stephen Breyer for his service and announced his nomination of Judge Ketanji Brown Jackson to the Supreme Court.\\n',\n",
" 'sources': '31-pl'}"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"chain(\n",
" {\"question\": \"What did the president say about Justice Breyer\"},\n",
" return_only_outputs=True,\n",
")"
]
},
{
"cell_type": "markdown",
"id": "fecab3b5",
"metadata": {},
"source": [
"### Retrieval-Augmented Generation\n",
"\n",
"Another very popular application of combining LLMs and vector stores is retrieval-augmented generation (RAG). This is a technique that uses a retriever to find relevant information from a vector store, and then uses an LLM to provide an output based on the retrieved data and a prompt.\n",
"\n",
"We begin with a similar setup:"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "33b0a9d3",
"metadata": {},
"outputs": [],
"source": [
"with open(\"state_of_the_union.txt\") as f:\n",
" state_of_the_union = f.read()\n",
"text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
"texts = text_splitter.split_text(state_of_the_union)"
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "d2ade6ae",
"metadata": {},
"outputs": [],
"source": [
"docsearch = WeaviateVectorStore.from_texts(\n",
" texts,\n",
" embeddings,\n",
" client=weaviate_client,\n",
" metadatas=[{\"source\": f\"{i}-pl\"} for i in range(len(texts))],\n",
")\n",
"\n",
"retriever = docsearch.as_retriever()"
]
},
{
"cell_type": "markdown",
"id": "39413671",
"metadata": {},
"source": [
"We need to construct a template for the RAG model so that the retrieved information will be populated in the template."
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "578570b8",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"input_variables=['context', 'question'] messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], template=\"You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\\nQuestion: {question}\\nContext: {context}\\nAnswer:\\n\"))]\n"
]
}
],
"source": [
"from langchain_core.prompts import ChatPromptTemplate\n",
"\n",
"template = \"\"\"You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\n",
"Question: {question}\n",
"Context: {context}\n",
"Answer:\n",
"\"\"\"\n",
"prompt = ChatPromptTemplate.from_template(template)\n",
"\n",
"print(prompt)"
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "74982155",
"metadata": {},
"outputs": [],
"source": [
"from langchain_community.chat_models import ChatOpenAI\n",
"\n",
"llm = ChatOpenAI(model=\"gpt-3.5-turbo\", temperature=0)"
]
},
{
"cell_type": "markdown",
"id": "e47abe3a",
"metadata": {},
"source": [
"And running the cell, we get a very similar output."
]
},
{
"cell_type": "code",
"execution_count": 24,
"id": "fe129bdd",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/workspaces/langchain-weaviate/.venv/lib/python3.12/site-packages/pydantic/main.py:1024: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/\n",
" warnings.warn('The `dict` method is deprecated; use `model_dump` instead.', category=PydanticDeprecatedSince20)\n",
"/workspaces/langchain-weaviate/.venv/lib/python3.12/site-packages/pydantic/main.py:1024: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/\n",
" warnings.warn('The `dict` method is deprecated; use `model_dump` instead.', category=PydanticDeprecatedSince20)\n"
]
},
{
"data": {
"text/plain": [
"\"The president honored Justice Stephen Breyer for his service to the country as an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. The president also mentioned nominating Circuit Court of Appeals Judge Ketanji Brown Jackson to continue Justice Breyer's legacy of excellence. The president expressed gratitude towards Justice Breyer and highlighted the importance of nominating someone to serve on the United States Supreme Court.\""
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from langchain_core.output_parsers import StrOutputParser\n",
"from langchain_core.runnables import RunnablePassthrough\n",
"\n",
"rag_chain = (\n",
" {\"context\": retriever, \"question\": RunnablePassthrough()}\n",
" | prompt\n",
" | llm\n",
" | StrOutputParser()\n",
")\n",
"\n",
"rag_chain.invoke(\"What did the president say about Justice Breyer\")"
]
},
{
"cell_type": "markdown",
"id": "ce5a2553",
"metadata": {},
"source": [
"But note that since the template is upto you to construct, you can customize it to your needs. "
]
},
{
"cell_type": "markdown",
"id": "e7417ac5",
"metadata": {},
"source": [
"### Wrap-up & resources\n",
"\n",
"Weaviate is a scalable, production-ready vector store. \n",
"\n",
"This integration allows Weaviate to be used with LangChain to enhance the capabilities of large language models with a robust data store. Its scalability and production-readiness make it a great choice as a vector store for your LangChain applications, and it will reduce your time to production."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}