mirror of
https://github.com/hwchase17/langchain
synced 2024-10-31 15:20:26 +00:00
e1f4f9ac3e
Added missed pages for `integrations/providers` from `vectorstores`. Updated several `vectorstores` notebooks.
332 lines
9.3 KiB
Plaintext
332 lines
9.3 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "1292f057",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Postgres Embedding\n",
|
|
"\n",
|
|
"> [Postgres Embedding](https://github.com/neondatabase/pg_embedding) is an open-source vector similarity search for `Postgres` that uses `Hierarchical Navigable Small Worlds (HNSW)` for approximate nearest neighbor search.\n",
|
|
"\n",
|
|
">It supports:\n",
|
|
">- exact and approximate nearest neighbor search using HNSW\n",
|
|
">- L2 distance\n",
|
|
"\n",
|
|
"This notebook shows how to use the Postgres vector database (`PGEmbedding`).\n",
|
|
"\n",
|
|
"> The PGEmbedding integration creates the pg_embedding extension for you, but you run the following Postgres query to add it:\n",
|
|
"```sql\n",
|
|
"CREATE EXTENSION embedding;\n",
|
|
"```"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "a6214221",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Pip install necessary package\n",
|
|
"!pip install openai\n",
|
|
"!pip install psycopg2-binary\n",
|
|
"!pip install tiktoken"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "b2e49694",
|
|
"metadata": {},
|
|
"source": [
|
|
"Add the OpenAI API Key to the environment variables to use `OpenAIEmbeddings`."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"id": "1dcc8d99",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"OpenAI API Key:········\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"import os\n",
|
|
"import getpass\n",
|
|
"\n",
|
|
"os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"OpenAI API Key:\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"id": "9719ea68",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"## Loading Environment Variables\n",
|
|
"from typing import List, Tuple"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "dfd1f38d",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from langchain.embeddings.openai import OpenAIEmbeddings\n",
|
|
"from langchain.text_splitter import CharacterTextSplitter\n",
|
|
"from langchain.vectorstores import PGEmbedding\n",
|
|
"from langchain.document_loaders import TextLoader\n",
|
|
"from langchain.docstore.document import Document"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 5,
|
|
"id": "8fab8cc2",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Database Url:········\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"os.environ[\"DATABASE_URL\"] = getpass.getpass(\"Database Url:\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 6,
|
|
"id": "bef17115",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"loader = TextLoader(\"state_of_the_union.txt\")\n",
|
|
"documents = loader.load()\n",
|
|
"text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
|
|
"docs = text_splitter.split_documents(documents)\n",
|
|
"\n",
|
|
"embeddings = OpenAIEmbeddings()\n",
|
|
"connection_string = os.environ.get(\"DATABASE_URL\")\n",
|
|
"collection_name = \"state_of_the_union\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 13,
|
|
"id": "743abfaa",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"db = PGEmbedding.from_documents(\n",
|
|
" embedding=embeddings,\n",
|
|
" documents=docs,\n",
|
|
" collection_name=collection_name,\n",
|
|
" connection_string=connection_string,\n",
|
|
")\n",
|
|
"\n",
|
|
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
|
|
"docs_with_score: List[Tuple[Document, float]] = db.similarity_search_with_score(query)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "41ce4c4e",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"for doc, score in docs_with_score:\n",
|
|
" print(\"-\" * 80)\n",
|
|
" print(\"Score: \", score)\n",
|
|
" print(doc.page_content)\n",
|
|
" print(\"-\" * 80)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "7ef7b052",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Working with vectorstore in Postgres"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "939151f7",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Uploading a vectorstore in PG "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 32,
|
|
"id": "595ac511",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"db = PGEmbedding.from_documents(\n",
|
|
" embedding=embeddings,\n",
|
|
" documents=docs,\n",
|
|
" collection_name=collection_name,\n",
|
|
" connection_string=connection_string,\n",
|
|
" pre_delete_collection=False,\n",
|
|
")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "f9510e6b",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Create HNSW Index\n",
|
|
"By default, the extension performs a sequential scan search, with 100% recall. You might consider creating an HNSW index for approximate nearest neighbor (ANN) search to speed up `similarity_search_with_score` execution time. To create the HNSW index on your vector column, use a `create_hnsw_index` function:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "2d1981fa",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"PGEmbedding.create_hnsw_index(\n",
|
|
" max_elements=10000, dims=1536, m=8, ef_construction=16, ef_search=16\n",
|
|
")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "7adacf29",
|
|
"metadata": {},
|
|
"source": [
|
|
"The function above is equivalent to running the below SQL query:\n",
|
|
"```sql\n",
|
|
"CREATE INDEX ON vectors USING hnsw(vec) WITH (maxelements=10000, dims=1536, m=3, efconstruction=16, efsearch=16);\n",
|
|
"```\n",
|
|
"The HNSW index options used in the statement above include:\n",
|
|
"\n",
|
|
"- maxelements: Defines the maximum number of elements indexed. This is a required parameter. The example shown above has a value of 3. A real-world example would have a much large value, such as 1000000. An \"element\" refers to a data point (a vector) in the dataset, which is represented as a node in the HNSW graph. Typically, you would set this option to a value able to accommodate the number of rows in your in your dataset.\n",
|
|
"- dims: Defines the number of dimensions in your vector data. This is a required parameter. A small value is used in the example above. If you are storing data generated using OpenAI's text-embedding-ada-002 model, which supports 1536 dimensions, you would define a value of 1536, for example.\n",
|
|
"- m: Defines the maximum number of bi-directional links (also referred to as \"edges\") created for each node during graph construction.\n",
|
|
"The following additional index options are supported:\n",
|
|
"\n",
|
|
"- efConstruction: Defines the number of nearest neighbors considered during index construction. The default value is 32.\n",
|
|
"- efsearch: Defines the number of nearest neighbors considered during index search. The default value is 32.\n",
|
|
"For information about how you can configure these options to influence the HNSW algorithm, refer to [Tuning the HNSW algorithm](https://neon.tech/docs/extensions/pg_embedding#tuning-the-hnsw-algorithm)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "528893fb",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Retrieving a vectorstore in PG"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 15,
|
|
"id": "b6162b1c",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"store = PGEmbedding(\n",
|
|
" connection_string=connection_string,\n",
|
|
" embedding_function=embeddings,\n",
|
|
" collection_name=collection_name,\n",
|
|
")\n",
|
|
"\n",
|
|
"retriever = store.as_retriever()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 16,
|
|
"id": "1a5fedb1",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"VectorStoreRetriever(vectorstore=<langchain.vectorstores.pghnsw.HNSWVectoreStore object at 0x121d3c8b0>, search_type='similarity', search_kwargs={})"
|
|
]
|
|
},
|
|
"execution_count": 16,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"retriever"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 17,
|
|
"id": "0cefc938",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"db1 = PGEmbedding.from_existing_index(\n",
|
|
" embedding=embeddings,\n",
|
|
" collection_name=collection_name,\n",
|
|
" pre_delete_collection=False,\n",
|
|
" connection_string=connection_string,\n",
|
|
")\n",
|
|
"\n",
|
|
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
|
|
"docs_with_score: List[Tuple[Document, float]] = db1.similarity_search_with_score(query)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "85cde495",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"for doc, score in docs_with_score:\n",
|
|
" print(\"-\" * 80)\n",
|
|
" print(\"Score: \", score)\n",
|
|
" print(doc.page_content)\n",
|
|
" print(\"-\" * 80)"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.10.12"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5
|
|
}
|