mirror of
https://github.com/hwchase17/langchain
synced 2024-11-08 07:10:35 +00:00
eac4ddb4bb
Todo: - [x] Connection options (cloud, localhost url, es_connection) support - [x] Logging support - [x] Customisable field support - [x] Distance Similarity support - [x] Metadata support - [x] Metadata Filter support - [x] Retrieval Strategies - [x] Approx - [x] Approx with Hybrid - [x] Exact - [x] Custom - [x] ELSER (excluding hybrid as we are working on RRF support) - [x] integration tests - [x] Documentation 👋 this is a contribution to improve Elasticsearch integration with Langchain. Its based loosely on the changes that are in master but with some notable changes: ## Package name & design improvements The import name is now `ElasticsearchStore`, to aid discoverability of the VectorStore. ```py ## Before from langchain.vectorstores.elastic_vector_search import ElasticVectorSearch, ElasticKnnSearch ## Now from langchain.vectorstores.elasticsearch import ElasticsearchStore ``` ## Retrieval Strategy support Before we had a number of classes, depending on the strategy you wanted. `ElasticKnnSearch` for approx, `ElasticVectorSearch` for exact / brute force. With `ElasticsearchStore` we have retrieval strategies: ### Approx Example Default strategy for the vast majority of developers who use Elasticsearch will be inferring the embeddings from outside of Elasticsearch. Uses KNN functionality of _search. ```py texts = ["foo", "bar", "baz"] docsearch = ElasticsearchStore.from_texts( texts, FakeEmbeddings(), es_url="http://localhost:9200", index_name="sample-index" ) output = docsearch.similarity_search("foo", k=1) ``` ### Approx, with hybrid Developers who want to search, using both the embedding and the text bm25 match. Its simple to enable. ```py texts = ["foo", "bar", "baz"] docsearch = ElasticsearchStore.from_texts( texts, FakeEmbeddings(), es_url="http://localhost:9200", index_name="sample-index", strategy=ElasticsearchStore.ApproxRetrievalStrategy(hybrid=True) ) output = docsearch.similarity_search("foo", k=1) ``` ### Approx, with `query_model_id` Developers who want to infer within Elasticsearch, using the model loaded in the ml node. This relies on the developer to setup the pipeline and index if they wish to embed the text in Elasticsearch. Example of this in the test. ```py texts = ["foo", "bar", "baz"] docsearch = ElasticsearchStore.from_texts( texts, FakeEmbeddings(), es_url="http://localhost:9200", index_name="sample-index", strategy=ElasticsearchStore.ApproxRetrievalStrategy( query_model_id="sentence-transformers__all-minilm-l6-v2" ), ) output = docsearch.similarity_search("foo", k=1) ``` ### I want to provide my own custom Elasticsearch Query You might want to have more control over the query, to perform multi-phase retrieval such as LTR, linearly boosting on document parameters like recently updated or geo-distance. You can do this with `custom_query_fn` ```py def my_custom_query(query_body: dict, query: str) -> dict: return {"query": {"match": {"text": {"query": "bar"}}}} texts = ["foo", "bar", "baz"] docsearch = ElasticsearchStore.from_texts( texts, FakeEmbeddings(), **elasticsearch_connection, index_name=index_name ) docsearch.similarity_search("foo", k=1, custom_query=my_custom_query) ``` ### Exact Example Developers who have a small dataset in Elasticsearch, dont want the cost of indexing the dims vs tradeoff on cost at query time. Uses script_score. ```py texts = ["foo", "bar", "baz"] docsearch = ElasticsearchStore.from_texts( texts, FakeEmbeddings(), es_url="http://localhost:9200", index_name="sample-index", strategy=ElasticsearchStore.ExactRetrievalStrategy(), ) output = docsearch.similarity_search("foo", k=1) ``` ### ELSER Example Elastic provides its own sparse vector model called ELSER. With these changes, its really easy to use. The vector store creates a pipeline and index thats setup for ELSER. All the developer needs to do is configure, ingest and query via langchain tooling. ```py texts = ["foo", "bar", "baz"] docsearch = ElasticsearchStore.from_texts( texts, FakeEmbeddings(), es_url="http://localhost:9200", index_name="sample-index", strategy=ElasticsearchStore.SparseVectorStrategy(), ) output = docsearch.similarity_search("foo", k=1) ``` ## Architecture In future, we can introduce new strategies and allow us to not break bwc as we evolve the index / query strategy. ## Credit On release, could you credit @elastic and @phoey1 please? Thank you! --------- Co-authored-by: Bagatur <baskaryan@gmail.com>
801 lines
30 KiB
Plaintext
801 lines
30 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "683953b3",
|
||
"metadata": {
|
||
"id": "683953b3"
|
||
},
|
||
"source": [
|
||
"# Elasticsearch\n",
|
||
"\n",
|
||
">[Elasticsearch](https://www.elastic.co/elasticsearch/) is a distributed, RESTful search and analytics engine, capable of performing both vector and lexical search. It is built on top of the Apache Lucene library. \n",
|
||
"\n",
|
||
"This notebook shows how to use functionality related to the `Elasticsearch` database."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "b66c12b2-2a07-4136-ac77-ce1c9fa7a409",
|
||
"metadata": {
|
||
"id": "b66c12b2-2a07-4136-ac77-ce1c9fa7a409",
|
||
"tags": []
|
||
},
|
||
"source": [
|
||
"## Running and connecting to Elasticsearch"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "81f43794-f002-477c-9b68-4975df30e718",
|
||
"metadata": {
|
||
"id": "81f43794-f002-477c-9b68-4975df30e718"
|
||
},
|
||
"source": [
|
||
"There are two main ways to setup an Elasticsearch instance for use with:\n",
|
||
"\n",
|
||
"1. Elastic Cloud: Elastic Cloud is a managed Elasticsearch service. Signup for a [free trial](https://cloud.elastic.co/registration?storm=langchain-notebook).\n",
|
||
"\n",
|
||
"To connect to an Elasticsearch instance that does not require\n",
|
||
"login credentials (starting the docker instance with security enabled), pass the Elasticsearch URL and index name along with the\n",
|
||
"embedding object to the constructor.\n",
|
||
"\n",
|
||
"2. Local Install Elasticsearch: Get started with Elasticsearch by running it locally. The easiest way is to use the official Elasticsearch Docker image. See the [Elasticsearch Docker documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/docker.html) for more information.\n",
|
||
"\n",
|
||
"\n",
|
||
"### Running Elasticsearch via Docker \n",
|
||
"Example: Run a single-node Elasticsearch instance with security disabled. This is not recommended for production use.\n",
|
||
"\n",
|
||
"```bash\n",
|
||
" docker run -p 9200:9200 -e \"discovery.type=single-node\" -e \"xpack.security.enabled=false\" -e \"xpack.security.http.ssl.enabled=false\" docker.elastic.co/elasticsearch/elasticsearch:8.9.0\n",
|
||
"```\n",
|
||
"\n",
|
||
"Once the Elasticsearch instance is running, you can connect to it using the Elasticsearch URL and index name along with the embedding object to the constructor.\n",
|
||
"\n",
|
||
"Example:\n",
|
||
"```python\n",
|
||
" from langchain.vectorstores.elasticsearch import ElasticsearchStore\n",
|
||
" from langchain.embeddings.openai import OpenAIEmbeddings\n",
|
||
"\n",
|
||
" embedding = OpenAIEmbeddings()\n",
|
||
" elastic_vector_search = ElasticsearchStore(\n",
|
||
" es_url=\"http://localhost:9200\",\n",
|
||
" index_name=\"test_index\",\n",
|
||
" embedding=embedding\n",
|
||
" )\n",
|
||
"```\n",
|
||
"### Authentication\n",
|
||
"For production, we recommend you run with security enabled. To connect with login credentials, you can use the parameters `api_key` or `es_user` and `es_password`.\n",
|
||
"\n",
|
||
"Example:\n",
|
||
"```python\n",
|
||
" from langchain.vectorstores import ElasticsearchStore\n",
|
||
" from langchain.embeddings import OpenAIEmbeddings\n",
|
||
"\n",
|
||
" embedding = OpenAIEmbeddings()\n",
|
||
" elastic_vector_search = ElasticsearchStore(\n",
|
||
" es_url=\"http://localhost:9200\",\n",
|
||
" index_name=\"test_index\",\n",
|
||
" embedding=embedding,\n",
|
||
" es_user=\"elastic\",\n",
|
||
" es_password=\"changeme\"\n",
|
||
" )\n",
|
||
"```\n",
|
||
"\n",
|
||
"#### How to obtain a password for the default \"elastic\" user?\n",
|
||
"\n",
|
||
"To obtain your Elastic Cloud password for the default \"elastic\" user:\n",
|
||
"1. Log in to the Elastic Cloud console at https://cloud.elastic.co\n",
|
||
"2. Go to \"Security\" > \"Users\"\n",
|
||
"3. Locate the \"elastic\" user and click \"Edit\"\n",
|
||
"4. Click \"Reset password\"\n",
|
||
"5. Follow the prompts to reset the password\n",
|
||
"\n",
|
||
"#### How to obtain an API key?\n",
|
||
"\n",
|
||
"To obtain an API key:\n",
|
||
"1. Log in to the Elastic Cloud console at https://cloud.elastic.co\n",
|
||
"2. Open Kibana and go to Stack Management > API Keys\n",
|
||
"3. Click \"Create API key\"\n",
|
||
"4. Enter a name for the API key and click \"Create\"\n",
|
||
"5. Copy the API key and paste it into the `api_key` parameter\n",
|
||
"\n",
|
||
"### Elastic Cloud\n",
|
||
"To connect to an Elasticsearch instance on Elastic Cloud, you can use either the `es_cloud_id` parameter or `es_url`.\n",
|
||
"\n",
|
||
"Example:\n",
|
||
"```python\n",
|
||
" from langchain.vectorstores.elasticsearch import ElasticsearchStore\n",
|
||
" from langchain.embeddings import OpenAIEmbeddings\n",
|
||
"\n",
|
||
" embedding = OpenAIEmbeddings()\n",
|
||
" elastic_vector_search = ElasticsearchStore(\n",
|
||
" es_cloud_id=\"<cloud_id>\",\n",
|
||
" index_name=\"test_index\",\n",
|
||
" embedding=embedding,\n",
|
||
" es_user=\"elastic\",\n",
|
||
" es_password=\"changeme\"\n",
|
||
" )\n",
|
||
"```"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 1,
|
||
"id": "d6197931-cbe5-460c-a5e6-b5eedb83887c",
|
||
"metadata": {
|
||
"id": "d6197931-cbe5-460c-a5e6-b5eedb83887c",
|
||
"tags": []
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Requirement already satisfied: elasticsearch in /Users/joe/Library/Caches/pypoetry/virtualenvs/langchain-monorepo-ln7dNLl5-py3.10/lib/python3.10/site-packages (8.9.0)\n",
|
||
"Requirement already satisfied: elastic-transport<9,>=8 in /Users/joe/Library/Caches/pypoetry/virtualenvs/langchain-monorepo-ln7dNLl5-py3.10/lib/python3.10/site-packages (from elasticsearch) (8.4.0)\n",
|
||
"Requirement already satisfied: urllib3<2,>=1.26.2 in /Users/joe/Library/Caches/pypoetry/virtualenvs/langchain-monorepo-ln7dNLl5-py3.10/lib/python3.10/site-packages (from elastic-transport<9,>=8->elasticsearch) (1.26.16)\n",
|
||
"Requirement already satisfied: certifi in /Users/joe/Library/Caches/pypoetry/virtualenvs/langchain-monorepo-ln7dNLl5-py3.10/lib/python3.10/site-packages (from elastic-transport<9,>=8->elasticsearch) (2023.7.22)\n",
|
||
"\n",
|
||
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.2\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m23.2.1\u001b[0m\n",
|
||
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"!pip install elasticsearch"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "ea167a29",
|
||
"metadata": {},
|
||
"source": [
|
||
"We want to use `OpenAIEmbeddings` so we have to get the OpenAI API Key."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 3,
|
||
"id": "67ab8afa-f7c6-4fbf-b596-cb512da949da",
|
||
"metadata": {
|
||
"id": "67ab8afa-f7c6-4fbf-b596-cb512da949da",
|
||
"outputId": "fd16b37f-cb76-40a9-b83f-eab58dd0d912",
|
||
"tags": []
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"import os\n",
|
||
"import getpass\n",
|
||
"\n",
|
||
"os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"OpenAI API Key:\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "f6030187-0bd7-4798-8372-a265036af5e0",
|
||
"metadata": {
|
||
"id": "f6030187-0bd7-4798-8372-a265036af5e0",
|
||
"tags": []
|
||
},
|
||
"source": [
|
||
"## Basic Example\n",
|
||
"This example we are going to load \"state_of_the_union.txt\" via the TextLoader, chunk the text into 500 word chunks, and then index each chunk into Elasticsearch.\n",
|
||
"\n",
|
||
"Once the data is indexed, we perform a simple query to find the top 4 chunks that similar to the query \"What did the president say about Ketanji Brown Jackson\".\n",
|
||
"\n",
|
||
"Elasticsearch is running locally on localhost:9200 with [docker](#running-elasticsearch-via-docker). For more details on how to connect to Elasticsearch from Elastic Cloud, see [connecting with authentication](#authentication) above.\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "aac9563e",
|
||
"metadata": {
|
||
"id": "aac9563e",
|
||
"tags": []
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"from langchain.embeddings.openai import OpenAIEmbeddings\n",
|
||
"from langchain.vectorstores import ElasticsearchStore"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 4,
|
||
"id": "a3c3999a",
|
||
"metadata": {
|
||
"id": "a3c3999a",
|
||
"tags": []
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"from langchain.document_loaders import TextLoader\n",
|
||
"from langchain.text_splitter import CharacterTextSplitter\n",
|
||
"\n",
|
||
"loader = TextLoader(\"../../modules/state_of_the_union.txt\")\n",
|
||
"documents = loader.load()\n",
|
||
"text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=0)\n",
|
||
"docs = text_splitter.split_documents(documents)\n",
|
||
"\n",
|
||
"embeddings = OpenAIEmbeddings()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 9,
|
||
"id": "12eb86d8",
|
||
"metadata": {
|
||
"id": "12eb86d8",
|
||
"tags": []
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"[Document(page_content='One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \\n\\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.', metadata={'source': '../../modules/state_of_the_union.txt'}), Document(page_content='One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \\n\\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.', metadata={'source': '../../modules/state_of_the_union.txt', 'date': '2016-01-01', 'rating': 2, 'author': 'John Doe'}), Document(page_content='One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \\n\\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.', metadata={'source': '../../modules/state_of_the_union.txt', 'date': '2010-01-01', 'rating': 1, 'author': 'John Doe'}), Document(page_content='As I said last year, especially to our younger transgender Americans, I will always have your back as your President, so you can be yourself and reach your God-given potential. \\n\\nWhile it often appears that we never agree, that isn’t true. I signed 80 bipartisan bills into law last year. From preventing government shutdowns to protecting Asian-Americans from still-too-common hate crimes to reforming military justice.', metadata={'source': '../../modules/state_of_the_union.txt'})]\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"\n",
|
||
"db = ElasticsearchStore.from_documents(\n",
|
||
" docs, embeddings, es_url=\"http://localhost:9200\", index_name=\"test-basic\", \n",
|
||
")\n",
|
||
"\n",
|
||
"db.client.indices.refresh(index=\"test-basic\")\n",
|
||
"\n",
|
||
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
|
||
"results = db.similarity_search(query)\n",
|
||
"print(results)\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "f86ec452",
|
||
"metadata": {},
|
||
"source": [
|
||
"# Metadata\n",
|
||
"\n",
|
||
"`ElasticsearchStore` supports metadata to stored along with the document. This metadata dict object is stored in a metadata object field in the Elasticsearch document. Based on the metadata value, Elasticsearch will automatically setup the mapping by infering the data type of the metadata value. For example, if the metadata value is a string, Elasticsearch will setup the mapping for the metadata object field as a string type."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 8,
|
||
"id": "5d076412",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"{'source': '../../modules/state_of_the_union.txt', 'date': '2016-01-01', 'rating': 2, 'author': 'John Doe'}\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"\n",
|
||
"# Adding metadata to documents\n",
|
||
"for i, doc in enumerate(docs):\n",
|
||
" doc.metadata[\"date\"] = f\"{range(2010, 2020)[i % 10]}-01-01\"\n",
|
||
" doc.metadata[\"rating\"] = range(1, 6)[i % 5] \n",
|
||
" doc.metadata[\"author\"] = [\"John Doe\", \"Jane Doe\"][i % 2]\n",
|
||
"\n",
|
||
"db = ElasticsearchStore.from_documents(\n",
|
||
" docs, embeddings, es_url=\"http://localhost:9200\", index_name=\"test-metadata\"\n",
|
||
")\n",
|
||
"\n",
|
||
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
|
||
"docs = db.similarity_search(query)\n",
|
||
"print(docs[0].metadata)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "3befa9e0",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Filtering Metadata\n",
|
||
"With metadata added to the documents, you can add metadata filtering at query time. \n",
|
||
"\n",
|
||
"### Example: Filter by keyword"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 57,
|
||
"id": "b2a4bd1b",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"{'source': '../../modules/state_of_the_union.txt', 'date': '2010-01-01', 'rating': 1, 'author': 'John Doe', 'geo_location': {'lat': 40.12, 'lon': -71.34}}\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"docs = db.similarity_search(query, filter=[{ \"match\": { \"metadata.author\": \"John Doe\"}}])\n",
|
||
"print(docs[0].metadata)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "647d26ea",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Example: Filter by Date Range"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 59,
|
||
"id": "55b63a61",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"{'source': '../../modules/state_of_the_union.txt', 'date': '2012-01-01', 'rating': 3, 'author': 'John Doe', 'geo_location': {'lat': 40.12, 'lon': -71.34}}\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"docs = db.similarity_search(\"Any mention about Fred?\", filter=[{ \"range\": { \"metadata.date\": { \"gte\": \"2010-01-01\" }}}])\n",
|
||
"print(docs[0].metadata)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "08f57936",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Example: Filter by Numeric Range"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 60,
|
||
"id": "9b831b3d",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"{'source': '../../modules/state_of_the_union.txt', 'date': '2012-01-01', 'rating': 3, 'author': 'John Doe', 'geo_location': {'lat': 40.12, 'lon': -71.34}}\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"docs = db.similarity_search(\"Any mention about Fred?\", filter=[{ \"range\": { \"metadata.rating\": { \"gte\": 2 }}}])\n",
|
||
"print(docs[0].metadata)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "05e7ec40",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Example: Filter by Geo Distance\n",
|
||
"Requires an index with a geo_point mapping to be declared for `metadata.geo_location`."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "fb1482e7",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"docs = db.similarity_search(\"Any mention about Fred?\", filter=[{ \"geo_distance\": { \"distance\": \"200km\", \"metadata.geo_location\": { \"lat\": 40, \"lon\": -70 } } }])\n",
|
||
"print(docs[0].metadata)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "729eb73b",
|
||
"metadata": {},
|
||
"source": [
|
||
"Filter supports many more types of queries than above. \n",
|
||
"\n",
|
||
"Read more about them in the [documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html)."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "1d54068c",
|
||
"metadata": {},
|
||
"source": [
|
||
"# Distance Similarity Algorithm\n",
|
||
"Elasticsearch supports the following vector distance similarity algorithms:\n",
|
||
"\n",
|
||
"- cosine\n",
|
||
"- euclidean\n",
|
||
"- dot_product\n",
|
||
"\n",
|
||
"The cosine similarity algorithm is the default.\n",
|
||
"\n",
|
||
"You can specify the similarity Algorithm needed via the similarity parameter.\n",
|
||
"\n",
|
||
"**NOTE**\n",
|
||
"Depending on the retrieval strategy, the similarity algorithm cannot be changed at query time. It is needed to be set when creating the index mapping for field. If you need to change the similarity algorithm, you need to delete the index and recreate it with the correct distance_strategy.\n",
|
||
"\n",
|
||
"```python\n",
|
||
"\n",
|
||
"db = ElasticsearchStore.from_documents(\n",
|
||
" docs, \n",
|
||
" embeddings, \n",
|
||
" es_url=\"http://localhost:9200\", \n",
|
||
" index_name=\"test\",\n",
|
||
" distance_strategy=\"COSINE\"\n",
|
||
" # distance_strategy=\"EUCLIDEAN_DISTANCE\"\n",
|
||
" # distance_strategy=\"DOT_PRODUCT\"\n",
|
||
")\n",
|
||
"\n",
|
||
"```"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "b06da4d7",
|
||
"metadata": {},
|
||
"source": [
|
||
"# Retrieval Strategies\n",
|
||
"Elasticsearch has big advantages over other vector only databases from its ability to support a wide range of retrieval strategies. In this notebook we will configure `ElasticsearchStore` to support some of the most common retrieval strategies. \n",
|
||
"\n",
|
||
"By default, `ElasticsearchStore` uses the `ApproxRetrievalStrategy`.\n",
|
||
"\n",
|
||
"## ApproxRetrievalStrategy\n",
|
||
"This will return the top `k` most similar vectors to the query vector. The `k` parameter is set when the `ElasticsearchStore` is initialized. The default value is `10`."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "999b5ef5",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"db = ElasticsearchStore.from_documents(\n",
|
||
" docs, \n",
|
||
" embeddings, \n",
|
||
" es_url=\"http://localhost:9200\", \n",
|
||
" index_name=\"test\",\n",
|
||
" strategy=ElasticsearchStore.ApproxRetrievalStrategy()\n",
|
||
")\n",
|
||
"\n",
|
||
"docs = db.similarity_search(query=\"What did the president say about Ketanji Brown Jackson?\", k=10)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "9b651be5",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Example: Approx with hybrid\n",
|
||
"This example will show how to configure `ElasticsearchStore` to perform a hybrid retrieval, using a combination of approximate semantic search and keyword based search. \n",
|
||
"\n",
|
||
"We use RRF to balance the two scores from different retrieval methods.\n",
|
||
"\n",
|
||
"To enable hybrid retrieval, we need to set `hybrid=True` in `ElasticsearchStore` `ApproxRetrievalStrategy` constructor.\n",
|
||
"\n",
|
||
"```python\n",
|
||
"\n",
|
||
"db = ElasticsearchStore.from_documents(\n",
|
||
" docs, \n",
|
||
" embeddings, \n",
|
||
" es_url=\"http://localhost:9200\", \n",
|
||
" index_name=\"test\",\n",
|
||
" strategy=ElasticsearchStore.ApproxRetrievalStrategy(\n",
|
||
" hybrid=True,\n",
|
||
" )\n",
|
||
")\n",
|
||
"```\n",
|
||
"\n",
|
||
"When `hybrid` is enabled, the query performed will be a combination of approximate semantic search and keyword based search. \n",
|
||
"\n",
|
||
"It will use `rrf` (Reciprocal Rank Fusion) to balance the two scores from different retrieval methods.\n",
|
||
"\n",
|
||
"**Note** RRF requires Elasticsearch 8.9.0 or above.\n",
|
||
"\n",
|
||
"```json\n",
|
||
"{\n",
|
||
" \"knn\": {\n",
|
||
" \"field\": \"vector\",\n",
|
||
" \"filter\": [],\n",
|
||
" \"k\": 1,\n",
|
||
" \"num_candidates\": 50,\n",
|
||
" \"query_vector\": [1.0, ..., 0.0],\n",
|
||
" },\n",
|
||
" \"query\": {\n",
|
||
" \"bool\": {\n",
|
||
" \"filter\": [],\n",
|
||
" \"must\": [{\"match\": {\"text\": {\"query\": \"foo\"}}}],\n",
|
||
" }\n",
|
||
" },\n",
|
||
" \"rank\": {\"rrf\": {}},\n",
|
||
"}\n",
|
||
"```\n",
|
||
"\n",
|
||
"### Example: Approx with Embedding Model in Elasticsearch\n",
|
||
"This example will show how to configure `ElasticsearchStore` to use the embedding model deployed in Elasticsearch for approximate retrieval. \n",
|
||
"\n",
|
||
"To use this, specify the model_id in `ElasticsearchStore` `ApproxRetrievalStrategy` constructor via the `query_model_id` argument.\n",
|
||
"\n",
|
||
"**NOTE** This requires the model to be deployed and running in Elasticsearch ml node. \n",
|
||
"\n",
|
||
"```python\n",
|
||
"\n",
|
||
"# Note: This does not have an embedding function specified\n",
|
||
"# Instead, we will use the embedding model deployed in Elasticsearch\n",
|
||
"db = ElasticsearchStore( \n",
|
||
" es_url=\"http://localhost:9200\", \n",
|
||
" index_name=\"test\",\n",
|
||
" strategy=ElasticsearchStore.ApproxRetrievalStrategy(\n",
|
||
" query_model_id=\"sentence-transformers__all-minilm-l6-v2\"\n",
|
||
" )\n",
|
||
")\n",
|
||
"\n",
|
||
"# Perform search\n",
|
||
"db.similarity_search(\"hello world\", k=10)\n",
|
||
"```\n",
|
||
"\n",
|
||
"## SparseVectorRetrievalStrategy (ELSER)\n",
|
||
"This strategy uses Elasticsearch's sparse vector retrieval to retrieve the top-k results. We only support our own \"ELSER\" embedding model for now.\n",
|
||
"\n",
|
||
"**NOTE** This requires the ELSER model to be deployed and running in Elasticsearch ml node. \n",
|
||
"\n",
|
||
"To use this, specify `SparseVectorRetrievalStrategy` in `ElasticsearchStore` constructor."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 5,
|
||
"id": "38a63256",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"page_content='One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \\n\\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.' metadata={'source': '../../modules/state_of_the_union.txt'}\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"\n",
|
||
"# Note that this example doesn't have an embedding function. This is because we infer the tokens at index time and at query time within Elasticsearch. \n",
|
||
"# This requires the ELSER model to be loaded and running in Elasticsearch.\n",
|
||
"db = ElasticsearchStore.from_documents(\n",
|
||
" docs, \n",
|
||
" es_cloud_id=\"My_deployment:dXMtY2VudHJhbDEuZ2NwLmNsb3VkLmVzLmlvOjQ0MyQ2OGJhMjhmNDc1M2Y0MWVjYTk2NzI2ZWNkMmE5YzRkNyQ3NWI4ODRjNWQ2OTU0MTYzODFjOTkxNmQ1YzYxMGI1Mw==\",\n",
|
||
" es_user=\"elastic\",\n",
|
||
" es_password=\"GgUPiWKwEzgHIYdHdgPk1Lwi\",\n",
|
||
" index_name=\"test-elser\",\n",
|
||
" strategy=ElasticsearchStore.SparseVectorRetrievalStrategy()\n",
|
||
")\n",
|
||
"\n",
|
||
"db.client.indices.refresh(index=\"test-elser\")\n",
|
||
"\n",
|
||
"results = db.similarity_search(\"What did the president say about Ketanji Brown Jackson\", k=4)\n",
|
||
"print(results[0])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "edf3a093",
|
||
"metadata": {},
|
||
"source": [
|
||
"## ExactRetrievalStrategy\n",
|
||
"This strategy uses Elasticsearch's exact retrieval (also known as brute force) to retrieve the top-k results.\n",
|
||
"\n",
|
||
"To use this, specify `ExactRetrievalStrategy` in `ElasticsearchStore` constructor.\n",
|
||
"\n",
|
||
"```python\n",
|
||
"\n",
|
||
"db = ElasticsearchStore.from_documents(\n",
|
||
" docs, \n",
|
||
" embeddings, \n",
|
||
" es_url=\"http://localhost:9200\", \n",
|
||
" index_name=\"test\",\n",
|
||
" strategy=ElasticsearchStore.ExactRetrievalStrategy()\n",
|
||
")\n",
|
||
"```"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "0960fa0a",
|
||
"metadata": {},
|
||
"source": [
|
||
"# Customise the Query\n",
|
||
"With `custom_query` parameter at search, you are able to adjust the query that is used to retrieve documents from Elasticsearch. This is useful if you want to want to use a more complex query, to support linear boosting of fields."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 11,
|
||
"id": "b926a606",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Query Retriever created by the retrieval strategy:\n",
|
||
"{'query': {'bool': {'must': [{'text_expansion': {'vector.tokens': {'model_id': '.elser_model_1', 'model_text': 'What did the president say about Ketanji Brown Jackson'}}}], 'filter': []}}}\n",
|
||
"\n",
|
||
"Query thats actually used in Elasticsearch:\n",
|
||
"{'query': {'match': {'text': 'What did the president say about Ketanji Brown Jackson'}}}\n",
|
||
"\n",
|
||
"Results:\n",
|
||
"page_content='One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \\n\\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.' metadata={'source': '../../modules/state_of_the_union.txt'}\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# Example of a custom query thats just doing a BM25 search on the text field.\n",
|
||
"def custom_query(query_body: dict, query: str):\n",
|
||
" \"\"\"Custom query to be used in Elasticsearch.\n",
|
||
" Args:\n",
|
||
" query_body (dict): Elasticsearch query body.\n",
|
||
" query (str): Query string.\n",
|
||
" Returns:\n",
|
||
" dict: Elasticsearch query body.\n",
|
||
" \"\"\"\n",
|
||
" print(\"Query Retriever created by the retrieval strategy:\")\n",
|
||
" print(query_body)\n",
|
||
" print()\n",
|
||
"\n",
|
||
" new_query_body = {\n",
|
||
" \"query\": {\n",
|
||
" \"match\": {\n",
|
||
" \"text\": query\n",
|
||
" }\n",
|
||
" }\n",
|
||
" }\n",
|
||
"\n",
|
||
" print(\"Query thats actually used in Elasticsearch:\")\n",
|
||
" print(new_query_body)\n",
|
||
" print()\n",
|
||
"\n",
|
||
" return new_query_body\n",
|
||
"\n",
|
||
"results = db.similarity_search(\"What did the president say about Ketanji Brown Jackson\", k=4, custom_query=custom_query)\n",
|
||
"print(\"Results:\")\n",
|
||
"print(results[0])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "604c66ea",
|
||
"metadata": {},
|
||
"source": [
|
||
"# Upgrading to ElasticsearchStore\n",
|
||
"\n",
|
||
"If you're already using Elasticsearch in your langchain based project, you may be using the old implementations: `ElasticVectorSearch` and `ElasticKNNSearch` which are now deprecated. We've introduced a new implementation called `ElasticsearchStore` which is more flexible and easier to use. This notebook will guide you through the process of upgrading to the new implementation.\n",
|
||
"\n",
|
||
"## What's new?\n",
|
||
"\n",
|
||
"The new implementation is now one class called `ElasticsearchStore` which can be used for approx, exact, and ELSER search retrieval, via strategies.\n",
|
||
"\n",
|
||
"## Im using ElasticKNNSearch\n",
|
||
"\n",
|
||
"Old implementation:\n",
|
||
"\n",
|
||
"```python\n",
|
||
"\n",
|
||
"from langchain.vectorstores.elastic_vector_search import ElasticKNNSearch\n",
|
||
"\n",
|
||
"db = ElasticKNNSearch(\n",
|
||
" elasticsearch_url=\"http://localhost:9200\",\n",
|
||
" index_name=\"test_index\",\n",
|
||
" embedding=embedding\n",
|
||
")\n",
|
||
"\n",
|
||
"```\n",
|
||
"\n",
|
||
"New implementation:\n",
|
||
"\n",
|
||
"```python\n",
|
||
"\n",
|
||
"from langchain.vectorstores.elasticsearch import ElasticsearchStore\n",
|
||
"\n",
|
||
"db = ElasticsearchStore(\n",
|
||
" es_url=\"http://localhost:9200\",\n",
|
||
" index_name=\"test_index\",\n",
|
||
" embedding=embedding,\n",
|
||
" # if you use the model_id\n",
|
||
" # strategy=ElasticsearchStore.ApproxRetrievalStrategy( query_model_id=\"test_model\" )\n",
|
||
" # if you use hybrid search\n",
|
||
" # strategy=ElasticsearchStore.ApproxRetrievalStrategy( hybrid=True )\n",
|
||
")\n",
|
||
"\n",
|
||
"```\n",
|
||
"\n",
|
||
"## Im using ElasticVectorSearch\n",
|
||
"\n",
|
||
"Old implementation:\n",
|
||
"\n",
|
||
"```python\n",
|
||
"\n",
|
||
"from langchain.vectorstores.elastic_vector_search import ElasticVectorSearch\n",
|
||
"\n",
|
||
"db = ElasticVectorSearch(\n",
|
||
" elasticsearch_url=\"http://localhost:9200\",\n",
|
||
" index_name=\"test_index\",\n",
|
||
" embedding=embedding\n",
|
||
")\n",
|
||
"\n",
|
||
"```\n",
|
||
"\n",
|
||
"New implementation:\n",
|
||
"\n",
|
||
"```python\n",
|
||
"\n",
|
||
"from langchain.vectorstores.elasticsearch import ElasticsearchStore\n",
|
||
"\n",
|
||
"db = ElasticsearchStore(\n",
|
||
" es_url=\"http://localhost:9200\",\n",
|
||
" index_name=\"test_index\",\n",
|
||
" embedding=embedding,\n",
|
||
" strategy=ElasticsearchStore.ExactRetrievalStrategy()\n",
|
||
")\n",
|
||
"\n",
|
||
"```"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 6,
|
||
"id": "3cff8421",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"ObjectApiResponse({'acknowledged': True})"
|
||
]
|
||
},
|
||
"execution_count": 6,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"db.client.indices.delete(index='test-metadata, test-elser, test-basic', ignore_unavailable=True, allow_no_indices=True)"
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"colab": {
|
||
"provenance": []
|
||
},
|
||
"kernelspec": {
|
||
"display_name": "Python 3 (ipykernel)",
|
||
"language": "python",
|
||
"name": "python3"
|
||
},
|
||
"language_info": {
|
||
"codemirror_mode": {
|
||
"name": "ipython",
|
||
"version": 3
|
||
},
|
||
"file_extension": ".py",
|
||
"mimetype": "text/x-python",
|
||
"name": "python",
|
||
"nbconvert_exporter": "python",
|
||
"pygments_lexer": "ipython3",
|
||
"version": "3.10.3"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 5
|
||
}
|