You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
langchain/docs/docs/integrations/vectorstores/opensearch.ipynb

486 lines
13 KiB
Plaintext

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

{
"cells": [
{
"cell_type": "markdown",
"id": "683953b3",
"metadata": {},
"source": [
"# OpenSearch\n",
"\n",
"> [OpenSearch](https://opensearch.org/) is a scalable, flexible, and extensible open-source software suite for search, analytics, and observability applications licensed under Apache 2.0. `OpenSearch` is a distributed search and analytics engine based on `Apache Lucene`.\n",
"\n",
"\n",
"This notebook shows how to use functionality related to the `OpenSearch` database.\n",
"\n",
"To run, you should have an OpenSearch instance up and running: [see here for an easy Docker installation](https://hub.docker.com/r/opensearchproject/opensearch).\n",
"\n",
"`similarity_search` by default performs the Approximate k-NN Search which uses one of the several algorithms like lucene, nmslib, faiss recommended for\n",
"large datasets. To perform brute force search we have other search methods known as Script Scoring and Painless Scripting.\n",
"Check [this](https://opensearch.org/docs/latest/search-plugins/knn/index/) for more details."
]
},
{
"cell_type": "markdown",
"id": "94963977-9dfc-48b7-872a-53f2947f46c6",
"metadata": {},
"source": [
"## Installation\n",
"Install the Python client."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6e606066-9386-4427-8a87-1b93f435c57e",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"%pip install --upgrade --quiet opensearch-py"
]
},
{
"cell_type": "markdown",
"id": "b1fa637e-4fbf-4d5a-9188-2cad826a193e",
"metadata": {},
"source": [
"We want to use OpenAIEmbeddings so we have to get the OpenAI API Key."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "28e5455e-322d-4010-9e3b-491d522ef5db",
"metadata": {},
"outputs": [],
"source": [
"import getpass\n",
"import os\n",
"\n",
"os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"OpenAI API Key:\")"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "aac9563e",
"metadata": {},
"outputs": [],
"source": [
"from langchain_community.document_loaders import TextLoader\n",
"from langchain_community.vectorstores import OpenSearchVectorSearch\n",
"from langchain_openai import OpenAIEmbeddings\n",
"from langchain_text_splitters import CharacterTextSplitter"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a3c3999a",
"metadata": {},
"outputs": [],
"source": [
"from langchain_community.document_loaders import TextLoader\n",
"\n",
"loader = TextLoader(\"../../modules/state_of_the_union.txt\")\n",
"documents = loader.load()\n",
"text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
"docs = text_splitter.split_documents(documents)\n",
"\n",
"embeddings = OpenAIEmbeddings()"
]
},
{
"cell_type": "markdown",
"id": "01a9a035",
"metadata": {},
"source": [
"## similarity_search using Approximate k-NN\n",
"\n",
"`similarity_search` using `Approximate k-NN` Search with Custom Parameters"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "803fe12b",
"metadata": {},
"outputs": [],
"source": [
"docsearch = OpenSearchVectorSearch.from_documents(\n",
" docs, embeddings, opensearch_url=\"http://localhost:9200\"\n",
")\n",
"\n",
"# If using the default Docker installation, use this instantiation instead:\n",
"# docsearch = OpenSearchVectorSearch.from_documents(\n",
"# docs,\n",
"# embeddings,\n",
"# opensearch_url=\"https://localhost:9200\",\n",
"# http_auth=(\"admin\", \"admin\"),\n",
"# use_ssl = False,\n",
"# verify_certs = False,\n",
"# ssl_assert_hostname = False,\n",
"# ssl_show_warn = False,\n",
"# )"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "db3fa309",
"metadata": {},
"outputs": [],
"source": [
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
"docs = docsearch.similarity_search(query, k=10)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c160d5bb",
"metadata": {},
"outputs": [],
"source": [
"print(docs[0].page_content)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "96215c90",
"metadata": {},
"outputs": [],
"source": [
"docsearch = OpenSearchVectorSearch.from_documents(\n",
" docs,\n",
" embeddings,\n",
" opensearch_url=\"http://localhost:9200\",\n",
" engine=\"faiss\",\n",
" space_type=\"innerproduct\",\n",
" ef_construction=256,\n",
" m=48,\n",
")\n",
"\n",
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
"docs = docsearch.similarity_search(query)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "62a7cea0",
"metadata": {},
"outputs": [],
"source": [
"print(docs[0].page_content)"
]
},
{
"cell_type": "markdown",
"id": "0d0cd877",
"metadata": {},
"source": [
"## similarity_search using Script Scoring\n",
"\n",
"`similarity_search` using `Script Scoring` with Custom Parameters"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0a8e3c0e",
"metadata": {},
"outputs": [],
"source": [
"docsearch = OpenSearchVectorSearch.from_documents(\n",
" docs, embeddings, opensearch_url=\"http://localhost:9200\", is_appx_search=False\n",
")\n",
"\n",
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
"docs = docsearch.similarity_search(\n",
" \"What did the president say about Ketanji Brown Jackson\",\n",
" k=1,\n",
" search_type=\"script_scoring\",\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "92bc40db",
"metadata": {},
"outputs": [],
"source": [
"print(docs[0].page_content)"
]
},
{
"cell_type": "markdown",
"id": "a4af96cc",
"metadata": {},
"source": [
"## similarity_search using Painless Scripting\n",
"\n",
"`similarity_search` using `Painless Scripting` with Custom Parameters"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6d9f436e",
"metadata": {},
"outputs": [],
"source": [
"docsearch = OpenSearchVectorSearch.from_documents(\n",
" docs, embeddings, opensearch_url=\"http://localhost:9200\", is_appx_search=False\n",
")\n",
"filter = {\"bool\": {\"filter\": {\"term\": {\"text\": \"smuggling\"}}}}\n",
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
"docs = docsearch.similarity_search(\n",
" \"What did the president say about Ketanji Brown Jackson\",\n",
" search_type=\"painless_scripting\",\n",
" space_type=\"cosineSimilarity\",\n",
" pre_filter=filter,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8ca50bce",
"metadata": {},
"outputs": [],
"source": [
"print(docs[0].page_content)"
]
},
{
"cell_type": "markdown",
"id": "4f8fb0d0",
"metadata": {},
"source": [
"## Maximum marginal relevance search (MMR)\n",
"If youd like to look up for some similar documents, but youd also like to receive diverse results, MMR is method you should consider. Maximal marginal relevance optimizes for similarity to query AND diversity among selected documents."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ba85e092",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
"docs = docsearch.max_marginal_relevance_search(query, k=2, fetch_k=10, lambda_param=0.5)"
]
},
{
"cell_type": "markdown",
"id": "73264864",
"metadata": {},
"source": [
"## Using a preexisting OpenSearch instance\n",
"\n",
"It's also possible to use a preexisting OpenSearch instance with documents that already have vectors present."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "82a23440",
"metadata": {},
"outputs": [],
"source": [
"# this is just an example, you would need to change these values to point to another opensearch instance\n",
"docsearch = OpenSearchVectorSearch(\n",
" index_name=\"index-*\",\n",
" embedding_function=embeddings,\n",
" opensearch_url=\"http://localhost:9200\",\n",
")\n",
"\n",
"# you can specify custom field names to match the fields you're using to store your embedding, document text value, and metadata\n",
"docs = docsearch.similarity_search(\n",
" \"Who was asking about getting lunch today?\",\n",
" search_type=\"script_scoring\",\n",
" space_type=\"cosinesimil\",\n",
" vector_field=\"message_embedding\",\n",
" text_field=\"message\",\n",
" metadata_field=\"message_metadata\",\n",
")"
]
},
{
"cell_type": "markdown",
"id": "5f590d35",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
},
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## Using AOSS (Amazon OpenSearch Service Serverless)\n",
"\n",
"It is an example of the `AOSS` with `faiss` engine and `efficient_filter`.\n",
"\n",
"\n",
"We need to install several `python` packages."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "279bfc5c-b7f4-4553-ad15-2df7baebec47",
"metadata": {},
"outputs": [],
"source": [
"%pip install --upgrade --quiet boto3 requests requests-aws4auth"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "de397be7",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
},
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"import boto3\n",
"from opensearchpy import RequestsHttpConnection\n",
"from requests_aws4auth import AWS4Auth\n",
"\n",
"service = \"aoss\" # must set the service as 'aoss'\n",
"region = \"us-east-2\"\n",
"credentials = boto3.Session(\n",
" aws_access_key_id=\"xxxxxx\", aws_secret_access_key=\"xxxxx\"\n",
").get_credentials()\n",
"awsauth = AWS4Auth(\"xxxxx\", \"xxxxxx\", region, service, session_token=credentials.token)\n",
"\n",
"docsearch = OpenSearchVectorSearch.from_documents(\n",
" docs,\n",
" embeddings,\n",
" opensearch_url=\"host url\",\n",
" http_auth=awsauth,\n",
" timeout=300,\n",
" use_ssl=True,\n",
" verify_certs=True,\n",
" connection_class=RequestsHttpConnection,\n",
" index_name=\"test-index-using-aoss\",\n",
" engine=\"faiss\",\n",
")\n",
"\n",
"docs = docsearch.similarity_search(\n",
" \"What is feature selection\",\n",
" efficient_filter=filter,\n",
" k=200,\n",
")"
]
},
{
"cell_type": "markdown",
"id": "0aa012c8",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"source": [
"## Using AOS (Amazon OpenSearch Service)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4b02cd8d-f182-476b-935a-737f9c05d8e4",
"metadata": {},
"outputs": [],
"source": [
"%pip install --upgrade --quiet boto3"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2c47e408",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
},
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"# This is just an example to show how to use Amazon OpenSearch Service, you need to set proper values.\n",
"import boto3\n",
"from opensearchpy import RequestsHttpConnection\n",
"\n",
"service = \"es\" # must set the service as 'es'\n",
"region = \"us-east-2\"\n",
"credentials = boto3.Session(\n",
" aws_access_key_id=\"xxxxxx\", aws_secret_access_key=\"xxxxx\"\n",
").get_credentials()\n",
"awsauth = AWS4Auth(\"xxxxx\", \"xxxxxx\", region, service, session_token=credentials.token)\n",
"\n",
"docsearch = OpenSearchVectorSearch.from_documents(\n",
" docs,\n",
" embeddings,\n",
" opensearch_url=\"host url\",\n",
" http_auth=awsauth,\n",
" timeout=300,\n",
" use_ssl=True,\n",
" verify_certs=True,\n",
" connection_class=RequestsHttpConnection,\n",
" index_name=\"test-index\",\n",
")\n",
"\n",
"docs = docsearch.similarity_search(\n",
" \"What is feature selection\",\n",
" k=200,\n",
")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}