mirror of
https://github.com/hwchase17/langchain
synced 2024-11-10 01:10:59 +00:00
8021d2a2ab
Thank you for contributing to LangChain! - Oracle AI Vector Search Oracle AI Vector Search is designed for Artificial Intelligence (AI) workloads that allows you to query data based on semantics, rather than keywords. One of the biggest benefit of Oracle AI Vector Search is that semantic search on unstructured data can be combined with relational search on business data in one single system. This is not only powerful but also significantly more effective because you don't need to add a specialized vector database, eliminating the pain of data fragmentation between multiple systems. - Oracle AI Vector Search is designed for Artificial Intelligence (AI) workloads that allows you to query data based on semantics, rather than keywords. One of the biggest benefit of Oracle AI Vector Search is that semantic search on unstructured data can be combined with relational search on business data in one single system. This is not only powerful but also significantly more effective because you don't need to add a specialized vector database, eliminating the pain of data fragmentation between multiple systems. This Pull Requests Adds the following functionalities Oracle AI Vector Search : Vector Store Oracle AI Vector Search : Document Loader Oracle AI Vector Search : Document Splitter Oracle AI Vector Search : Summary Oracle AI Vector Search : Oracle Embeddings - We have added unit tests and have our own local unit test suite which verifies all the code is correct. We have made sure to add guides for each of the components and one end to end guide that shows how the entire thing runs. - We have made sure that make format and make lint run clean. Additional guidelines: - Make sure optional dependencies are imported within a function. - Please do not add dependencies to pyproject.toml files (even optional ones) unless they are required for unit tests. - Most PRs should not touch more than one package. - Changes should be backwards compatible. - If you are adding something to community, do not re-import it in langchain. If no one reviews your PR within a few days, please @-mention one of baskaryan, efriis, eyurtsev, hwchase17. --------- Co-authored-by: skmishraoracle <shailendra.mishra@oracle.com> Co-authored-by: hroyofc <harichandan.roy@oracle.com> Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com> Co-authored-by: Bagatur <baskaryan@gmail.com>
873 lines
33 KiB
Plaintext
873 lines
33 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Oracle AI Vector Search with Document Processing\n",
|
|
"Oracle AI Vector Search is designed for Artificial Intelligence (AI) workloads that allows you to query data based on semantics, rather than keywords.\n",
|
|
"One of the biggest benefit of Oracle AI Vector Search is that semantic search on unstructured data can be combined with relational search on business data in one single system. This is not only powerful but also significantly more effective because you don't need to add a specialized vector database, eliminating the pain of data fragmentation between multiple systems.\n",
|
|
"\n",
|
|
"In addition, because Oracle has been building database technologies for so long, your vectors can benefit from all of Oracle Database's most powerful features, like the following:\n",
|
|
"\n",
|
|
" * Partitioning Support\n",
|
|
" * Real Application Clusters scalability\n",
|
|
" * Exadata smart scans\n",
|
|
" * Shard processing across geographically distributed databases\n",
|
|
" * Transactions\n",
|
|
" * Parallel SQL\n",
|
|
" * Disaster recovery\n",
|
|
" * Security\n",
|
|
" * Oracle Machine Learning\n",
|
|
" * Oracle Graph Database\n",
|
|
" * Oracle Spatial and Graph\n",
|
|
" * Oracle Blockchain\n",
|
|
" * JSON\n",
|
|
"\n",
|
|
"This guide demonstrates how Oracle AI Vector Search can be used with Langchain to serve an end-to-end RAG pipeline. This guide goes through examples of:\n",
|
|
"\n",
|
|
" * Loading the documents from various sources using OracleDocLoader\n",
|
|
" * Summarizing them within/outside the database using OracleSummary\n",
|
|
" * Generating embeddings for them within/outside the database using OracleEmbeddings\n",
|
|
" * Chunking them according to different requirements using Advanced Oracle Capabilities from OracleTextSplitter\n",
|
|
" * Storing and Indexing them in a Vector Store and querying them for queries in OracleVS"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Prerequisites\n",
|
|
"\n",
|
|
"Please install Oracle Python Client driver to use Langchain with Oracle AI Vector Search. "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# pip install oracledb"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Create Demo User\n",
|
|
"First, create a demo user with all the required privileges. "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 37,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Connection successful!\n",
|
|
"User setup done!\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"import sys\n",
|
|
"\n",
|
|
"import oracledb\n",
|
|
"\n",
|
|
"# please update with your username, password, hostname and service_name\n",
|
|
"# please make sure this user has sufficient privileges to perform all below\n",
|
|
"username = \"\"\n",
|
|
"password = \"\"\n",
|
|
"dsn = \"\"\n",
|
|
"\n",
|
|
"try:\n",
|
|
" conn = oracledb.connect(user=username, password=password, dsn=dsn)\n",
|
|
" print(\"Connection successful!\")\n",
|
|
"\n",
|
|
" cursor = conn.cursor()\n",
|
|
" cursor.execute(\n",
|
|
" \"\"\"\n",
|
|
" begin\n",
|
|
" -- drop user\n",
|
|
" begin\n",
|
|
" execute immediate 'drop user testuser cascade';\n",
|
|
" exception\n",
|
|
" when others then\n",
|
|
" dbms_output.put_line('Error setting up user.');\n",
|
|
" end;\n",
|
|
" execute immediate 'create user testuser identified by testuser';\n",
|
|
" execute immediate 'grant connect, unlimited tablespace, create credential, create procedure, create any index to testuser';\n",
|
|
" execute immediate 'create or replace directory DEMO_PY_DIR as ''/scratch/hroy/view_storage/hroy_devstorage/demo/orachain''';\n",
|
|
" execute immediate 'grant read, write on directory DEMO_PY_DIR to public';\n",
|
|
" execute immediate 'grant create mining model to testuser';\n",
|
|
"\n",
|
|
" -- network access\n",
|
|
" begin\n",
|
|
" DBMS_NETWORK_ACL_ADMIN.APPEND_HOST_ACE(\n",
|
|
" host => '*',\n",
|
|
" ace => xs$ace_type(privilege_list => xs$name_list('connect'),\n",
|
|
" principal_name => 'testuser',\n",
|
|
" principal_type => xs_acl.ptype_db));\n",
|
|
" end;\n",
|
|
" end;\n",
|
|
" \"\"\"\n",
|
|
" )\n",
|
|
" print(\"User setup done!\")\n",
|
|
" cursor.close()\n",
|
|
" conn.close()\n",
|
|
"except Exception as e:\n",
|
|
" print(\"User setup failed!\")\n",
|
|
" cursor.close()\n",
|
|
" conn.close()\n",
|
|
" sys.exit(1)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Process Documents using Oracle AI\n",
|
|
"Let's think about a scenario that the users have some documents in Oracle Database or in a file system. They want to use the data for Oracle AI Vector Search using Langchain.\n",
|
|
"\n",
|
|
"For that, the users need to do some document preprocessing. The first step would be to read the documents, generate their summary(if needed) and then chunk/split them if needed. After that, they need to generate the embeddings for those chunks and store into Oracle AI Vector Store. Finally, the users will perform some semantic queries on those data. \n",
|
|
"\n",
|
|
"Oracle AI Vector Search Langchain library provides a range of document processing functionalities including document loading, splitting, generating summary and embeddings.\n",
|
|
"\n",
|
|
"In the following sections, we will go through how to use Oracle AI Langchain APIs to achieve each of these functionalities individually. "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Connect to Demo User\n",
|
|
"The following sample code will show how to connect to Oracle Database. "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 45,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Connection successful!\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"import sys\n",
|
|
"\n",
|
|
"import oracledb\n",
|
|
"\n",
|
|
"# please update with your username, password, hostname and service_name\n",
|
|
"username = \"\"\n",
|
|
"password = \"\"\n",
|
|
"dsn = \"\"\n",
|
|
"\n",
|
|
"try:\n",
|
|
" conn = oracledb.connect(user=username, password=password, dsn=dsn)\n",
|
|
" print(\"Connection successful!\")\n",
|
|
"except Exception as e:\n",
|
|
" print(\"Connection failed!\")\n",
|
|
" sys.exit(1)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Populate a Demo Table\n",
|
|
"Create a demo table and insert some sample documents."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 46,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Table created and populated.\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"try:\n",
|
|
" cursor = conn.cursor()\n",
|
|
"\n",
|
|
" drop_table_sql = \"\"\"drop table demo_tab\"\"\"\n",
|
|
" cursor.execute(drop_table_sql)\n",
|
|
"\n",
|
|
" create_table_sql = \"\"\"create table demo_tab (id number, data clob)\"\"\"\n",
|
|
" cursor.execute(create_table_sql)\n",
|
|
"\n",
|
|
" insert_row_sql = \"\"\"insert into demo_tab values (:1, :2)\"\"\"\n",
|
|
" rows_to_insert = [\n",
|
|
" (\n",
|
|
" 1,\n",
|
|
" \"If the answer to any preceding questions is yes, then the database stops the search and allocates space from the specified tablespace; otherwise, space is allocated from the database default shared temporary tablespace.\",\n",
|
|
" ),\n",
|
|
" (\n",
|
|
" 2,\n",
|
|
" \"A tablespace can be online (accessible) or offline (not accessible) whenever the database is open.\\nA tablespace is usually online so that its data is available to users. The SYSTEM tablespace and temporary tablespaces cannot be taken offline.\",\n",
|
|
" ),\n",
|
|
" (\n",
|
|
" 3,\n",
|
|
" \"The database stores LOBs differently from other data types. Creating a LOB column implicitly creates a LOB segment and a LOB index. The tablespace containing the LOB segment and LOB index, which are always stored together, may be different from the tablespace containing the table.\\nSometimes the database can store small amounts of LOB data in the table itself rather than in a separate LOB segment.\",\n",
|
|
" ),\n",
|
|
" ]\n",
|
|
" cursor.executemany(insert_row_sql, rows_to_insert)\n",
|
|
"\n",
|
|
" conn.commit()\n",
|
|
"\n",
|
|
" print(\"Table created and populated.\")\n",
|
|
" cursor.close()\n",
|
|
"except Exception as e:\n",
|
|
" print(\"Table creation failed.\")\n",
|
|
" cursor.close()\n",
|
|
" conn.close()\n",
|
|
" sys.exit(1)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"\n",
|
|
"\n",
|
|
"Now that we have a demo user and a demo table with some data, we just need to do one more setup. For embedding and summary, we have a few provider options that the users can choose from such as database, 3rd party providers like ocigenai, huggingface, openai, etc. If the users choose to use 3rd party provider, they need to create a credential with corresponding authentication information. On the other hand, if the users choose to use 'database' as provider, they need to load an onnx model to Oracle Database for embeddings; however, for summary, they don't need to do anything."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Load ONNX Model\n",
|
|
"\n",
|
|
"To generate embeddings, Oracle provides a few provider options for users to choose from. The users can choose 'database' provider or some 3rd party providers like OCIGENAI, HuggingFace, etc.\n",
|
|
"\n",
|
|
"***Note*** If the users choose database option, they need to load an ONNX model to Oracle Database. The users do not need to load an ONNX model to Oracle Database if they choose to use 3rd party provider to generate embeddings.\n",
|
|
"\n",
|
|
"One of the core benefits of using an ONNX model is that the users do not need to transfer their data to 3rd party to generate embeddings. And also, since it does not involve any network or REST API calls, it may provide better performance.\n",
|
|
"\n",
|
|
"Here is the sample code to load an ONNX model to Oracle Database:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 47,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"ONNX model loaded.\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"from langchain_community.embeddings.oracleai import OracleEmbeddings\n",
|
|
"\n",
|
|
"# please update with your related information\n",
|
|
"# make sure that you have onnx file in the system\n",
|
|
"onnx_dir = \"DEMO_PY_DIR\"\n",
|
|
"onnx_file = \"tinybert.onnx\"\n",
|
|
"model_name = \"demo_model\"\n",
|
|
"\n",
|
|
"try:\n",
|
|
" OracleEmbeddings.load_onnx_model(conn, onnx_dir, onnx_file, model_name)\n",
|
|
" print(\"ONNX model loaded.\")\n",
|
|
"except Exception as e:\n",
|
|
" print(\"ONNX model loading failed!\")\n",
|
|
" sys.exit(1)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Create Credential\n",
|
|
"\n",
|
|
"On the other hand, if the users choose to use 3rd party provider to generate embeddings and summary, they need to create credential to access 3rd party provider's end points.\n",
|
|
"\n",
|
|
"***Note:*** The users do not need to create any credential if they choose to use 'database' provider to generate embeddings and summary. Should the users choose to 3rd party provider, they need to create credential for the 3rd party provider they want to use. \n",
|
|
"\n",
|
|
"Here is a sample example:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"try:\n",
|
|
" cursor = conn.cursor()\n",
|
|
" cursor.execute(\n",
|
|
" \"\"\"\n",
|
|
" declare\n",
|
|
" jo json_object_t;\n",
|
|
" begin\n",
|
|
" -- HuggingFace\n",
|
|
" dbms_vector_chain.drop_credential(credential_name => 'HF_CRED');\n",
|
|
" jo := json_object_t();\n",
|
|
" jo.put('access_token', '<access_token>');\n",
|
|
" dbms_vector_chain.create_credential(\n",
|
|
" credential_name => 'HF_CRED',\n",
|
|
" params => json(jo.to_string));\n",
|
|
"\n",
|
|
" -- OCIGENAI\n",
|
|
" dbms_vector_chain.drop_credential(credential_name => 'OCI_CRED');\n",
|
|
" jo := json_object_t();\n",
|
|
" jo.put('user_ocid','<user_ocid>');\n",
|
|
" jo.put('tenancy_ocid','<tenancy_ocid>');\n",
|
|
" jo.put('compartment_ocid','<compartment_ocid>');\n",
|
|
" jo.put('private_key','<private_key>');\n",
|
|
" jo.put('fingerprint','<fingerprint>');\n",
|
|
" dbms_vector_chain.create_credential(\n",
|
|
" credential_name => 'OCI_CRED',\n",
|
|
" params => json(jo.to_string));\n",
|
|
" end;\n",
|
|
" \"\"\"\n",
|
|
" )\n",
|
|
" cursor.close()\n",
|
|
" print(\"Credentials created.\")\n",
|
|
"except Exception as ex:\n",
|
|
" cursor.close()\n",
|
|
" raise"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Load Documents\n",
|
|
"The users can load the documents from Oracle Database or a file system or both. They just need to set the loader parameters accordingly. Please refer to the Oracle AI Vector Search Guide book for complete information about these parameters.\n",
|
|
"\n",
|
|
"The main benefit of using OracleDocLoader is that it can handle 150+ different file formats. You don't need to use different types of loader for different file formats. Here is the list formats that we support: [Oracle Text Supported Document Formats](https://docs.oracle.com/en/database/oracle/oracle-database/23/ccref/oracle-text-supported-document-formats.html)\n",
|
|
"\n",
|
|
"The following sample code will show how to do that:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 48,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Number of docs loaded: 3\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"from langchain_community.document_loaders.oracleai import OracleDocLoader\n",
|
|
"from langchain_core.documents import Document\n",
|
|
"\n",
|
|
"# loading from Oracle Database table\n",
|
|
"# make sure you have the table with this specification\n",
|
|
"loader_params = {}\n",
|
|
"loader_params = {\n",
|
|
" \"owner\": \"testuser\",\n",
|
|
" \"tablename\": \"demo_tab\",\n",
|
|
" \"colname\": \"data\",\n",
|
|
"}\n",
|
|
"\n",
|
|
"\"\"\" load the docs \"\"\"\n",
|
|
"loader = OracleDocLoader(conn=conn, params=loader_params)\n",
|
|
"docs = loader.load()\n",
|
|
"\n",
|
|
"\"\"\" verify \"\"\"\n",
|
|
"print(f\"Number of docs loaded: {len(docs)}\")\n",
|
|
"# print(f\"Document-0: {docs[0].page_content}\") # content"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Generate Summary\n",
|
|
"Now that the user loaded the documents, they may want to generate a summary for each document. The Oracle AI Vector Search Langchain library provides an API to do that. There are a few summary generation provider options including Database, OCIGENAI, HuggingFace and so on. The users can choose their preferred provider to generate a summary. Like before, they just need to set the summary parameters accordingly. Please refer to the Oracle AI Vector Search Guide book for complete information about these parameters."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"***Note:*** The users may need to set proxy if they want to use some 3rd party summary generation providers other than Oracle's in-house and default provider: 'database'. If you don't have proxy, please remove the proxy parameter when you instantiate the OracleSummary."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 22,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# proxy to be used when we instantiate summary and embedder object\n",
|
|
"proxy = \"\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"The following sample code will show how to generate summary:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 49,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Number of Summaries: 3\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"from langchain_community.utilities.oracleai import OracleSummary\n",
|
|
"from langchain_core.documents import Document\n",
|
|
"\n",
|
|
"# using 'database' provider\n",
|
|
"summary_params = {\n",
|
|
" \"provider\": \"database\",\n",
|
|
" \"glevel\": \"S\",\n",
|
|
" \"numParagraphs\": 1,\n",
|
|
" \"language\": \"english\",\n",
|
|
"}\n",
|
|
"\n",
|
|
"# get the summary instance\n",
|
|
"# Remove proxy if not required\n",
|
|
"summ = OracleSummary(conn=conn, params=summary_params, proxy=proxy)\n",
|
|
"\n",
|
|
"list_summary = []\n",
|
|
"for doc in docs:\n",
|
|
" summary = summ.get_summary(doc.page_content)\n",
|
|
" list_summary.append(summary)\n",
|
|
"\n",
|
|
"\"\"\" verify \"\"\"\n",
|
|
"print(f\"Number of Summaries: {len(list_summary)}\")\n",
|
|
"# print(f\"Summary-0: {list_summary[0]}\") #content"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Split Documents\n",
|
|
"The documents can be in different sizes: small, medium, large, or very large. The users like to split/chunk their documents into smaller pieces to generate embeddings. There are lots of different splitting customizations the users can do. Please refer to the Oracle AI Vector Search Guide book for complete information about these parameters.\n",
|
|
"\n",
|
|
"The following sample code will show how to do that:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 50,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Number of Chunks: 3\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"from langchain_community.document_loaders.oracleai import OracleTextSplitter\n",
|
|
"from langchain_core.documents import Document\n",
|
|
"\n",
|
|
"# split by default parameters\n",
|
|
"splitter_params = {\"normalize\": \"all\"}\n",
|
|
"\n",
|
|
"\"\"\" get the splitter instance \"\"\"\n",
|
|
"splitter = OracleTextSplitter(conn=conn, params=splitter_params)\n",
|
|
"\n",
|
|
"list_chunks = []\n",
|
|
"for doc in docs:\n",
|
|
" chunks = splitter.split_text(doc.page_content)\n",
|
|
" list_chunks.extend(chunks)\n",
|
|
"\n",
|
|
"\"\"\" verify \"\"\"\n",
|
|
"print(f\"Number of Chunks: {len(list_chunks)}\")\n",
|
|
"# print(f\"Chunk-0: {list_chunks[0]}\") # content"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Generate Embeddings\n",
|
|
"Now that the documents are chunked as per requirements, the users may want to generate embeddings for these chunks. Oracle AI Vector Search provides a number of ways to generate embeddings. The users can load an ONNX embedding model to Oracle Database and use it to generate embeddings or use some 3rd party API's end points to generate embeddings. Please refer to the Oracle AI Vector Search Guide book for complete information about these parameters."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"***Note:*** The users may need to set proxy if they want to use some 3rd party embedding generation providers other than 'database' provider (aka using ONNX model)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 12,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# proxy to be used when we instantiate summary and embedder object\n",
|
|
"proxy = \"\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"The following sample code will show how to generate embeddings:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 51,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Number of embeddings: 3\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"from langchain_community.embeddings.oracleai import OracleEmbeddings\n",
|
|
"from langchain_core.documents import Document\n",
|
|
"\n",
|
|
"# using ONNX model loaded to Oracle Database\n",
|
|
"embedder_params = {\"provider\": \"database\", \"model\": \"demo_model\"}\n",
|
|
"\n",
|
|
"# get the embedding instance\n",
|
|
"# Remove proxy if not required\n",
|
|
"embedder = OracleEmbeddings(conn=conn, params=embedder_params, proxy=proxy)\n",
|
|
"\n",
|
|
"embeddings = []\n",
|
|
"for doc in docs:\n",
|
|
" chunks = splitter.split_text(doc.page_content)\n",
|
|
" for chunk in chunks:\n",
|
|
" embed = embedder.embed_query(chunk)\n",
|
|
" embeddings.append(embed)\n",
|
|
"\n",
|
|
"\"\"\" verify \"\"\"\n",
|
|
"print(f\"Number of embeddings: {len(embeddings)}\")\n",
|
|
"# print(f\"Embedding-0: {embeddings[0]}\") # content"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Create Oracle AI Vector Store\n",
|
|
"Now that you know how to use Oracle AI Langchain library APIs individually to process the documents, let us show how to integrate with Oracle AI Vector Store to facilitate the semantic searches."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"First, let's import all the dependencies."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 52,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import sys\n",
|
|
"\n",
|
|
"import oracledb\n",
|
|
"from langchain_community.document_loaders.oracleai import (\n",
|
|
" OracleDocLoader,\n",
|
|
" OracleTextSplitter,\n",
|
|
")\n",
|
|
"from langchain_community.embeddings.oracleai import OracleEmbeddings\n",
|
|
"from langchain_community.utilities.oracleai import OracleSummary\n",
|
|
"from langchain_community.vectorstores import oraclevs\n",
|
|
"from langchain_community.vectorstores.oraclevs import OracleVS\n",
|
|
"from langchain_community.vectorstores.utils import DistanceStrategy\n",
|
|
"from langchain_core.documents import Document"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Next, let's combine all document processing stages together. Here is the sample code below:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 53,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Connection successful!\n",
|
|
"ONNX model loaded.\n",
|
|
"Number of total chunks with metadata: 3\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"\"\"\"\n",
|
|
"In this sample example, we will use 'database' provider for both summary and embeddings.\n",
|
|
"So, we don't need to do the followings:\n",
|
|
" - set proxy for 3rd party providers\n",
|
|
" - create credential for 3rd party providers\n",
|
|
"\n",
|
|
"If you choose to use 3rd party provider, \n",
|
|
"please follow the necessary steps for proxy and credential.\n",
|
|
"\"\"\"\n",
|
|
"\n",
|
|
"# oracle connection\n",
|
|
"# please update with your username, password, hostname, and service_name\n",
|
|
"username = \"\"\n",
|
|
"password = \"\"\n",
|
|
"dsn = \"\"\n",
|
|
"\n",
|
|
"try:\n",
|
|
" conn = oracledb.connect(user=username, password=password, dsn=dsn)\n",
|
|
" print(\"Connection successful!\")\n",
|
|
"except Exception as e:\n",
|
|
" print(\"Connection failed!\")\n",
|
|
" sys.exit(1)\n",
|
|
"\n",
|
|
"\n",
|
|
"# load onnx model\n",
|
|
"# please update with your related information\n",
|
|
"onnx_dir = \"DEMO_PY_DIR\"\n",
|
|
"onnx_file = \"tinybert.onnx\"\n",
|
|
"model_name = \"demo_model\"\n",
|
|
"try:\n",
|
|
" OracleEmbeddings.load_onnx_model(conn, onnx_dir, onnx_file, model_name)\n",
|
|
" print(\"ONNX model loaded.\")\n",
|
|
"except Exception as e:\n",
|
|
" print(\"ONNX model loading failed!\")\n",
|
|
" sys.exit(1)\n",
|
|
"\n",
|
|
"\n",
|
|
"# params\n",
|
|
"# please update necessary fields with related information\n",
|
|
"loader_params = {\n",
|
|
" \"owner\": \"testuser\",\n",
|
|
" \"tablename\": \"demo_tab\",\n",
|
|
" \"colname\": \"data\",\n",
|
|
"}\n",
|
|
"summary_params = {\n",
|
|
" \"provider\": \"database\",\n",
|
|
" \"glevel\": \"S\",\n",
|
|
" \"numParagraphs\": 1,\n",
|
|
" \"language\": \"english\",\n",
|
|
"}\n",
|
|
"splitter_params = {\"normalize\": \"all\"}\n",
|
|
"embedder_params = {\"provider\": \"database\", \"model\": \"demo_model\"}\n",
|
|
"\n",
|
|
"# instantiate loader, summary, splitter, and embedder\n",
|
|
"loader = OracleDocLoader(conn=conn, params=loader_params)\n",
|
|
"summary = OracleSummary(conn=conn, params=summary_params)\n",
|
|
"splitter = OracleTextSplitter(conn=conn, params=splitter_params)\n",
|
|
"embedder = OracleEmbeddings(conn=conn, params=embedder_params)\n",
|
|
"\n",
|
|
"# process the documents\n",
|
|
"chunks_with_mdata = []\n",
|
|
"for id, doc in enumerate(docs, start=1):\n",
|
|
" summ = summary.get_summary(doc.page_content)\n",
|
|
" chunks = splitter.split_text(doc.page_content)\n",
|
|
" for ic, chunk in enumerate(chunks, start=1):\n",
|
|
" chunk_metadata = doc.metadata.copy()\n",
|
|
" chunk_metadata[\"id\"] = chunk_metadata[\"_oid\"] + \"$\" + str(id) + \"$\" + str(ic)\n",
|
|
" chunk_metadata[\"document_id\"] = str(id)\n",
|
|
" chunk_metadata[\"document_summary\"] = str(summ[0])\n",
|
|
" chunks_with_mdata.append(\n",
|
|
" Document(page_content=str(chunk), metadata=chunk_metadata)\n",
|
|
" )\n",
|
|
"\n",
|
|
"\"\"\" verify \"\"\"\n",
|
|
"print(f\"Number of total chunks with metadata: {len(chunks_with_mdata)}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"At this point, we have processed the documents and generated chunks with metadata. Next, we will create Oracle AI Vector Store with those chunks.\n",
|
|
"\n",
|
|
"Here is the sample code how to do that:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 55,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Vector Store Table: oravs\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# create Oracle AI Vector Store\n",
|
|
"vectorstore = OracleVS.from_documents(\n",
|
|
" chunks_with_mdata,\n",
|
|
" embedder,\n",
|
|
" client=conn,\n",
|
|
" table_name=\"oravs\",\n",
|
|
" distance_strategy=DistanceStrategy.DOT_PRODUCT,\n",
|
|
")\n",
|
|
"\n",
|
|
"\"\"\" verify \"\"\"\n",
|
|
"print(f\"Vector Store Table: {vectorstore.table_name}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"The above example creates a vector store with DOT_PRODUCT distance strategy. \n",
|
|
"\n",
|
|
"However, the users can create Oracle AI Vector Store provides different distance strategies. Please see the [comprehensive guide](/docs/integrations/vectorstores/oracle) for more information."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Now that we have embeddings stored in vector stores, let's create an index on them to get better semantic search performance during query time.\n",
|
|
"\n",
|
|
"***Note*** If you are getting some insufficient memory error, please increase ***vector_memory_size*** in your database.\n",
|
|
"\n",
|
|
"Here is the sample code to create an index:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 56,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"oraclevs.create_index(\n",
|
|
" conn, vectorstore, params={\"idx_name\": \"hnsw_oravs\", \"idx_type\": \"HNSW\"}\n",
|
|
")\n",
|
|
"\n",
|
|
"print(\"Index created.\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"The above example creates a default HNSW index on the embeddings stored in 'oravs' table. The users can set different parameters as per their requirements. Please refer to the Oracle AI Vector Search Guide book for complete information about these parameters.\n",
|
|
"\n",
|
|
"Also, there are different types of vector indices that the users can create. Please see the [comprehensive guide](/docs/integrations/vectorstores/oracle) for more information.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Perform Semantic Search\n",
|
|
"All set!\n",
|
|
"\n",
|
|
"We have processed the documents, stored them to vector store, and then created index to get better query performance. Now let's do some semantic searches.\n",
|
|
"\n",
|
|
"Here is the sample code for this:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 58,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"[Document(page_content='The database stores LOBs differently from other data types. Creating a LOB column implicitly creates a LOB segment and a LOB index. The tablespace containing the LOB segment and LOB index, which are always stored together, may be different from the tablespace containing the table. Sometimes the database can store small amounts of LOB data in the table itself rather than in a separate LOB segment.', metadata={'_oid': '662f2f257677f3c2311a8ff999fd34e5', '_rowid': 'AAAR/xAAEAAAAAnAAC', 'id': '662f2f257677f3c2311a8ff999fd34e5$3$1', 'document_id': '3', 'document_summary': 'Sometimes the database can store small amounts of LOB data in the table itself rather than in a separate LOB segment.\\n\\n'})]\n",
|
|
"[]\n",
|
|
"[(Document(page_content='The database stores LOBs differently from other data types. Creating a LOB column implicitly creates a LOB segment and a LOB index. The tablespace containing the LOB segment and LOB index, which are always stored together, may be different from the tablespace containing the table. Sometimes the database can store small amounts of LOB data in the table itself rather than in a separate LOB segment.', metadata={'_oid': '662f2f257677f3c2311a8ff999fd34e5', '_rowid': 'AAAR/xAAEAAAAAnAAC', 'id': '662f2f257677f3c2311a8ff999fd34e5$3$1', 'document_id': '3', 'document_summary': 'Sometimes the database can store small amounts of LOB data in the table itself rather than in a separate LOB segment.\\n\\n'}), 0.055675752460956573)]\n",
|
|
"[]\n",
|
|
"[Document(page_content='If the answer to any preceding questions is yes, then the database stops the search and allocates space from the specified tablespace; otherwise, space is allocated from the database default shared temporary tablespace.', metadata={'_oid': '662f2f253acf96b33b430b88699490a2', '_rowid': 'AAAR/xAAEAAAAAnAAA', 'id': '662f2f253acf96b33b430b88699490a2$1$1', 'document_id': '1', 'document_summary': 'If the answer to any preceding questions is yes, then the database stops the search and allocates space from the specified tablespace; otherwise, space is allocated from the database default shared temporary tablespace.\\n\\n'})]\n",
|
|
"[Document(page_content='If the answer to any preceding questions is yes, then the database stops the search and allocates space from the specified tablespace; otherwise, space is allocated from the database default shared temporary tablespace.', metadata={'_oid': '662f2f253acf96b33b430b88699490a2', '_rowid': 'AAAR/xAAEAAAAAnAAA', 'id': '662f2f253acf96b33b430b88699490a2$1$1', 'document_id': '1', 'document_summary': 'If the answer to any preceding questions is yes, then the database stops the search and allocates space from the specified tablespace; otherwise, space is allocated from the database default shared temporary tablespace.\\n\\n'})]\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"query = \"What is Oracle AI Vector Store?\"\n",
|
|
"filter = {\"document_id\": [\"1\"]}\n",
|
|
"\n",
|
|
"# Similarity search without a filter\n",
|
|
"print(vectorstore.similarity_search(query, 1))\n",
|
|
"\n",
|
|
"# Similarity search with a filter\n",
|
|
"print(vectorstore.similarity_search(query, 1, filter=filter))\n",
|
|
"\n",
|
|
"# Similarity search with relevance score\n",
|
|
"print(vectorstore.similarity_search_with_score(query, 1))\n",
|
|
"\n",
|
|
"# Similarity search with relevance score with filter\n",
|
|
"print(vectorstore.similarity_search_with_score(query, 1, filter=filter))\n",
|
|
"\n",
|
|
"# Max marginal relevance search\n",
|
|
"print(vectorstore.max_marginal_relevance_search(query, 1, fetch_k=20, lambda_mult=0.5))\n",
|
|
"\n",
|
|
"# Max marginal relevance search with filter\n",
|
|
"print(\n",
|
|
" vectorstore.max_marginal_relevance_search(\n",
|
|
" query, 1, fetch_k=20, lambda_mult=0.5, filter=filter\n",
|
|
" )\n",
|
|
")"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.11.9"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 4
|
|
}
|