mirror of
https://github.com/hwchase17/langchain
synced 2024-10-31 15:20:26 +00:00
1f751343e2
Thank you for contributing to LangChain! - [ ] **PR title**: "package: description" - Where "package" is whichever of langchain, community, core, experimental, etc. is being modified. Use "docs: ..." for purely docs changes, "templates: ..." for template changes, "infra: ..." for CI changes. - Example: "community: add foobar LLM" "community/embeddings: update oracleai.py" - [ ] **PR message**: ***Delete this entire checklist*** and replace with - **Description:** a description of the change - **Issue:** the issue # it fixes, if applicable - **Dependencies:** any dependencies required for this change - **Twitter handle:** if your PR gets announced, and you'd like a mention, we'll gladly shout you out! Adding oracle VECTOR_ARRAY_T support. - [ ] **Add tests and docs**: If you're adding a new integration, please include 1. a test for the integration, preferably unit tests that do not rely on network access, 2. an example notebook showing its use. It lives in `docs/docs/integrations` directory. Tests are not impacted. - [ ] **Lint and test**: Run `make format`, `make lint` and `make test` from the root of the package(s) you've modified. See contribution guidelines for more: https://python.langchain.com/docs/contributing/ Done. Additional guidelines: - Make sure optional dependencies are imported within a function. - Please do not add dependencies to pyproject.toml files (even optional ones) unless they are required for unit tests. - Most PRs should not touch more than one package. - Changes should be backwards compatible. - If you are adding something to community, do not re-import it in langchain. If no one reviews your PR within a few days, please @-mention one of baskaryan, efriis, eyurtsev, ccurme, vbarda, hwchase17.
877 lines
37 KiB
Plaintext
877 lines
37 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"# Oracle AI Vector Search with Document Processing\n",
|
||
"Oracle AI Vector Search is designed for Artificial Intelligence (AI) workloads that allows you to query data based on semantics, rather than keywords.\n",
|
||
"One of the biggest benefits of Oracle AI Vector Search is that semantic search on unstructured data can be combined with relational search on business data in one single system.\n",
|
||
"This is not only powerful but also significantly more effective because you don't need to add a specialized vector database, eliminating the pain of data fragmentation between multiple systems.\n",
|
||
"\n",
|
||
"In addition, your vectors can benefit from all of Oracle Database’s most powerful features, like the following:\n",
|
||
"\n",
|
||
" * [Partitioning Support](https://www.oracle.com/database/technologies/partitioning.html)\n",
|
||
" * [Real Application Clusters scalability](https://www.oracle.com/database/real-application-clusters/)\n",
|
||
" * [Exadata smart scans](https://www.oracle.com/database/technologies/exadata/software/smartscan/)\n",
|
||
" * [Shard processing across geographically distributed databases](https://www.oracle.com/database/distributed-database/)\n",
|
||
" * [Transactions](https://docs.oracle.com/en/database/oracle/oracle-database/23/cncpt/transactions.html)\n",
|
||
" * [Parallel SQL](https://docs.oracle.com/en/database/oracle/oracle-database/21/vldbg/parallel-exec-intro.html#GUID-D28717E4-0F77-44F5-BB4E-234C31D4E4BA)\n",
|
||
" * [Disaster recovery](https://www.oracle.com/database/data-guard/)\n",
|
||
" * [Security](https://www.oracle.com/security/database-security/)\n",
|
||
" * [Oracle Machine Learning](https://www.oracle.com/artificial-intelligence/database-machine-learning/)\n",
|
||
" * [Oracle Graph Database](https://www.oracle.com/database/integrated-graph-database/)\n",
|
||
" * [Oracle Spatial and Graph](https://www.oracle.com/database/spatial/)\n",
|
||
" * [Oracle Blockchain](https://docs.oracle.com/en/database/oracle/oracle-database/23/arpls/dbms_blockchain_table.html#GUID-B469E277-978E-4378-A8C1-26D3FF96C9A6)\n",
|
||
" * [JSON](https://docs.oracle.com/en/database/oracle/oracle-database/23/adjsn/json-in-oracle-database.html)\n",
|
||
"\n",
|
||
"This guide demonstrates how Oracle AI Vector Search can be used with Langchain to serve an end-to-end RAG pipeline. This guide goes through examples of:\n",
|
||
"\n",
|
||
" * Loading the documents from various sources using OracleDocLoader\n",
|
||
" * Summarizing them within/outside the database using OracleSummary\n",
|
||
" * Generating embeddings for them within/outside the database using OracleEmbeddings\n",
|
||
" * Chunking them according to different requirements using Advanced Oracle Capabilities from OracleTextSplitter\n",
|
||
" * Storing and Indexing them in a Vector Store and querying them for queries in OracleVS"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"If you are just starting with Oracle Database, consider exploring the [free Oracle 23 AI](https://www.oracle.com/database/free/#resources) which provides a great introduction to setting up your database environment. While working with the database, it is often advisable to avoid using the system user by default; instead, you can create your own user for enhanced security and customization. For detailed steps on user creation, refer to our [end-to-end guide](https://github.com/langchain-ai/langchain/blob/master/cookbook/oracleai_demo.ipynb) which also shows how to set up a user in Oracle. Additionally, understanding user privileges is crucial for managing database security effectively. You can learn more about this topic in the official [Oracle guide](https://docs.oracle.com/en/database/oracle/oracle-database/19/admqs/administering-user-accounts-and-security.html#GUID-36B21D72-1BBB-46C9-A0C9-F0D2A8591B8D) on administering user accounts and security."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Prerequisites\n",
|
||
"\n",
|
||
"Please install Oracle Python Client driver to use Langchain with Oracle AI Vector Search. "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# pip install oracledb"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Create Demo User\n",
|
||
"First, create a demo user with all the required privileges. "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 37,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Connection successful!\n",
|
||
"User setup done!\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"import sys\n",
|
||
"\n",
|
||
"import oracledb\n",
|
||
"\n",
|
||
"# please update with your username, password, hostname and service_name\n",
|
||
"# please make sure this user has sufficient privileges to perform all below\n",
|
||
"username = \"\"\n",
|
||
"password = \"\"\n",
|
||
"dsn = \"\"\n",
|
||
"\n",
|
||
"try:\n",
|
||
" conn = oracledb.connect(user=username, password=password, dsn=dsn)\n",
|
||
" print(\"Connection successful!\")\n",
|
||
"\n",
|
||
" cursor = conn.cursor()\n",
|
||
" cursor.execute(\n",
|
||
" \"\"\"\n",
|
||
" begin\n",
|
||
" -- drop user\n",
|
||
" begin\n",
|
||
" execute immediate 'drop user testuser cascade';\n",
|
||
" exception\n",
|
||
" when others then\n",
|
||
" dbms_output.put_line('Error setting up user.');\n",
|
||
" end;\n",
|
||
" execute immediate 'create user testuser identified by testuser';\n",
|
||
" execute immediate 'grant connect, unlimited tablespace, create credential, create procedure, create any index to testuser';\n",
|
||
" execute immediate 'create or replace directory DEMO_PY_DIR as ''/scratch/hroy/view_storage/hroy_devstorage/demo/orachain''';\n",
|
||
" execute immediate 'grant read, write on directory DEMO_PY_DIR to public';\n",
|
||
" execute immediate 'grant create mining model to testuser';\n",
|
||
"\n",
|
||
" -- network access\n",
|
||
" begin\n",
|
||
" DBMS_NETWORK_ACL_ADMIN.APPEND_HOST_ACE(\n",
|
||
" host => '*',\n",
|
||
" ace => xs$ace_type(privilege_list => xs$name_list('connect'),\n",
|
||
" principal_name => 'testuser',\n",
|
||
" principal_type => xs_acl.ptype_db));\n",
|
||
" end;\n",
|
||
" end;\n",
|
||
" \"\"\"\n",
|
||
" )\n",
|
||
" print(\"User setup done!\")\n",
|
||
" cursor.close()\n",
|
||
" conn.close()\n",
|
||
"except Exception as e:\n",
|
||
" print(\"User setup failed!\")\n",
|
||
" cursor.close()\n",
|
||
" conn.close()\n",
|
||
" sys.exit(1)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Process Documents using Oracle AI\n",
|
||
"Consider the following scenario: users possess documents stored either in an Oracle Database or a file system and intend to utilize this data with Oracle AI Vector Search powered by Langchain.\n",
|
||
"\n",
|
||
"To prepare the documents for analysis, a comprehensive preprocessing workflow is necessary. Initially, the documents must be retrieved, summarized (if required), and chunked as needed. Subsequent steps involve generating embeddings for these chunks and integrating them into the Oracle AI Vector Store. Users can then conduct semantic searches on this data.\n",
|
||
"\n",
|
||
"The Oracle AI Vector Search Langchain library encompasses a suite of document processing tools that facilitate document loading, chunking, summary generation, and embedding creation.\n",
|
||
"\n",
|
||
"In the sections that follow, we will detail the utilization of Oracle AI Langchain APIs to effectively implement each of these processes."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Connect to Demo User\n",
|
||
"The following sample code will show how to connect to Oracle Database. By default, python-oracledb runs in a ‘Thin’ mode which connects directly to Oracle Database. This mode does not need Oracle Client libraries. However, some additional functionality is available when python-oracledb uses them. Python-oracledb is said to be in ‘Thick’ mode when Oracle Client libraries are used. Both modes have comprehensive functionality supporting the Python Database API v2.0 Specification. See the following [guide](https://python-oracledb.readthedocs.io/en/latest/user_guide/appendix_a.html#featuresummary) that talks about features supported in each mode. You might want to switch to thick-mode if you are unable to use thin-mode."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 45,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Connection successful!\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"import sys\n",
|
||
"\n",
|
||
"import oracledb\n",
|
||
"\n",
|
||
"# please update with your username, password, hostname and service_name\n",
|
||
"username = \"\"\n",
|
||
"password = \"\"\n",
|
||
"dsn = \"\"\n",
|
||
"\n",
|
||
"try:\n",
|
||
" conn = oracledb.connect(user=username, password=password, dsn=dsn)\n",
|
||
" print(\"Connection successful!\")\n",
|
||
"except Exception as e:\n",
|
||
" print(\"Connection failed!\")\n",
|
||
" sys.exit(1)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Populate a Demo Table\n",
|
||
"Create a demo table and insert some sample documents."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 46,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Table created and populated.\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"try:\n",
|
||
" cursor = conn.cursor()\n",
|
||
"\n",
|
||
" drop_table_sql = \"\"\"drop table demo_tab\"\"\"\n",
|
||
" cursor.execute(drop_table_sql)\n",
|
||
"\n",
|
||
" create_table_sql = \"\"\"create table demo_tab (id number, data clob)\"\"\"\n",
|
||
" cursor.execute(create_table_sql)\n",
|
||
"\n",
|
||
" insert_row_sql = \"\"\"insert into demo_tab values (:1, :2)\"\"\"\n",
|
||
" rows_to_insert = [\n",
|
||
" (\n",
|
||
" 1,\n",
|
||
" \"If the answer to any preceding questions is yes, then the database stops the search and allocates space from the specified tablespace; otherwise, space is allocated from the database default shared temporary tablespace.\",\n",
|
||
" ),\n",
|
||
" (\n",
|
||
" 2,\n",
|
||
" \"A tablespace can be online (accessible) or offline (not accessible) whenever the database is open.\\nA tablespace is usually online so that its data is available to users. The SYSTEM tablespace and temporary tablespaces cannot be taken offline.\",\n",
|
||
" ),\n",
|
||
" (\n",
|
||
" 3,\n",
|
||
" \"The database stores LOBs differently from other data types. Creating a LOB column implicitly creates a LOB segment and a LOB index. The tablespace containing the LOB segment and LOB index, which are always stored together, may be different from the tablespace containing the table.\\nSometimes the database can store small amounts of LOB data in the table itself rather than in a separate LOB segment.\",\n",
|
||
" ),\n",
|
||
" ]\n",
|
||
" cursor.executemany(insert_row_sql, rows_to_insert)\n",
|
||
"\n",
|
||
" conn.commit()\n",
|
||
"\n",
|
||
" print(\"Table created and populated.\")\n",
|
||
" cursor.close()\n",
|
||
"except Exception as e:\n",
|
||
" print(\"Table creation failed.\")\n",
|
||
" cursor.close()\n",
|
||
" conn.close()\n",
|
||
" sys.exit(1)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"With the inclusion of a demo user and a populated sample table, the remaining configuration involves setting up embedding and summary functionalities. Users are presented with multiple provider options, including local database solutions and third-party services such as Ocigenai, Hugging Face, and OpenAI. Should users opt for a third-party provider, they are required to establish credentials containing the necessary authentication details. Conversely, if selecting a database as the provider for embeddings, it is necessary to upload an ONNX model to the Oracle Database. No additional setup is required for summary functionalities when using the database option."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Load ONNX Model\n",
|
||
"\n",
|
||
"Oracle accommodates a variety of embedding providers, enabling users to choose between proprietary database solutions and third-party services such as OCIGENAI and HuggingFace. This selection dictates the methodology for generating and managing embeddings.\n",
|
||
"\n",
|
||
"***Important*** : Should users opt for the database option, they must upload an ONNX model into the Oracle Database. Conversely, if a third-party provider is selected for embedding generation, uploading an ONNX model to Oracle Database is not required.\n",
|
||
"\n",
|
||
"A significant advantage of utilizing an ONNX model directly within Oracle is the enhanced security and performance it offers by eliminating the need to transmit data to external parties. Additionally, this method avoids the latency typically associated with network or REST API calls.\n",
|
||
"\n",
|
||
"Below is the example code to upload an ONNX model into Oracle Database:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 47,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"ONNX model loaded.\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"from langchain_community.embeddings.oracleai import OracleEmbeddings\n",
|
||
"\n",
|
||
"# please update with your related information\n",
|
||
"# make sure that you have onnx file in the system\n",
|
||
"onnx_dir = \"DEMO_PY_DIR\"\n",
|
||
"onnx_file = \"tinybert.onnx\"\n",
|
||
"model_name = \"demo_model\"\n",
|
||
"\n",
|
||
"try:\n",
|
||
" OracleEmbeddings.load_onnx_model(conn, onnx_dir, onnx_file, model_name)\n",
|
||
" print(\"ONNX model loaded.\")\n",
|
||
"except Exception as e:\n",
|
||
" print(\"ONNX model loading failed!\")\n",
|
||
" sys.exit(1)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Create Credential\n",
|
||
"\n",
|
||
"When selecting third-party providers for generating embeddings, users are required to establish credentials to securely access the provider's endpoints.\n",
|
||
"\n",
|
||
"***Important:*** No credentials are necessary when opting for the 'database' provider to generate embeddings. However, should users decide to utilize a third-party provider, they must create credentials specific to the chosen provider.\n",
|
||
"\n",
|
||
"Below is an illustrative example:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"try:\n",
|
||
" cursor = conn.cursor()\n",
|
||
" cursor.execute(\n",
|
||
" \"\"\"\n",
|
||
" declare\n",
|
||
" jo json_object_t;\n",
|
||
" begin\n",
|
||
" -- HuggingFace\n",
|
||
" dbms_vector_chain.drop_credential(credential_name => 'HF_CRED');\n",
|
||
" jo := json_object_t();\n",
|
||
" jo.put('access_token', '<access_token>');\n",
|
||
" dbms_vector_chain.create_credential(\n",
|
||
" credential_name => 'HF_CRED',\n",
|
||
" params => json(jo.to_string));\n",
|
||
"\n",
|
||
" -- OCIGENAI\n",
|
||
" dbms_vector_chain.drop_credential(credential_name => 'OCI_CRED');\n",
|
||
" jo := json_object_t();\n",
|
||
" jo.put('user_ocid','<user_ocid>');\n",
|
||
" jo.put('tenancy_ocid','<tenancy_ocid>');\n",
|
||
" jo.put('compartment_ocid','<compartment_ocid>');\n",
|
||
" jo.put('private_key','<private_key>');\n",
|
||
" jo.put('fingerprint','<fingerprint>');\n",
|
||
" dbms_vector_chain.create_credential(\n",
|
||
" credential_name => 'OCI_CRED',\n",
|
||
" params => json(jo.to_string));\n",
|
||
" end;\n",
|
||
" \"\"\"\n",
|
||
" )\n",
|
||
" cursor.close()\n",
|
||
" print(\"Credentials created.\")\n",
|
||
"except Exception as ex:\n",
|
||
" cursor.close()\n",
|
||
" raise"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Load Documents\n",
|
||
"Users have the flexibility to load documents from either the Oracle Database, a file system, or both, by appropriately configuring the loader parameters. For comprehensive details on these parameters, please consult the [Oracle AI Vector Search Guide](https://docs.oracle.com/en/database/oracle/oracle-database/23/arpls/dbms_vector_chain1.html#GUID-73397E89-92FB-48ED-94BB-1AD960C4EA1F).\n",
|
||
"\n",
|
||
"A significant advantage of utilizing OracleDocLoader is its capability to process over 150 distinct file formats, eliminating the need for multiple loaders for different document types. For a complete list of the supported formats, please refer to the [Oracle Text Supported Document Formats](https://docs.oracle.com/en/database/oracle/oracle-database/23/ccref/oracle-text-supported-document-formats.html).\n",
|
||
"\n",
|
||
"Below is a sample code snippet that demonstrates how to use OracleDocLoader"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 48,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Number of docs loaded: 3\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"from langchain_community.document_loaders.oracleai import OracleDocLoader\n",
|
||
"from langchain_core.documents import Document\n",
|
||
"\n",
|
||
"# loading from Oracle Database table\n",
|
||
"# make sure you have the table with this specification\n",
|
||
"loader_params = {}\n",
|
||
"loader_params = {\n",
|
||
" \"owner\": \"testuser\",\n",
|
||
" \"tablename\": \"demo_tab\",\n",
|
||
" \"colname\": \"data\",\n",
|
||
"}\n",
|
||
"\n",
|
||
"\"\"\" load the docs \"\"\"\n",
|
||
"loader = OracleDocLoader(conn=conn, params=loader_params)\n",
|
||
"docs = loader.load()\n",
|
||
"\n",
|
||
"\"\"\" verify \"\"\"\n",
|
||
"print(f\"Number of docs loaded: {len(docs)}\")\n",
|
||
"# print(f\"Document-0: {docs[0].page_content}\") # content"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Generate Summary\n",
|
||
"Now that the user loaded the documents, they may want to generate a summary for each document. The Oracle AI Vector Search Langchain library offers a suite of APIs designed for document summarization. It supports multiple summarization providers such as Database, OCIGENAI, HuggingFace, among others, allowing users to select the provider that best meets their needs. To utilize these capabilities, users must configure the summary parameters as specified. For detailed information on these parameters, please consult the [Oracle AI Vector Search Guide book](https://docs.oracle.com/en/database/oracle/oracle-database/23/arpls/dbms_vector_chain1.html#GUID-EC9DDB58-6A15-4B36-BA66-ECBA20D2CE57)."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"***Note:*** The users may need to set proxy if they want to use some 3rd party summary generation providers other than Oracle's in-house and default provider: 'database'. If you don't have proxy, please remove the proxy parameter when you instantiate the OracleSummary."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 22,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# proxy to be used when we instantiate summary and embedder object\n",
|
||
"proxy = \"\""
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"The following sample code will show how to generate summary:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 49,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Number of Summaries: 3\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"from langchain_community.utilities.oracleai import OracleSummary\n",
|
||
"from langchain_core.documents import Document\n",
|
||
"\n",
|
||
"# using 'database' provider\n",
|
||
"summary_params = {\n",
|
||
" \"provider\": \"database\",\n",
|
||
" \"glevel\": \"S\",\n",
|
||
" \"numParagraphs\": 1,\n",
|
||
" \"language\": \"english\",\n",
|
||
"}\n",
|
||
"\n",
|
||
"# get the summary instance\n",
|
||
"# Remove proxy if not required\n",
|
||
"summ = OracleSummary(conn=conn, params=summary_params, proxy=proxy)\n",
|
||
"\n",
|
||
"list_summary = []\n",
|
||
"for doc in docs:\n",
|
||
" summary = summ.get_summary(doc.page_content)\n",
|
||
" list_summary.append(summary)\n",
|
||
"\n",
|
||
"\"\"\" verify \"\"\"\n",
|
||
"print(f\"Number of Summaries: {len(list_summary)}\")\n",
|
||
"# print(f\"Summary-0: {list_summary[0]}\") #content"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Split Documents\n",
|
||
"The documents may vary in size, ranging from small to very large. Users often prefer to chunk their documents into smaller sections to facilitate the generation of embeddings. A wide array of customization options is available for this splitting process. For comprehensive details regarding these parameters, please consult the [Oracle AI Vector Search Guide](https://docs.oracle.com/en/database/oracle/oracle-database/23/arpls/dbms_vector_chain1.html#GUID-4E145629-7098-4C7C-804F-FC85D1F24240).\n",
|
||
"\n",
|
||
"Below is a sample code illustrating how to implement this:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 50,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Number of Chunks: 3\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"from langchain_community.document_loaders.oracleai import OracleTextSplitter\n",
|
||
"from langchain_core.documents import Document\n",
|
||
"\n",
|
||
"# split by default parameters\n",
|
||
"splitter_params = {\"normalize\": \"all\"}\n",
|
||
"\n",
|
||
"\"\"\" get the splitter instance \"\"\"\n",
|
||
"splitter = OracleTextSplitter(conn=conn, params=splitter_params)\n",
|
||
"\n",
|
||
"list_chunks = []\n",
|
||
"for doc in docs:\n",
|
||
" chunks = splitter.split_text(doc.page_content)\n",
|
||
" list_chunks.extend(chunks)\n",
|
||
"\n",
|
||
"\"\"\" verify \"\"\"\n",
|
||
"print(f\"Number of Chunks: {len(list_chunks)}\")\n",
|
||
"# print(f\"Chunk-0: {list_chunks[0]}\") # content"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Generate Embeddings\n",
|
||
"Now that the documents are chunked as per requirements, the users may want to generate embeddings for these chunks. Oracle AI Vector Search provides multiple methods for generating embeddings, utilizing either locally hosted ONNX models or third-party APIs. For comprehensive instructions on configuring these alternatives, please refer to the [Oracle AI Vector Search Guide](https://docs.oracle.com/en/database/oracle/oracle-database/23/arpls/dbms_vector_chain1.html#GUID-C6439E94-4E86-4ECD-954E-4B73D53579DE)."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"***Note:*** Users may need to configure a proxy to utilize third-party embedding generation providers, excluding the 'database' provider that utilizes an ONNX model."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 12,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# proxy to be used when we instantiate summary and embedder object\n",
|
||
"proxy = \"\""
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"The following sample code will show how to generate embeddings:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 51,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Number of embeddings: 3\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"from langchain_community.embeddings.oracleai import OracleEmbeddings\n",
|
||
"from langchain_core.documents import Document\n",
|
||
"\n",
|
||
"# using ONNX model loaded to Oracle Database\n",
|
||
"embedder_params = {\"provider\": \"database\", \"model\": \"demo_model\"}\n",
|
||
"\n",
|
||
"# get the embedding instance\n",
|
||
"# Remove proxy if not required\n",
|
||
"embedder = OracleEmbeddings(conn=conn, params=embedder_params, proxy=proxy)\n",
|
||
"\n",
|
||
"embeddings = []\n",
|
||
"for doc in docs:\n",
|
||
" chunks = splitter.split_text(doc.page_content)\n",
|
||
" for chunk in chunks:\n",
|
||
" embed = embedder.embed_query(chunk)\n",
|
||
" embeddings.append(embed)\n",
|
||
"\n",
|
||
"\"\"\" verify \"\"\"\n",
|
||
"print(f\"Number of embeddings: {len(embeddings)}\")\n",
|
||
"# print(f\"Embedding-0: {embeddings[0]}\") # content"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Create Oracle AI Vector Store\n",
|
||
"Now that you know how to use Oracle AI Langchain library APIs individually to process the documents, let us show how to integrate with Oracle AI Vector Store to facilitate the semantic searches."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"First, let's import all the dependencies."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 52,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"import sys\n",
|
||
"\n",
|
||
"import oracledb\n",
|
||
"from langchain_community.document_loaders.oracleai import (\n",
|
||
" OracleDocLoader,\n",
|
||
" OracleTextSplitter,\n",
|
||
")\n",
|
||
"from langchain_community.embeddings.oracleai import OracleEmbeddings\n",
|
||
"from langchain_community.utilities.oracleai import OracleSummary\n",
|
||
"from langchain_community.vectorstores import oraclevs\n",
|
||
"from langchain_community.vectorstores.oraclevs import OracleVS\n",
|
||
"from langchain_community.vectorstores.utils import DistanceStrategy\n",
|
||
"from langchain_core.documents import Document"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Next, let's combine all document processing stages together. Here is the sample code below:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 53,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Connection successful!\n",
|
||
"ONNX model loaded.\n",
|
||
"Number of total chunks with metadata: 3\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"\"\"\"\n",
|
||
"In this sample example, we will use 'database' provider for both summary and embeddings.\n",
|
||
"So, we don't need to do the followings:\n",
|
||
" - set proxy for 3rd party providers\n",
|
||
" - create credential for 3rd party providers\n",
|
||
"\n",
|
||
"If you choose to use 3rd party provider, \n",
|
||
"please follow the necessary steps for proxy and credential.\n",
|
||
"\"\"\"\n",
|
||
"\n",
|
||
"# oracle connection\n",
|
||
"# please update with your username, password, hostname, and service_name\n",
|
||
"username = \"\"\n",
|
||
"password = \"\"\n",
|
||
"dsn = \"\"\n",
|
||
"\n",
|
||
"try:\n",
|
||
" conn = oracledb.connect(user=username, password=password, dsn=dsn)\n",
|
||
" print(\"Connection successful!\")\n",
|
||
"except Exception as e:\n",
|
||
" print(\"Connection failed!\")\n",
|
||
" sys.exit(1)\n",
|
||
"\n",
|
||
"\n",
|
||
"# load onnx model\n",
|
||
"# please update with your related information\n",
|
||
"onnx_dir = \"DEMO_PY_DIR\"\n",
|
||
"onnx_file = \"tinybert.onnx\"\n",
|
||
"model_name = \"demo_model\"\n",
|
||
"try:\n",
|
||
" OracleEmbeddings.load_onnx_model(conn, onnx_dir, onnx_file, model_name)\n",
|
||
" print(\"ONNX model loaded.\")\n",
|
||
"except Exception as e:\n",
|
||
" print(\"ONNX model loading failed!\")\n",
|
||
" sys.exit(1)\n",
|
||
"\n",
|
||
"\n",
|
||
"# params\n",
|
||
"# please update necessary fields with related information\n",
|
||
"loader_params = {\n",
|
||
" \"owner\": \"testuser\",\n",
|
||
" \"tablename\": \"demo_tab\",\n",
|
||
" \"colname\": \"data\",\n",
|
||
"}\n",
|
||
"summary_params = {\n",
|
||
" \"provider\": \"database\",\n",
|
||
" \"glevel\": \"S\",\n",
|
||
" \"numParagraphs\": 1,\n",
|
||
" \"language\": \"english\",\n",
|
||
"}\n",
|
||
"splitter_params = {\"normalize\": \"all\"}\n",
|
||
"embedder_params = {\"provider\": \"database\", \"model\": \"demo_model\"}\n",
|
||
"\n",
|
||
"# instantiate loader, summary, splitter, and embedder\n",
|
||
"loader = OracleDocLoader(conn=conn, params=loader_params)\n",
|
||
"summary = OracleSummary(conn=conn, params=summary_params)\n",
|
||
"splitter = OracleTextSplitter(conn=conn, params=splitter_params)\n",
|
||
"embedder = OracleEmbeddings(conn=conn, params=embedder_params)\n",
|
||
"\n",
|
||
"# process the documents\n",
|
||
"chunks_with_mdata = []\n",
|
||
"for id, doc in enumerate(docs, start=1):\n",
|
||
" summ = summary.get_summary(doc.page_content)\n",
|
||
" chunks = splitter.split_text(doc.page_content)\n",
|
||
" for ic, chunk in enumerate(chunks, start=1):\n",
|
||
" chunk_metadata = doc.metadata.copy()\n",
|
||
" chunk_metadata[\"id\"] = chunk_metadata[\"_oid\"] + \"$\" + str(id) + \"$\" + str(ic)\n",
|
||
" chunk_metadata[\"document_id\"] = str(id)\n",
|
||
" chunk_metadata[\"document_summary\"] = str(summ[0])\n",
|
||
" chunks_with_mdata.append(\n",
|
||
" Document(page_content=str(chunk), metadata=chunk_metadata)\n",
|
||
" )\n",
|
||
"\n",
|
||
"\"\"\" verify \"\"\"\n",
|
||
"print(f\"Number of total chunks with metadata: {len(chunks_with_mdata)}\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"At this point, we have processed the documents and generated chunks with metadata. Next, we will create Oracle AI Vector Store with those chunks.\n",
|
||
"\n",
|
||
"Here is the sample code how to do that:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 55,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Vector Store Table: oravs\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# create Oracle AI Vector Store\n",
|
||
"vectorstore = OracleVS.from_documents(\n",
|
||
" chunks_with_mdata,\n",
|
||
" embedder,\n",
|
||
" client=conn,\n",
|
||
" table_name=\"oravs\",\n",
|
||
" distance_strategy=DistanceStrategy.DOT_PRODUCT,\n",
|
||
")\n",
|
||
"\n",
|
||
"\"\"\" verify \"\"\"\n",
|
||
"print(f\"Vector Store Table: {vectorstore.table_name}\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"The example provided illustrates the creation of a vector store using the DOT_PRODUCT distance strategy. Users have the flexibility to employ various distance strategies with the Oracle AI Vector Store, as detailed in our [comprehensive guide](https://python.langchain.com/v0.1/docs/integrations/vectorstores/oracle/)."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"With embeddings now stored in vector stores, it is advisable to establish an index to enhance semantic search performance during query execution.\n",
|
||
"\n",
|
||
"***Note*** Should you encounter an \"insufficient memory\" error, it is recommended to increase the ***vector_memory_size*** in your database configuration\n",
|
||
"\n",
|
||
"Below is a sample code snippet for creating an index:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 56,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"oraclevs.create_index(\n",
|
||
" conn, vectorstore, params={\"idx_name\": \"hnsw_oravs\", \"idx_type\": \"HNSW\"}\n",
|
||
")\n",
|
||
"\n",
|
||
"print(\"Index created.\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"This example demonstrates the creation of a default HNSW index on embeddings within the 'oravs' table. Users may adjust various parameters according to their specific needs. For detailed information on these parameters, please consult the [Oracle AI Vector Search Guide book](https://docs.oracle.com/en/database/oracle/oracle-database/23/vecse/manage-different-categories-vector-indexes.html).\n",
|
||
"\n",
|
||
"Additionally, various types of vector indices can be created to meet diverse requirements. More details can be found in our [comprehensive guide](https://python.langchain.com/v0.1/docs/integrations/vectorstores/oracle/).\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Perform Semantic Search\n",
|
||
"All set!\n",
|
||
"\n",
|
||
"We have successfully processed the documents and stored them in the vector store, followed by the creation of an index to enhance query performance. We are now prepared to proceed with semantic searches.\n",
|
||
"\n",
|
||
"Below is the sample code for this process:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 58,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"[Document(page_content='The database stores LOBs differently from other data types. Creating a LOB column implicitly creates a LOB segment and a LOB index. The tablespace containing the LOB segment and LOB index, which are always stored together, may be different from the tablespace containing the table. Sometimes the database can store small amounts of LOB data in the table itself rather than in a separate LOB segment.', metadata={'_oid': '662f2f257677f3c2311a8ff999fd34e5', '_rowid': 'AAAR/xAAEAAAAAnAAC', 'id': '662f2f257677f3c2311a8ff999fd34e5$3$1', 'document_id': '3', 'document_summary': 'Sometimes the database can store small amounts of LOB data in the table itself rather than in a separate LOB segment.\\n\\n'})]\n",
|
||
"[]\n",
|
||
"[(Document(page_content='The database stores LOBs differently from other data types. Creating a LOB column implicitly creates a LOB segment and a LOB index. The tablespace containing the LOB segment and LOB index, which are always stored together, may be different from the tablespace containing the table. Sometimes the database can store small amounts of LOB data in the table itself rather than in a separate LOB segment.', metadata={'_oid': '662f2f257677f3c2311a8ff999fd34e5', '_rowid': 'AAAR/xAAEAAAAAnAAC', 'id': '662f2f257677f3c2311a8ff999fd34e5$3$1', 'document_id': '3', 'document_summary': 'Sometimes the database can store small amounts of LOB data in the table itself rather than in a separate LOB segment.\\n\\n'}), 0.055675752460956573)]\n",
|
||
"[]\n",
|
||
"[Document(page_content='If the answer to any preceding questions is yes, then the database stops the search and allocates space from the specified tablespace; otherwise, space is allocated from the database default shared temporary tablespace.', metadata={'_oid': '662f2f253acf96b33b430b88699490a2', '_rowid': 'AAAR/xAAEAAAAAnAAA', 'id': '662f2f253acf96b33b430b88699490a2$1$1', 'document_id': '1', 'document_summary': 'If the answer to any preceding questions is yes, then the database stops the search and allocates space from the specified tablespace; otherwise, space is allocated from the database default shared temporary tablespace.\\n\\n'})]\n",
|
||
"[Document(page_content='If the answer to any preceding questions is yes, then the database stops the search and allocates space from the specified tablespace; otherwise, space is allocated from the database default shared temporary tablespace.', metadata={'_oid': '662f2f253acf96b33b430b88699490a2', '_rowid': 'AAAR/xAAEAAAAAnAAA', 'id': '662f2f253acf96b33b430b88699490a2$1$1', 'document_id': '1', 'document_summary': 'If the answer to any preceding questions is yes, then the database stops the search and allocates space from the specified tablespace; otherwise, space is allocated from the database default shared temporary tablespace.\\n\\n'})]\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"query = \"What is Oracle AI Vector Store?\"\n",
|
||
"filter = {\"document_id\": [\"1\"]}\n",
|
||
"\n",
|
||
"# Similarity search without a filter\n",
|
||
"print(vectorstore.similarity_search(query, 1))\n",
|
||
"\n",
|
||
"# Similarity search with a filter\n",
|
||
"print(vectorstore.similarity_search(query, 1, filter=filter))\n",
|
||
"\n",
|
||
"# Similarity search with relevance score\n",
|
||
"print(vectorstore.similarity_search_with_score(query, 1))\n",
|
||
"\n",
|
||
"# Similarity search with relevance score with filter\n",
|
||
"print(vectorstore.similarity_search_with_score(query, 1, filter=filter))\n",
|
||
"\n",
|
||
"# Max marginal relevance search\n",
|
||
"print(vectorstore.max_marginal_relevance_search(query, 1, fetch_k=20, lambda_mult=0.5))\n",
|
||
"\n",
|
||
"# Max marginal relevance search with filter\n",
|
||
"print(\n",
|
||
" vectorstore.max_marginal_relevance_search(\n",
|
||
" query, 1, fetch_k=20, lambda_mult=0.5, filter=filter\n",
|
||
" )\n",
|
||
")"
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"kernelspec": {
|
||
"display_name": "Python 3 (ipykernel)",
|
||
"language": "python",
|
||
"name": "python3"
|
||
},
|
||
"language_info": {
|
||
"codemirror_mode": {
|
||
"name": "ipython",
|
||
"version": 3
|
||
},
|
||
"file_extension": ".py",
|
||
"mimetype": "text/x-python",
|
||
"name": "python",
|
||
"nbconvert_exporter": "python",
|
||
"pygments_lexer": "ipython3",
|
||
"version": "3.11.9"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 4
|
||
}
|