{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Xata\n", "\n", "> [Xata](https://xata.io) is a serverless data platform, based on PostgreSQL. It provides a Python SDK for interacting with your database, and a UI for managing your data.\n", "> Xata has a native vector type, which can be added to any table, and supports similarity search. LangChain inserts vectors directly to Xata, and queries it for the nearest neighbors of a given vector, so that you can use all the LangChain Embeddings integrations with Xata." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook guides you how to use Xata as a VectorStore." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setup\n", "\n", "### Create a database to use as a vector store\n", "\n", "In the [Xata UI](https://app.xata.io) create a new database. You can name it whatever you want, in this notepad we'll use `langchain`.\n", "Create a table, again you can name it anything, but we will use `vectors`. Add the following columns via the UI:\n", "\n", "* `content` of type \"Text\". This is used to store the `Document.pageContent` values.\n", "* `embedding` of type \"Vector\". Use the dimension used by the model you plan to use. In this notebook we use OpenAI embeddings, which have 1536 dimensions.\n", "* `search` of type \"Text\". This is used as a metadata column by this example.\n", "* any other columns you want to use as metadata. They are populated from the `Document.metadata` object. For example, if in the `Document.metadata` object you have a `title` property, you can create a `title` column in the table and it will be populated.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's first install our dependencies:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [], "source": [ "!pip install xata openai tiktoken langchain" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's load the OpenAI key to the environemnt. If you don't have one you can create an OpenAI account and create a key on this [page](https://platform.openai.com/account/api-keys)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [], "source": [ "import os\n", "import getpass\n", "\n", "os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"OpenAI API Key:\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Similarly, we need to get the environment variables for Xata. You can create a new API key by visiting your [account settings](https://app.xata.io/settings). To find the database URL, go to the Settings page of the database that you have created. The database URL should look something like this: `https://demo-uni3q8.eu-west-1.xata.sh/db/langchain`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "api_key = getpass.getpass(\"Xata API key: \")\n", "db_url = input(\"Xata database URL (copy it from your DB settings):\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [], "source": [ "from langchain.embeddings.openai import OpenAIEmbeddings\n", "from langchain.text_splitter import CharacterTextSplitter\n", "from langchain.document_loaders import TextLoader\n", "from langchain.vectorstores.xata import XataVectorStore\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create the Xata vector store\n", "Let's import our test dataset:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [], "source": [ "loader = TextLoader(\"../../../state_of_the_union.txt\")\n", "documents = loader.load()\n", "text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n", "docs = text_splitter.split_documents(documents)\n", "\n", "embeddings = OpenAIEmbeddings()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now create the actual vector store, backed by the Xata table." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [], "source": [ "vector_store = XataVectorStore.from_documents(docs, embeddings, api_key=api_key, db_url=db_url, table_name=\"vectors\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After running the above command, if you go to the Xata UI, you should see the documents loaded together with their embeddings." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Similarity Search" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [], "source": [ "query = \"What did the president say about Ketanji Brown Jackson\"\n", "found_docs = vector_store.similarity_search(query)\n", "print(found_docs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Similarity Search with score (vector distance)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [], "source": [ "query = \"What did the president say about Ketanji Brown Jackson\"\n", "result = vector_store.similarity_search_with_score(query)\n", "for doc, score in result:\n", " print(f\"document={doc}, score={score}\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" } }, "nbformat": 4, "nbformat_minor": 4 }