langchain/docs/extras/integrations/vectorstores/xata.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Xata\n",
    "\n",
    "> [Xata](https://xata.io) is a serverless data platform, based on PostgreSQL. It provides a Python SDK for interacting with your database, and a UI for managing your data.\n",
    "> Xata has a native vector type, which can be added to any table, and supports similarity search. LangChain inserts vectors directly to Xata, and queries it for the nearest neighbors of a given vector, so that you can use all the LangChain Embeddings integrations with Xata."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook guides you how to use Xata as a VectorStore."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup\n",
    "\n",
    "### Create a database to use as a vector store\n",
    "\n",
    "In the [Xata UI](https://app.xata.io) create a new database. You can name it whatever you want, in this notepad we'll use `langchain`.\n",
    "Create a table, again you can name it anything, but we will use `vectors`. Add the following columns via the UI:\n",
    "\n",
    "* `content` of type \"Text\". This is used to store the `Document.pageContent` values.\n",
    "* `embedding` of type \"Vector\". Use the dimension used by the model you plan to use. In this notebook we use OpenAI embeddings, which have 1536 dimensions.\n",
    "* `search` of type \"Text\". This is used as a metadata column by this example.\n",
    "* any other columns you want to use as metadata. They are populated from the `Document.metadata` object. For example, if in the `Document.metadata` object you have a `title` property, you can create a `title` column in the table and it will be populated.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's first install our dependencies:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "!pip install xata openai tiktoken langchain"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's load the OpenAI key to the environemnt. If you don't have one you can create an OpenAI account and create a key on this [page](https://platform.openai.com/account/api-keys)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "import os\n",
    "import getpass\n",
    "\n",
    "os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"OpenAI API Key:\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Similarly, we need to get the environment variables for Xata. You can create a new API key by visiting your [account settings](https://app.xata.io/settings). To find the database URL, go to the Settings page of the database that you have created. The database URL should look something like this: `https://demo-uni3q8.eu-west-1.xata.sh/db/langchain`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "api_key = getpass.getpass(\"Xata API key: \")\n",
    "db_url = input(\"Xata database URL (copy it from your DB settings):\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "from langchain.embeddings.openai import OpenAIEmbeddings\n",
    "from langchain.text_splitter import CharacterTextSplitter\n",
    "from langchain.document_loaders import TextLoader\n",
    "from langchain.vectorstores.xata import XataVectorStore\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create the Xata vector store\n",
    "Let's import our test dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "loader = TextLoader(\"../../../state_of_the_union.txt\")\n",
    "documents = loader.load()\n",
    "text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
    "docs = text_splitter.split_documents(documents)\n",
    "\n",
    "embeddings = OpenAIEmbeddings()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now create the actual vector store, backed by the Xata table."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "vector_store = XataVectorStore.from_documents(docs, embeddings, api_key=api_key, db_url=db_url, table_name=\"vectors\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After running the above command, if you go to the Xata UI, you should see the documents loaded together with their embeddings."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Similarity Search"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "query = \"What did the president say about Ketanji Brown Jackson\"\n",
    "found_docs = vector_store.similarity_search(query)\n",
    "print(found_docs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Similarity Search with score (vector distance)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "query = \"What did the president say about Ketanji Brown Jackson\"\n",
    "result = vector_store.similarity_search_with_score(query)\n",
    "for doc, score in result:\n",
    "    print(f\"document={doc}, score={score}\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
Add support for Xata as a vector store (#8822) This adds support for [Xata](https://xata.io) (data platform based on Postgres) as a vector store. We have recently added [Xata to Langchain.js](https://github.com/hwchase17/langchainjs/pull/2125) and would love to have the equivalent in the Python project as well. The PR includes integration tests and a Jupyter notebook as docs. Please let me know if anything else would be needed or helpful. I have added the xata python SDK as an optional dependency. ## To run the integration tests You will need to create a DB in xata (see the docs), then run something like: ``` OPENAI_API_KEY=sk-... XATA_API_KEY=xau_... XATA_DB_URL='https://....xata.sh/db/langchain' poetry run pytest tests/integration_tests/vectorstores/test_xata.py ``` <!-- Thank you for contributing to LangChain! Replace this comment with: - Description: a description of the change, - Issue: the issue # it fixes (if applicable), - Dependencies: any dependencies required for this change, - Tag maintainer: for a quicker response, tag the relevant maintainer (see below), - Twitter handle: we announce bigger features on Twitter. If your PR gets announced and you'd like a mention, we'll gladly shout you out! Please make sure you're PR is passing linting and testing before submitting. Run `make format`, `make lint` and `make test` to check this locally. If you're adding a new integration, please include: 1. a test for the integration, preferably unit tests that do not rely on network access, 2. an example notebook showing its use. Maintainer responsibilities: - General / Misc / if you don't know who to tag: @baskaryan - DataLoaders / VectorStores / Retrievers: @rlancemartin, @eyurtsev - Models / Prompts: @hwchase17, @baskaryan - Memory: @hwchase17 - Agents / Tools / Toolkits: @hinthornw - Tracing / Callbacks: @agola11 - Async: @agola11 If no one reviews your PR within a few days, feel free to @-mention the same people again. See contribution guidelines for more information on how to write/run tests, lint, etc: https://github.com/hwchase17/langchain/blob/master/.github/CONTRIBUTING.md --> --------- Co-authored-by: Harrison Chase <hw.chase.17@gmail.com> Co-authored-by: Philip Krauss <35487337+philkra@users.noreply.github.com> Co-authored-by: Bagatur <baskaryan@gmail.com> 1 year ago			`{`
			`"cells": [`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"# Xata\n",`
			`"\n",`
			`"> [Xata](https://xata.io) is a serverless data platform, based on PostgreSQL. It provides a Python SDK for interacting with your database, and a UI for managing your data.\n",`
			`"> Xata has a native vector type, which can be added to any table, and supports similarity search. LangChain inserts vectors directly to Xata, and queries it for the nearest neighbors of a given vector, so that you can use all the LangChain Embeddings integrations with Xata."`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"This notebook guides you how to use Xata as a VectorStore."`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## Setup\n",`
			`"\n",`
			`"### Create a database to use as a vector store\n",`
			`"\n",`
			"In the [Xata UI](https://app.xata.io) create a new database. You can name it whatever you want, in this notepad we'll use `langchain`.\n",
			"Create a table, again you can name it anything, but we will use `vectors`. Add the following columns via the UI:\n",
			`"\n",`
			"* `content` of type \"Text\". This is used to store the `Document.pageContent` values.\n",
			"* `embedding` of type \"Vector\". Use the dimension used by the model you plan to use. In this notebook we use OpenAI embeddings, which have 1536 dimensions.\n",
			"* `search` of type \"Text\". This is used as a metadata column by this example.\n",
			"* any other columns you want to use as metadata. They are populated from the `Document.metadata` object. For example, if in the `Document.metadata` object you have a `title` property, you can create a `title` column in the table and it will be populated.\n"
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"Let's first install our dependencies:"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": null,`
			`"metadata": {`
			`"collapsed": false,`
			`"jupyter": {`
			`"outputs_hidden": false`
			`}`
			`},`
			`"outputs": [],`
			`"source": [`
Pre-release Xata version no longer required (#9915) Tiny PR: Since we've released version 1.0.0 of the python SDK, we no longer need to specify the pre-release version when pip installing. 1 year ago			`"!pip install xata openai tiktoken langchain"`
Add support for Xata as a vector store (#8822) This adds support for [Xata](https://xata.io) (data platform based on Postgres) as a vector store. We have recently added [Xata to Langchain.js](https://github.com/hwchase17/langchainjs/pull/2125) and would love to have the equivalent in the Python project as well. The PR includes integration tests and a Jupyter notebook as docs. Please let me know if anything else would be needed or helpful. I have added the xata python SDK as an optional dependency. ## To run the integration tests You will need to create a DB in xata (see the docs), then run something like: ``` OPENAI_API_KEY=sk-... XATA_API_KEY=xau_... XATA_DB_URL='https://....xata.sh/db/langchain' poetry run pytest tests/integration_tests/vectorstores/test_xata.py ``` <!-- Thank you for contributing to LangChain! Replace this comment with: - Description: a description of the change, - Issue: the issue # it fixes (if applicable), - Dependencies: any dependencies required for this change, - Tag maintainer: for a quicker response, tag the relevant maintainer (see below), - Twitter handle: we announce bigger features on Twitter. If your PR gets announced and you'd like a mention, we'll gladly shout you out! Please make sure you're PR is passing linting and testing before submitting. Run `make format`, `make lint` and `make test` to check this locally. If you're adding a new integration, please include: 1. a test for the integration, preferably unit tests that do not rely on network access, 2. an example notebook showing its use. Maintainer responsibilities: - General / Misc / if you don't know who to tag: @baskaryan - DataLoaders / VectorStores / Retrievers: @rlancemartin, @eyurtsev - Models / Prompts: @hwchase17, @baskaryan - Memory: @hwchase17 - Agents / Tools / Toolkits: @hinthornw - Tracing / Callbacks: @agola11 - Async: @agola11 If no one reviews your PR within a few days, feel free to @-mention the same people again. See contribution guidelines for more information on how to write/run tests, lint, etc: https://github.com/hwchase17/langchain/blob/master/.github/CONTRIBUTING.md --> --------- Co-authored-by: Harrison Chase <hw.chase.17@gmail.com> Co-authored-by: Philip Krauss <35487337+philkra@users.noreply.github.com> Co-authored-by: Bagatur <baskaryan@gmail.com> 1 year ago			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"Let's load the OpenAI key to the environemnt. If you don't have one you can create an OpenAI account and create a key on this [page](https://platform.openai.com/account/api-keys)."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": null,`
			`"metadata": {`
			`"collapsed": false,`
			`"jupyter": {`
			`"outputs_hidden": false`
			`}`
			`},`
			`"outputs": [],`
			`"source": [`
			`"import os\n",`
			`"import getpass\n",`
			`"\n",`
			`"os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"OpenAI API Key:\")"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			"Similarly, we need to get the environment variables for Xata. You can create a new API key by visiting your [account settings](https://app.xata.io/settings). To find the database URL, go to the Settings page of the database that you have created. The database URL should look something like this: `https://demo-uni3q8.eu-west-1.xata.sh/db/langchain`."
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"api_key = getpass.getpass(\"Xata API key: \")\n",`
			`"db_url = input(\"Xata database URL (copy it from your DB settings):\")"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": null,`
			`"metadata": {`
			`"collapsed": false,`
			`"jupyter": {`
			`"outputs_hidden": false`
			`}`
			`},`
			`"outputs": [],`
			`"source": [`
			`"from langchain.embeddings.openai import OpenAIEmbeddings\n",`
			`"from langchain.text_splitter import CharacterTextSplitter\n",`
			`"from langchain.document_loaders import TextLoader\n",`
			`"from langchain.vectorstores.xata import XataVectorStore\n"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"### Create the Xata vector store\n",`
			`"Let's import our test dataset:"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": null,`
			`"metadata": {`
			`"collapsed": false,`
			`"jupyter": {`
			`"outputs_hidden": false`
			`}`
			`},`
			`"outputs": [],`
			`"source": [`
			`"loader = TextLoader(\"../../../state_of_the_union.txt\")\n",`
			`"documents = loader.load()\n",`
			`"text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",`
			`"docs = text_splitter.split_documents(documents)\n",`
			`"\n",`
			`"embeddings = OpenAIEmbeddings()"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"Now create the actual vector store, backed by the Xata table."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": null,`
			`"metadata": {`
			`"collapsed": false,`
			`"jupyter": {`
			`"outputs_hidden": false`
			`}`
			`},`
			`"outputs": [],`
			`"source": [`
			`"vector_store = XataVectorStore.from_documents(docs, embeddings, api_key=api_key, db_url=db_url, table_name=\"vectors\")"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"After running the above command, if you go to the Xata UI, you should see the documents loaded together with their embeddings."`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"### Similarity Search"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": null,`
			`"metadata": {`
			`"collapsed": false,`
			`"jupyter": {`
			`"outputs_hidden": false`
			`}`
			`},`
			`"outputs": [],`
			`"source": [`
			`"query = \"What did the president say about Ketanji Brown Jackson\"\n",`
			`"found_docs = vector_store.similarity_search(query)\n",`
			`"print(found_docs)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"### Similarity Search with score (vector distance)"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": null,`
			`"metadata": {`
			`"collapsed": false,`
			`"jupyter": {`
			`"outputs_hidden": false`
			`}`
			`},`
			`"outputs": [],`
			`"source": [`
			`"query = \"What did the president say about Ketanji Brown Jackson\"\n",`
			`"result = vector_store.similarity_search_with_score(query)\n",`
			`"for doc, score in result:\n",`
			`" print(f\"document={doc}, score={score}\")"`
			`]`
			`}`
			`],`
			`"metadata": {`
			`"kernelspec": {`
			`"display_name": "Python 3 (ipykernel)",`
			`"language": "python",`
			`"name": "python3"`
			`},`
			`"language_info": {`
			`"codemirror_mode": {`
			`"name": "ipython",`
			`"version": 3`
			`},`
			`"file_extension": ".py",`
			`"mimetype": "text/x-python",`
			`"name": "python",`
			`"nbconvert_exporter": "python",`
			`"pygments_lexer": "ipython3",`
docs: `integrations/providers` (#9631) Added missed pages for `integrations/providers` from `vectorstores`. Updated several `vectorstores` notebooks. 1 year ago			`"version": "3.10.12"`
Add support for Xata as a vector store (#8822) This adds support for [Xata](https://xata.io) (data platform based on Postgres) as a vector store. We have recently added [Xata to Langchain.js](https://github.com/hwchase17/langchainjs/pull/2125) and would love to have the equivalent in the Python project as well. The PR includes integration tests and a Jupyter notebook as docs. Please let me know if anything else would be needed or helpful. I have added the xata python SDK as an optional dependency. ## To run the integration tests You will need to create a DB in xata (see the docs), then run something like: ``` OPENAI_API_KEY=sk-... XATA_API_KEY=xau_... XATA_DB_URL='https://....xata.sh/db/langchain' poetry run pytest tests/integration_tests/vectorstores/test_xata.py ``` <!-- Thank you for contributing to LangChain! Replace this comment with: - Description: a description of the change, - Issue: the issue # it fixes (if applicable), - Dependencies: any dependencies required for this change, - Tag maintainer: for a quicker response, tag the relevant maintainer (see below), - Twitter handle: we announce bigger features on Twitter. If your PR gets announced and you'd like a mention, we'll gladly shout you out! Please make sure you're PR is passing linting and testing before submitting. Run `make format`, `make lint` and `make test` to check this locally. If you're adding a new integration, please include: 1. a test for the integration, preferably unit tests that do not rely on network access, 2. an example notebook showing its use. Maintainer responsibilities: - General / Misc / if you don't know who to tag: @baskaryan - DataLoaders / VectorStores / Retrievers: @rlancemartin, @eyurtsev - Models / Prompts: @hwchase17, @baskaryan - Memory: @hwchase17 - Agents / Tools / Toolkits: @hinthornw - Tracing / Callbacks: @agola11 - Async: @agola11 If no one reviews your PR within a few days, feel free to @-mention the same people again. See contribution guidelines for more information on how to write/run tests, lint, etc: https://github.com/hwchase17/langchain/blob/master/.github/CONTRIBUTING.md --> --------- Co-authored-by: Harrison Chase <hw.chase.17@gmail.com> Co-authored-by: Philip Krauss <35487337+philkra@users.noreply.github.com> Co-authored-by: Bagatur <baskaryan@gmail.com> 1 year ago			`}`
			`},`
			`"nbformat": 4,`
			`"nbformat_minor": 4`
			`}`