{ "cells": [ { "cell_type": "markdown", "id": "683953b3", "metadata": {}, "source": [ "# Dingo\n", "\n", ">[Dingo](https://dingodb.readthedocs.io/en/latest/) is a distributed multi-mode vector database, which combines the characteristics of data lakes and vector databases, and can store data of any type and size (Key-Value, PDF, audio, video, etc.). It has real-time low-latency processing capabilities to achieve rapid insight and response, and can efficiently conduct instant analysis and process multi-modal data.\n", "\n", "This notebook shows how to use functionality related to the DingoDB vector database.\n", "\n", "To run, you should have a [DingoDB instance up and running](https://github.com/dingodb/dingo-deploy/blob/main/README.md)." ] }, { "cell_type": "code", "execution_count": null, "id": "a62cff8a-bcf7-4e33-bbbc-76999c2e3e20", "metadata": { "tags": [] }, "outputs": [], "source": [ "!pip install dingodb\n", "or install latest:\n", "!pip install git+https://git@github.com/dingodb/pydingo.git" ] }, { "cell_type": "markdown", "id": "7a0f9e02-8eb0-4aef-b11f-8861360472ee", "metadata": {}, "source": [ "We want to use OpenAIEmbeddings so we have to get the OpenAI API Key." ] }, { "cell_type": "code", "execution_count": 1, "id": "8b6ed9cd-81b9-46e5-9c20-5aafca2844d0", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "OpenAI API Key:········\n" ] } ], "source": [ "import os\n", "import getpass\n", "\n", "os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"OpenAI API Key:\")" ] }, { "cell_type": "code", "execution_count": 2, "id": "aac9563e", "metadata": { "tags": [] }, "outputs": [], "source": [ "from langchain.embeddings.openai import OpenAIEmbeddings\n", "from langchain.text_splitter import CharacterTextSplitter\n", "from langchain.vectorstores import Dingo\n", "from langchain.document_loaders import TextLoader" ] }, { "cell_type": "code", "execution_count": 3, "id": "a3c3999a", "metadata": { "tags": [] }, "outputs": [], "source": [ "from langchain.document_loaders import TextLoader\n", "\n", "loader = TextLoader(\"../../../state_of_the_union.txt\")\n", "documents = loader.load()\n", "text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n", "docs = text_splitter.split_documents(documents)\n", "\n", "embeddings = OpenAIEmbeddings()" ] }, { "cell_type": "code", "execution_count": 4, "id": "dcf88bdf", "metadata": { "tags": [] }, "outputs": [], "source": [ "from dingodb import DingoDB\n", "\n", "index_name = \"langchain-demo\"\n", "\n", "dingo_client = DingoDB(user=\"\", password=\"\", host=[\"127.0.0.1:13000\"])\n", "# First, check if our index already exists. If it doesn't, we create it\n", "if index_name not in dingo_client.get_index():\n", " # we create a new index, modify to your own\n", " dingo_client.create_index(\n", " index_name=index_name,\n", " dimension=1536,\n", " metric_type='cosine',\n", " auto_id=False\n", ")\n", "\n", "# The OpenAI embedding model `text-embedding-ada-002 uses 1536 dimensions`\n", "docsearch = Dingo.from_documents(docs, embeddings, client=dingo_client, index_name=index_name)\n" ] }, { "cell_type": "code", "execution_count": null, "id": "c3aae49e", "metadata": {}, "outputs": [], "source": [ "from langchain.embeddings.openai import OpenAIEmbeddings\n", "from langchain.text_splitter import CharacterTextSplitter\n", "from langchain.vectorstores import Dingo\n", "from langchain.document_loaders import TextLoader" ] }, { "cell_type": "code", "execution_count": 5, "id": "a8c513ab", "metadata": {}, "outputs": [], "source": [ "query = \"What did the president say about Ketanji Brown Jackson\"\n", "docs = docsearch.similarity_search(query)" ] }, { "cell_type": "code", "execution_count": 2, "id": "fc516993", "metadata": {}, "outputs": [], "source": [ "print(docs[0].page_content)" ] }, { "cell_type": "markdown", "id": "1eca81e4", "metadata": {}, "source": [ "### Adding More Text to an Existing Index\n", "\n", "More text can embedded and upserted to an existing Dingo index using the `add_texts` function" ] }, { "cell_type": "code", "execution_count": null, "id": "e40d558b", "metadata": {}, "outputs": [], "source": [ "vectorstore = Dingo(embeddings, \"text\", client=dingo_client, index_name=index_name)\n", "\n", "vectorstore.add_texts([\"More text!\"])" ] }, { "cell_type": "markdown", "id": "bcb858a8", "metadata": {}, "source": [ "### Maximal Marginal Relevance Searches\n", "\n", "In addition to using similarity search in the retriever object, you can also use `mmr` as retriever." ] }, { "cell_type": "code", "execution_count": null, "id": "649083ab", "metadata": {}, "outputs": [], "source": [ "retriever = docsearch.as_retriever(search_type=\"mmr\")\n", "matched_docs = retriever.get_relevant_documents(query)\n", "for i, d in enumerate(matched_docs):\n", " print(f\"\\n## Document {i}\\n\")\n", " print(d.page_content)" ] }, { "cell_type": "markdown", "id": "7d3831ad", "metadata": {}, "source": [ "Or use `max_marginal_relevance_search` directly:" ] }, { "cell_type": "code", "execution_count": null, "id": "732f58b1", "metadata": {}, "outputs": [], "source": [ "found_docs = docsearch.max_marginal_relevance_search(query, k=2, fetch_k=10)\n", "for i, doc in enumerate(found_docs):\n", " print(f\"{i + 1}.\", doc.page_content, \"\\n\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.11" } }, "nbformat": 4, "nbformat_minor": 5 }