Added notebook tutorial on using Yellowbrick as a vector store with LangChain (#14509)

- **Description:** a notebook documenting Yellowbrick as a vector store usage --------- Co-authored-by: markcusack <markcusack@markcusacksmac.lan> Co-authored-by: markcusack <markcusack@Mark-Cusack-sMac.local>
10 months ago · 158dda440b
parent 0dc432aa95
commit 158dda440b
1 changed files with 441 additions and 0 deletions
--- a/docs/docs/integrations/vectorstores/yellowbrick.ipynb
+++ b/docs/docs/integrations/vectorstores/yellowbrick.ipynb
@ -0,0 +1,441 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "id": "7e80d338-091b-421c-ac66-5950b14944b2",
   "metadata": {},
   "source": [
    "# Yellowbrick\n",
    "\n",
    "[Yellowbrick](https://yellowbrick.com/yellowbrick-data-warehouse/) is an elastic, massively parallel processing (MPP) SQL database that runs in the cloud and on-premises, using kubernetes for scale, resilience and cloud portability. Yellowbrick is designed to address the largest and most complex business-critical data warehousing use cases. The efficiency at scale that Yellowbrick provides also enables it to be used as a high performance and scalable vector database to store and search vectors with SQL. \n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9291d9e5-d404-405f-8307-87d80d0233f2",
   "metadata": {},
   "source": [
    "## Using Yellowbrick as the vector store for ChatGpt\n",
    "\n",
    "This tutorial demonstrates how to create a simple chatbot backed by ChatGpt that uses Yellowbrick as a vector store to support Retrieval Augmented Generation (RAG). What you'll need:\n",
    "\n",
    "1. An account on the [Yellowbrick sandbox](https://cloudlabs.yellowbrick.com/)\n",
    "2. An api key from [OpenAI](https://platform.openai.com/)\n",
    "\n",
    "The tutorial is divided into five parts. First we'll use langchain to create a baseline chatbot to interact with ChatGpt without a vector store. Second, we'll create an embeddings table in Yellowbrick that will represent the vector store. Third, we'll load a series of documents (the Administration chapter of the Yellowbrick Manual). Fourth, we'll create the vector representation of those documents and store in a Yellowbrick table.  Lastly, we'll send the same queries to the improved chatbox to see the results.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "924d1c25",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Install all needed libraries\n",
    "%pip install langchain\n",
    "%pip install openai\n",
    "%pip install psycopg2-binary\n",
    "%pip install tiktoken"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5928e9c7-7666-4282-9cb4-00d919228ce0",
   "metadata": {},
   "source": [
    "## Setup: Enter the information used to connect to Yellowbrick and OpenAI API\n",
    "\n",
    "Our chatbot integrates with ChatGpt via the langchain library, so you'll need an API key from OpenAI first:\n",
    "\n",
    "To get an api key for OpenAI:\n",
    "1. Register at https://platform.openai.com/\n",
    "2. Add a payment method - You're unlikely to go over free quota\n",
    "3. Create an API key\n",
    "\n",
    "You'll also need your Username, Password, and Database name from the welcome email when you sign up for the Yellowbrick Sandbox Account.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aaf215bb",
   "metadata": {},
   "source": [
    "The following should be modified to include the information for your Yellowbrick database and OpenAPI Key"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a4393d8d",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Modify these values to match your Yellowbrick Sandbox and OpenAI API Key\n",
    "YBUSER = \"[SANDBOX USER]\"\n",
    "YBPASSWORD = \"[SANDBOX PASSWORD]\"\n",
    "YBDATABASE = \"[SANDBOX_DATABASE]\"\n",
    "YBHOST = \"trialsandbox.sandbox.aws.yellowbrickcloud.com\"\n",
    "\n",
    "OPENAI_API_KEY = \"[OPENAI API KEY]\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c186f99b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import libraries and setup keys / login info\n",
    "import os\n",
    "import pathlib\n",
    "import re\n",
    "import sys\n",
    "import urllib.parse as urlparse\n",
    "from getpass import getpass\n",
    "\n",
    "import psycopg2\n",
    "from IPython.display import Markdown, display\n",
    "from langchain.chains import LLMChain, RetrievalQAWithSourcesChain\n",
    "from langchain.chat_models import ChatOpenAI\n",
    "from langchain.docstore.document import Document\n",
    "from langchain.embeddings.openai import OpenAIEmbeddings\n",
    "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
    "from langchain.vectorstores import Yellowbrick\n",
    "\n",
    "# Establish connection parameters to Yellowbrick.  If you've signed up for Sandbox, fill in the information from your welcome mail here:\n",
    "yellowbrick_connection_string = (\n",
    "    f\"postgres://{urlparse.quote(YBUSER)}:{YBPASSWORD}@{YBHOST}:5432/{YBDATABASE}\"\n",
    ")\n",
    "\n",
    "YB_DOC_DATABASE = \"sample_data\"\n",
    "YB_DOC_TABLE = \"yellowbrick_documentation\"\n",
    "embedding_table = \"my_embeddings\"\n",
    "\n",
    "# API Key for OpenAI.  Signup at https://platform.openai.com\n",
    "os.environ[\"OPENAI_API_KEY\"] = OPENAI_API_KEY\n",
    "\n",
    "from langchain.prompts.chat import (\n",
    "    ChatPromptTemplate,\n",
    "    HumanMessagePromptTemplate,\n",
    "    SystemMessagePromptTemplate,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e955b19b",
   "metadata": {},
   "source": [
    "## Part 1: Creating a baseline chatbot backed by ChatGpt without a Vector Store\n",
    "\n",
    "We will use langchain to query ChatGPT.  As there is no Vector Store, ChatGPT will have no context in which to answer the question.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "538f8b96-1b54-4f2f-9239-dfb5cc7fd259",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Set up the chat model and specific prompt\n",
    "system_template = \"\"\"If you don't know the answer, Make up your best guess.\"\"\"\n",
    "messages = [\n",
    "    SystemMessagePromptTemplate.from_template(system_template),\n",
    "    HumanMessagePromptTemplate.from_template(\"{question}\"),\n",
    "]\n",
    "prompt = ChatPromptTemplate.from_messages(messages)\n",
    "\n",
    "chain_type_kwargs = {\"prompt\": prompt}\n",
    "llm = ChatOpenAI(\n",
    "    model_name=\"gpt-3.5-turbo\",  # Modify model_name if you have access to GPT-4\n",
    "    temperature=0,\n",
    "    max_tokens=256,\n",
    ")\n",
    "\n",
    "chain = LLMChain(\n",
    "    llm=llm,\n",
    "    prompt=prompt,\n",
    "    verbose=False,\n",
    ")\n",
    "\n",
    "\n",
    "def print_result_simple(query):\n",
    "    result = chain(query)\n",
    "    output_text = f\"\"\"### Question:\n",
    "  {query}\n",
    "  ### Answer: \n",
    "  {result['text']}\n",
    "    \"\"\"\n",
    "    display(Markdown(output_text))\n",
    "\n",
    "\n",
    "# Use the chain to query\n",
    "print_result_simple(\"How many databases can be in a Yellowbrick Instance?\")\n",
    "\n",
    "print_result_simple(\"What's an easy way to add users in bulk to Yellowbrick?\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "798c7aa6-5904-4860-b4a9-896fe7681a45",
   "metadata": {},
   "source": [
    "## Part 2: Connect to Yellowbrick and create the embedding tables\n",
    "\n",
    "To load your document embeddings into Yellowbrick, you should create your own table for storing them in. Note that the \n",
    "Yellowbrick database that the table is in has to be UTF-8 encoded. \n",
    "\n",
    "Create a table in a UTF-8 database with the following schema, providing a table name of your choice:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e72daf30-6160-4ff3-921f-c4c9da329991",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Establish a connection to the Yellowbrick database\n",
    "try:\n",
    "    conn = psycopg2.connect(yellowbrick_connection_string)\n",
    "except psycopg2.Error as e:\n",
    "    print(f\"Error connecting to the database: {e}\")\n",
    "    exit(1)\n",
    "\n",
    "# Create a cursor object using the connection\n",
    "cursor = conn.cursor()\n",
    "\n",
    "# Define the SQL statement to create a table\n",
    "create_table_query = f\"\"\"\n",
    "CREATE TABLE if not exists {embedding_table} (\n",
    "    id uuid,\n",
    "    embedding_id integer,\n",
    "    text character varying(60000),\n",
    "    metadata character varying(1024),\n",
    "    embedding double precision\n",
    ")\n",
    "DISTRIBUTE ON (id);\n",
    "truncate table {embedding_table};\n",
    "\"\"\"\n",
    "\n",
    "# Execute the SQL query to create a table\n",
    "try:\n",
    "    cursor.execute(create_table_query)\n",
    "    print(f\"Table '{embedding_table}' created successfully!\")\n",
    "except psycopg2.Error as e:\n",
    "    print(f\"Error creating table: {e}\")\n",
    "    conn.rollback()\n",
    "\n",
    "# Commit changes and close the cursor and connection\n",
    "conn.commit()\n",
    "cursor.close()\n",
    "conn.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8690ac3d-a775-4b0c-9499-9825885f3c82",
   "metadata": {},
   "source": [
    "## Part 3: Extract the documents to index from an existing table in Yellowbrick\n",
    "Extract document paths and contents from an existing Yellowbrick table. We'll use these documents to create embeddings from in the next step.\n",
    "\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "60ab85bb-7901-44cf-b149-10fcde2ab91d",
   "metadata": {},
   "outputs": [],
   "source": [
    "yellowbrick_doc_connection_string = (\n",
    "    f\"postgres://{urlparse.quote(YBUSER)}:{YBPASSWORD}@{YBHOST}:5432/{YB_DOC_DATABASE}\"\n",
    ")\n",
    "\n",
    "# Establish a connection to the Yellowbrick database\n",
    "conn = psycopg2.connect(yellowbrick_doc_connection_string)\n",
    "\n",
    "# Create a cursor object\n",
    "cursor = conn.cursor()\n",
    "\n",
    "# Query to select all documents from the table\n",
    "query = f\"SELECT path, document FROM {YB_DOC_TABLE}\"\n",
    "\n",
    "# Execute the query\n",
    "cursor.execute(query)\n",
    "\n",
    "# Fetch all documents\n",
    "yellowbrick_documents = cursor.fetchall()\n",
    "\n",
    "print(f\"Extracted {len(yellowbrick_documents)} documents successfully!\")\n",
    "\n",
    "# Close the cursor and connection\n",
    "cursor.close()\n",
    "conn.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b62b4150-2aa3-453e-a4db-81a2f8a11e70",
   "metadata": {},
   "source": [
    "## Part 4: Load the Yellowbrick Vector Store with Documents\n",
    "Go through documents, split them into digestable chunks, create the embedding and insert into the Yellowbrick table. This takes around 5 minutes.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "de914b10-850e-4c5b-a09b-c6a14006637c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Split documents into chunks for conversion to embeddings\n",
    "DOCUMENT_BASE_URL = \"https://docs.yellowbrick.com/6.7.1/\"  # Actual URL\n",
    "\n",
    "\n",
    "separator = \"\\n## \"  # This separator assumes Markdown docs from the repo uses ### as logical main header most of the time\n",
    "chunk_size_limit = 2000\n",
    "max_chunk_overlap = 200\n",
    "\n",
    "documents = [\n",
    "    Document(\n",
    "        page_content=document[1],\n",
    "        metadata={\"source\": DOCUMENT_BASE_URL + document[0].replace(\".md\", \".html\")},\n",
    "    )\n",
    "    for document in yellowbrick_documents\n",
    "]\n",
    "\n",
    "text_splitter = RecursiveCharacterTextSplitter(\n",
    "    chunk_size=chunk_size_limit,\n",
    "    chunk_overlap=max_chunk_overlap,\n",
    "    separators=[separator, \"\\nn\", \"\\n\", \",\", \" \", \"\"],\n",
    ")\n",
    "split_docs = text_splitter.split_documents(documents)\n",
    "\n",
    "docs_text = [doc.page_content for doc in split_docs]\n",
    "\n",
    "embeddings = OpenAIEmbeddings()\n",
    "vector_store = Yellowbrick.from_documents(\n",
    "    documents=split_docs,\n",
    "    embedding=embeddings,\n",
    "    connection_string=yellowbrick_connection_string,\n",
    "    table=embedding_table,\n",
    ")\n",
    "\n",
    "print(f\"Created vector store with {len(documents)} documents\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "beee89f5-0f1e-4c6e-91a9-44c10762d466",
   "metadata": {},
   "source": [
    "## Part 5: Creating a chatbot that uses Yellowbrick as the vector store\n",
    "\n",
    "Next, we add Yellowbrick as a vector store. The vector store has been populated with embeddings representing the administrative chapter of the Yellowbrick product documentation.\n",
    "\n",
    "We'll send the same queries as above to see the impoved responses.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7daa9d4f-7804-4cfa-9873-415998d5e0f5",
   "metadata": {},
   "outputs": [],
   "source": [
    "system_template = \"\"\"Use the following pieces of context to answer the users question.\n",
    "Take note of the sources and include them in the answer in the format: \"SOURCES: source1 source2\", use \"SOURCES\" in capital letters regardless of the number of sources.\n",
    "If you don't know the answer, just say that \"I don't know\", don't try to make up an answer.\n",
    "----------------\n",
    "{summaries}\"\"\"\n",
    "messages = [\n",
    "    SystemMessagePromptTemplate.from_template(system_template),\n",
    "    HumanMessagePromptTemplate.from_template(\"{question}\"),\n",
    "]\n",
    "prompt = ChatPromptTemplate.from_messages(messages)\n",
    "\n",
    "vector_store = Yellowbrick(\n",
    "    OpenAIEmbeddings(),\n",
    "    yellowbrick_connection_string,\n",
    "    embedding_table,  # Change the table name to reflect your embeddings\n",
    ")\n",
    "\n",
    "chain_type_kwargs = {\"prompt\": prompt}\n",
    "llm = ChatOpenAI(\n",
    "    model_name=\"gpt-3.5-turbo\",  # Modify model_name if you have access to GPT-4\n",
    "    temperature=0,\n",
    "    max_tokens=256,\n",
    ")\n",
    "chain = RetrievalQAWithSourcesChain.from_chain_type(\n",
    "    llm=llm,\n",
    "    chain_type=\"stuff\",\n",
    "    retriever=vector_store.as_retriever(search_kwargs={\"k\": 5}),\n",
    "    return_source_documents=True,\n",
    "    chain_type_kwargs=chain_type_kwargs,\n",
    ")\n",
    "\n",
    "\n",
    "def print_result_sources(query):\n",
    "    result = chain(query)\n",
    "    output_text = f\"\"\"### Question: \n",
    "  {query}\n",
    "  ### Answer: \n",
    "  {result['answer']}\n",
    "  ### Sources: \n",
    "  {result['sources']}\n",
    "  ### All relevant sources:\n",
    "  {', '.join(list(set([doc.metadata['source'] for doc in result['source_documents']])))}\n",
    "    \"\"\"\n",
    "    display(Markdown(output_text))\n",
    "\n",
    "\n",
    "# Use the chain to query\n",
    "\n",
    "print_result_sources(\"How many databases can be in a Yellowbrick Instance?\")\n",
    "\n",
    "print_result_sources(\"Whats an easy way to add users in bulk to Yellowbrick?\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "697c8a38",
   "metadata": {},
   "source": [
    "## Next Steps:\n",
    "\n",
    "This code can be modified to ask different questions. You can also load your own documents into the vector store. The langchain module is very flexible and can parse a large variety of files (including HTML, PDF, etc).\n",
    "\n",
    "You can also modify this to use Huggingface embeddings models and Meta's Llama 2 LLM for a completely private chatbox experience."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "langchain_venv",
   "language": "python",
   "name": "langchain_venv"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
 }