From 158dda440b307f695c7bcff263a4a259a3fab746 Mon Sep 17 00:00:00 2001
From: Mark Cusack <mark@fieldcottage.com>
Date: Tue, 12 Dec 2023 19:59:05 -0500
Subject: [PATCH] Added notebook tutorial on using Yellowbrick as a vector
 store with LangChain (#14509)

- **Description:** a notebook documenting Yellowbrick as a vector store
usage

---------

Co-authored-by: markcusack <markcusack@markcusacksmac.lan>
Co-authored-by: markcusack <markcusack@Mark-Cusack-sMac.local>
---
 .../vectorstores/yellowbrick.ipynb            | 441 ++++++++++++++++++
 1 file changed, 441 insertions(+)
 create mode 100644 docs/docs/integrations/vectorstores/yellowbrick.ipynb

diff --git a/docs/docs/integrations/vectorstores/yellowbrick.ipynb b/docs/docs/integrations/vectorstores/yellowbrick.ipynb
new file mode 100644
index 0000000000..34658b912c
--- /dev/null
+++ b/docs/docs/integrations/vectorstores/yellowbrick.ipynb
@@ -0,0 +1,441 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "7e80d338-091b-421c-ac66-5950b14944b2",
+   "metadata": {},
+   "source": [
+    "# Yellowbrick\n",
+    "\n",
+    "[Yellowbrick](https://yellowbrick.com/yellowbrick-data-warehouse/) is an elastic, massively parallel processing (MPP) SQL database that runs in the cloud and on-premises, using kubernetes for scale, resilience and cloud portability. Yellowbrick is designed to address the largest and most complex business-critical data warehousing use cases. The efficiency at scale that Yellowbrick provides also enables it to be used as a high performance and scalable vector database to store and search vectors with SQL. \n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9291d9e5-d404-405f-8307-87d80d0233f2",
+   "metadata": {},
+   "source": [
+    "## Using Yellowbrick as the vector store for ChatGpt\n",
+    "\n",
+    "This tutorial demonstrates how to create a simple chatbot backed by ChatGpt that uses Yellowbrick as a vector store to support Retrieval Augmented Generation (RAG). What you'll need:\n",
+    "\n",
+    "1. An account on the [Yellowbrick sandbox](https://cloudlabs.yellowbrick.com/)\n",
+    "2. An api key from [OpenAI](https://platform.openai.com/)\n",
+    "\n",
+    "The tutorial is divided into five parts. First we'll use langchain to create a baseline chatbot to interact with ChatGpt without a vector store. Second, we'll create an embeddings table in Yellowbrick that will represent the vector store. Third, we'll load a series of documents (the Administration chapter of the Yellowbrick Manual). Fourth, we'll create the vector representation of those documents and store in a Yellowbrick table.  Lastly, we'll send the same queries to the improved chatbox to see the results.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "924d1c25",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Install all needed libraries\n",
+    "%pip install langchain\n",
+    "%pip install openai\n",
+    "%pip install psycopg2-binary\n",
+    "%pip install tiktoken"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5928e9c7-7666-4282-9cb4-00d919228ce0",
+   "metadata": {},
+   "source": [
+    "## Setup: Enter the information used to connect to Yellowbrick and OpenAI API\n",
+    "\n",
+    "Our chatbot integrates with ChatGpt via the langchain library, so you'll need an API key from OpenAI first:\n",
+    "\n",
+    "To get an api key for OpenAI:\n",
+    "1. Register at https://platform.openai.com/\n",
+    "2. Add a payment method - You're unlikely to go over free quota\n",
+    "3. Create an API key\n",
+    "\n",
+    "You'll also need your Username, Password, and Database name from the welcome email when you sign up for the Yellowbrick Sandbox Account.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "aaf215bb",
+   "metadata": {},
+   "source": [
+    "The following should be modified to include the information for your Yellowbrick database and OpenAPI Key"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a4393d8d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Modify these values to match your Yellowbrick Sandbox and OpenAI API Key\n",
+    "YBUSER = \"[SANDBOX USER]\"\n",
+    "YBPASSWORD = \"[SANDBOX PASSWORD]\"\n",
+    "YBDATABASE = \"[SANDBOX_DATABASE]\"\n",
+    "YBHOST = \"trialsandbox.sandbox.aws.yellowbrickcloud.com\"\n",
+    "\n",
+    "OPENAI_API_KEY = \"[OPENAI API KEY]\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c186f99b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Import libraries and setup keys / login info\n",
+    "import os\n",
+    "import pathlib\n",
+    "import re\n",
+    "import sys\n",
+    "import urllib.parse as urlparse\n",
+    "from getpass import getpass\n",
+    "\n",
+    "import psycopg2\n",
+    "from IPython.display import Markdown, display\n",
+    "from langchain.chains import LLMChain, RetrievalQAWithSourcesChain\n",
+    "from langchain.chat_models import ChatOpenAI\n",
+    "from langchain.docstore.document import Document\n",
+    "from langchain.embeddings.openai import OpenAIEmbeddings\n",
+    "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
+    "from langchain.vectorstores import Yellowbrick\n",
+    "\n",
+    "# Establish connection parameters to Yellowbrick.  If you've signed up for Sandbox, fill in the information from your welcome mail here:\n",
+    "yellowbrick_connection_string = (\n",
+    "    f\"postgres://{urlparse.quote(YBUSER)}:{YBPASSWORD}@{YBHOST}:5432/{YBDATABASE}\"\n",
+    ")\n",
+    "\n",
+    "YB_DOC_DATABASE = \"sample_data\"\n",
+    "YB_DOC_TABLE = \"yellowbrick_documentation\"\n",
+    "embedding_table = \"my_embeddings\"\n",
+    "\n",
+    "# API Key for OpenAI.  Signup at https://platform.openai.com\n",
+    "os.environ[\"OPENAI_API_KEY\"] = OPENAI_API_KEY\n",
+    "\n",
+    "from langchain.prompts.chat import (\n",
+    "    ChatPromptTemplate,\n",
+    "    HumanMessagePromptTemplate,\n",
+    "    SystemMessagePromptTemplate,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e955b19b",
+   "metadata": {},
+   "source": [
+    "## Part 1: Creating a baseline chatbot backed by ChatGpt without a Vector Store\n",
+    "\n",
+    "We will use langchain to query ChatGPT.  As there is no Vector Store, ChatGPT will have no context in which to answer the question.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "538f8b96-1b54-4f2f-9239-dfb5cc7fd259",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Set up the chat model and specific prompt\n",
+    "system_template = \"\"\"If you don't know the answer, Make up your best guess.\"\"\"\n",
+    "messages = [\n",
+    "    SystemMessagePromptTemplate.from_template(system_template),\n",
+    "    HumanMessagePromptTemplate.from_template(\"{question}\"),\n",
+    "]\n",
+    "prompt = ChatPromptTemplate.from_messages(messages)\n",
+    "\n",
+    "chain_type_kwargs = {\"prompt\": prompt}\n",
+    "llm = ChatOpenAI(\n",
+    "    model_name=\"gpt-3.5-turbo\",  # Modify model_name if you have access to GPT-4\n",
+    "    temperature=0,\n",
+    "    max_tokens=256,\n",
+    ")\n",
+    "\n",
+    "chain = LLMChain(\n",
+    "    llm=llm,\n",
+    "    prompt=prompt,\n",
+    "    verbose=False,\n",
+    ")\n",
+    "\n",
+    "\n",
+    "def print_result_simple(query):\n",
+    "    result = chain(query)\n",
+    "    output_text = f\"\"\"### Question:\n",
+    "  {query}\n",
+    "  ### Answer: \n",
+    "  {result['text']}\n",
+    "    \"\"\"\n",
+    "    display(Markdown(output_text))\n",
+    "\n",
+    "\n",
+    "# Use the chain to query\n",
+    "print_result_simple(\"How many databases can be in a Yellowbrick Instance?\")\n",
+    "\n",
+    "print_result_simple(\"What's an easy way to add users in bulk to Yellowbrick?\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "798c7aa6-5904-4860-b4a9-896fe7681a45",
+   "metadata": {},
+   "source": [
+    "## Part 2: Connect to Yellowbrick and create the embedding tables\n",
+    "\n",
+    "To load your document embeddings into Yellowbrick, you should create your own table for storing them in. Note that the \n",
+    "Yellowbrick database that the table is in has to be UTF-8 encoded. \n",
+    "\n",
+    "Create a table in a UTF-8 database with the following schema, providing a table name of your choice:\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e72daf30-6160-4ff3-921f-c4c9da329991",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Establish a connection to the Yellowbrick database\n",
+    "try:\n",
+    "    conn = psycopg2.connect(yellowbrick_connection_string)\n",
+    "except psycopg2.Error as e:\n",
+    "    print(f\"Error connecting to the database: {e}\")\n",
+    "    exit(1)\n",
+    "\n",
+    "# Create a cursor object using the connection\n",
+    "cursor = conn.cursor()\n",
+    "\n",
+    "# Define the SQL statement to create a table\n",
+    "create_table_query = f\"\"\"\n",
+    "CREATE TABLE if not exists {embedding_table} (\n",
+    "    id uuid,\n",
+    "    embedding_id integer,\n",
+    "    text character varying(60000),\n",
+    "    metadata character varying(1024),\n",
+    "    embedding double precision\n",
+    ")\n",
+    "DISTRIBUTE ON (id);\n",
+    "truncate table {embedding_table};\n",
+    "\"\"\"\n",
+    "\n",
+    "# Execute the SQL query to create a table\n",
+    "try:\n",
+    "    cursor.execute(create_table_query)\n",
+    "    print(f\"Table '{embedding_table}' created successfully!\")\n",
+    "except psycopg2.Error as e:\n",
+    "    print(f\"Error creating table: {e}\")\n",
+    "    conn.rollback()\n",
+    "\n",
+    "# Commit changes and close the cursor and connection\n",
+    "conn.commit()\n",
+    "cursor.close()\n",
+    "conn.close()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8690ac3d-a775-4b0c-9499-9825885f3c82",
+   "metadata": {},
+   "source": [
+    "## Part 3: Extract the documents to index from an existing table in Yellowbrick\n",
+    "Extract document paths and contents from an existing Yellowbrick table. We'll use these documents to create embeddings from in the next step.\n",
+    "\n",
+    "\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "60ab85bb-7901-44cf-b149-10fcde2ab91d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "yellowbrick_doc_connection_string = (\n",
+    "    f\"postgres://{urlparse.quote(YBUSER)}:{YBPASSWORD}@{YBHOST}:5432/{YB_DOC_DATABASE}\"\n",
+    ")\n",
+    "\n",
+    "# Establish a connection to the Yellowbrick database\n",
+    "conn = psycopg2.connect(yellowbrick_doc_connection_string)\n",
+    "\n",
+    "# Create a cursor object\n",
+    "cursor = conn.cursor()\n",
+    "\n",
+    "# Query to select all documents from the table\n",
+    "query = f\"SELECT path, document FROM {YB_DOC_TABLE}\"\n",
+    "\n",
+    "# Execute the query\n",
+    "cursor.execute(query)\n",
+    "\n",
+    "# Fetch all documents\n",
+    "yellowbrick_documents = cursor.fetchall()\n",
+    "\n",
+    "print(f\"Extracted {len(yellowbrick_documents)} documents successfully!\")\n",
+    "\n",
+    "# Close the cursor and connection\n",
+    "cursor.close()\n",
+    "conn.close()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b62b4150-2aa3-453e-a4db-81a2f8a11e70",
+   "metadata": {},
+   "source": [
+    "## Part 4: Load the Yellowbrick Vector Store with Documents\n",
+    "Go through documents, split them into digestable chunks, create the embedding and insert into the Yellowbrick table. This takes around 5 minutes.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "de914b10-850e-4c5b-a09b-c6a14006637c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Split documents into chunks for conversion to embeddings\n",
+    "DOCUMENT_BASE_URL = \"https://docs.yellowbrick.com/6.7.1/\"  # Actual URL\n",
+    "\n",
+    "\n",
+    "separator = \"\\n## \"  # This separator assumes Markdown docs from the repo uses ### as logical main header most of the time\n",
+    "chunk_size_limit = 2000\n",
+    "max_chunk_overlap = 200\n",
+    "\n",
+    "documents = [\n",
+    "    Document(\n",
+    "        page_content=document[1],\n",
+    "        metadata={\"source\": DOCUMENT_BASE_URL + document[0].replace(\".md\", \".html\")},\n",
+    "    )\n",
+    "    for document in yellowbrick_documents\n",
+    "]\n",
+    "\n",
+    "text_splitter = RecursiveCharacterTextSplitter(\n",
+    "    chunk_size=chunk_size_limit,\n",
+    "    chunk_overlap=max_chunk_overlap,\n",
+    "    separators=[separator, \"\\nn\", \"\\n\", \",\", \" \", \"\"],\n",
+    ")\n",
+    "split_docs = text_splitter.split_documents(documents)\n",
+    "\n",
+    "docs_text = [doc.page_content for doc in split_docs]\n",
+    "\n",
+    "embeddings = OpenAIEmbeddings()\n",
+    "vector_store = Yellowbrick.from_documents(\n",
+    "    documents=split_docs,\n",
+    "    embedding=embeddings,\n",
+    "    connection_string=yellowbrick_connection_string,\n",
+    "    table=embedding_table,\n",
+    ")\n",
+    "\n",
+    "print(f\"Created vector store with {len(documents)} documents\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "beee89f5-0f1e-4c6e-91a9-44c10762d466",
+   "metadata": {},
+   "source": [
+    "## Part 5: Creating a chatbot that uses Yellowbrick as the vector store\n",
+    "\n",
+    "Next, we add Yellowbrick as a vector store. The vector store has been populated with embeddings representing the administrative chapter of the Yellowbrick product documentation.\n",
+    "\n",
+    "We'll send the same queries as above to see the impoved responses.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7daa9d4f-7804-4cfa-9873-415998d5e0f5",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "system_template = \"\"\"Use the following pieces of context to answer the users question.\n",
+    "Take note of the sources and include them in the answer in the format: \"SOURCES: source1 source2\", use \"SOURCES\" in capital letters regardless of the number of sources.\n",
+    "If you don't know the answer, just say that \"I don't know\", don't try to make up an answer.\n",
+    "----------------\n",
+    "{summaries}\"\"\"\n",
+    "messages = [\n",
+    "    SystemMessagePromptTemplate.from_template(system_template),\n",
+    "    HumanMessagePromptTemplate.from_template(\"{question}\"),\n",
+    "]\n",
+    "prompt = ChatPromptTemplate.from_messages(messages)\n",
+    "\n",
+    "vector_store = Yellowbrick(\n",
+    "    OpenAIEmbeddings(),\n",
+    "    yellowbrick_connection_string,\n",
+    "    embedding_table,  # Change the table name to reflect your embeddings\n",
+    ")\n",
+    "\n",
+    "chain_type_kwargs = {\"prompt\": prompt}\n",
+    "llm = ChatOpenAI(\n",
+    "    model_name=\"gpt-3.5-turbo\",  # Modify model_name if you have access to GPT-4\n",
+    "    temperature=0,\n",
+    "    max_tokens=256,\n",
+    ")\n",
+    "chain = RetrievalQAWithSourcesChain.from_chain_type(\n",
+    "    llm=llm,\n",
+    "    chain_type=\"stuff\",\n",
+    "    retriever=vector_store.as_retriever(search_kwargs={\"k\": 5}),\n",
+    "    return_source_documents=True,\n",
+    "    chain_type_kwargs=chain_type_kwargs,\n",
+    ")\n",
+    "\n",
+    "\n",
+    "def print_result_sources(query):\n",
+    "    result = chain(query)\n",
+    "    output_text = f\"\"\"### Question: \n",
+    "  {query}\n",
+    "  ### Answer: \n",
+    "  {result['answer']}\n",
+    "  ### Sources: \n",
+    "  {result['sources']}\n",
+    "  ### All relevant sources:\n",
+    "  {', '.join(list(set([doc.metadata['source'] for doc in result['source_documents']])))}\n",
+    "    \"\"\"\n",
+    "    display(Markdown(output_text))\n",
+    "\n",
+    "\n",
+    "# Use the chain to query\n",
+    "\n",
+    "print_result_sources(\"How many databases can be in a Yellowbrick Instance?\")\n",
+    "\n",
+    "print_result_sources(\"Whats an easy way to add users in bulk to Yellowbrick?\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "697c8a38",
+   "metadata": {},
+   "source": [
+    "## Next Steps:\n",
+    "\n",
+    "This code can be modified to ask different questions. You can also load your own documents into the vector store. The langchain module is very flexible and can parse a large variety of files (including HTML, PDF, etc).\n",
+    "\n",
+    "You can also modify this to use Huggingface embeddings models and Meta's Llama 2 LLM for a completely private chatbox experience."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "langchain_venv",
+   "language": "python",
+   "name": "langchain_venv"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.6"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}