From 158dda440b307f695c7bcff263a4a259a3fab746 Mon Sep 17 00:00:00 2001 From: Mark Cusack Date: Tue, 12 Dec 2023 19:59:05 -0500 Subject: [PATCH] Added notebook tutorial on using Yellowbrick as a vector store with LangChain (#14509) - **Description:** a notebook documenting Yellowbrick as a vector store usage --------- Co-authored-by: markcusack Co-authored-by: markcusack --- .../vectorstores/yellowbrick.ipynb | 441 ++++++++++++++++++ 1 file changed, 441 insertions(+) create mode 100644 docs/docs/integrations/vectorstores/yellowbrick.ipynb diff --git a/docs/docs/integrations/vectorstores/yellowbrick.ipynb b/docs/docs/integrations/vectorstores/yellowbrick.ipynb new file mode 100644 index 0000000000..34658b912c --- /dev/null +++ b/docs/docs/integrations/vectorstores/yellowbrick.ipynb @@ -0,0 +1,441 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "7e80d338-091b-421c-ac66-5950b14944b2", + "metadata": {}, + "source": [ + "# Yellowbrick\n", + "\n", + "[Yellowbrick](https://yellowbrick.com/yellowbrick-data-warehouse/) is an elastic, massively parallel processing (MPP) SQL database that runs in the cloud and on-premises, using kubernetes for scale, resilience and cloud portability. Yellowbrick is designed to address the largest and most complex business-critical data warehousing use cases. The efficiency at scale that Yellowbrick provides also enables it to be used as a high performance and scalable vector database to store and search vectors with SQL. \n" + ] + }, + { + "cell_type": "markdown", + "id": "9291d9e5-d404-405f-8307-87d80d0233f2", + "metadata": {}, + "source": [ + "## Using Yellowbrick as the vector store for ChatGpt\n", + "\n", + "This tutorial demonstrates how to create a simple chatbot backed by ChatGpt that uses Yellowbrick as a vector store to support Retrieval Augmented Generation (RAG). What you'll need:\n", + "\n", + "1. An account on the [Yellowbrick sandbox](https://cloudlabs.yellowbrick.com/)\n", + "2. An api key from [OpenAI](https://platform.openai.com/)\n", + "\n", + "The tutorial is divided into five parts. First we'll use langchain to create a baseline chatbot to interact with ChatGpt without a vector store. Second, we'll create an embeddings table in Yellowbrick that will represent the vector store. Third, we'll load a series of documents (the Administration chapter of the Yellowbrick Manual). Fourth, we'll create the vector representation of those documents and store in a Yellowbrick table. Lastly, we'll send the same queries to the improved chatbox to see the results.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "924d1c25", + "metadata": {}, + "outputs": [], + "source": [ + "# Install all needed libraries\n", + "%pip install langchain\n", + "%pip install openai\n", + "%pip install psycopg2-binary\n", + "%pip install tiktoken" + ] + }, + { + "cell_type": "markdown", + "id": "5928e9c7-7666-4282-9cb4-00d919228ce0", + "metadata": {}, + "source": [ + "## Setup: Enter the information used to connect to Yellowbrick and OpenAI API\n", + "\n", + "Our chatbot integrates with ChatGpt via the langchain library, so you'll need an API key from OpenAI first:\n", + "\n", + "To get an api key for OpenAI:\n", + "1. Register at https://platform.openai.com/\n", + "2. Add a payment method - You're unlikely to go over free quota\n", + "3. Create an API key\n", + "\n", + "You'll also need your Username, Password, and Database name from the welcome email when you sign up for the Yellowbrick Sandbox Account.\n" + ] + }, + { + "cell_type": "markdown", + "id": "aaf215bb", + "metadata": {}, + "source": [ + "The following should be modified to include the information for your Yellowbrick database and OpenAPI Key" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a4393d8d", + "metadata": {}, + "outputs": [], + "source": [ + "# Modify these values to match your Yellowbrick Sandbox and OpenAI API Key\n", + "YBUSER = \"[SANDBOX USER]\"\n", + "YBPASSWORD = \"[SANDBOX PASSWORD]\"\n", + "YBDATABASE = \"[SANDBOX_DATABASE]\"\n", + "YBHOST = \"trialsandbox.sandbox.aws.yellowbrickcloud.com\"\n", + "\n", + "OPENAI_API_KEY = \"[OPENAI API KEY]\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c186f99b", + "metadata": {}, + "outputs": [], + "source": [ + "# Import libraries and setup keys / login info\n", + "import os\n", + "import pathlib\n", + "import re\n", + "import sys\n", + "import urllib.parse as urlparse\n", + "from getpass import getpass\n", + "\n", + "import psycopg2\n", + "from IPython.display import Markdown, display\n", + "from langchain.chains import LLMChain, RetrievalQAWithSourcesChain\n", + "from langchain.chat_models import ChatOpenAI\n", + "from langchain.docstore.document import Document\n", + "from langchain.embeddings.openai import OpenAIEmbeddings\n", + "from langchain.text_splitter import RecursiveCharacterTextSplitter\n", + "from langchain.vectorstores import Yellowbrick\n", + "\n", + "# Establish connection parameters to Yellowbrick. If you've signed up for Sandbox, fill in the information from your welcome mail here:\n", + "yellowbrick_connection_string = (\n", + " f\"postgres://{urlparse.quote(YBUSER)}:{YBPASSWORD}@{YBHOST}:5432/{YBDATABASE}\"\n", + ")\n", + "\n", + "YB_DOC_DATABASE = \"sample_data\"\n", + "YB_DOC_TABLE = \"yellowbrick_documentation\"\n", + "embedding_table = \"my_embeddings\"\n", + "\n", + "# API Key for OpenAI. Signup at https://platform.openai.com\n", + "os.environ[\"OPENAI_API_KEY\"] = OPENAI_API_KEY\n", + "\n", + "from langchain.prompts.chat import (\n", + " ChatPromptTemplate,\n", + " HumanMessagePromptTemplate,\n", + " SystemMessagePromptTemplate,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "e955b19b", + "metadata": {}, + "source": [ + "## Part 1: Creating a baseline chatbot backed by ChatGpt without a Vector Store\n", + "\n", + "We will use langchain to query ChatGPT. As there is no Vector Store, ChatGPT will have no context in which to answer the question.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "538f8b96-1b54-4f2f-9239-dfb5cc7fd259", + "metadata": {}, + "outputs": [], + "source": [ + "# Set up the chat model and specific prompt\n", + "system_template = \"\"\"If you don't know the answer, Make up your best guess.\"\"\"\n", + "messages = [\n", + " SystemMessagePromptTemplate.from_template(system_template),\n", + " HumanMessagePromptTemplate.from_template(\"{question}\"),\n", + "]\n", + "prompt = ChatPromptTemplate.from_messages(messages)\n", + "\n", + "chain_type_kwargs = {\"prompt\": prompt}\n", + "llm = ChatOpenAI(\n", + " model_name=\"gpt-3.5-turbo\", # Modify model_name if you have access to GPT-4\n", + " temperature=0,\n", + " max_tokens=256,\n", + ")\n", + "\n", + "chain = LLMChain(\n", + " llm=llm,\n", + " prompt=prompt,\n", + " verbose=False,\n", + ")\n", + "\n", + "\n", + "def print_result_simple(query):\n", + " result = chain(query)\n", + " output_text = f\"\"\"### Question:\n", + " {query}\n", + " ### Answer: \n", + " {result['text']}\n", + " \"\"\"\n", + " display(Markdown(output_text))\n", + "\n", + "\n", + "# Use the chain to query\n", + "print_result_simple(\"How many databases can be in a Yellowbrick Instance?\")\n", + "\n", + "print_result_simple(\"What's an easy way to add users in bulk to Yellowbrick?\")" + ] + }, + { + "cell_type": "markdown", + "id": "798c7aa6-5904-4860-b4a9-896fe7681a45", + "metadata": {}, + "source": [ + "## Part 2: Connect to Yellowbrick and create the embedding tables\n", + "\n", + "To load your document embeddings into Yellowbrick, you should create your own table for storing them in. Note that the \n", + "Yellowbrick database that the table is in has to be UTF-8 encoded. \n", + "\n", + "Create a table in a UTF-8 database with the following schema, providing a table name of your choice:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e72daf30-6160-4ff3-921f-c4c9da329991", + "metadata": {}, + "outputs": [], + "source": [ + "# Establish a connection to the Yellowbrick database\n", + "try:\n", + " conn = psycopg2.connect(yellowbrick_connection_string)\n", + "except psycopg2.Error as e:\n", + " print(f\"Error connecting to the database: {e}\")\n", + " exit(1)\n", + "\n", + "# Create a cursor object using the connection\n", + "cursor = conn.cursor()\n", + "\n", + "# Define the SQL statement to create a table\n", + "create_table_query = f\"\"\"\n", + "CREATE TABLE if not exists {embedding_table} (\n", + " id uuid,\n", + " embedding_id integer,\n", + " text character varying(60000),\n", + " metadata character varying(1024),\n", + " embedding double precision\n", + ")\n", + "DISTRIBUTE ON (id);\n", + "truncate table {embedding_table};\n", + "\"\"\"\n", + "\n", + "# Execute the SQL query to create a table\n", + "try:\n", + " cursor.execute(create_table_query)\n", + " print(f\"Table '{embedding_table}' created successfully!\")\n", + "except psycopg2.Error as e:\n", + " print(f\"Error creating table: {e}\")\n", + " conn.rollback()\n", + "\n", + "# Commit changes and close the cursor and connection\n", + "conn.commit()\n", + "cursor.close()\n", + "conn.close()" + ] + }, + { + "cell_type": "markdown", + "id": "8690ac3d-a775-4b0c-9499-9825885f3c82", + "metadata": {}, + "source": [ + "## Part 3: Extract the documents to index from an existing table in Yellowbrick\n", + "Extract document paths and contents from an existing Yellowbrick table. We'll use these documents to create embeddings from in the next step.\n", + "\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "60ab85bb-7901-44cf-b149-10fcde2ab91d", + "metadata": {}, + "outputs": [], + "source": [ + "yellowbrick_doc_connection_string = (\n", + " f\"postgres://{urlparse.quote(YBUSER)}:{YBPASSWORD}@{YBHOST}:5432/{YB_DOC_DATABASE}\"\n", + ")\n", + "\n", + "# Establish a connection to the Yellowbrick database\n", + "conn = psycopg2.connect(yellowbrick_doc_connection_string)\n", + "\n", + "# Create a cursor object\n", + "cursor = conn.cursor()\n", + "\n", + "# Query to select all documents from the table\n", + "query = f\"SELECT path, document FROM {YB_DOC_TABLE}\"\n", + "\n", + "# Execute the query\n", + "cursor.execute(query)\n", + "\n", + "# Fetch all documents\n", + "yellowbrick_documents = cursor.fetchall()\n", + "\n", + "print(f\"Extracted {len(yellowbrick_documents)} documents successfully!\")\n", + "\n", + "# Close the cursor and connection\n", + "cursor.close()\n", + "conn.close()" + ] + }, + { + "cell_type": "markdown", + "id": "b62b4150-2aa3-453e-a4db-81a2f8a11e70", + "metadata": {}, + "source": [ + "## Part 4: Load the Yellowbrick Vector Store with Documents\n", + "Go through documents, split them into digestable chunks, create the embedding and insert into the Yellowbrick table. This takes around 5 minutes.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "de914b10-850e-4c5b-a09b-c6a14006637c", + "metadata": {}, + "outputs": [], + "source": [ + "# Split documents into chunks for conversion to embeddings\n", + "DOCUMENT_BASE_URL = \"https://docs.yellowbrick.com/6.7.1/\" # Actual URL\n", + "\n", + "\n", + "separator = \"\\n## \" # This separator assumes Markdown docs from the repo uses ### as logical main header most of the time\n", + "chunk_size_limit = 2000\n", + "max_chunk_overlap = 200\n", + "\n", + "documents = [\n", + " Document(\n", + " page_content=document[1],\n", + " metadata={\"source\": DOCUMENT_BASE_URL + document[0].replace(\".md\", \".html\")},\n", + " )\n", + " for document in yellowbrick_documents\n", + "]\n", + "\n", + "text_splitter = RecursiveCharacterTextSplitter(\n", + " chunk_size=chunk_size_limit,\n", + " chunk_overlap=max_chunk_overlap,\n", + " separators=[separator, \"\\nn\", \"\\n\", \",\", \" \", \"\"],\n", + ")\n", + "split_docs = text_splitter.split_documents(documents)\n", + "\n", + "docs_text = [doc.page_content for doc in split_docs]\n", + "\n", + "embeddings = OpenAIEmbeddings()\n", + "vector_store = Yellowbrick.from_documents(\n", + " documents=split_docs,\n", + " embedding=embeddings,\n", + " connection_string=yellowbrick_connection_string,\n", + " table=embedding_table,\n", + ")\n", + "\n", + "print(f\"Created vector store with {len(documents)} documents\")" + ] + }, + { + "cell_type": "markdown", + "id": "beee89f5-0f1e-4c6e-91a9-44c10762d466", + "metadata": {}, + "source": [ + "## Part 5: Creating a chatbot that uses Yellowbrick as the vector store\n", + "\n", + "Next, we add Yellowbrick as a vector store. The vector store has been populated with embeddings representing the administrative chapter of the Yellowbrick product documentation.\n", + "\n", + "We'll send the same queries as above to see the impoved responses.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7daa9d4f-7804-4cfa-9873-415998d5e0f5", + "metadata": {}, + "outputs": [], + "source": [ + "system_template = \"\"\"Use the following pieces of context to answer the users question.\n", + "Take note of the sources and include them in the answer in the format: \"SOURCES: source1 source2\", use \"SOURCES\" in capital letters regardless of the number of sources.\n", + "If you don't know the answer, just say that \"I don't know\", don't try to make up an answer.\n", + "----------------\n", + "{summaries}\"\"\"\n", + "messages = [\n", + " SystemMessagePromptTemplate.from_template(system_template),\n", + " HumanMessagePromptTemplate.from_template(\"{question}\"),\n", + "]\n", + "prompt = ChatPromptTemplate.from_messages(messages)\n", + "\n", + "vector_store = Yellowbrick(\n", + " OpenAIEmbeddings(),\n", + " yellowbrick_connection_string,\n", + " embedding_table, # Change the table name to reflect your embeddings\n", + ")\n", + "\n", + "chain_type_kwargs = {\"prompt\": prompt}\n", + "llm = ChatOpenAI(\n", + " model_name=\"gpt-3.5-turbo\", # Modify model_name if you have access to GPT-4\n", + " temperature=0,\n", + " max_tokens=256,\n", + ")\n", + "chain = RetrievalQAWithSourcesChain.from_chain_type(\n", + " llm=llm,\n", + " chain_type=\"stuff\",\n", + " retriever=vector_store.as_retriever(search_kwargs={\"k\": 5}),\n", + " return_source_documents=True,\n", + " chain_type_kwargs=chain_type_kwargs,\n", + ")\n", + "\n", + "\n", + "def print_result_sources(query):\n", + " result = chain(query)\n", + " output_text = f\"\"\"### Question: \n", + " {query}\n", + " ### Answer: \n", + " {result['answer']}\n", + " ### Sources: \n", + " {result['sources']}\n", + " ### All relevant sources:\n", + " {', '.join(list(set([doc.metadata['source'] for doc in result['source_documents']])))}\n", + " \"\"\"\n", + " display(Markdown(output_text))\n", + "\n", + "\n", + "# Use the chain to query\n", + "\n", + "print_result_sources(\"How many databases can be in a Yellowbrick Instance?\")\n", + "\n", + "print_result_sources(\"Whats an easy way to add users in bulk to Yellowbrick?\")" + ] + }, + { + "cell_type": "markdown", + "id": "697c8a38", + "metadata": {}, + "source": [ + "## Next Steps:\n", + "\n", + "This code can be modified to ask different questions. You can also load your own documents into the vector store. The langchain module is very flexible and can parse a large variety of files (including HTML, PDF, etc).\n", + "\n", + "You can also modify this to use Huggingface embeddings models and Meta's Llama 2 LLM for a completely private chatbox experience." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "langchain_venv", + "language": "python", + "name": "langchain_venv" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.6" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}