{ "cells": [ { "cell_type": "markdown", "id": "63785634", "metadata": {}, "source": [ "# Power your products with ChatGPT and your own data\n", "\n", "This is a walkthrough taking readers through how to build starter Q&A and Chatbot applications using the ChatGPT API and their own data. \n", "\n", "It is laid out in these sections:\n", "- **Setup:** \n", " - Initiate variables and source the data\n", "- **Lay the foundations:**\n", " - Set up the vector database to accept vectors and data\n", " - Load the dataset, chunk the data up for embedding and store in the vector database\n", "- **Make it a product:**\n", " - Add a retrieval step where users provide queries and we return the most relevant entries\n", " - Summarise search results with GPT-3\n", " - Test out this basic Q&A app in Streamlit\n", "- **Build your moat:**\n", " - Create an Assistant class to manage context and interact with our bot\n", " - Use the Chatbot to answer questions using semantic search context\n", " - Test out this basic Chatbot app in Streamlit\n", " \n", "Upon completion, you have the building blocks to create your own production chatbot or Q&A application using OpenAI APIs and a vector database.\n", "\n", "This notebook was originally presented with [these slides](https://drive.google.com/file/d/1dB-RQhZC_Q1iAsHkNNdkqtxxXqYODFYy/view?usp=share_link), which provide visual context for this journey." ] }, { "cell_type": "code", "execution_count": 1, "id": "59f08ea7", "metadata": {}, "outputs": [], "source": [ "%load_ext autoreload\n", "%autoreload 2" ] }, { "cell_type": "markdown", "id": "13649895", "metadata": {}, "source": [ "## Setup\n", "\n", "First we'll setup our libraries and environment variables" ] }, { "cell_type": "code", "execution_count": 14, "id": "7590fbfc", "metadata": {}, "outputs": [], "source": [ "import openai\n", "import os\n", "import requests\n", "import numpy as np\n", "import pandas as pd\n", "from typing import Iterator\n", "import tiktoken\n", "import textract\n", "from numpy import array, average\n", "\n", "from database import get_redis_connection\n", "\n", "# Set our default models and chunking size\n", "from config import COMPLETIONS_MODEL, EMBEDDINGS_MODEL, CHAT_MODEL, TEXT_EMBEDDING_CHUNK_SIZE, VECTOR_FIELD_NAME\n", "\n", "# Ignore unclosed SSL socket warnings - optional in case you get these errors\n", "import warnings\n", "\n", "warnings.filterwarnings(action=\"ignore\", message=\"unclosed\", category=ImportWarning)\n", "warnings.filterwarnings(\"ignore\", category=DeprecationWarning) " ] }, { "cell_type": "code", "execution_count": 3, "id": "760efc1e", "metadata": {}, "outputs": [], "source": [ "pd.set_option('display.max_colwidth', 0)" ] }, { "cell_type": "code", "execution_count": 4, "id": "3f90817d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[\"FIA Practice Directions - Competitor's Staff Registration System.pdf\",\n", " 'fia_2022_formula_1_sporting_regulations_-_issue_9_-_2022-10-19_0.pdf',\n", " 'fia_2023_formula_1_technical_regulations_-_issue_4_-_2022-12-07.pdf',\n", " 'fia_f1_power_unit_financial_regulations_issue_1_-_2022-08-16.pdf',\n", " 'fia_formula_1_financial_regulations_iss.13.pdf']" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_dir = os.path.join(os.curdir,'data')\n", "pdf_files = sorted([x for x in os.listdir(data_dir) if 'DS_Store' not in x])\n", "pdf_files" ] }, { "cell_type": "markdown", "id": "5dc4018c", "metadata": {}, "source": [ "## Laying the foundations" ] }, { "cell_type": "markdown", "id": "632b82ed", "metadata": {}, "source": [ "### Storage\n", "\n", "We're going to use Redis as our database for both document contents and the vector embeddings. You will need the full Redis Stack to enable use of Redisearch, which is the module that allows semantic search - more detail is in the [docs for Redis Stack](https://redis.io/docs/stack/get-started/install/docker/).\n", "\n", "To set this up locally, you will need to install Docker and then run the following command: ```docker run -d --name redis-stack -p 6379:6379 -p 8001:8001 redis/redis-stack:latest```.\n", "\n", "The code used here draws heavily on [this repo](https://github.com/RedisAI/vecsim-demo).\n", "\n", "After setting up the Docker instance of Redis Stack, you can follow the below instructions to initiate a Redis connection and create a Hierarchical Navigable Small World (HNSW) index for semantic search." ] }, { "cell_type": "code", "execution_count": 5, "id": "17d6b886", "metadata": {}, "outputs": [], "source": [ "# Setup Redis\n", "from redis import Redis\n", "from redis.commands.search.query import Query\n", "from redis.commands.search.field import (\n", " TextField,\n", " VectorField,\n", " NumericField\n", ")\n", "from redis.commands.search.indexDefinition import (\n", " IndexDefinition,\n", " IndexType\n", ")\n", "\n", "redis_client = get_redis_connection()" ] }, { "cell_type": "code", "execution_count": 6, "id": "4f3d3e6b", "metadata": {}, "outputs": [], "source": [ "# Constants\n", "VECTOR_DIM = 1536 #len(data['title_vector'][0]) # length of the vectors\n", "#VECTOR_NUMBER = len(data) # initial number of vectors\n", "PREFIX = \"sportsdoc\" # prefix for the document keys\n", "DISTANCE_METRIC = \"COSINE\" # distance metric for the vectors (ex. COSINE, IP, L2)" ] }, { "cell_type": "code", "execution_count": 10, "id": "d3c352ca", "metadata": {}, "outputs": [], "source": [ "# Create search index\n", "\n", "# Index\n", "INDEX_NAME = \"f1-index\" # name of the search index\n", "VECTOR_FIELD_NAME = 'content_vector'\n", "\n", "# Define RediSearch fields for each of the columns in the dataset\n", "# This is where you should add any additional metadata you want to capture\n", "filename = TextField(\"filename\")\n", "text_chunk = TextField(\"text_chunk\")\n", "file_chunk_index = NumericField(\"file_chunk_index\")\n", "\n", "# define RediSearch vector fields to use HNSW index\n", "\n", "text_embedding = VectorField(VECTOR_FIELD_NAME,\n", " \"HNSW\", {\n", " \"TYPE\": \"FLOAT32\",\n", " \"DIM\": VECTOR_DIM,\n", " \"DISTANCE_METRIC\": DISTANCE_METRIC\n", " }\n", ")\n", "# Add all our field objects to a list to be created as an index\n", "fields = [filename,text_chunk,file_chunk_index,text_embedding]" ] }, { "cell_type": "code", "execution_count": 8, "id": "a6c78b7e", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "redis_client.ping()" ] }, { "cell_type": "code", "execution_count": 304, "id": "cf3ad41f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Not there yet. Creating\n" ] } ], "source": [ "# Optional step to drop the index if it already exists\n", "#redis_client.ft(INDEX_NAME).dropindex()\n", "\n", "# Check if index exists\n", "try:\n", " redis_client.ft(INDEX_NAME).info()\n", " print(\"Index already exists\")\n", "except Exception as e:\n", " print(e)\n", " # Create RediSearch Index\n", " print('Not there yet. Creating')\n", " redis_client.ft(INDEX_NAME).create_index(\n", " fields = fields,\n", " definition = IndexDefinition(prefix=[PREFIX], index_type=IndexType.HASH)\n", " )" ] }, { "cell_type": "markdown", "id": "f74ebeb5", "metadata": {}, "source": [ "### Ingestion\n", "\n", "We'll load up our PDFs and do the following\n", "- Initiate our tokenizer\n", "- Run a processing pipeline to:\n", " - Mine the text from each PDF\n", " - Split them into chunks and embed them\n", " - Store them in Redis" ] }, { "cell_type": "code", "execution_count": 11, "id": "ed23bf9d", "metadata": {}, "outputs": [], "source": [ "# The transformers.py file contains all of the transforming functions, including ones to chunk, embed and load data\n", "# For more details, check the file and work through each function individually\n", "from transformers import handle_file_string" ] }, { "cell_type": "code", "execution_count": 15, "id": "31f299f6", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "./data/FIA Practice Directions - Competitor's Staff Registration System.pdf\n", "./data/fia_2022_formula_1_sporting_regulations_-_issue_9_-_2022-10-19_0.pdf\n", "./data/fia_2023_formula_1_technical_regulations_-_issue_4_-_2022-12-07.pdf\n", "./data/fia_f1_power_unit_financial_regulations_issue_1_-_2022-08-16.pdf\n", "./data/fia_formula_1_financial_regulations_iss.13.pdf\n", "CPU times: user 3.16 s, sys: 372 ms, total: 3.54 s\n", "Wall time: 45.8 s\n" ] } ], "source": [ "%%time\n", "# This step takes about 5 minutes\n", "\n", "# Initialise tokenizer\n", "tokenizer = tiktoken.get_encoding(\"cl100k_base\")\n", "\n", "# Process each PDF file and prepare for embedding\n", "for pdf_file in pdf_files:\n", " \n", " pdf_path = os.path.join(data_dir,pdf_file)\n", " print(pdf_path)\n", " \n", " # Extract the raw text from each PDF using textract\n", " text = textract.process(pdf_path, method='pdfminer')\n", " \n", " # Chunk each document, embed the contents and load to Redis\n", " handle_file_string((pdf_file,text.decode(\"utf-8\")),tokenizer,redis_client,VECTOR_FIELD_NAME,INDEX_NAME)" ] }, { "cell_type": "code", "execution_count": 16, "id": "22aff597", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'829'" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Check that our docs have been inserted\n", "redis_client.ft(INDEX_NAME).info()['num_docs']" ] }, { "cell_type": "markdown", "id": "6b12cb6e", "metadata": {}, "source": [ "## Make it a product\n", "\n", "Now we can test that our search works as intended by:\n", "- Querying our data in Redis using semantic search and verifying results\n", "- Adding a step to pass the results to GPT-3 for summarisation" ] }, { "cell_type": "code", "execution_count": 17, "id": "e921ac96", "metadata": {}, "outputs": [], "source": [ "from database import get_redis_results" ] }, { "cell_type": "code", "execution_count": 18, "id": "cb9dfacf", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 5.85 ms, sys: 2.61 ms, total: 8.45 ms\n", "Wall time: 240 ms\n" ] }, { "data": { "text/html": [ "
\n", " | id | \n", "result | \n", "certainty | \n", "
---|---|---|---|
0 | \n", "0 | \n", "The IT will, therefore, be competent to establish the existence, or not, of a breach of the FIA regulations and to impose any sanction upon the person and Competitor concerned (see the process governed by the FIA Judicial and Disciplinary Rules). The President of the FIA, in its capacity as prosecuting authority, will ask, in respect of every disciplinary procedure: - - for the imposition of a suspension upon Competitor’s Staff Certificate of Registration holders who have contravened the FIA Code of Good Standing or the withdrawal of the Competitor’s Staff Certificate of Registration (any withdrawal can only be imposed for the remaining period of the current season of the FIA Formula One World Championship) and that these same people not be fined. The person and/or Competitor sanctioned may bring an appeal before the ICA against the IT’s decision. ********* The FIA will inform the relevant Competitor of any proceedings instigated against any member of its staff. It is the responsibility to the relevant Competitor to send the IT a written request to be heard, and if granted, it shall be permitted to submit written observations. The FIA undertakes to support before the IT and/or the ICA any request from the Competitor to intervene as a third party within the framework of a disciplinary procedure. The right to deprive any duly registered member of a Competitor’s staff of access to the Reserved Areas at events forming part of the FIA Formula One World Championship is subject to the procedure set forth in the FIA Judicial and Disciplinary Rules. The Stewards during the course of an Event or otherwise will have no authority to suspend or withdraw a Competitor’s Staff Certificate of Registration for any breach or alleged breach of the FIA Code of Good Standing. | \n", "0.205749571323 | \n", "
1 | \n", "1 | \n", "The following sets out examples of the type of behaviours which might constitute an infringement of the FIA Code of Good Standing (non-exhaustive list of examples) in relation to a person who is subject to the Code of Good Standing: - - - - giving instructions to a driver or other member of a Competitor’s staff with the intention or with the likely result of causing an accident, collision or crash or a race to be stopped or suspended any action which is likely to endanger or materially compromise the safety of any driver, other members of the Competitor’s staff, other participants in a race, Officials or any spectators or other members of the public who attend an event giving instructions to make any changes to a car in breach of any safety requirements or regulations giving instructions to tamper with or adversely affect the set-up or performance of the car of any other Competitor 4 / 5 \f", "FIA Legal Department Practice Directions - Competitor’s Staff Registration System 17 March 2011 - - giving instructions to a driver or otherwise taking any action by which the result or course of a race may be influenced or affected for the purpose of profiting or assisting someone to profit through betting on the outcome of a race or any part of a race or being convicted of a criminal offence (other than a driving offence) which carries a maximum prison sentence of five years. VII. AMENDMENTS TO THE COMPETITOR’S STAFF REGISTRATION SYSTEM The FIA will not make any amendments with regard to the Competitor’s Staff Registration System, either to the International Sporting Code or to the Practice Directions, prior consultation with the Competitors entered in the FIA Formula One World Championship and adequate opportunity to provide input on the proposed amendments. | \n", "0.206525266171 | \n", "