openai-cookbook/examples/vector_databases/pinecone/Gen_QA.ipynb

{
  "cells": [
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "v0to-QXCQjsm"
      },
      "source": [
        "# Retrieval Augmented Generative Question Answering with Pinecone\n",
        "\n",
        "#### Fixing LLMs that Hallucinate\n",
        "\n",
        "In this notebook we will learn how to query relevant contexts to our queries from Pinecone, and pass these to a generative OpenAI model to generate an answer backed by real data sources.\n",
        "\n",
        "A common problem with using GPT-3 to factually answer questions is that GPT-3 can sometimes make things up. The GPT models have a broad range of general knowledge, but this does not necessarily apply to more specific information. For that we use the Pinecone vector database as our _\"external knowledge base\"_ — like *long-term memory* for GPT-3.\n",
        "\n",
        "Required installs for this notebook are:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 1,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "VpMvHAYRQf9N",
        "outputId": "f2b1a704-1b38-4985-f5cf-be0479f2ac31"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "\u001b[?25l     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m0.0/55.3 KB\u001b[0m \u001b[31m?\u001b[0m eta \u001b[36m-:--:--\u001b[0m\r\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m55.3/55.3 KB\u001b[0m \u001b[31m1.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25h  Installing build dependencies ... \u001b[?25l\u001b[?25hdone\n",
            "  Getting requirements to build wheel ... \u001b[?25l\u001b[?25hdone\n",
            "  Preparing metadata (pyproject.toml) ... \u001b[?25l\u001b[?25hdone\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m170.6/170.6 KB\u001b[0m \u001b[31m13.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m452.9/452.9 KB\u001b[0m \u001b[31m30.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m58.3/58.3 KB\u001b[0m \u001b[31m6.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m213.0/213.0 KB\u001b[0m \u001b[31m17.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m132.0/132.0 KB\u001b[0m \u001b[31m13.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m182.4/182.4 KB\u001b[0m \u001b[31m18.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m140.6/140.6 KB\u001b[0m \u001b[31m6.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25h  Building wheel for openai (pyproject.toml) ... \u001b[?25l\u001b[?25hdone\n"
          ]
        }
      ],
      "source": [
        "!pip install -qU openai pinecone-client datasets"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 2,
      "metadata": {
        "id": "aEreHNxYkDbK"
      },
      "outputs": [],
      "source": [
        "import openai\n",
        "\n",
        "# get API key from top-right dropdown on OpenAI website\n",
        "openai.api_key = \"OPENAI_API_KEY\""
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "seS2VDFz0BCI"
      },
      "source": [
        "For many questions *state-of-the-art (SOTA)* LLMs are more than capable of answering correctly."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 3,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 35
        },
        "id": "9FEDn7LvkDYj",
        "outputId": "dea469a8-55ab-491f-f645-356e86d361ac"
      },
      "outputs": [
        {
          "data": {
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            },
            "text/plain": [
              "'The 12th person on the moon was Harrison Schmitt, and he landed on December 11, 1972.'"
            ]
          },
          "execution_count": 3,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "query = \"who was the 12th person on the moon and when did they land?\"\n",
        "\n",
        "# now query text-davinci-003 WITHOUT context\n",
        "res = openai.Completion.create(\n",
        "    engine='text-davinci-003',\n",
        "    prompt=query,\n",
        "    temperature=0,\n",
        "    max_tokens=400,\n",
        "    top_p=1,\n",
        "    frequency_penalty=0,\n",
        "    presence_penalty=0,\n",
        "    stop=None\n",
        ")\n",
        "\n",
        "res['choices'][0]['text'].strip()"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "pfHwX1qSldhY"
      },
      "source": [
        "However, that isn't always the case. First let's first rewrite the above into a simple function so we're not rewriting this every time."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 4,
      "metadata": {
        "id": "SczFSfnjmNji"
      },
      "outputs": [],
      "source": [
        "def complete(prompt):\n",
        "    # query text-davinci-003\n",
        "    res = openai.Completion.create(\n",
        "        engine='text-davinci-003',\n",
        "        prompt=prompt,\n",
        "        temperature=0,\n",
        "        max_tokens=400,\n",
        "        top_p=1,\n",
        "        frequency_penalty=0,\n",
        "        presence_penalty=0,\n",
        "        stop=None\n",
        "    )\n",
        "    return res['choices'][0]['text'].strip()"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "YC6csbA40UW3"
      },
      "source": [
        "Now let's ask a more specific question about training a type of transformer model called a *sentence transformer*. The ideal answer we'd be looking for is _\"Multiple Negatives Ranking (MNR) loss\"_.\n",
        "\n",
        "Don't worry if this is a new term to you, it isn't required to understand what we're doing or demoing here."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 5,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 89
        },
        "id": "H2fUC8BtxCt_",
        "outputId": "01beb42c-1f32-4e08-afc5-127e2dc5597a"
      },
      "outputs": [
        {
          "data": {
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            },
            "text/plain": [
              "'If you only have pairs of related sentences, then the best training method to use for sentence transformers is the supervised learning approach. This approach involves providing the model with labeled data, such as pairs of related sentences, and then training the model to learn the relationships between the sentences. This approach is often used for tasks such as natural language inference, semantic similarity, and paraphrase identification.'"
            ]
          },
          "execution_count": 5,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "query = (\n",
        "    \"Which training method should I use for sentence transformers when \" +\n",
        "    \"I only have pairs of related sentences?\"\n",
        ")\n",
        "\n",
        "complete(query)"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "k7ut_DNBwIk1"
      },
      "source": [
        "One of the common answers we get to this is:\n",
        "\n",
        "```\n",
        "The best training method to use for fine-tuning a pre-trained model with sentence transformers is the Masked Language Model (MLM) training. MLM training involves randomly masking some of the words in a sentence and then training the model to predict the masked words. This helps the model to learn the context of the sentence and better understand the relationships between words.\n",
        "```\n",
        "\n",
        "This answer seems pretty convincing right? Yet, it's wrong. MLM is typically used in the pretraining step of a transformer model but *\"cannot\"* be used to fine-tune a sentence-transformer, and has nothing to do with having _\"pairs of related sentences\"_.\n",
        "\n",
        "An alternative answer we receive (and the one we returned above) is about `supervised learning approach` being the most suitable. This is completely true, but it's not specific and doesn't answer the question.\n",
        "\n",
        "We have two options for enabling our LLM in understanding and correctly answering this question:\n",
        "\n",
        "1. We fine-tune the LLM on text data covering the topic mentioned, likely on articles and papers talking about sentence transformers, semantic search training methods, etc.\n",
        "\n",
        "2. We use **R**etrieval **A**ugmented **G**eneration (RAG), a technique that implements an information retrieval component to the generation process. Allowing us to retrieve relevant information and feed this information into the generation model as a *secondary* source of information.\n",
        "\n",
        "We will demonstrate option **2**."
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "NhWnLkHqmeWI"
      },
      "source": [
        "---\n",
        "\n",
        "## Building a Knowledge Base\n",
        "\n",
        "With option **2** the retrieval of relevant information requires an external _\"Knowledge Base\"_, a place where we can store and use to efficiently retrieve information. We can think of this as the external _long-term memory_ of our LLM.\n",
        "\n",
        "We will need to retrieve information that is semantically related to our queries, to do this we need to use _\"dense vector embeddings\"_. These can be thought of as numerical representations of the *meaning* behind our sentences.\n",
        "\n",
        "To create these dense vectors we use the `text-embedding-ada-002` model.\n",
        "\n",
        "We have already authenticated our OpenAI connection, to create an embedding we just do:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 6,
      "metadata": {
        "id": "EI2iYxq16or9"
      },
      "outputs": [],
      "source": [
        "embed_model = \"text-embedding-ada-002\"\n",
        "\n",
        "res = openai.Embedding.create(\n",
        "    input=[\n",
        "        \"Sample document text goes here\",\n",
        "        \"there will be several phrases in each batch\"\n",
        "    ], engine=embed_model\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ZnHpGP5R60Fv"
      },
      "source": [
        "In the response `res` we will find a JSON-like object containing our new embeddings within the `'data'` field."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 7,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "57smZFmz61tj",
        "outputId": "30745411-1f44-4abb-ac36-20abcfdbb343"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "dict_keys(['object', 'data', 'model', 'usage'])"
            ]
          },
          "execution_count": 7,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "res.keys()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "MwSk-wiK62KO"
      },
      "source": [
        "Inside `'data'` we will find two records, one for each of the two sentences we just embedded. Each vector embedding contains `1536` dimensions (the output dimensionality of the `text-embedding-ada-002` model."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 8,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "36D4ipOR63AW",
        "outputId": "10a3d6ba-a646-4ebd-d74f-90868d04a6f6"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "2"
            ]
          },
          "execution_count": 8,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "len(res['data'])"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 9,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "dPyGLhDX62t4",
        "outputId": "f5d38bb2-f863-4d39-c8f6-d75579634ec9"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "(1536, 1536)"
            ]
          },
          "execution_count": 9,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "len(res['data'][0]['embedding']), len(res['data'][1]['embedding'])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "byxj1rgy68k1"
      },
      "source": [
        "We will apply this same embedding logic to a dataset containing information relevant to our query (and many other queries on the topics of ML and AI).\n",
        "\n",
        "### Data Preparation\n",
        "\n",
        "The dataset we will be using is the `jamescalam/youtube-transcriptions` from Hugging Face _Datasets_. It contains transcribed audio from several ML and tech YouTube channels. We download it with:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 2,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "t7uzxkGz73Ov",
        "outputId": "995123eb-8f78-44b0-b325-e0ce2284b168"
      },
      "outputs": [
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "Using custom data configuration jamescalam--youtube-transcriptions-6a482f3df0aedcdb\n",
            "Reusing dataset json (/Users/jamesbriggs/.cache/huggingface/datasets/jamescalam___json/jamescalam--youtube-transcriptions-6a482f3df0aedcdb/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b)\n"
          ]
        },
        {
          "data": {
            "text/plain": [
              "Dataset({\n",
              "    features: ['title', 'published', 'url', 'video_id', 'channel_id', 'id', 'text', 'start', 'end'],\n",
              "    num_rows: 208619\n",
              "})"
            ]
          },
          "execution_count": 2,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "from datasets import load_dataset\n",
        "\n",
        "data = load_dataset('jamescalam/youtube-transcriptions', split='train')\n",
        "data"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 3,
      "metadata": {},
      "outputs": [
        {
          "data": {
            "text/plain": [
              "{'title': 'Training and Testing an Italian BERT - Transformers From Scratch #4',\n",
              " 'published': '2021-07-06 13:00:03 UTC',\n",
              " 'url': 'https://youtu.be/35Pdoyi6ZoQ',\n",
              " 'video_id': '35Pdoyi6ZoQ',\n",
              " 'channel_id': 'UCv83tO5cePwHMt1952IVVHw',\n",
              " 'id': '35Pdoyi6ZoQ-t0.0',\n",
              " 'text': 'Hi, welcome to the video.',\n",
              " 'start': 0.0,\n",
              " 'end': 9.36}"
            ]
          },
          "execution_count": 3,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "data[0]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "TWWdWjA273qF"
      },
      "source": [
        "The dataset contains many small snippets of text data. We will need to merge many snippets from each video to create more substantial chunks of text that contain more information."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 11,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 49,
          "referenced_widgets": [
            "8b7d062ee1c14bf6b0c55da89ff4b551",
            "8f03d894148346bb90897fb39d6ec686",
            "a115589785f34bc38e1730e8b497eef6",
            "f205f29abe8d47f6b28627684f947bcd",
            "d5aeb124d44744d2aaaa7d5b213caca7",
            "2af9f8bae68d406d8cd4f56acf3db9e4",
            "88f0b5625c9a4ce89d8a30fdf28efd90",
            "a21a6992c8a744d49826ab0f56b867ed",
            "6aa795a589714b058783f5f3eb5983e1",
            "1d70ba5c815a4473939665061e52ae6e",
            "fbd86b292a484498a61acf0ea7f5e814"
          ]
        },
        "id": "uG9ZTI0o-9cJ",
        "outputId": "30b65907-eea0-4de0-c457-d69531e388c3"
      },
      "outputs": [
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "8b7d062ee1c14bf6b0c55da89ff4b551",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "  0%|          | 0/52155 [00:00<?, ?it/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        }
      ],
      "source": [
        "from tqdm.auto import tqdm\n",
        "\n",
        "new_data = []\n",
        "\n",
        "window = 20  # number of sentences to combine\n",
        "stride = 4  # number of sentences to 'stride' over, used to create overlap\n",
        "\n",
        "for i in tqdm(range(0, len(data), stride)):\n",
        "    i_end = min(len(data)-1, i+window)\n",
        "    if data[i]['title'] != data[i_end]['title']:\n",
        "        # in this case we skip this entry as we have start/end of two videos\n",
        "        continue\n",
        "    text = ' '.join(data[i:i_end]['text'])\n",
        "    # create the new merged dataset\n",
        "    new_data.append({\n",
        "        'start': data[i]['start'],\n",
        "        'end': data[i_end]['end'],\n",
        "        'title': data[i]['title'],\n",
        "        'text': text,\n",
        "        'id': data[i]['id'],\n",
        "        'url': data[i]['url'],\n",
        "        'published': data[i]['published'],\n",
        "        'channel_id': data[i]['channel_id']\n",
        "    })"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 12,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "wN0BuMWSnqId",
        "outputId": "2b733986-c26b-487b-a4cb-336602a1a3dc"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "{'start': 0.0,\n",
              " 'end': 74.12,\n",
              " 'title': 'Training and Testing an Italian BERT - Transformers From Scratch #4',\n",
              " 'text': \"Hi, welcome to the video. So this is the fourth video in a Transformers from Scratch mini series. So if you haven't been following along, we've essentially covered what you can see on the screen. So we got some data. We built a tokenizer with it. And then we've set up our input pipeline ready to begin actually training our model, which is what we're going to cover in this video. So let's move over to the code. And we see here that we have essentially everything we've done so far. So we've built our input data, our input pipeline. And we're now at a point where we have a data loader, PyTorch data loader, ready. And we can begin training a model with it. So there are a few things to be aware of. So I mean, first, let's just have a quick look at the structure of our data.\",\n",
              " 'id': '35Pdoyi6ZoQ-t0.0',\n",
              " 'url': 'https://youtu.be/35Pdoyi6ZoQ',\n",
              " 'published': '2021-07-06 13:00:03 UTC',\n",
              " 'channel_id': 'UCv83tO5cePwHMt1952IVVHw'}"
            ]
          },
          "execution_count": 12,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "new_data[0]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "VMyJjt1cnwcH"
      },
      "source": [
        "Now we need a place to store these embeddings and enable a efficient _vector search_ through them all. To do that we use **`Pinecone`**, we can get a [free API key](https://app.pinecone.io) and enter it below where we will initialize our connection to `Pinecone` and create a new index."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 13,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "UPNwQTH0RNcl",
        "outputId": "c5d22baf-0e69-4039-fda0-624ce22cd740"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "{'dimension': 1536,\n",
              " 'index_fullness': 0.0,\n",
              " 'namespaces': {},\n",
              " 'total_vector_count': 0}"
            ]
          },
          "execution_count": 13,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "import pinecone\n",
        "\n",
        "index_name = 'openai-youtube-transcriptions'\n",
        "\n",
        "# initialize connection to pinecone (get API key at app.pinecone.io)\n",
        "pinecone.init(\n",
        "    api_key=\"PINECONE_API_KEY\",\n",
        "    environment=\"us-east1-gcp\"  # may be different, check at app.pinecone.io\n",
        ")\n",
        "\n",
        "# check if index already exists (it shouldn't if this is first time)\n",
        "if index_name not in pinecone.list_indexes():\n",
        "    # if does not exist, create index\n",
        "    pinecone.create_index(\n",
        "        index_name,\n",
        "        dimension=len(res['data'][0]['embedding']),\n",
        "        metric='cosine',\n",
        "        metadata_config={'indexed': ['channel_id', 'published']}\n",
        "    )\n",
        "# connect to index\n",
        "index = pinecone.Index(index_name)\n",
        "# view index stats\n",
        "index.describe_index_stats()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "nELBmqxxzeqL"
      },
      "source": [
        "We can see the index is currently empty with a `total_vector_count` of `0`. We can begin populating it with OpenAI `text-embedding-ada-002` built embeddings like so:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 14,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 49,
          "referenced_widgets": [
            "bb392ce2d1e047daa1747c4a0f5e89b7",
            "c298c3dc46ed4f2e85e34a9972b3faf4",
            "b335ce0994e045df8a886ca32e3ebb76",
            "b3f2dde1b97c4989b0e6e5ea3365a270",
            "1cbbf96f9b7f46c29ebbd696e5777e82",
            "3172ad39260d41aea64fba5df2c13961",
            "7c20e179ec504d4caed56e17e3f53e02",
            "1a7b9a94c88a4496a24d4e57d4801047",
            "2b2409a6c2024d57b2b6ddb0e76c7068",
            "d26a5434beda435aa5d62c5c11de2bb8",
            "c56384bfac4c4f5596787f04fd76b86c"
          ]
        },
        "id": "vPb9liovzrc8",
        "outputId": "bb69dbce-c140-49af-840f-2c03dd940e2a"
      },
      "outputs": [
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "bb392ce2d1e047daa1747c4a0f5e89b7",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "  0%|          | 0/487 [00:00<?, ?it/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        }
      ],
      "source": [
        "from tqdm.auto import tqdm\n",
        "from time import sleep\n",
        "\n",
        "batch_size = 100  # how many embeddings we create and insert at once\n",
        "\n",
        "for i in tqdm(range(0, len(new_data), batch_size)):\n",
        "    # find end of batch\n",
        "    i_end = min(len(new_data), i+batch_size)\n",
        "    meta_batch = new_data[i:i_end]\n",
        "    # get ids\n",
        "    ids_batch = [x['id'] for x in meta_batch]\n",
        "    # get texts to encode\n",
        "    texts = [x['text'] for x in meta_batch]\n",
        "    # create embeddings (try-except added to avoid RateLimitError)\n",
        "    try:\n",
        "        res = openai.Embedding.create(input=texts, engine=embed_model)\n",
        "    except:\n",
        "        done = False\n",
        "        while not done:\n",
        "            sleep(5)\n",
        "            try:\n",
        "                res = openai.Embedding.create(input=texts, engine=embed_model)\n",
        "                done = True\n",
        "            except:\n",
        "                pass\n",
        "    embeds = [record['embedding'] for record in res['data']]\n",
        "    # cleanup metadata\n",
        "    meta_batch = [{\n",
        "        'start': x['start'],\n",
        "        'end': x['end'],\n",
        "        'title': x['title'],\n",
        "        'text': x['text'],\n",
        "        'url': x['url'],\n",
        "        'published': x['published'],\n",
        "        'channel_id': x['channel_id']\n",
        "    } for x in meta_batch]\n",
        "    to_upsert = list(zip(ids_batch, embeds, meta_batch))\n",
        "    # upsert to Pinecone\n",
        "    index.upsert(vectors=to_upsert)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "2yiF91IbyGYo"
      },
      "source": [
        "Now we search, for this we need to create a _query vector_ `xq`:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 15,
      "metadata": {
        "id": "LF1U_yZGojRJ"
      },
      "outputs": [],
      "source": [
        "res = openai.Embedding.create(\n",
        "    input=[query],\n",
        "    engine=embed_model\n",
        ")\n",
        "\n",
        "# retrieve from Pinecone\n",
        "xq = res['data'][0]['embedding']\n",
        "\n",
        "# get relevant contexts (including the questions)\n",
        "res = index.query(xq, top_k=2, include_metadata=True)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 16,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "GH_DkmsNomww",
        "outputId": "fc84b83b-164d-45a7-9e80-9b11519bf25b"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "{'matches': [{'id': 'pNvujJ1XyeQ-t418.88',\n",
              "              'metadata': {'channel_id': 'UCv83tO5cePwHMt1952IVVHw',\n",
              "                           'end': 568.4,\n",
              "                           'published': datetime.date(2021, 11, 24),\n",
              "                           'start': 418.88,\n",
              "                           'text': 'pairs of related sentences you can go '\n",
              "                                   'ahead and actually try training or '\n",
              "                                   'fine-tuning using NLI with multiple '\n",
              "                                   \"negative ranking loss. If you don't have \"\n",
              "                                   'that fine. Another option is that you have '\n",
              "                                   'a semantic textual similarity data set or '\n",
              "                                   'STS and what this is is you have so you '\n",
              "                                   'have sentence A here, sentence B here and '\n",
              "                                   'then you have a score from from 0 to 1 '\n",
              "                                   'that tells you the similarity between '\n",
              "                                   'those two scores and you would train this '\n",
              "                                   'using something like cosine similarity '\n",
              "                                   \"loss. Now if that's not an option and your \"\n",
              "                                   'focus or use case is on building a '\n",
              "                                   'sentence transformer for another language '\n",
              "                                   'where there is no current sentence '\n",
              "                                   'transformer you can use multilingual '\n",
              "                                   'parallel data. So what I mean by that is '\n",
              "                                   'so parallel data just means translation '\n",
              "                                   'pairs so if you have for example a English '\n",
              "                                   'sentence and then you have another '\n",
              "                                   'language here so it can it can be anything '\n",
              "                                   \"I'm just going to put XX and that XX is \"\n",
              "                                   'your target language you can fine-tune a '\n",
              "                                   'model using something called multilingual '\n",
              "                                   'knowledge distillation and what that does '\n",
              "                                   'is takes a monolingual model for example '\n",
              "                                   'in English and using those translation '\n",
              "                                   'pairs it distills the knowledge the '\n",
              "                                   'semantic similarity knowledge from that '\n",
              "                                   'monolingual English model into a '\n",
              "                                   'multilingual model which can handle both '\n",
              "                                   'English and your target language. So '\n",
              "                                   \"they're three options quite popular very \"\n",
              "                                   'common that you can go for and as a '\n",
              "                                   'supervised methods the chances are that '\n",
              "                                   'probably going to outperform anything you '\n",
              "                                   'do with unsupervised training at least for '\n",
              "                                   'now. So if none of those sound like '\n",
              "                                   'something',\n",
              "                           'title': 'Today Unsupervised Sentence Transformers, '\n",
              "                                    'Tomorrow Skynet (how TSDAE works)',\n",
              "                           'url': 'https://youtu.be/pNvujJ1XyeQ'},\n",
              "              'score': 0.865277052,\n",
              "              'sparseValues': {},\n",
              "              'values': []},\n",
              "             {'id': 'WS1uVMGhlWQ-t737.28',\n",
              "              'metadata': {'channel_id': 'UCv83tO5cePwHMt1952IVVHw',\n",
              "                           'end': 900.72,\n",
              "                           'published': datetime.date(2021, 10, 20),\n",
              "                           'start': 737.28,\n",
              "                           'text': \"were actually more accurate. So we can't \"\n",
              "                                   \"really do that. We can't use this what is \"\n",
              "                                   'called a mean pooling approach. Or we '\n",
              "                                   \"can't use it in its current form. Now the \"\n",
              "                                   'solution to this problem was introduced by '\n",
              "                                   'two people in 2019 Nils Reimers and Irenia '\n",
              "                                   'Gurevich. They introduced what is the '\n",
              "                                   'first sentence transformer or sentence '\n",
              "                                   'BERT. And it was found that sentence BERT '\n",
              "                                   'or S BERT outformed all of the previous '\n",
              "                                   'Save the Art models on pretty much all '\n",
              "                                   'benchmarks. Not all of them but most of '\n",
              "                                   'them. And it did it in a very quick time. '\n",
              "                                   'So if we compare it to BERT, if we wanted '\n",
              "                                   'to find the most similar sentence pair '\n",
              "                                   'from 10,000 sentences in that 2019 paper '\n",
              "                                   'they found that with BERT that took 65 '\n",
              "                                   'hours. With S BERT embeddings they could '\n",
              "                                   'create all the embeddings in just around '\n",
              "                                   'five seconds. And then they could compare '\n",
              "                                   'all those with cosine similarity in 0.01 '\n",
              "                                   \"seconds. So it's a lot faster. We go from \"\n",
              "                                   '65 hours to just over five seconds which '\n",
              "                                   'is I think pretty incredible. Now I think '\n",
              "                                   \"that's pretty much all the context we need \"\n",
              "                                   'behind sentence transformers. And what we '\n",
              "                                   'do now is dive into a little bit of how '\n",
              "                                   'they actually work. Now we said before we '\n",
              "                                   'have the core transform models and what S '\n",
              "                                   'BERT does is fine tunes on sentence pairs '\n",
              "                                   'using what is called a Siamese '\n",
              "                                   'architecture or Siamese network. What we '\n",
              "                                   'mean by a Siamese network is that we have '\n",
              "                                   'what we can see, what can view as two BERT '\n",
              "                                   'models that are identical and the weights '\n",
              "                                   'between those two models are tied. Now in '\n",
              "                                   'reality when implementing this we just use '\n",
              "                                   'a single BERT model. And what we do is we '\n",
              "                                   'process one sentence, a sentence A through '\n",
              "                                   'the model and then we process another '\n",
              "                                   'sentence, sentence B through the model. '\n",
              "                                   \"And that's the sentence pair. So with our \"\n",
              "                                   'cross-linked we were processing the '\n",
              "                                   'sentence pair together. We were putting '\n",
              "                                   'them both together, processing them all at '\n",
              "                                   'once. This time we process them '\n",
              "                                   'separately. And during training what '\n",
              "                                   'happens is the weights',\n",
              "                           'title': 'Intro to Sentence Embeddings with '\n",
              "                                    'Transformers',\n",
              "                           'url': 'https://youtu.be/WS1uVMGhlWQ'},\n",
              "              'score': 0.85855335,\n",
              "              'sparseValues': {},\n",
              "              'values': []}],\n",
              " 'namespace': ''}"
            ]
          },
          "execution_count": 16,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "res"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 30,
      "metadata": {
        "id": "92NmGGJ1TKQp"
      },
      "outputs": [],
      "source": [
        "limit = 3750\n",
        "\n",
        "def retrieve(query):\n",
        "    res = openai.Embedding.create(\n",
        "        input=[query],\n",
        "        engine=embed_model\n",
        "    )\n",
        "\n",
        "    # retrieve from Pinecone\n",
        "    xq = res['data'][0]['embedding']\n",
        "\n",
        "    # get relevant contexts\n",
        "    res = index.query(xq, top_k=3, include_metadata=True)\n",
        "    contexts = [\n",
        "        x['metadata']['text'] for x in res['matches']\n",
        "    ]\n",
        "\n",
        "    # build our prompt with the retrieved contexts included\n",
        "    prompt_start = (\n",
        "        \"Answer the question based on the context below.\\n\\n\"+\n",
        "        \"Context:\\n\"\n",
        "    )\n",
        "    prompt_end = (\n",
        "        f\"\\n\\nQuestion: {query}\\nAnswer:\"\n",
        "    )\n",
        "    # append contexts until hitting limit\n",
        "    for i in range(1, len(contexts)):\n",
        "        if len(\"\\n\\n---\\n\\n\".join(contexts[:i])) >= limit:\n",
        "            prompt = (\n",
        "                prompt_start +\n",
        "                \"\\n\\n---\\n\\n\".join(contexts[:i-1]) +\n",
        "                prompt_end\n",
        "            )\n",
        "            break\n",
        "        elif i == len(contexts)-1:\n",
        "            prompt = (\n",
        "                prompt_start +\n",
        "                \"\\n\\n---\\n\\n\".join(contexts) +\n",
        "                prompt_end\n",
        "            )\n",
        "    return prompt"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 31,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 142
        },
        "id": "LwsZuxiTvU2d",
        "outputId": "7e3acf8b-7356-41bc-8c9e-5405a21153e0"
      },
      "outputs": [
        {
          "data": {
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            },
            "text/plain": [
              "\"Answer the question based on the context below.\\n\\nContext:\\npairs of related sentences you can go ahead and actually try training or fine-tuning using NLI with multiple negative ranking loss. If you don't have that fine. Another option is that you have a semantic textual similarity data set or STS and what this is is you have so you have sentence A here, sentence B here and then you have a score from from 0 to 1 that tells you the similarity between those two scores and you would train this using something like cosine similarity loss. Now if that's not an option and your focus or use case is on building a sentence transformer for another language where there is no current sentence transformer you can use multilingual parallel data. So what I mean by that is so parallel data just means translation pairs so if you have for example a English sentence and then you have another language here so it can it can be anything I'm just going to put XX and that XX is your target language you can fine-tune a model using something called multilingual knowledge distillation and what that does is takes a monolingual model for example in English and using those translation pairs it distills the knowledge the semantic similarity knowledge from that monolingual English model into a multilingual model which can handle both English and your target language. So they're three options quite popular very common that you can go for and as a supervised methods the chances are that probably going to outperform anything you do with unsupervised training at least for now. So if none of those sound like something\\n\\n---\\n\\nwere actually more accurate. So we can't really do that. We can't use this what is called a mean pooling approach. Or we can't use it in its current form. Now the solution to this problem was introduced by two people in 2019 Nils Reimers and Irenia Gurevich. They introduced what is the first sentence transformer or sentence BERT. And it was found that sentence BERT or S BERT outformed all of the previous Save the Art models on pretty much all benchmarks. Not all of them but most of them. And it did it in a very quick time. So if we compare it to BERT, if we wanted to find the most similar sentence pair from 10,000 sentences in that 2019 paper they found that with BERT that took 65 hours. With S BERT embeddings they could create all the embeddings in just around five seconds. And then they could compare all those with cosine similarity in 0.01 seconds. So it's a lot faster. We go from 65 hours to just over five seconds which is I think pretty incredible. Now I think that's pretty much all the context we need behind sentence transformers. And what we do now is dive into a little bit of how they actually work. Now we said before we have the core transform models and what S BERT does is fine tunes on sentence pairs using what is called a Siamese architecture or Siamese network. What we mean by a Siamese network is that we have what we can see, what can view as two BERT models that are identical and the weights between those two models are tied. Now in reality when implementing this we just use a single BERT model. And what we do is we process one sentence, a sentence A through the model and then we process another sentence, sentence B through the model. And that's the sentence pair. So with our cross-linked we were processing the sentence pair together. We were putting them both together, processing them all at once. This time we process them separately. And during training what happens is the weights\\n\\n---\\n\\nTransformer-based Sequential Denoising Autoencoder. So what we'll do is jump straight into it and take a look at where we might want to use this training approach and and how we can actually implement it. So the first question we need to ask is do we really need to resort to unsupervised training? Now what we're going to do here is just have a look at a few of the most popular training approaches and what sort of data we need for that. So the first one we're looking at here is Natural Language Inferen
            ]
          },
          "execution_count": 31,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "# first we retrieve relevant items from Pinecone\n",
        "query_with_contexts = retrieve(query)\n",
        "query_with_contexts"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 32,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 35
        },
        "id": "ioDVGF7lkDQL",
        "outputId": "88bbbd48-89b1-4485-f511-cc5014bf3a5b"
      },
      "outputs": [
        {
          "data": {
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            },
            "text/plain": [
              "'You should use Natural Language Inference (NLI) with multiple negative ranking loss.'"
            ]
          },
          "execution_count": 32,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "# then we complete the context-infused query\n",
        "complete(query_with_contexts)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "OPO36aN8QoPZ"
      },
      "source": [
        "And we get a pretty great answer straight away, specifying to use _multiple-rankings loss_ (also called _multiple negatives ranking loss_)."
      ]
    }
  ],
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "display_name": "ml",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.9.12 (main, Apr  5 2022, 01:52:34) \n[Clang 12.0.0 ]"
    },
    "vscode": {
      "interpreter": {
        "hash": "b8e7999f96e1b425e2d542f21b571f5a4be3e97158b0b46ea1b2500df63956ce"
      }
    },
    "widgets": {}
  },
  "nbformat": 4,
  "nbformat_minor": 0
}