# Enterprise Knowledge Retrieval

This notebook contains an end-to-end workflow to set up an Enterprise Knowledge Retrieval solution from scratch.

### Problem Statement

LLMs have great conversational ability but their knowledge is general and often out of date. Relevant knowledge often exists, but is kept in disparate datestores that are hard to surface with current search solutions.


### Objective

We want to deliver an outstanding user experience where the user is presented with the right knowledge when they need it in a clear and conversational way. To accomplish this we need an LLM-powered solution that knows our organizational context and data, that can retrieve the right knowledge when the user needs it. 


## Solution

![title](img/enterprise_knowledge_retrieval.png)

We'll build a knowledge retrieval solution that will embed a corpus of knowledge (in our case a database of Wikipedia manuals) and use it to answer user questions.

### Learning Path

#### Walkthrough

You can follow on to this solution walkthrough through either the video recorded here, or the text walkthrough below. We'll build out the solution in the following stages:
- **Setup:** Initiate variables and connect to a vector database.
- **Storage:** Configure the database, prepare our data and store embeddings and metadata for retrieval.
- **Search:** Extract relevant documents back out with a basic search function and use an LLM to summarise results into a concise reply.
- **Answer:** Add a more sophisticated agent which will process the user's query and maintain a memory for follow-up questions.
- **Evaluate:** Take a sample evaluated question/answer pairs using our service and plot them to scope out remedial action.

## Walkthrough

In [1]:
%load_ext autoreload
%autoreload 2

## Setup

Import libraries and set up a connection to a Redis vector database for our knowledge base.

You can substitute Redis for any other vectorstore or database - there are a [selection](https://python.langchain.com/en/latest/modules/indexes/vectorstores.html) that are supported by Langchain natively, while other connectors will need to be developed yourself.

In [2]:
!pip install redis
!pip install openai
!pip install tiktoken
!pip install wget



In [3]:
from ast import literal_eval
import concurrent
import openai
import os
import numpy as np
from numpy import array, average
import pandas as pd
from tenacity import retry, wait_random_exponential, stop_after_attempt
import tiktoken
from tqdm import tqdm
from typing import List, Iterator
import wget

# Redis imports
from redis import Redis as r
from redis.commands.search.query import Query
from redis.commands.search.field import (
    TextField,
    VectorField,
    NumericField
)
from redis.commands.search.indexDefinition import (
    IndexDefinition,
    IndexType
)

# Langchain imports
from langchain.embeddings import OpenAIEmbeddings
from langchain.chains import RetrievalQA

CHAT_MODEL = "gpt-3.5-turbo"

In [4]:
pd.set_option('display.max_colwidth', 0)

In [5]:
embeddings_url = 'https://cdn.openai.com/API/examples/data/wikipedia_articles_2000.csv'

# The file is ~700 MB so this will take some time
wget.download(embeddings_url)


  0% [                                                                          ]       0 / 4470649
  0% [                                                                          ]    8192 / 4470649
  0% [                                                                          ]   16384 / 4470649
  0% [                                                                          ]   24576 / 4470649
  0% [                                                                          ]   32768 / 4470649
  0% [                                                                          ]   40960 / 4470649
  1% [                                                                          ]   49152 / 4470649
  1% [                                                                          ]   57344 / 4470649
  1% [.                                                                         ]   65536 / 4470649
  1% [.                                                                         ]   73728


 91% [....................................................................      ] 4112384 / 4470649
 92% [....................................................................      ] 4120576 / 4470649
 92% [....................................................................      ] 4128768 / 4470649
 92% [....................................................................      ] 4136960 / 4470649
 92% [....................................................................      ] 4145152 / 4470649
 92% [....................................................................      ] 4153344 / 4470649
 93% [....................................................................      ] 4161536 / 4470649
 93% [.....................................................................     ] 4169728 / 4470649
 93% [.....................................................................     ] 4177920 / 4470649
 93% [.....................................................................     ] 4186112

'wikipedia_articles_2000 (2).csv'

In [6]:
article_df = pd.read_csv('./wikipedia_articles_2000.csv')
article_df.head()

Unnamed: 0.1,Unnamed: 0,id,url,title,text
0,878,3661,https://simple.wikipedia.org/wiki/Photon,Photon,"Photons (from Greek φως, meaning light), in many atomic models in physics, are particles which transmit light. In other words, light is carried over space by photons. Photon is an elementary particle that is its own antiparticle. In quantum mechanics each photon has a characteristic quantum of energy that depends on frequency: A photon associated with light at a higher frequency will have more energy (and be associated with light at a shorter wavelength).\n\nPhotons have a rest mass of 0 (zero). However, Einstein's theory of relativity says that they do have a certain amount of momentum. Before the photon got its name, Einstein revived the proposal that light is separate pieces of energy (particles). These particles came to be known as photons. \n\nA photon is usually given the symbol γ (gamma),\n\nProperties \n\nPhotons are fundamental particles. Although they can be created and destroyed, their lifetime is infinite.\n\nIn a vacuum, all photons move at the speed of light, c, which is equal to 299,792,458 meters (approximately 300,000 kilometers) per second.\n\nA photon has a given frequency, which determines its color. Radio technology makes great use of frequency. Beyond the visible range, frequency is less discussed, for example it is little used in distinguishing between X-Ray photons and infrared. Frequency is equivalent to the quantum energy of the photon, as related by the Planck constant equation,\n\n,\n\nwhere is the photon's energy, is the Plank constant, and is the frequency of the light associated with the photon. This frequency, , is typically measured in cycles per second, or equivalently, in Hz. The quantum energy of different photons is often used in cameras, and other machines that use visible and higher than visible radiation. This because these photons are energetic enough to ionize atoms. \n\nAnother property of a photon is its wavelength. The frequency , wavelength , and speed of light are related by the equation,\n\n,\n\nwhere (lambda) is the wavelength, or length of the wave (typically measured in meters.)\n\nAnother important property of a photon is its polarity. If you saw a giant photon coming straight at you, it could appear as a swath whipping vertically, horizontally, or somewhere in between. Polarized sunglasses stop photons swinging up and down from passing. This is how they reduce glare as light bouncing off of surfaces tend to fly that way. Liquid crystal displays also use polarity to control which light passes through. Some animals can see light polarization. \n\nFinally, a photon has a property called spin. Spin is related to light's circular polarization.\n\nPhoton interactions with matter\nLight is often created or absorbed when an electron gains or loses energy. This energy can be in the form of heat, kinetic energy, or other form. For example, an incandescent light bulb uses heat. The increase of energy can push an electron up one level in a shell called a ""valence"". This makes it unstable, and like everything, it wants to be in the lowest energy state. (If being in the lowest energy state is confusing, pick up a pencil and drop it. Once on the ground, the pencil will be in a lower energy state). When the electron drops back down to a lower energy state, it needs to release the energy that hit it, and it must obey the conservation of energy (energy can neither be created nor destroyed). Electrons release this energy as photons, and at higher intensities, this photon can be seen as visible light.\n\nPhotons and the electromagnetic force\nIn particle physics, photons are responsible for electromagnetic force. Electromagnetism is an idea that combines electricity with magnetism. One common way that we experience electromagnetism in our daily lives is light, which is caused by electromagnetism. Electromagnetism is also responsible for charge, which is the reason that you can not push your hand through a table. Since photons are the force-carrying particle of electromagnetism, they are also gauge bosons. Some matter–called dark matter–is not believed to be affected by electromagnetism. This would mean that dark matter does not have a charge, and does not give off light.\n\nRelated pages\n Particle physics\n\nBasic physics ideas\nElectromagnetism\nLight\nElementary particles"
1,2425,7796,https://simple.wikipedia.org/wiki/Thomas%20Dolby,Thomas Dolby,"Thomas Dolby (born Thomas Morgan Robertson; 14 October 1958) is a British musican and computer designer. He is probably most famous for his 1982 hit, ""She Blinded me with Science"".\n\nHe married actress Kathleen Beller in 1988. The couple have three children together.\n\nDiscography\n\nSingles\n\nA Track did not chart in North America until 1983, after the success of ""She Blinded Me With Science"".\n\nAlbums\n\nStudio albums\n\nEPs\n\nReferences\n\nEnglish musicians\nLiving people\n1958 births\nNew wave musicians\nWarner Bros. Records artists"
2,18059,67912,https://simple.wikipedia.org/wiki/Embroidery,Embroidery,"Embroidery is the art of decorating fabric or other materials with designs stitched in strands of thread or yarn using a needle. Embroidery may also incorporate other materials such as metal strips, pearls, beads, quills, and sequins. Sewing machines can be used to create machine embroidery.\n\nQualifications \nCity and Guilds qualification in Embroidery allows embroiderers to become recognized for their skill. This qualification also gives them the credibility to teach. For example, the notable textiles artist, Kathleen Laurel Sage, began her teaching career by getting the City and Guilds Embroidery 1 and 2 qualifications. She has now gone on to write a book on the subject.\n\nReferences\n\nOther websites\n The Crimson Thread of Kinship at the National Museum of Australia\n\nNeedlework"
3,12045,44309,https://simple.wikipedia.org/wiki/Consecutive%20integer,Consecutive integer,"Consecutive numbers are numbers that follow each other in order. They have a difference of 1 between every two numbers. In a set of consecutive numbers, the mean and the median are equal. \n\nIf n is a number, then the next numbers will be n+1 and n+2. \n\nExamples \n\nConsecutive numbers that follow each other in order:\n\n 1, 2, 3, 4, 5\n -3, −2, −1, 0, 1, 2, 3, 4\n 6, 7, 8, 9, 10, 11, 12, 13\n\nConsecutive even numbers \nConsecutive even numbers are even numbers that follow each other. They have a difference of 2 between every two numbers.\n\nIf n is an even integer, then n, n+2, n+4 and n+6 will be consecutive even numbers.\n\nFor example - 2,4,6,8,10,12,14,18 etc.\n\nConsecutive odd numbers\nConsecutive odd numbers are odd numbers that follow each other. Like consecutive odd numbers, they have a difference of 2 between every two numbers.\n\nIf n is an odd integer, then n, n+2, n+4 and n+6 will be consecutive odd numbers.\n\nExamples\n\n3, 5, 7, 9, 11, 13, etc.\n\n−23, −21, −19, −17, −15, -13, -11\n\nIntegers"
4,11477,41741,https://simple.wikipedia.org/wiki/German%20Empire,German Empire,"The German Empire (""Deutsches Reich"" or ""Deutsches Kaiserreich"" in the German language) is the name for a group of German countries from January 18, 1871 to November 9, 1918. This is from the Unification of Germany when Wilhelm I of Prussia was made German Kaiser to when the third Emperor Wilhelm II was removed from power at the end of the First World War. In the 1920s, German nationalists started to call it the ""Second Reich"".\n\nThe name of Germany was ""Deutsches Reich"" until 1945. ""Reich"" can mean many things, empire, kingdom, state, ""richness"" or ""wealth"". Most members of the Empire were previously members of the North German Confederation. \n\nAt different times, there were three groups of smaller countries, each group was later called a ""Reich"" by some Germans. The first was the Holy Roman Empire. The second was the German Empire. The third was the Third Reich.\n\nThe words ""Second Reich"" were used for the German Empire by Arthur Moeller van den Bruck, a nationalist writer in the 1920s. He was trying to make a link with the earlier Holy Roman Empire which had once been very strong. Germany had lost First World War and was suffering big problems. van den Bruck wanted to start a ""Third Reich"" to unite the country. These words were later used by the Nazis to make themselves appear stronger.\n\nStates in the Empire\n\nRelated pages\n Germany\n Holy Roman Empire\n Nazi Germany, or ""Drittes Reich""\n\n1870s establishments in Germany\n \nStates and territories disestablished in the 20th century\nStates and territories established in the 19th century\n1871 establishments in Europe\n1918 disestablishments in Germany"


## Storage

We'll initialise our vector database first. Which database you choose and how you store data in it is a key decision point, and we've collated a few principles to aid your decision here:

#### How much data to store
How much metadata do you want to include in the index. Metadata can be used to filter your queries or to bring back more information upon retrieval for your application to use, but larger indices will be slower so there is a trade-off.

There are two common design patterns here:
- **All-in-one:** Store your metadata with the vector embeddings so you perform semantic search and retrieval on the same database. This is easier to setup and run, but can run into scaling issues when your index grows.
- **Vectors only:** Store just the embeddings and any IDs/references needed to locate the metadata that goes with the vector in a different database or location. In this pattern the vector database is only used to locate the most relevant IDs, then those are looked up from a different database. This can be more scalable if your vector database is going to be extremely large, or if you have large volumes of metadata with each vector.

#### Which vector database to use

The vector database market is wide and varied, so we won't recommend one over the other. For a few options you can review [this cookbook](./vector_databases/Using_vector_databases_for_embeddings_search.ipynb) and the sub-folders, which have examples supplied by many of the vector database providers in the market. 

We're going to use Redis as our database for both document contents and the vector embeddings. You will need the full Redis Stack to enable use of Redisearch, which is the module that allows semantic search - more detail is in the [docs for Redis Stack](https://redis.io/docs/stack/get-started/install/docker/).

To set this up locally, you will need to:
- Install an appropriate version of [Docker](https://docs.docker.com/desktop/) for your OS
- Ensure Docker is running i.e. by running ```docker run hello-world```
- Run the following command: ```docker run -d --name redis-stack -p 6379:6379 -p 8001:8001 redis/redis-stack:latest```.

The code used here draws heavily on [this repo](https://github.com/RedisAI/vecsim-demo).

After setting up the Docker instance of Redis Stack, you can follow the below instructions to initiate a Redis connection and create a Hierarchical Navigable Small World (HNSW) index for semantic search.

In [7]:
# Setup Redis


REDIS_HOST = 'localhost'
REDIS_PORT = '6379'
REDIS_DB = '0'

redis_client = r(host=REDIS_HOST, port=REDIS_PORT, db=REDIS_DB,decode_responses=False)


# Constants
VECTOR_DIM = 1536 # length of the vectors
PREFIX = "wiki" # prefix for the document keys
DISTANCE_METRIC = "COSINE" # distance metric for the vectors (ex. COSINE, IP, L2)

In [8]:
# Create search index

# Index
INDEX_NAME = "wiki-index"           # name of the search index
VECTOR_FIELD_NAME = 'content_vector'

# Define RediSearch fields for each of the columns in the dataset
# This is where you should add any additional metadata you want to capture
id = TextField("id")
url = TextField("url")
title = TextField("title")
text_chunk = TextField("content")
file_chunk_index = NumericField("file_chunk_index")

# define RediSearch vector fields to use HNSW index

text_embedding = VectorField(VECTOR_FIELD_NAME,
    "HNSW", {
        "TYPE": "FLOAT32",
        "DIM": VECTOR_DIM,
        "DISTANCE_METRIC": DISTANCE_METRIC
    }
)
# Add all our field objects to a list to be created as an index
fields = [url,title,text_chunk,file_chunk_index,text_embedding]

redis_client.ping()

True

Optional step to drop the index if it already exists

```redis_client.ft(INDEX_NAME).dropindex()```

If you want to clear the whole DB use:

```redis_client.flushall()```

In [9]:
# Check if index exists
try:
    redis_client.ft(INDEX_NAME).info()
    print("Index already exists")
except Exception as e:
    print(e)
    # Create RediSearch Index
    print('Not there yet. Creating')
    redis_client.ft(INDEX_NAME).create_index(
        fields = fields,
        definition = IndexDefinition(prefix=[PREFIX], index_type=IndexType.HASH)
    )

Unknown Index name
Not there yet. Creating


### Data preparation

The next step is to prepare your data. There are a few decisions to keep in mind here:

#### Chunking your data

In this context, "chunking" means cutting up the text into reasonable sizes so that the content will fit into the context length of the language model you choose. If your data is small enough or your LLM has a large enough context limit then you can proceed with no chunking, but in many cases you'll need to chunk your data. I'll share two main design patterns here:
- **Token-based:** Chunking your data based on some common token threshold i.e. 300, 500, 1000 depending on your use case. This approach works best with a grid-search evaluation to decide the optimal chunking logic over a set of evaluation questions. Variables to consider are whether chunks have overlaps, and whether you extend or truncate a section to keep full sentences and paragraphs together.
- **Deterministic:** Deterministic chunking uses some common delimiter, like a page break, paragraph end, section header etc. to chunk. This can work well if you have data of reasonable uniform structure, or if you can use GPT to help annotate the data first so you can guarantee common delimiters. However, it can be difficult to handle your chunks when you stuff them into the prompt given you need to cater for many different lengths of content, so consider that in your application design.

#### Which vectors should you store

It is critical to think through the user experience you're building towards because this will inform both the number and content of your vectors. Here are two example use cases that show how these can pan out:
- **Tool Manual Knowledge Base:** We have a database of manuals that our customers want to search over. For this use case, we want a vector to allow the user to identify the right manual, before searching a different set of vectors to interrogate the content of the manual to avoid any cross-pollination of similar content between different manuals. 
    - **Title Vector:** Could include title, author name, brand and abstract.
    - **Content Vector:** Includes content only.
- **Investor Reports:** We have a database of investor reports that contain financial information about public companies. I want relevant snippets pulled out and summarised so I can decide how to invest. In this instance we want one set of content vectors, so that the retrieval can pull multiple entries on a company or industry, and summarise them to form a composite analysis.
    - **Content Vector:** Includes content only, or content supplemented by other features that improve search quality such as author, industry etc.
    
For this walkthrough we'll go with 1000 token-based chunking of text content with no overlap, and embed them with the article title included as a prefix.

In [10]:
# We'll use 1000 token chunks with some intelligence to not split at the end of a sentence
TEXT_EMBEDDING_CHUNK_SIZE = 1000
EMBEDDINGS_MODEL = "text-embedding-ada-002"

In [11]:
## Chunking Logic

# Split a text into smaller chunks of size n, preferably ending at the end of a sentence
def chunks(text, n, tokenizer):
    tokens = tokenizer.encode(text)
    """Yield successive n-sized chunks from text."""
    i = 0
    while i < len(tokens):
        # Find the nearest end of sentence within a range of 0.5 * n and 1.5 * n tokens
        j = min(i + int(1.5 * n), len(tokens))
        while j > i + int(0.5 * n):
            # Decode the tokens and check for full stop or newline
            chunk = tokenizer.decode(tokens[i:j])
            if chunk.endswith(".") or chunk.endswith("\n"):
                break
            j -= 1
        # If no end of sentence found, use n tokens as the chunk size
        if j == i + int(0.5 * n):
            j = min(i + n, len(tokens))
        yield tokens[i:j]
        i = j
        
def get_unique_id_for_file_chunk(title, chunk_index):
    return str(title+"-!"+str(chunk_index))

def chunk_text(x,text_list):
    url = x['url']
    title = x['title']
    file_body_string = x['text']
        
    """Return a list of tuples (text_chunk, embedding) for a text."""
    token_chunks = list(chunks(file_body_string, TEXT_EMBEDDING_CHUNK_SIZE, tokenizer))
    text_chunks = [f'Title: {title};\n'+ tokenizer.decode(chunk) for chunk in token_chunks]
    
    #embeddings_response = openai.Embedding.create(input=text_chunks, model=EMBEDDINGS_MODEL)

    #embeddings = [embedding["embedding"] for embedding in embeddings_response['data']]
    #text_embeddings = list(zip(text_chunks, embeddings))

    # Get the vectors array of triples: file_chunk_id, embedding, metadata for each embedding
    # Metadata is a dict with keys: filename, file_chunk_index
    
    for i, text_chunk in enumerate(text_chunks):
        id = get_unique_id_for_file_chunk(title, i)
        text_list.append(({'id': id
                         , 'metadata': {"url": x['url']
                                      ,"title": title
                                      , "content": text_chunk
                                      , "file_chunk_index": i}}))

In [12]:
## Batch Embedding Logic

# Simple function to take in a list of text objects and return them as a list of embeddings
@retry(wait=wait_random_exponential(min=1, max=40), stop=stop_after_attempt(10))
def get_embeddings(input: List):
    response = openai.Embedding.create(
        input=input,
        model=EMBEDDINGS_MODEL,
    )["data"]
    return [data["embedding"] for data in response]

def batchify(iterable, n=1):
    l = len(iterable)
    for ndx in range(0, l, n):
        yield iterable[ndx : min(ndx + n, l)]

# Function for batching and parallel processing the embeddings
def embed_corpus(
    corpus: List[str],
    batch_size=64,
    num_workers=8,
    max_context_len=8191,
):

    # Encode the corpus, truncating to max_context_len
    encoding = tiktoken.get_encoding("cl100k_base")
    encoded_corpus = [
        encoded_article[:max_context_len] for encoded_article in encoding.encode_batch(corpus)
    ]

    # Calculate corpus statistics: the number of inputs, the total number of tokens, and the estimated cost to embed
    num_tokens = sum(len(article) for article in encoded_corpus)
    cost_to_embed_tokens = num_tokens / 1_000 * 0.0004
    print(
        f"num_articles={len(encoded_corpus)}, num_tokens={num_tokens}, est_embedding_cost={cost_to_embed_tokens:.2f} USD"
    )

    # Embed the corpus
    with concurrent.futures.ThreadPoolExecutor(max_workers=num_workers) as executor:
        
        futures = [
            executor.submit(get_embeddings, text_batch)
            for text_batch in batchify(encoded_corpus, batch_size)
        ]

        with tqdm(total=len(encoded_corpus)) as pbar:
            for _ in concurrent.futures.as_completed(futures):
                pbar.update(batch_size)

        embeddings = []
        for future in futures:
            data = future.result()
            embeddings.extend(data)

        return embeddings

In [13]:
%%time
# Initialise tokenizer
tokenizer = tiktoken.get_encoding("cl100k_base")

# List to hold vectors
text_list = []

# Process each PDF file and prepare for embedding
x = article_df.apply(lambda x: chunk_text(x, text_list),axis = 1)

CPU times: user 1.04 s, sys: 131 ms, total: 1.17 s
Wall time: 1.2 s


In [14]:
text_list[0]

{'id': 'Photon-!0',
 'metadata': {'url': 'https://simple.wikipedia.org/wiki/Photon',
  'title': 'Photon',
  'content': 'Title: Photon;\nPhotons  (from Greek φως, meaning light), in many atomic models in physics,  are particles which transmit light. In other words, light is carried over space by photons. Photon is an elementary particle that is its own antiparticle. In quantum mechanics each photon has a characteristic quantum of energy that depends on frequency: A photon associated with light at a higher frequency will have more energy (and be associated with light at a shorter wavelength).\n\nPhotons have a rest mass of 0 (zero). However, Einstein\'s theory of relativity says that they do have a certain amount of momentum. Before the photon got its name, Einstein revived the proposal that light is separate pieces of energy (particles). These particles came to be known as photons. \n\nA photon is usually given the symbol γ (gamma),\n\nProperties \n\nPhotons are fundamental particles. A

In [15]:
# Batch embed our chunked text - this will cost you about $0.50
embeddings = embed_corpus([text["metadata"]['content'] for text in text_list])

num_articles=2693, num_tokens=1046988, est_embedding_cost=0.42 USD


2752it [00:10, 271.48it/s]                                                                               


In [16]:
# Join up embeddings with our original list
embeddings_list = [{"embedding": v} for v in embeddings]
for i,x in enumerate(embeddings_list):
    text_list[i].update(x)
text_list[0]

{'id': 'Photon-!0',
 'metadata': {'url': 'https://simple.wikipedia.org/wiki/Photon',
  'title': 'Photon',
  'content': 'Title: Photon;\nPhotons  (from Greek φως, meaning light), in many atomic models in physics,  are particles which transmit light. In other words, light is carried over space by photons. Photon is an elementary particle that is its own antiparticle. In quantum mechanics each photon has a characteristic quantum of energy that depends on frequency: A photon associated with light at a higher frequency will have more energy (and be associated with light at a shorter wavelength).\n\nPhotons have a rest mass of 0 (zero). However, Einstein\'s theory of relativity says that they do have a certain amount of momentum. Before the photon got its name, Einstein revived the proposal that light is separate pieces of energy (particles). These particles came to be known as photons. \n\nA photon is usually given the symbol γ (gamma),\n\nProperties \n\nPhotons are fundamental particles. A

In [17]:
# Create a Redis pipeline to load all the vectors and their metadata
def load_vectors(client:r, input_list, vector_field_name):
    p = client.pipeline(transaction=False)
    for text in input_list:    
        #hash key
        key=f"{PREFIX}:{text['id']}"
        
        #hash values
        item_metadata = text['metadata']
        #
        item_keywords_vector = np.array(text['embedding'],dtype= 'float32').tobytes()
        item_metadata[vector_field_name]=item_keywords_vector
        
        # HSET
        p.hset(key,mapping=item_metadata)
            
    p.execute()

In [18]:
batch_size = 100  # how many vectors we insert at once

for i in tqdm(range(0, len(text_list), batch_size)):
    # find end of batch
    i_end = min(len(text_list), i+batch_size)
    meta_batch = text_list[i:i_end]
    
    load_vectors(redis_client,meta_batch,vector_field_name=VECTOR_FIELD_NAME)

100%|████████████████████████████████████████████████████████████████████| 27/27 [00:07<00:00,  3.40it/s]


In [19]:
redis_client.ft(INDEX_NAME).info()['num_docs']

'2693'

### Search

We can now use our knowledge base to bring back search results. This is one of the areas of highest friction in enterprise knowledge retrieval use cases, with the most common being that the system is not retrieving what you intuitively think are the most relevant documents. There are a few ways of tackling this - I'll share a few options here, as well as some resources to take your research further:

#### Vector search, keyword search or a hybrid

Despite the strong capabilities out of the box that vector search gives, search is still not a solved problem, and there are well proven [Lucene-based](https://en.wikipedia.org/wiki/Apache_Lucene) search solutions such Elasticsearch and Solr that use methods that work well for certain use cases, as well as the sparse vector methods of traditional NLP such as [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf). If your retrieval is poor, the answer may be one of these in particular, or a combination:
- **Vector search:** Converts your text into vector embeddings which can be searched using KNN, SVM or some other model to return the most relevant results. This is the approach we take in this workbook, using a RediSearch vector DB which employs a KNN search under the hood.
- **Keyword search:** This method uses any keyword-based search approach to return a score - it could use Elasticsearch/Solr out-of-the-box, or a TF-IDF approach like BM25.
- **Hybrid search:** This last approach is a mix of the two, where you produce both a vector search and keyword search result, before using an ```alpha``` between 0 and 1 to weight the outputs. There is a great example of this explained by the Weaviate team [here](https://weaviate.io/blog/hybrid-search-explained).

#### Hypothetical Document Embeddings (HyDE)

This is a novel approach from [this paper](https://arxiv.org/abs/2212.10496), which states that a hypothetical answer to a question is more semantically similar to the real answer than the question is. In practice this means that your search would use GPT to generate a hypothetical answer, then embed that and use it for search. I've seen success with this both as a pure search, and as a retry step if the initial retrieval fails to retrieve relevant content. A simple example implementation is here:
```
def answer_question_hyde(question,prompt):
    
    hyde_prompt = '''You are OracleGPT, an helpful expert who answers user questions to the best of their ability.
    Provide a confident answer to their question. If you don't know the answer, make the best guess you can based on the context of the question.

    User question: USER_QUESTION_HERE
    
    Answer:'''
    
    hypothetical_answer = openai.Completion.create(model=COMPLETIONS_MODEL,prompt=hyde_prompt.replace('USER_QUESTION_HERE',question))['choices'][0]['text']
    
    search_results = get_redis_results(redis_client,hypothetical_answer)
    
    return search_results
```

#### Fine-tuning embeddings

This next approach leverages the learning you gain from real question/answer pairs that your users will generate during the evaluation approach. It works by:
- Creating a dataset of positive (and optionally negative) question and answer pairs. Positive examples would be a correct retrieval to a question, while negative would be poor retrievals.
- Calculating the embeddings for both questions and answers and the cosine similarity between them.
- Train a model to optimize the embeddings matrix and test retrieval, picking the best one.
- Perform a matrix multiplication of the base Ada embeddings by this new best matrix, creating a new fine-tuned embedding to do for retrieval.

There is a great walkthrough of both the approach and the code to perform it in [this cookbook](./Customizing_embeddings.ipynb).

#### Reranking

One other well-proven method from traditional search solutions that can be applied to any of the above approaches is reranking, where we over-fetch our search results, and then deterministically rerank based on a modifier or set of modifiers.

An example is investor reports again - it is highly likely that if we have 3 reports on Apple, we'll want to make our investment decisions based on the latest one. In this instance a ```recency``` modifier could be applied to the vector scores to sort them, giving us the latest one on the top even if it is not the most semantically similar to our search question. 

For this walkthrough we'll stick with a basic semantic search bringing back the top 5 chunks for a user question, and providing a summarised response using GPT.

In [20]:
# Make query to Redis
def query_redis(redis_conn,query,index_name, top_k=5):
    
    

    ## Creates embedding vector from user query
    embedded_query = np.array(openai.Embedding.create(
                                                input=query,
                                                model=EMBEDDINGS_MODEL,
                                            )["data"][0]['embedding'], dtype=np.float32).tobytes()

    #prepare the query
    q = Query(f'*=>[KNN {top_k} @{VECTOR_FIELD_NAME} $vec_param AS vector_score]').sort_by('vector_score').paging(0,top_k).return_fields('vector_score','url','title','content','text_chunk_index').dialect(2) 
    params_dict = {"vec_param": embedded_query}

    
    #Execute the query
    results = redis_conn.ft(index_name).search(q, query_params = params_dict)
    
    return results

# Get mapped documents from Redis results
def get_redis_results(redis_conn,query,index_name):
    
    # Get most relevant documents from Redis
    query_result = query_redis(redis_conn,query,index_name)
    
    # Extract info into a list
    query_result_list = []
    for i, result in enumerate(query_result.docs):
        result_order = i
        url = result.url
        title = result.title
        text = result.content
        score = result.vector_score
        query_result_list.append((result_order,url,title,text,score))
        
    # Display result as a DataFrame for ease of us
    result_df = pd.DataFrame(query_result_list)
    result_df.columns = ['id','url','title','result','certainty']
    return result_df

In [21]:
%%time

wiki_query='What is Thomas Dolby known for?'

result_df = get_redis_results(redis_client,wiki_query,index_name=INDEX_NAME)
result_df.head(2)

CPU times: user 7.1 ms, sys: 2.35 ms, total: 9.45 ms
Wall time: 495 ms


Unnamed: 0,id,url,title,result,certainty
0,0,https://simple.wikipedia.org/wiki/Thomas%20Dolby,Thomas Dolby,"Title: Thomas Dolby;\nThomas Dolby (born Thomas Morgan Robertson; 14 October 1958) is a British musican and computer designer. He is probably most famous for his 1982 hit, ""She Blinded me with Science"".\n\nHe married actress Kathleen Beller in 1988. The couple have three children together.\n\nDiscography\n\nSingles\n\nA Track did not chart in North America until 1983, after the success of ""She Blinded Me With Science"".\n\nAlbums\n\nStudio albums\n\nEPs\n\nReferences\n\nEnglish musicians\nLiving people\n1958 births\nNew wave musicians\nWarner Bros. Records artists",0.132723689079
1,1,https://simple.wikipedia.org/wiki/Synthesizer,Synthesizer,Title: Synthesizer;\nAudio technology,0.223129153252


In [22]:
# Build a prompt to provide the original query, the result and ask to summarise for the user
retrieval_prompt = '''Use the content to answer the search query the customer has sent.
If you can't answer the user's question, say "Sorry, I am unable to answer the question with the content". Do not guess.

Search query: 

SEARCH_QUERY_HERE

Content: 

SEARCH_CONTENT_HERE

Answer:
'''

def answer_user_question(query):
    
    results = get_redis_results(redis_client,query,INDEX_NAME)
    
    retrieval_prepped = retrieval_prompt.replace('SEARCH_QUERY_HERE',query).replace('SEARCH_CONTENT_HERE',results['result'][0])
    retrieval = openai.ChatCompletion.create(model=CHAT_MODEL,messages=[{'role':"user",'content': retrieval_prepped}],max_tokens=500)
    
    # Response provided by GPT-3.5
    return retrieval['choices'][0]['message']['content']

In [23]:
print(answer_user_question(wiki_query))

Thomas Dolby is known for his music, particularly his 1982 hit "She Blinded Me With Science". He is also a computer designer.


### Answer

We've now created a knowledge base that can answer user questions on Wikipedia. However, the user experience could be better, and this is where the Answer layer comes in, where an LLM Agent is used to interact with the user.

There are different level of complexity in building a knowledge retrieval experience leveraging an LLM; there is an experience vs. effort trade-off to consider when selecting the right type of interaction. There are many patterns, but I'll highlight a few of the most common here:

#### Choosing the user experience and architecture

There are different level of complexity in building a knowledge retrieval experience leveraging an LLM; there is an experience vs. effort trade-off to consider when selecting the right type of interaction. There are many patterns, but I'll highlight a few of the most common here:
- **Q&A:** Your classic search engine use case, where the user inputs a question and your LLM gives them an answer either using its knowledge or, much more commonly, using a knowledge base that you prepare using the steps we've covered already. This simple use case assumes no memory of past queries is required, and no ability to clarify with the human or ask for more information.
- **Chat:** I think of Chat as being Q&A + memory - this is a slightly more sophisticated interaction where the LLM remembers what was previously asked and can delve deeper on something already covered.
- **Agent:** The most sophisticated is what LangChain calls an Agent, they leverage large language models to process and produce human-like results through a variety of tools, and will chain queries together dynamically until it has an answer that the LLM feels is appropriate to answer the user's question. However, for every "turn" you allow between Agent and user you increase the risks of loss of context, hallucination, or parsing errors, so be clear about the exact requirements your users have before embarking on building the Answer layer.

Q&A use cases are the simplest to implement, while Agents can give the most sophisticated user experience - in this notebook we'll build an Agent with memory and a single Tool to give an appreciation for the flexibilty prompt chaining gives you in getting a more complete answer for your users.

#### Ensuring reliability

The more complexity you add, the more chance your LLM will fail to respond correctly, or a response will come back in the wrong format and break your Answer pipeline. We'll share a few methods our customers have used elsewhere to help "channel" the Agent down a more deterministic path, and to deal with issues when they do crop up:
- **Prompt chaining:** Prompting the model to take a step-by-step approach and think aloud using a scratchpad has been proven to deliver more consistent results. It also means that as a developer you can break up one complex prompt into many simpler, more deterministic prompts, with the output of one prompt becoming the input for the next. This approach is known as Chain-of-Thought (CoT) reasoning - I'd suggest digging deeper as this is a dynamic new area of research, with a few of the key papers referenced here:
    - Chain of thought prompting [paper](https://arxiv.org/abs/2201.11903)
    - Self-reflecting agent [paper](https://arxiv.org/abs/2303.11366)
- **Self-referencing:** You can return references for the LLM's answer through either your application logic, or by prompt engineering it to return references. I would generally suggest doing it in your application logic, although if you have multiple chunks then a hybrid approach where you ask the LLM to return the key of the chunk it used could be advisable. I view this as a UX opportunity, where for many search use cases giving the "raw" output of the chunks retrieved as well as the summarised answer can give the user the best of both worlds, but please go with whatever is most appropriate for your users.
- **Discriminator models:** The best control for unwanted outputs is undoubtably through preventing it from happening with prompt engineering, prompt chaining and retrieval. However, when all these fail then a discriminator model is a useful detective control. This is a classifier trained on past unwanted outputs, that flags the Agent's response to the user as Safe or Not, enabling you to perform some business logic to either retry, pass to a human, or say it doesn't know. 
    - There is an example in our [Help Center](https://help.openai.com/en/articles/5528730-fine-tuning-a-classifier-to-improve-truthfulness).

This is a dynamic topic that has still not consolidated to a clear design that works best above all others, so for ease of implementation we will use LangChain, which supplies a framework with implementations for most of the concepts we've discussed above.

We'll create an Agent with access to our knowledge base, give it a prompt template and a custom parser for extracting the answers, set up a prompt chain and then let it answer our Wikipedia questions.

Our work here draws heavily on LangChain's great documentation, in particular [this guide](https://python.langchain.com/en/latest/modules/agents/agents/custom_llm_chat_agent.html).

In [24]:
from langchain.agents import Tool, AgentExecutor, LLMSingleActionAgent, AgentOutputParser
from langchain.prompts import BaseChatPromptTemplate
from langchain import SerpAPIWrapper, LLMChain
from langchain.chat_models import ChatOpenAI
from typing import List, Union
from langchain.schema import AgentAction, AgentFinish, HumanMessage
from langchain.memory import ConversationBufferWindowMemory
import re

In [25]:
def ask_gpt(query):
    response = openai.ChatCompletion.create(model=CHAT_MODEL,messages=[{"role":"user","content":"Please answer my question.\nQuestion: {}".format(query)}],temperature=0)
    return response['choices'][0]['message']['content']

In [26]:
# Define which tools the agent can use to answer user queries
tools = [
    Tool(
        name = "Search",
        func=answer_user_question,
        description="Useful for when you need to answer general knowledge questions. Input should be a fully formed question."
    ),
    Tool(
        name = "Knowledge",
        func = ask_gpt,
        description = "Useful for any other questions. Input should be a fully formed question."
    )
]

In [27]:
# Set up the base template
template = """You are WikiGPT, a helpful bot who answers question using your tools or your own knowledge.
You have access to the following tools::

{tools}

Use the following format:

Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [{tool_names}]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question

Begin!

Previous conversation history:
{history}

New question: {input}
{agent_scratchpad}"""

In [28]:
# Set up a prompt template
class CustomPromptTemplate(BaseChatPromptTemplate):
    # The template to use
    template: str
    # The list of tools available
    tools: List[Tool]
    
    def format_messages(self, **kwargs) -> str:
        # Get the intermediate steps (AgentAction, Observation tuples)
        # Format them in a particular way
        intermediate_steps = kwargs.pop("intermediate_steps")
        thoughts = ""
        for action, observation in intermediate_steps:
            thoughts += action.log
            thoughts += f"\nObservation: {observation}\nThought: "
        # Set the agent_scratchpad variable to that value
        kwargs["agent_scratchpad"] = thoughts
        # Create a tools variable from the list of tools provided
        kwargs["tools"] = "\n".join([f"{tool.name}: {tool.description}" for tool in self.tools])
        # Create a list of tool names for the tools provided
        kwargs["tool_names"] = ", ".join([tool.name for tool in self.tools])
        formatted = self.template.format(**kwargs)
        return [HumanMessage(content=formatted)]
    
    
class CustomOutputParser(AgentOutputParser):
    
    def parse(self, llm_output: str) -> Union[AgentAction, AgentFinish]:
        # Check if agent should finish
        if "Final Answer:" in llm_output:
            return AgentFinish(
                # Return values is generally always a dictionary with a single `output` key
                # It is not recommended to try anything else at the moment :)
                return_values={"output": llm_output.split("Final Answer:")[-1].strip()},
                log=llm_output,
            )
        # Parse out the action and action input
        regex = r"Action\s*\d*\s*:(.*?)\nAction\s*\d*\s*Input\s*\d*\s*:[\s]*(.*)"
        match = re.search(regex, llm_output, re.DOTALL)
        if not match:
            raise ValueError(f"Could not parse LLM output: `{llm_output}`")
        action = match.group(1).strip()
        action_input = match.group(2)
        # Return the action and action input
        return AgentAction(tool=action, tool_input=action_input.strip(" ").strip('"'), log=llm_output)

In [29]:
prompt = CustomPromptTemplate(
    template=template,
    tools=tools,
    # This omits the `agent_scratchpad`, `tools`, and `tool_names` variables because those are generated dynamically
    # The history template includes "history" as an input variable so we can interpolate it into the prompt
    input_variables=["input", "intermediate_steps", "history"]
)

# Initiate the memory with k=2 to keep the last two turns
# Provide the memory to the agent
memory = ConversationBufferWindowMemory(k=2)

In [30]:
output_parser = CustomOutputParser()

llm = ChatOpenAI(temperature=0)

# LLM chain consisting of the LLM and a prompt
llm_chain = LLMChain(llm=llm, prompt=prompt)

tool_names = [tool.name for tool in tools]
agent = LLMSingleActionAgent(
    llm_chain=llm_chain, 
    output_parser=output_parser,
    stop=["\nObservation:"], 
    allowed_tools=tool_names
)

In [31]:
agent_executor = AgentExecutor.from_agent_and_tools(agent=agent, tools=tools, verbose=True, memory=memory)

In [32]:
agent_executor.run(wiki_query)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: I'm not sure who Thomas Dolby is, I should probably search for more information.
Action: Search
Action Input: "What is Thomas Dolby known for?"[0m

Observation:[36;1m[1;3mThomas Dolby is known for being a British musician and computer designer, and his 1982 hit "She Blinded Me With Science".[0m[32;1m[1;3mNow that I know who Thomas Dolby is, I can answer the question.
Final Answer: Thomas Dolby is known for being a British musician and computer designer, and his 1982 hit "She Blinded Me With Science".[0m

[1m> Finished chain.[0m


'Thomas Dolby is known for being a British musician and computer designer, and his 1982 hit "She Blinded Me With Science".'

In [33]:
agent_executor.run('What is 5 + 5')



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: This is a simple math question.
Action: Knowledge
Action Input: What is the sum of 5 and 5?[0m

Observation:[33;1m[1;3mThe sum of 5 and 5 is 10.[0m[32;1m[1;3mI now know the final answer.
Final Answer: The sum of 5 and 5 is 10.[0m

[1m> Finished chain.[0m


'The sum of 5 and 5 is 10.'

### Evaluation

Last comes the not-so-fun bit that will make the difference between nifty prototype and production application - the process of evaluating and tuning your results. 

The key takeaway here is to make a framework that saves the results of each evaluation, as well as the parameters. Evaluation can be a difficult task that takes significant resources, so it is best to start prepared to handle multiple iterations. Some useful principles we've seen successful deployments use are:
- **Assign clear product ownership and metrics:** Ensure you have a team aligned from the start to annotate the outputs and determine whether they're bad or good. This may seem an obvious step, but too often the focus is on the engineering challenge of successfully retrieving content rather than the product challenge of providing retrieval results that are useful.
- **Log everything:** Store all requests and responses to and from your LLM and retrieval service if you can, it builds a great base for fine-tuning both the embeddings and any fine-tuned models or few-shot LLMs in future.
- **Use GPT-4 as a labeller:** When running evaluations, it can help to use GPT-4 as a gatekeeper for human annotation. Human annotation is costly and time-consuming, so doing an initial evaluation run with GPT-4 can help set a quality bar that needs to be met to justify human labeling. At this stage I would not suggest using GPT-4 as your only labeler, but it can certainly ease the burden.
    - This approach is outlined further in [this paper](https://arxiv.org/abs/2108.13487).

We'll use these principles to make a quick evaluation framework where we will:
- Use GPT-4 to make a list of hypothetical questions on our topic
- Ask our Agent the questions and save question/answer tuples
    - These two above steps simulate the actual users interacting with your application
- Get GPT-4 to evaluate whether the answers correctly respond to the questions
- Look at our results to measure how well the Agent answered the questions
- Plan remedial action

In [34]:
import time

# Build a prompt to provide the original query, the result and ask to summarise for the user
evaluation_question_prompt = '''You are a helpful Wikipedia assistant who will generate a list of 10 creative general knowledge questions in markdown format.

Example:
- Explain how photons work
- What is Thomas Dolby known for?
- What are some key events of the 20th century?

Begin!
'''

try:
    # We'll use our model to generate 10 hypothetical questions to evaluate
    question = openai.ChatCompletion.create(model=CHAT_MODEL
                                            ,messages=[{"role":"user","content":evaluation_question_prompt}]
                                            ,temperature=0.9)
    evaluation_questions = question['choices'][0]['message']['content']
except Exception as e:
    print(e)


In [35]:
cleaned_questions = evaluation_questions.split('\n')
print(cleaned_questions)

['1. What is the difference between weather and climate?', '2. Who designed the Eiffel Tower?', '3. What is the capital of Australia?', '4. What is the chemical symbol for gold?', '5. Who invented the telephone?', '6. What is the largest organ in the human body?', '7. Which famous artist painted the Mona Lisa?', '8. What is the highest mountain in Africa?', '9. What famous building was destroyed during the September 11th attacks?', '10. Who wrote the novel "To Kill a Mockingbird"?']


In [36]:
# We'll use our agent to answer the generated questions to simulate users interacting with the system
question_answer_pairs = []

for question in cleaned_questions:
    memory = ConversationBufferWindowMemory(k=2)
    
    agent_executor = AgentExecutor.from_agent_and_tools(agent=agent
                                                        , tools=tools
                                                        , verbose=False
                                                        ,memory=memory)
    try:
        
        answer = agent_executor.run(question)
    except Exception as e:
        print(question)
        print(e)
        answer = 'Unable to answer question'
    question_answer_pairs.append((question,answer))
    time.sleep(2)

In [37]:
len(question_answer_pairs), question_answer_pairs[:5]

(10,
 [('1. What is the difference between weather and climate?',
   'Weather refers to short-term atmospheric conditions in a specific area, while climate refers to long-term patterns and trends of weather in a particular region over a period of time.'),
  ('2. Who designed the Eiffel Tower?',
   'Gustave Eiffel designed the Eiffel Tower.'),
  ('3. What is the capital of Australia?',
   'The capital of Australia is Canberra.'),
  ('4. What is the chemical symbol for gold?',
   'The chemical symbol for gold is Au.'),
  ('5. Who invented the telephone?',
   'Alexander Graham Bell invented the telephone.')])

In [38]:
# Build a prompt to provide the original query, the result and ask to evaluate for the user
gpt_evaluator_system = '''You are WikiGPT, a helpful Wikipedia expert.
You will be presented with general knowledge questions our users have asked.

Think about this step by step:
- You need to decide whether the answer adequately answers the question
- If it answers the question, you will say "Correct"
- If it doesn't answer the question, you will say one of the following:
    - If it couldn't answer at all, you will say "Unable to answer"
    - If the answer was provided but was incorrect, you will say "Incorrect" 
- If none of these rules are met, say "Unable to evaluate"

Evaluation can only be "Correct", "Incorrect", "Unable to answer", and "Unable to evaluate"

Example 1:

Question: What is the cost cap for the 2023 season of Formula 1?

Answer: The cost cap for 2023 is 95m USD.

Evaluation: Correct

Example 2:

Question: What is Thomas Dolby known for?

Answer: Inventing electricity

Evaluation: Incorrect

Begin!'''

# We'll provide our evaluator the questions and answers we've generated and get it to evaluate them as one of our four evaluation categories.
gpt_evaluator_message = '''
Question: {question}

Answer: {answer}

Evaluation:'''

In [39]:
evaluation_output = []

In [40]:
for pair in question_answer_pairs:
    
    message = gpt_evaluator_message.format(question=pair[0]
                                           ,answer=pair[1])
    evaluation = openai.ChatCompletion.create(model=CHAT_MODEL
                                              ,messages=[{"role":"system","content":gpt_evaluator_system}
                                                         ,{"role":"user","content":message}]
                                              ,temperature=0)
    
    evaluation_output.append((pair[0]
                              ,pair[1]
                              ,evaluation['choices'][0]['message']['content']))

In [41]:
# We'll smooth the results for a simpler evaluation matrix
# In a real scenario we would take time and tune our prompt/add few shot examples to ensure consistent output from the evaluation step
def collate_results(x):
    text = x.lower()
    
    if 'incorrect' in text:
        return 'incorrect'
    elif 'correct' in text:
        return 'correct'
    else: 
        return 'unable to answer'

In [42]:
eval_df = pd.DataFrame(evaluation_output)
eval_df.columns = ['question','answer','evaluation']
# Replacing all the "unable to evaluates" with "unable to answer"
eval_df['evaluation'] = eval_df['evaluation'].apply(lambda x: collate_results(x))
eval_df.evaluation.value_counts()

correct    10
Name: evaluation, dtype: int64

#### Analysis

Depending on how GPT did here you may have actually gotten some good responses, but in all likelihood in the real world you'll end up with incorrect or unable to answer results, and will need to tune your search, LLM or another aspect of the pipeline.

Your remediation plan could be as follows:
- **Incorrect answers:** Either prompt engineering to help the model work out how to answer better (maybe even a bigger model like GPT-4), or search optimisation to return more relevant chunks. Chunking/embedding changes may help this as well - larger chunks may give more context, allowing the model to formulate a better answer.
- **Unable to answer:** This is either a retrieval problem, or the data doesn't exist in our knowledge base. We can prompt engineer to classify questions that are "out-of-bounds" and give the user a stock reply, or we can tune our search so the relevant data is returned.

This is the framework we'll build on to get our knowledge retrieval solution to production - again, log everything and store each run down to a question level so you can track regressions and iterate towards your production solution.

## Conclusion

This concludes our Enterprise Knowledge Retrieval walkthrough. We hope you've found it useful, and that you're now in a position to build enterprise knowledge retrieval solutions, and have a few tricks to start you down the road of putting them into production.