You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
openai-cookbook/examples/vector_databases/deeplake/deeplake_langchain_qa.ipynb

837 lines
28 KiB
Plaintext

This file contains invisible Unicode characters!

This file contains invisible Unicode characters that may be processed differently from what appears below. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to reveal hidden characters.

{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "Ol5OkztZqoAW"
},
"source": [
"# Question Answering with LangChain, Deep Lake, & OpenAI\n",
"\n",
"This notebook shows how to implement a question answering system with LangChain, [Deep Lake](https://activeloop.ai/) as a vector store and OpenAI embeddings. We will take the following steps to achieve this:\n",
"\n",
"1. Load a Deep Lake text dataset\n",
"2. Initialize a [Deep Lake vector store with LangChain](https://docs.activeloop.ai/tutorials/vector-store/deep-lake-vector-store-in-langchain)\n",
"3. Add text to the vector store\n",
"4. Run queries on the database\n",
"5. Done!\n",
"\n",
"You can also follow other tutorials such as question answering over any type of data (PDFs, json, csv, text): [chatting with any data](https://www.activeloop.ai/resources/data-chad-an-ai-app-with-lang-chain-deep-lake-to-chat-with-any-data/) stored in Deep Lake, [code understanding](https://www.activeloop.ai/resources/lang-chain-gpt-4-for-code-understanding-twitter-algorithm/), or [question answering over PDFs](https://www.activeloop.ai/resources/ultimate-guide-to-lang-chain-deep-lake-build-chat-gpt-to-answer-questions-on-your-financial-data/), or [recommending songs](https://www.activeloop.ai/resources/3-ways-to-build-a-recommendation-engine-for-songs-with-lang-chain/)."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "6uKh5KahrBs3"
},
"source": [
"## Install requirements\n",
"Let's install the following packages."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "cPsdluAqqnRH"
},
"outputs": [],
"source": [
"!pip install deeplake langchain openai tiktoken"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "IUm1NzURrGte"
},
"source": [
"## Authentication\n",
"Provide your OpenAI API key here:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "Q_-OiwJzrJ8m",
"outputId": "b11b0d5c-cbd4-469d-95d1-fcd7149bd493"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"··········\n"
]
}
],
"source": [
"import getpass\n",
"import os\n",
"\n",
"os.environ['OPENAI_API_KEY'] = getpass.getpass()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ok-hgiotrLmS"
},
"source": [
"## Load a Deep Lake text dataset\n",
"We will use a 20000 sample subset of the [cohere-wikipedia-22](https://app.activeloop.ai/davitbun/cohere-wikipedia-22) dataset for this example."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "cIj5g4smrwOm",
"outputId": "6315bd53-8a2f-40ef-b2f5-2687c90b2231"
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"\\"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Opening dataset in read-only mode as you don't have write permissions.\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"-"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"This dataset can be visualized in Jupyter Notebook by ds.visualize() or at https://app.activeloop.ai/activeloop/cohere-wikipedia-22-sample\n",
"\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"|"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"hub://activeloop/cohere-wikipedia-22-sample loaded successfully.\n",
"\n",
"Dataset(path='hub://activeloop/cohere-wikipedia-22-sample', read_only=True, tensors=['ids', 'metadata', 'text'])\n",
"\n",
" tensor htype shape dtype compression\n",
" ------- ------- ------- ------- ------- \n",
" ids text (20000, 1) str None \n",
" metadata json (20000, 1) str None \n",
" text text (20000, 1) str None \n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\r \r\r\r"
]
}
],
"source": [
"import deeplake\n",
"\n",
"ds = deeplake.load(\"hub://activeloop/cohere-wikipedia-22-sample\")\n",
"ds.summary()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "oY6FHqovHPfJ"
},
"source": [
"Let's take a look at a few samples:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "IWPYDrtUHPEr",
"outputId": "91e1b13e-abd0-4709-f65c-87986e90181a"
},
"outputs": [
{
"data": {
"text/plain": [
"['The 24-hour clock is a way of telling the time in which the day runs from midnight to midnight and is divided into 24 hours, numbered from 0 to 23. It does not use a.m. or p.m. This system is also referred to (only in the US and the English speaking parts of Canada) as military time or (only in the United Kingdom and now very rarely) as continental time. In some parts of the world, it is called railway time. Also, the international standard notation of time (ISO 8601) is based on this format.',\n",
" 'A time in the 24-hour clock is written in the form hours:minutes (for example, 01:23), or hours:minutes:seconds (01:23:45). Numbers under 10 have a zero in front (called a leading zero); e.g. 09:07. Under the 24-hour clock system, the day begins at midnight, 00:00, and the last minute of the day begins at 23:59 and ends at 24:00, which is identical to 00:00 of the following day. 12:00 can only be mid-day. Midnight is called 24:00 and is used to mean the end of the day and 00:00 is used to mean the beginning of the day. For example, you would say \"Tuesday at 24:00\" and \"Wednesday at 00:00\" to mean exactly the same time.',\n",
" 'However, the US military prefers not to say 24:00 - they do not like to have two names for the same thing, so they always say \"23:59\", which is one minute before midnight.']"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ds[:3].text.data()[\"value\"]"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "JRFPjoDaGcSa"
},
"source": [
"## LangChain's Deep Lake vector store\n",
"Let's define a `dataset_path`, this is where your Deep Lake vector store will house the text embeddings."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"id": "Klobw6_T257K"
},
"outputs": [],
"source": [
"dataset_path = 'wikipedia-embeddings-deeplake'"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "IW6BZubFGgu2"
},
"source": [
"We will setup OpenAI's `text-embedding-3-small` as our embedding function and initialize a Deep Lake vector store at `dataset_path`..."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "ykE3HgSl5mcg",
"outputId": "dde4d6bb-6c82-473e-f37d-3f03a358ee8b"
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"\r\r\r\r"
]
}
],
"source": [
"from langchain.embeddings.openai import OpenAIEmbeddings\n",
"from langchain.vectorstores import DeepLake\n",
"\n",
"embedding = OpenAIEmbeddings(model=\"text-embedding-3-small\")\n",
"db = DeepLake(dataset_path, embedding=embedding, overwrite=True)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "6mt2S1XpGj-D"
},
"source": [
"... and populate it with samples, one batch at a time, using the `add_texts` method."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 275,
"referenced_widgets": [
"30a05f9f55ae454ba75137634896e82a",
"0add33db728844a59c1ffa53e18fab98",
"26bf0f01ac414ab0b0da34971ba8cbdf",
"b595729257c34311a1c21b103a20bbb8",
"6a75dce7a6b84148a0515e30f116ee07",
"1dbe1466e8ba47b1898864ca5aa22f30",
"90c56b9af48d480b93c027032e44c9dd",
"06099626b6e34bf6acf06e53673d08e7",
"b8af7a2bffad44cea5264191b5079995",
"d397a65b169647588cf2eaf8342dde5e",
"2f9e6758a17441359021a6b66cff1dea"
]
},
"id": "hFJTvNGE53lS",
"outputId": "200e3808-1309-4520-9b42-6b59cfc506e6"
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "30a05f9f55ae454ba75137634896e82a",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
" 0%| | 0/1 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\n",
"creating embeddings: 0%| | 0/1 [00:00<?, ?it/s]\u001b[A\n",
"creating embeddings: 100%|██████████| 1/1 [00:02<00:00, 2.11s/it]\n",
"\n",
"100%|██████████| 10/10 [00:00<00:00, 462.42it/s]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Dataset(path='wikipedia-embeddings-deeplake', tensors=['text', 'metadata', 'embedding', 'id'])\n",
"\n",
" tensor htype shape dtype compression\n",
" ------- ------- ------- ------- ------- \n",
" text text (10, 1) str None \n",
" metadata json (10, 1) str None \n",
" embedding embedding (10, 1536) float32 None \n",
" id text (10, 1) str None \n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\n",
"\r\r"
]
}
],
"source": [
"from tqdm.auto import tqdm\n",
"\n",
"batch_size = 100\n",
"\n",
"nsamples = 10 # for testing. Replace with len(ds) to append everything\n",
"for i in tqdm(range(0, nsamples, batch_size)):\n",
" # find end of batch\n",
" i_end = min(nsamples, i + batch_size)\n",
"\n",
" batch = ds[i:i_end]\n",
" id_batch = batch.ids.data()[\"value\"]\n",
" text_batch = batch.text.data()[\"value\"]\n",
" meta_batch = batch.metadata.data()[\"value\"]\n",
"\n",
" db.add_texts(text_batch, metadatas=meta_batch, ids=id_batch)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "4AwidW4MGnNH"
},
"source": [
"## Run user queries on the database\n",
"The underlying Deep Lake dataset object is accessible through `db.vectorstore.dataset`, and the data structure can be summarized using `db.vectorstore.summary()`, which shows 4 tensors with 10 samples:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "RSp2aF9nGrgj",
"outputId": "d8d370ff-52ee-42ed-ceb2-c48e8c4ada8f"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Dataset(path='wikipedia-embeddings-deeplake', tensors=['text', 'metadata', 'embedding', 'id'])\n",
"\n",
" tensor htype shape dtype compression\n",
" ------- ------- ------- ------- ------- \n",
" text text (10, 1) str None \n",
" metadata json (10, 1) str None \n",
" embedding embedding (10, 1536) float32 None \n",
" id text (10, 1) str None \n"
]
}
],
"source": [
"db.vectorstore.summary()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "NMcH9pRsGrUW"
},
"source": [
"We will now setup QA on our vector store with GPT-3.5-Turbo as our LLM."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"id": "ywS3cL5oUHGL"
},
"outputs": [],
"source": [
"from langchain.chains import RetrievalQA\n",
"from langchain.chat_models import ChatOpenAI\n",
"\n",
"# Re-load the vector store in case it's no longer initialized\n",
"# db = DeepLake(dataset_path = dataset_path, embedding_function=embedding)\n",
"\n",
"qa = RetrievalQA.from_chain_type(llm=ChatOpenAI(model='gpt-3.5-turbo'), chain_type=\"stuff\", retriever=db.as_retriever())"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ZysWCch7Gwf_"
},
"source": [
"Let's try running a prompt and check the output. Internally, this API performs an embedding search to find the most relevant data to feed into the LLM context."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 36
},
"id": "7VaBJgKrFOXu",
"outputId": "951fd6d7-d749-4fd9-9c7b-2ee4c422f65a"
},
"outputs": [
{
"data": {
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "string"
},
"text/plain": [
"'The military prefers not to say 24:00 because they do not like to have two names for the same thing. Instead, they always say \"23:59\", which is one minute before midnight.'"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"query = 'Why does the military not say 24:00?'\n",
"qa.run(query)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "xX0lLg9xG0Rk"
},
"source": [
"Et voila!"
]
}
],
"metadata": {
"colab": {
"provenance": []
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
},
"language_info": {
"name": "python"
},
"widgets": {
"application/vnd.jupyter.widget-state+json": {
"06099626b6e34bf6acf06e53673d08e7": {
"model_module": "@jupyter-widgets/base",
"model_module_version": "1.2.0",
"model_name": "LayoutModel",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"0add33db728844a59c1ffa53e18fab98": {
"model_module": "@jupyter-widgets/controls",
"model_module_version": "1.5.0",
"model_name": "HTMLModel",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "HTMLModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "HTMLView",
"description": "",
"description_tooltip": null,
"layout": "IPY_MODEL_1dbe1466e8ba47b1898864ca5aa22f30",
"placeholder": "",
"style": "IPY_MODEL_90c56b9af48d480b93c027032e44c9dd",
"value": "100%"
}
},
"1dbe1466e8ba47b1898864ca5aa22f30": {
"model_module": "@jupyter-widgets/base",
"model_module_version": "1.2.0",
"model_name": "LayoutModel",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"26bf0f01ac414ab0b0da34971ba8cbdf": {
"model_module": "@jupyter-widgets/controls",
"model_module_version": "1.5.0",
"model_name": "FloatProgressModel",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "FloatProgressModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "ProgressView",
"bar_style": "success",
"description": "",
"description_tooltip": null,
"layout": "IPY_MODEL_06099626b6e34bf6acf06e53673d08e7",
"max": 1,
"min": 0,
"orientation": "horizontal",
"style": "IPY_MODEL_b8af7a2bffad44cea5264191b5079995",
"value": 1
}
},
"2f9e6758a17441359021a6b66cff1dea": {
"model_module": "@jupyter-widgets/controls",
"model_module_version": "1.5.0",
"model_name": "DescriptionStyleModel",
"state": {
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "DescriptionStyleModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "StyleView",
"description_width": ""
}
},
"30a05f9f55ae454ba75137634896e82a": {
"model_module": "@jupyter-widgets/controls",
"model_module_version": "1.5.0",
"model_name": "HBoxModel",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "HBoxModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "HBoxView",
"box_style": "",
"children": [
"IPY_MODEL_0add33db728844a59c1ffa53e18fab98",
"IPY_MODEL_26bf0f01ac414ab0b0da34971ba8cbdf",
"IPY_MODEL_b595729257c34311a1c21b103a20bbb8"
],
"layout": "IPY_MODEL_6a75dce7a6b84148a0515e30f116ee07"
}
},
"6a75dce7a6b84148a0515e30f116ee07": {
"model_module": "@jupyter-widgets/base",
"model_module_version": "1.2.0",
"model_name": "LayoutModel",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"90c56b9af48d480b93c027032e44c9dd": {
"model_module": "@jupyter-widgets/controls",
"model_module_version": "1.5.0",
"model_name": "DescriptionStyleModel",
"state": {
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "DescriptionStyleModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "StyleView",
"description_width": ""
}
},
"b595729257c34311a1c21b103a20bbb8": {
"model_module": "@jupyter-widgets/controls",
"model_module_version": "1.5.0",
"model_name": "HTMLModel",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "HTMLModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "HTMLView",
"description": "",
"description_tooltip": null,
"layout": "IPY_MODEL_d397a65b169647588cf2eaf8342dde5e",
"placeholder": "",
"style": "IPY_MODEL_2f9e6758a17441359021a6b66cff1dea",
"value": " 1/1 [00:04&lt;00:00, 4.45s/it]"
}
},
"b8af7a2bffad44cea5264191b5079995": {
"model_module": "@jupyter-widgets/controls",
"model_module_version": "1.5.0",
"model_name": "ProgressStyleModel",
"state": {
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "ProgressStyleModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "StyleView",
"bar_color": null,
"description_width": ""
}
},
"d397a65b169647588cf2eaf8342dde5e": {
"model_module": "@jupyter-widgets/base",
"model_module_version": "1.2.0",
"model_name": "LayoutModel",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
}
}
}
},
"nbformat": 4,
"nbformat_minor": 0
}