Merge branch 'main' into main
@ -0,0 +1,751 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Summarization with Controllable Detail"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The objective of this notebook is to demonstrate how to summarize large documents with a controllable level of detail.\n",
|
||||
" \n",
|
||||
"If you give a GPT model the task of summarizing a long document (e.g. 10k or more tokens), you'll tend to get back a relatively short summary that isn't proportional to the length of the document. For instance, a summary of a 20k token document will not be twice as long as a summary of a 10k token document. One way we can fix this is to split our document up into pieces, and produce a summary piecewise. After many queries to a GPT model, the full summary can be reconstructed. By controlling the number of text chunks and their sizes, we can ultimately control the level of detail in the output."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 41,
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
"end_time": "2024-04-10T05:19:35.305706Z",
|
||||
"start_time": "2024-04-10T05:19:35.303535Z"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import os\n",
|
||||
"from typing import List, Tuple, Optional\n",
|
||||
"\n",
|
||||
"from openai import OpenAI\n",
|
||||
"import tiktoken\n",
|
||||
"from tqdm import tqdm"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 42,
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
"end_time": "2024-04-10T05:19:35.325026Z",
|
||||
"start_time": "2024-04-10T05:19:35.322414Z"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# open dataset containing part of the text of the Wikipedia page for the United States\n",
|
||||
"with open(\"data/united_states_wikipedia.txt\", \"r\") as file:\n",
|
||||
" united_states_wikipedia_text = file.read()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 43,
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
"end_time": "2024-04-10T05:19:35.364483Z",
|
||||
"start_time": "2024-04-10T05:19:35.348213Z"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": "15781"
|
||||
},
|
||||
"execution_count": 43,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# load encoding and check the length of dataset\n",
|
||||
"encoding = tiktoken.encoding_for_model('gpt-3.5-turbo')\n",
|
||||
"len(encoding.encode(united_states_wikipedia_text))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We'll define a simple utility to wrap calls to the OpenAI API."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 44,
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
"end_time": "2024-04-10T05:19:35.375619Z",
|
||||
"start_time": "2024-04-10T05:19:35.365818Z"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"client = OpenAI(api_key=os.getenv(\"OPENAI_API_KEY\"))\n",
|
||||
"\n",
|
||||
"def get_chat_completion(messages, model='gpt-3.5-turbo'):\n",
|
||||
" response = client.chat.completions.create(\n",
|
||||
" model=model,\n",
|
||||
" messages=messages,\n",
|
||||
" temperature=0,\n",
|
||||
" )\n",
|
||||
" return response.choices[0].message.content"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Next we'll define some utilities to chunk a large document into smaller pieces."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 45,
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
"end_time": "2024-04-10T05:19:35.382790Z",
|
||||
"start_time": "2024-04-10T05:19:35.376721Z"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def tokenize(text: str) -> List[str]:\n",
|
||||
" encoding = tiktoken.encoding_for_model('gpt-3.5-turbo')\n",
|
||||
" return encoding.encode(text)\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"# This function chunks a text into smaller pieces based on a maximum token count and a delimiter.\n",
|
||||
"def chunk_on_delimiter(input_string: str,\n",
|
||||
" max_tokens: int, delimiter: str) -> List[str]:\n",
|
||||
" chunks = input_string.split(delimiter)\n",
|
||||
" combined_chunks, _, dropped_chunk_count = combine_chunks_with_no_minimum(\n",
|
||||
" chunks, max_tokens, chunk_delimiter=delimiter, add_ellipsis_for_overflow=True\n",
|
||||
" )\n",
|
||||
" if dropped_chunk_count > 0:\n",
|
||||
" print(f\"warning: {dropped_chunk_count} chunks were dropped due to overflow\")\n",
|
||||
" combined_chunks = [f\"{chunk}{delimiter}\" for chunk in combined_chunks]\n",
|
||||
" return combined_chunks\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"# This function combines text chunks into larger blocks without exceeding a specified token count. It returns the combined text blocks, their original indices, and the count of chunks dropped due to overflow.\n",
|
||||
"def combine_chunks_with_no_minimum(\n",
|
||||
" chunks: List[str],\n",
|
||||
" max_tokens: int,\n",
|
||||
" chunk_delimiter=\"\\n\\n\",\n",
|
||||
" header: Optional[str] = None,\n",
|
||||
" add_ellipsis_for_overflow=False,\n",
|
||||
") -> Tuple[List[str], List[int]]:\n",
|
||||
" dropped_chunk_count = 0\n",
|
||||
" output = [] # list to hold the final combined chunks\n",
|
||||
" output_indices = [] # list to hold the indices of the final combined chunks\n",
|
||||
" candidate = (\n",
|
||||
" [] if header is None else [header]\n",
|
||||
" ) # list to hold the current combined chunk candidate\n",
|
||||
" candidate_indices = []\n",
|
||||
" for chunk_i, chunk in enumerate(chunks):\n",
|
||||
" chunk_with_header = [chunk] if header is None else [header, chunk]\n",
|
||||
" if len(tokenize(chunk_delimiter.join(chunk_with_header))) > max_tokens:\n",
|
||||
" print(f\"warning: chunk overflow\")\n",
|
||||
" if (\n",
|
||||
" add_ellipsis_for_overflow\n",
|
||||
" and len(tokenize(chunk_delimiter.join(candidate + [\"...\"]))) <= max_tokens\n",
|
||||
" ):\n",
|
||||
" candidate.append(\"...\")\n",
|
||||
" dropped_chunk_count += 1\n",
|
||||
" continue # this case would break downstream assumptions\n",
|
||||
" # estimate token count with the current chunk added\n",
|
||||
" extended_candidate_token_count = len(tokenize(chunk_delimiter.join(candidate + [chunk])))\n",
|
||||
" # If the token count exceeds max_tokens, add the current candidate to output and start a new candidate\n",
|
||||
" if extended_candidate_token_count > max_tokens:\n",
|
||||
" output.append(chunk_delimiter.join(candidate))\n",
|
||||
" output_indices.append(candidate_indices)\n",
|
||||
" candidate = chunk_with_header # re-initialize candidate\n",
|
||||
" candidate_indices = [chunk_i]\n",
|
||||
" # otherwise keep extending the candidate\n",
|
||||
" else:\n",
|
||||
" candidate.append(chunk)\n",
|
||||
" candidate_indices.append(chunk_i)\n",
|
||||
" # add the remaining candidate to output if it's not empty\n",
|
||||
" if (header is not None and len(candidate) > 1) or (header is None and len(candidate) > 0):\n",
|
||||
" output.append(chunk_delimiter.join(candidate))\n",
|
||||
" output_indices.append(candidate_indices)\n",
|
||||
" return output, output_indices, dropped_chunk_count"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Now we can define a utility to summarize text with a controllable level of detail (note the detail parameter).\n",
|
||||
"\n",
|
||||
"The function first determines the number of chunks by interpolating between a minimum and a maximum chunk count based on a controllable detail parameter. It then splits the text into chunks and summarizes each chunk."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 46,
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
"end_time": "2024-04-10T05:19:35.390876Z",
|
||||
"start_time": "2024-04-10T05:19:35.385076Z"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def summarize(text: str,\n",
|
||||
" detail: float = 0,\n",
|
||||
" model: str = 'gpt-3.5-turbo',\n",
|
||||
" additional_instructions: Optional[str] = None,\n",
|
||||
" minimum_chunk_size: Optional[int] = 500,\n",
|
||||
" chunk_delimiter: str = \".\",\n",
|
||||
" summarize_recursively=False,\n",
|
||||
" verbose=False):\n",
|
||||
" \"\"\"\n",
|
||||
" Summarizes a given text by splitting it into chunks, each of which is summarized individually. \n",
|
||||
" The level of detail in the summary can be adjusted, and the process can optionally be made recursive.\n",
|
||||
"\n",
|
||||
" Parameters:\n",
|
||||
" - text (str): The text to be summarized.\n",
|
||||
" - detail (float, optional): A value between 0 and 1 indicating the desired level of detail in the summary.\n",
|
||||
" 0 leads to a higher level summary, and 1 results in a more detailed summary. Defaults to 0.\n",
|
||||
" - model (str, optional): The model to use for generating summaries. Defaults to 'gpt-3.5-turbo'.\n",
|
||||
" - additional_instructions (Optional[str], optional): Additional instructions to provide to the model for customizing summaries.\n",
|
||||
" - minimum_chunk_size (Optional[int], optional): The minimum size for text chunks. Defaults to 500.\n",
|
||||
" - chunk_delimiter (str, optional): The delimiter used to split the text into chunks. Defaults to \".\".\n",
|
||||
" - summarize_recursively (bool, optional): If True, summaries are generated recursively, using previous summaries for context.\n",
|
||||
" - verbose (bool, optional): If True, prints detailed information about the chunking process.\n",
|
||||
"\n",
|
||||
" Returns:\n",
|
||||
" - str: The final compiled summary of the text.\n",
|
||||
"\n",
|
||||
" The function first determines the number of chunks by interpolating between a minimum and a maximum chunk count based on the `detail` parameter. \n",
|
||||
" It then splits the text into chunks and summarizes each chunk. If `summarize_recursively` is True, each summary is based on the previous summaries, \n",
|
||||
" adding more context to the summarization process. The function returns a compiled summary of all chunks.\n",
|
||||
" \"\"\"\n",
|
||||
"\n",
|
||||
" # check detail is set correctly\n",
|
||||
" assert 0 <= detail <= 1\n",
|
||||
"\n",
|
||||
" # interpolate the number of chunks based to get specified level of detail\n",
|
||||
" max_chunks = len(chunk_on_delimiter(text, minimum_chunk_size, chunk_delimiter))\n",
|
||||
" min_chunks = 1\n",
|
||||
" num_chunks = int(min_chunks + detail * (max_chunks - min_chunks))\n",
|
||||
"\n",
|
||||
" # adjust chunk_size based on interpolated number of chunks\n",
|
||||
" document_length = len(tokenize(text))\n",
|
||||
" chunk_size = max(minimum_chunk_size, document_length // num_chunks)\n",
|
||||
" text_chunks = chunk_on_delimiter(text, chunk_size, chunk_delimiter)\n",
|
||||
" if verbose:\n",
|
||||
" print(f\"Splitting the text into {len(text_chunks)} chunks to be summarized.\")\n",
|
||||
" print(f\"Chunk lengths are {[len(tokenize(x)) for x in text_chunks]}\")\n",
|
||||
"\n",
|
||||
" # set system message\n",
|
||||
" system_message_content = \"Summarize the following text.\"\n",
|
||||
" if additional_instructions is not None:\n",
|
||||
" system_message_content += f\"\\n\\n{additional_instructions}\"\n",
|
||||
"\n",
|
||||
" accumulated_summaries = []\n",
|
||||
" for chunk in tqdm(text_chunks):\n",
|
||||
" if summarize_recursively and accumulated_summaries:\n",
|
||||
" # Creating a structured prompt for recursive summarization\n",
|
||||
" accumulated_summaries_string = '\\n\\n'.join(accumulated_summaries)\n",
|
||||
" user_message_content = f\"Previous summaries:\\n\\n{accumulated_summaries_string}\\n\\nText to summarize next:\\n\\n{chunk}\"\n",
|
||||
" else:\n",
|
||||
" # Directly passing the chunk for summarization without recursive context\n",
|
||||
" user_message_content = chunk\n",
|
||||
"\n",
|
||||
" # Constructing messages based on whether recursive summarization is applied\n",
|
||||
" messages = [\n",
|
||||
" {\"role\": \"system\", \"content\": system_message_content},\n",
|
||||
" {\"role\": \"user\", \"content\": user_message_content}\n",
|
||||
" ]\n",
|
||||
"\n",
|
||||
" # Assuming this function gets the completion and works as expected\n",
|
||||
" response = get_chat_completion(messages, model=model)\n",
|
||||
" accumulated_summaries.append(response)\n",
|
||||
"\n",
|
||||
" # Compile final summary from partial summaries\n",
|
||||
" final_summary = '\\n\\n'.join(accumulated_summaries)\n",
|
||||
"\n",
|
||||
" return final_summary"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"Now we can use this utility to produce summaries with varying levels of detail. By increasing 'detail' from 0 to 1 we get progressively longer summaries of the underlying document. A higher value for the detail parameter results in a more detailed summary because the utility first splits the document into a greater number of chunks. Each chunk is then summarized, and the final summary is a concatenation of all the chunk summaries."
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 47,
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
"end_time": "2024-04-10T05:19:47.541096Z",
|
||||
"start_time": "2024-04-10T05:19:35.391911Z"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Splitting the text into 1 chunks to be summarized.\n",
|
||||
"Chunk lengths are [15781]\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"100%|██████████| 1/1 [00:07<00:00, 7.31s/it]\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"summary_with_detail_0 = summarize(united_states_wikipedia_text, detail=0, verbose=True)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 48,
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
"end_time": "2024-04-10T05:19:58.724212Z",
|
||||
"start_time": "2024-04-10T05:19:47.542129Z"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Splitting the text into 5 chunks to be summarized.\n",
|
||||
"Chunk lengths are [3945, 3941, 3943, 3915, 37]\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"100%|██████████| 5/5 [00:09<00:00, 1.97s/it]\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"summary_with_detail_pt1 = summarize(united_states_wikipedia_text, detail=0.1, verbose=True)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 49,
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
"end_time": "2024-04-10T05:20:16.216023Z",
|
||||
"start_time": "2024-04-10T05:19:58.725014Z"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Splitting the text into 8 chunks to be summarized.\n",
|
||||
"Chunk lengths are [2214, 2253, 2249, 2255, 2254, 2255, 2221, 84]\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"100%|██████████| 8/8 [00:16<00:00, 2.08s/it]\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"summary_with_detail_pt2 = summarize(united_states_wikipedia_text, detail=0.2, verbose=True)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 50,
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
"end_time": "2024-04-10T05:20:46.941240Z",
|
||||
"start_time": "2024-04-10T05:20:16.225524Z"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Splitting the text into 14 chunks to be summarized.\n",
|
||||
"Chunk lengths are [1198, 1209, 1210, 1209, 1212, 1192, 1176, 1205, 1212, 1201, 1210, 1210, 1192, 154]\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"100%|██████████| 14/14 [00:30<00:00, 2.15s/it]\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"summary_with_detail_pt4 = summarize(united_states_wikipedia_text, detail=0.4, verbose=True)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 51,
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
"end_time": "2024-04-10T05:21:44.913140Z",
|
||||
"start_time": "2024-04-10T05:20:46.953285Z"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Splitting the text into 27 chunks to be summarized.\n",
|
||||
"Chunk lengths are [602, 596, 601, 601, 604, 598, 572, 594, 592, 592, 604, 593, 578, 582, 597, 600, 596, 555, 582, 601, 582, 587, 581, 595, 598, 568, 445]\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"100%|██████████| 27/27 [00:57<00:00, 2.13s/it]\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"summary_with_detail_pt8 = summarize(united_states_wikipedia_text, detail=0.8, verbose=True)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 52,
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
"end_time": "2024-04-10T05:22:57.760218Z",
|
||||
"start_time": "2024-04-10T05:21:44.921275Z"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Splitting the text into 33 chunks to be summarized.\n",
|
||||
"Chunk lengths are [490, 443, 475, 490, 501, 470, 472, 487, 479, 477, 447, 442, 490, 468, 488, 477, 493, 493, 472, 491, 490, 501, 493, 468, 500, 500, 474, 460, 489, 462, 490, 482, 445]\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"100%|██████████| 33/33 [01:12<00:00, 2.20s/it]\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"summary_with_detail_1 = summarize(united_states_wikipedia_text, detail=1.0, verbose=True)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The original document is ~15k tokens long. Notice how large the gap is between the length of 'summary_pt0' and summary_pt10'"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 53,
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
"end_time": "2024-04-10T05:22:57.782389Z",
|
||||
"start_time": "2024-04-10T05:22:57.763041Z"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": "[307, 494, 839, 1662, 3552, 4128]"
|
||||
},
|
||||
"execution_count": 53,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# lengths of summaries\n",
|
||||
"[len(tokenize(x)) for x in\n",
|
||||
" [summary_with_detail_0, summary_with_detail_pt1, summary_with_detail_pt2, summary_with_detail_pt4,\n",
|
||||
" summary_with_detail_pt8, summary_with_detail_1]]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Let's inspect the summaries to see how the level of detail changes with the `detail` parameter set to 0, 0.1, 0.2, 0.4, 0.8, and 1 respectively."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 54,
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
"end_time": "2024-04-10T05:22:57.785881Z",
|
||||
"start_time": "2024-04-10T05:22:57.783455Z"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"The United States of America is a diverse country located in North America, with a population exceeding 334 million. It is a federation of 50 states, a federal capital district, and various territories. The country has a rich history, from the migration of Paleo-Indians over 12,000 years ago to the British colonization and the American Revolution. The U.S. has gone through significant events like the Civil War, World War II, and the Cold War, emerging as a superpower after the collapse of the Soviet Union.\n",
|
||||
"\n",
|
||||
"The U.S. government is a presidential constitutional republic with three separate branches: legislative, executive, and judicial. The country has a strong emphasis on liberty, equality under the law, individualism, and limited government. Economically, the U.S. has the largest nominal GDP in the world and is a leader in economic competitiveness, innovation, and human rights. The U.S. is also a founding member of various international organizations like the UN, World Bank, and NATO.\n",
|
||||
"\n",
|
||||
"The U.S. has a rich cultural landscape, with influences from various ethnic groups and traditions. American literature, music, cinema, and theater have made significant contributions to global culture. The country is known for its diverse cuisine, with dishes influenced by various immigrant groups. The U.S. also has a strong presence in the fashion industry, with New York City being a global fashion capital.\n",
|
||||
"\n",
|
||||
"Overall, the United States is a country with a rich history, diverse population, strong economy, and significant cultural influence on the world stage.\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"print(summary_with_detail_0)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 55,
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
"end_time": "2024-04-10T05:22:57.788969Z",
|
||||
"start_time": "2024-04-10T05:22:57.786691Z"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"The United States of America is a country located in North America, consisting of 50 states, a federal capital district, and various territories. It has a rich history, from the arrival of Paleo-Indians over 12,000 years ago to British colonization and the American Revolution. The U.S. expanded across North America, faced sectional divisions over slavery, and emerged as a global superpower after World War II. The country has a presidential constitutional republic with three branches of government and a strong emphasis on liberty, equality, and limited government. Economically, the U.S. is a major player with the largest nominal GDP in the world and significant influence in various international organizations. The country's name, history, and expansion are detailed, including key events like the Declaration of Independence, the Revolutionary War, and the Louisiana Purchase.\n",
|
||||
"\n",
|
||||
"The text discusses key events in American history, including the Missouri Compromise, Indian removal policies, the Civil War, Reconstruction era, post-Civil War developments, rise as a superpower, Cold War era, and contemporary history. It highlights significant events such as the Trail of Tears, California Gold Rush, Reconstruction Amendments, immigration waves, World Wars, Cold War tensions, civil rights movement, economic developments, technological advancements, and major conflicts like the Gulf War and War on Terror. The text also mentions social changes, economic challenges like the Great Depression and Great Recession, and political developments leading to increased polarization in the 2010s.\n",
|
||||
"\n",
|
||||
"The text discusses the geography, climate, biodiversity, conservation efforts, government, politics, political parties, subdivisions, and foreign relations of the United States. It highlights the country's physical features, climate diversity, environmental issues, governmental structure, political parties, state subdivisions, and diplomatic relations. The text also mentions the historical context of the country's political system, including the development of political parties and the structure of the federal government.\n",
|
||||
"\n",
|
||||
"The text discusses the United States' international relations, military capabilities, law enforcement, crime rates, economy, and science and technology advancements. It highlights the country's membership in various international organizations, its military strength, economic dominance, income inequality, and technological innovations. The United States has strong diplomatic ties with several countries, a significant military presence globally, a large economy with high GDP, and is a leader in technological advancements and scientific research.\n",
|
||||
"\n",
|
||||
"The text discusses various aspects of the United States, including its scientific and innovation rankings, energy consumption, transportation infrastructure, demographics, language diversity, immigration, religion, urbanization, and healthcare. It highlights the country's achievements in scientific research, energy usage, transportation systems, population demographics, language diversity, immigration statistics, religious affiliations, urbanization trends, and healthcare facilities.\n",
|
||||
"\n",
|
||||
"The text discusses various aspects of life in the United States, including changes in life expectancy, the healthcare system, education, culture, society, literature, and mass media. It highlights the impact of the COVID-19 pandemic on life expectancy, the disparities in healthcare outcomes, the structure of the education system, the cultural diversity and values in American society, the development of American literature, and the media landscape in the country. The text also touches on issues such as healthcare coverage, education spending, student loan debt, and the protection of free speech in the U.S.\n",
|
||||
"\n",
|
||||
"The text discusses various aspects of American culture, including alternative newspapers in major cities, popular websites, the video game market, theater, visual arts, music, fashion, cinema, and cuisine. It highlights the influence of American culture globally, such as in music, fashion, cinema, and cuisine. The text also mentions significant figures and events in American cultural history, such as the Tony Awards, Broadway theater, the Hudson River School in visual arts, and the Golden Age of Hollywood in cinema. Additionally, it touches on the development of American cuisine, including traditional dishes and the impact of immigrant groups on American food culture.\n",
|
||||
"\n",
|
||||
"The American fast-food industry, known for pioneering the drive-through format in the 1940s, is considered a symbol of U.S. marketing dominance. Major American companies like McDonald's, Burger King, Pizza Hut, Kentucky Fried Chicken, and Domino's Pizza have a significant global presence with numerous outlets worldwide.\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"print(summary_with_detail_pt2)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Note that this utility also allows passing additional instructions."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 56,
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
"end_time": "2024-04-10T05:33:18.789246Z",
|
||||
"start_time": "2024-04-10T05:22:57.789764Z"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"100%|██████████| 5/5 [10:19<00:00, 123.94s/it]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"- The USA is a federation of 50 states, a federal capital district, and 326 Indian reservations.\n",
|
||||
"- It has sovereignty over five major unincorporated island territories and various uninhabited islands.\n",
|
||||
"- The country has a population exceeding 334 million.\n",
|
||||
"- The USA has the world's third-largest land area and the largest maritime exclusive economic zone.\n",
|
||||
"- The USA has had the largest nominal GDP in the world since 1890.\n",
|
||||
"- In 2023, the USA accounted for over 25% of the global economy based on nominal GDP and 15% based on PPP.\n",
|
||||
"- The USA has the highest median income per capita of any non-microstate.\n",
|
||||
"- The USA ranks high in economic competitiveness, productivity, innovation, human rights, and higher education.\n",
|
||||
"- The USA is a founding member of various international organizations such as the World Bank, IMF, NATO, and the UN Security Council.\n",
|
||||
"\n",
|
||||
"- The Great Society plan of President Lyndon Johnson's administration in the early 1960s resulted in groundbreaking laws and policies to counteract institutional racism.\n",
|
||||
"- By 1985, the majority of women aged 16 and older in the U.S. were employed.\n",
|
||||
"- In the 1990s, the U.S. saw the longest economic expansion in its history, with advances in technology such as the World Wide Web and the first gene therapy trial.\n",
|
||||
"- The U.S. spent $877 billion on its military in 2022, the largest amount globally, making up 39% of global military spending and 3.5% of the country's GDP.\n",
|
||||
"- The U.S. has the third-largest combined armed forces in the world, with about 800 bases and facilities abroad and deployments in 25 foreign countries.\n",
|
||||
"- As of January 2023, the U.S. had the sixth highest per-capita incarceration rate globally, with almost 2 million people incarcerated.\n",
|
||||
"- The U.S. had a nominal GDP of $27 trillion in 2023, the largest in the world, constituting over 25% of the global economy.\n",
|
||||
"\n",
|
||||
"- Real compounded annual GDP growth in the U.S. was 3.3%, compared to 2.3% for the rest of the Group of Seven.\n",
|
||||
"- The U.S. ranks first in the world by disposable income per capita and nominal GDP, second by GDP (PPP) after China, and ninth by GDP (PPP) per capita.\n",
|
||||
"- The U.S. has 136 of the world's 500 largest companies headquartered there.\n",
|
||||
"- The U.S. dollar is the most used currency in international transactions and is the world's foremost reserve currency.\n",
|
||||
"- The U.S. ranked second in the Global Competitiveness Report in 2019, after Singapore.\n",
|
||||
"- The U.S. is the second-largest manufacturing country after China as of 2021.\n",
|
||||
"- Americans have the highest average household and employee income among OECD member states.\n",
|
||||
"- The U.S. has 735 billionaires and nearly 22 million millionaires as of 2023.\n",
|
||||
"- In 2022, there were about 582,500 sheltered and unsheltered homeless persons in the U.S.\n",
|
||||
"- The U.S. receives approximately 81% of its energy from fossil fuels.\n",
|
||||
"- The U.S. has the highest vehicle ownership per capita in the world, with 910 vehicles per 1000 people.\n",
|
||||
"- The U.S. has the third-highest number of patent applications and ranked 3rd in the Global Innovation Index in 2023.\n",
|
||||
"- The U.S. has the third-highest number of published scientific papers in 2022.\n",
|
||||
"- The U.S. has a diverse population with 37 ancestry groups having more than one million members.\n",
|
||||
"- The U.S. has the largest Christian population in the world.\n",
|
||||
"- The average American life expectancy at birth was 77.5 years in 2022.\n",
|
||||
"- The U.S. spends more on education per student than any other country in the world.\n",
|
||||
"\n",
|
||||
"- The United States has the most Nobel Prize winners in history, with 411 awards won.\n",
|
||||
"- American higher education is dominated by state university systems, with private universities enrolling about 20% of students.\n",
|
||||
"- The U.S. spends more per student on higher education than the OECD average and all other nations in combined public and private spending.\n",
|
||||
"- Student loan debt in the U.S. has increased by 102% in the last decade, exceeding 1.7 trillion dollars as of 2022.\n",
|
||||
"- Americans donated 1.44% of total GDP to charity, the highest rate in the world.\n",
|
||||
"- The U.S. has the world's largest music market with a total retail value of $15.9 billion in 2022.\n",
|
||||
"- The United States restaurant industry was projected at $899 billion in sales for 2020, employing over 15 million people.\n",
|
||||
"- The U.S. is home to over 220 Michelin Star rated restaurants, with 70 in New York City alone.\n",
|
||||
"- California alone has 444 publishers, developers, and hardware companies in the video game market.\n",
|
||||
"- The U.S. fast-food industry pioneered the drive-through format in the 1940s.\n",
|
||||
"\n",
|
||||
"- American companies mentioned: McDonald's, Burger King, Pizza Hut, Kentucky Fried Chicken, Domino's Pizza\n",
|
||||
"- These companies have numerous outlets around the world\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"summary_with_additional_instructions = summarize(united_states_wikipedia_text, detail=0.1,\n",
|
||||
" additional_instructions=\"Write in point form and focus on numerical data.\")\n",
|
||||
"print(summary_with_additional_instructions)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"Finally, note that the utility allows for recursive summarization, where each summary is based on the previous summaries, adding more context to the summarization process. This can be enabled by setting the `summarize_recursively` parameter to True. This is more computationally expensive, but can increase consistency and coherence of the combined summary."
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 57,
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"100%|██████████| 5/5 [00:09<00:00, 1.99s/it]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"The text provides an overview of the United States, including its geography, history, government structure, economic status, and global influence. It covers the country's origins, colonization, independence, expansion, Civil War, post-war era, rise as a superpower, and involvement in the Cold War. The U.S. is described as a presidential constitutional republic with a strong emphasis on individual rights, liberty, and limited government. The text also highlights the country's economic prowess, cultural influence, and membership in various international organizations.\n",
|
||||
"\n",
|
||||
"The text discusses the United States from the early 1960s to the present day, highlighting significant events such as President Lyndon Johnson's Great Society plan, the counterculture movement, societal changes, the end of the Cold War, the economic expansion of the 1990s, the September 11 attacks, the Great Recession, and political polarization. It also covers the country's geography, climate, biodiversity, conservation efforts, government structure, political parties, foreign relations, military strength, law enforcement, crime rates, and the economy, including its status as the world's largest economy.\n",
|
||||
"\n",
|
||||
"The text discusses the economic status of the United States, highlighting its GDP growth, ranking in various economic indicators, dominance in global trade, and technological advancements. It also covers income distribution, poverty rates, and social issues like homelessness and food insecurity. The text further delves into the country's energy consumption, transportation infrastructure, demographics, immigration trends, religious diversity, urbanization, healthcare system, life expectancy, and education system.\n",
|
||||
"\n",
|
||||
"The text discusses various aspects of American culture and society, including education, literature, mass media, theater, visual arts, music, fashion, cinema, and cuisine. It highlights the country's achievements in education, with a focus on higher education and federal financial aid for students. The text also delves into American cultural values, ethnic diversity, and the country's strong protections of free speech. Additionally, it covers the development of American literature, mass media landscape, theater scene, visual arts movements, music genres, fashion industry, cinema history, and culinary traditions. The influence of American culture globally, as well as the economic impact of industries like music and restaurants, is also discussed.\n",
|
||||
"\n",
|
||||
"American fast-food chains like McDonald's, Burger King, Pizza Hut, Kentucky Fried Chicken, and Domino's Pizza have a widespread global presence with numerous outlets worldwide.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"recursive_summary = summarize(united_states_wikipedia_text, detail=0.1, summarize_recursively=True,\n",
|
||||
" additional_instructions=\"Don't overuse repetitive phrases to introduce each section\")\n",
|
||||
"print(recursive_summary)"
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"ExecuteTime": {
|
||||
"end_time": "2024-04-10T05:33:30.123036Z",
|
||||
"start_time": "2024-04-10T05:33:18.791253Z"
|
||||
}
|
||||
}
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.9.9"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 1
|
||||
}
|
@ -0,0 +1,612 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "w9w5JBaUL-lO"
|
||||
},
|
||||
"source": [
|
||||
"# Multimodal RAG with CLIP Embeddings and GPT-4 Vision\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "3CCjcFSiMbvf"
|
||||
},
|
||||
"source": [
|
||||
"Multimodal RAG integrates additional modalities into traditional text-based RAG, enhancing LLMs' question-answering by providing extra context and grounding textual data for improved understanding.\n",
|
||||
"\n",
|
||||
"Adopting the approach from the [clothing matchmaker cookbook](https://cookbook.openai.com/examples/how_to_combine_gpt4v_with_rag_outfit_assistant), we directly embed images for similarity search, bypassing the lossy process of text captioning, to boost retrieval accuracy.\n",
|
||||
"\n",
|
||||
"Using CLIP-based embeddings further allows fine-tuning with specific data or updating with unseen images.\n",
|
||||
"\n",
|
||||
"This technique is showcased through searching an enterprise knowledge base with user-provided tech images to deliver pertinent information."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "T-Mpdxit4x49"
|
||||
},
|
||||
"source": [
|
||||
"# Installations"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "nbt3evfHUJTZ"
|
||||
},
|
||||
"source": [
|
||||
"First let's install the relevant packages."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"id": "7hgrcVEl0Ma1"
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"#installations\n",
|
||||
"%pip install clip\n",
|
||||
"%pip install torch\n",
|
||||
"%pip install pillow\n",
|
||||
"%pip install faiss-cpu\n",
|
||||
"%pip install numpy\n",
|
||||
"%pip install git+https://github.com/openai/CLIP.git\n",
|
||||
"%pip install openai"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "GgrlBLTpT0si"
|
||||
},
|
||||
"source": [
|
||||
"Then let's import all the needed packages.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"metadata": {
|
||||
"id": "pN1cWF-iyLUg"
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# model imports\n",
|
||||
"import faiss\n",
|
||||
"import json\n",
|
||||
"import torch\n",
|
||||
"from openai import OpenAI\n",
|
||||
"import torch.nn as nn\n",
|
||||
"from torch.utils.data import DataLoader\n",
|
||||
"import clip\n",
|
||||
"client = OpenAI()\n",
|
||||
"\n",
|
||||
"# helper imports\n",
|
||||
"from tqdm import tqdm\n",
|
||||
"import json\n",
|
||||
"import os\n",
|
||||
"import numpy as np\n",
|
||||
"import pickle\n",
|
||||
"from typing import List, Union, Tuple\n",
|
||||
"\n",
|
||||
"# visualisation imports\n",
|
||||
"from PIL import Image\n",
|
||||
"import matplotlib.pyplot as plt\n",
|
||||
"import base64"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "9fONcWxRqll8"
|
||||
},
|
||||
"source": [
|
||||
"Now let's load the CLIP model."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"metadata": {
|
||||
"id": "_Ua9y98NRk70"
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"#load model on device. The device you are running inference/training on is either a CPU or GPU if you have.\n",
|
||||
"device = \"cpu\"\n",
|
||||
"model, preprocess = clip.load(\"ViT-B/32\",device=device)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "Dev-zjfJ774W"
|
||||
},
|
||||
"source": [
|
||||
"\n",
|
||||
"We will now:\n",
|
||||
"1. Create the image embedding database\n",
|
||||
"2. Set up a query to the vision model\n",
|
||||
"3. Perform the semantic search\n",
|
||||
"4. Pass a user query to the image\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "5Y1v2jkS42TS"
|
||||
},
|
||||
"source": [
|
||||
"# Create image embedding database"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "wVBAMyhesAyi"
|
||||
},
|
||||
"source": [
|
||||
"Next we will create our image embeddings knowledge base from a directory of images. This will be the knowledge base of technology that we search through to provide information to the user for an image they upload.\n",
|
||||
"\n",
|
||||
"We pass in the directory in which we store our images (as JPEGs) and loop through each to create our embeddings.\n",
|
||||
"\n",
|
||||
"We also have a description.json. This has an entry for every single image in our knowledge base. It has two keys: 'image_path' and 'description'. It maps each image to a useful description of this image to aid in answering the user question."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "fDCz76gr8yAu"
|
||||
},
|
||||
"source": [
|
||||
"First let's write a function to get all the image paths in a given directory. We will then get all the jpeg's from a directory called 'image_database'"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"metadata": {
|
||||
"id": "vE9i3zLuRk5c"
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def get_image_paths(directory: str, number: int = None) -> List[str]:\n",
|
||||
" image_paths = []\n",
|
||||
" count = 0\n",
|
||||
" for filename in os.listdir(directory):\n",
|
||||
" if filename.endswith('.jpeg'):\n",
|
||||
" image_paths.append(os.path.join(directory, filename))\n",
|
||||
" if number is not None and count == number:\n",
|
||||
" return [image_paths[-1]]\n",
|
||||
" count += 1\n",
|
||||
" return image_paths\n",
|
||||
"direc = 'image_database/'\n",
|
||||
"image_paths = get_image_paths(direc)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "hMldfjn189vC"
|
||||
},
|
||||
"source": [
|
||||
"Next we will write a function to get the image embeddings from the CLIP model given a series of paths.\n",
|
||||
"\n",
|
||||
"We first preprocess the image using the preprocess function we got earlier. This performs a few things to ensure the input to the CLIP model is of the right format and dimensionality including resizing, normalization, colour channel adjustment etc.\n",
|
||||
"\n",
|
||||
"We then stack these preprocessed images together so we can pass them into the model at once rather than in a loop. And finally return the model output which is an array of embeddings."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"metadata": {
|
||||
"id": "fd3I_fPh8qvi"
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def get_features_from_image_path(image_paths):\n",
|
||||
" images = [preprocess(Image.open(image_path).convert(\"RGB\")) for image_path in image_paths]\n",
|
||||
" image_input = torch.tensor(np.stack(images))\n",
|
||||
" with torch.no_grad():\n",
|
||||
" image_features = model.encode_image(image_input).float()\n",
|
||||
" return image_features\n",
|
||||
"image_features = get_features_from_image_path(image_paths)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "UH_kyZAE-kHe"
|
||||
},
|
||||
"source": [
|
||||
"We can now create our vector database."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"metadata": {
|
||||
"id": "TIeqpndF8tZk"
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"index = faiss.IndexFlatIP(image_features.shape[1])\n",
|
||||
"index.add(image_features)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "swDe1c4v-mbz"
|
||||
},
|
||||
"source": [
|
||||
"And also ingest our json for image-description mapping and create a list of jsons. We also create a helper function to search through this list for a given image we want, so we can obtain the description of that image"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 11,
|
||||
"metadata": {
|
||||
"id": "tdjlXQqC8uNE"
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"data = []\n",
|
||||
"image_path = 'train1.jpeg'\n",
|
||||
"with open('description.json', 'r') as file:\n",
|
||||
" for line in file:\n",
|
||||
" data.append(json.loads(line))\n",
|
||||
"def find_entry(data, key, value):\n",
|
||||
" for entry in data:\n",
|
||||
" if entry.get(key) == value:\n",
|
||||
" return entry\n",
|
||||
" return None"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "fJXfCtPD5_63"
|
||||
},
|
||||
"source": [
|
||||
"Let us display an example image, this will be the user uploaded image. This is a piece of tech that was unveiled at the 2024 CES. It is the DELTA Pro Ultra Whole House Battery Generator."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"id": "RtkZ7W3g5sED"
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"im = Image.open(image_path)\n",
|
||||
"plt.imshow(im)\n",
|
||||
"plt.show()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "5ivECCKSdbBy"
|
||||
},
|
||||
"source": [
|
||||
"![Delta Pro](../images/train1.jpeg)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "Sidjylki7Kye"
|
||||
},
|
||||
"source": [
|
||||
"# Querying the vision model"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "H8O7X6ml7t38"
|
||||
},
|
||||
"source": [
|
||||
"Now let's have a look at what GPT-4 Vision (which wouldn't have seen this technology before) will label it as.\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "r4uDjS-gQAqm"
|
||||
},
|
||||
"source": [
|
||||
"First we will need to write a function to encode our image in base64 as this is the format we will pass into the vision model. Then we will create a generic image_query function to allow us to query the LLM with an image input."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 12,
|
||||
"metadata": {
|
||||
"colab": {
|
||||
"base_uri": "https://localhost:8080/",
|
||||
"height": 35
|
||||
},
|
||||
"id": "87gf6_xO8Y4i",
|
||||
"outputId": "99be865f-12e8-4ef0-c2f5-5fd6e5c787f3"
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"application/vnd.google.colaboratory.intrinsic+json": {
|
||||
"type": "string"
|
||||
},
|
||||
"text/plain": [
|
||||
"'Autonomous Delivery Robot'"
|
||||
]
|
||||
},
|
||||
"execution_count": 12,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"def encode_image(image_path):\n",
|
||||
" with open(image_path, 'rb') as image_file:\n",
|
||||
" encoded_image = base64.b64encode(image_file.read())\n",
|
||||
" return encoded_image.decode('utf-8')\n",
|
||||
"\n",
|
||||
"def image_query(query, image_path):\n",
|
||||
" response = client.chat.completions.create(\n",
|
||||
" model='gpt-4-vision-preview',\n",
|
||||
" messages=[\n",
|
||||
" {\n",
|
||||
" \"role\": \"user\",\n",
|
||||
" \"content\": [\n",
|
||||
" {\n",
|
||||
" \"type\": \"text\",\n",
|
||||
" \"text\": query,\n",
|
||||
" },\n",
|
||||
" {\n",
|
||||
" \"type\": \"image_url\",\n",
|
||||
" \"image_url\": {\n",
|
||||
" \"url\": f\"data:image/jpeg;base64,{encode_image(image_path)}\",\n",
|
||||
" },\n",
|
||||
" }\n",
|
||||
" ],\n",
|
||||
" }\n",
|
||||
" ],\n",
|
||||
" max_tokens=300,\n",
|
||||
" )\n",
|
||||
" # Extract relevant features from the response\n",
|
||||
" return response.choices[0].message.content\n",
|
||||
"image_query('Write a short label of what is show in this image?', image_path)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "yfG_7c-jQAqm"
|
||||
},
|
||||
"source": [
|
||||
"As we can see, it tries its best from the information it's been trained on but it makes a mistake due to it not having seen anything similar in its training data. This is because it is an ambiguous image making it difficult to extrapolate and deduce."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "szWZqTqf7SrA"
|
||||
},
|
||||
"source": [
|
||||
"# Performing semantic search"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "eV8LaOncGH3j"
|
||||
},
|
||||
"source": [
|
||||
"Now let's perform similarity search to find the two most similar images in our knowledge base. We do this by getting the embeddings of a user inputted image_path, retrieving the indexes and distances of the similar iamges in our database. Distance will be our proxy metric for similarity and a smaller distance means more similar. We then sort based on distance in descending order."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 13,
|
||||
"metadata": {
|
||||
"id": "GzNEhKJ04D-F"
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"image_search_embedding = get_features_from_image_path([image_path])\n",
|
||||
"distances, indices = index.search(image_search_embedding.reshape(1, -1), 2) #2 signifies the number of topmost similar images to bring back\n",
|
||||
"distances = distances[0]\n",
|
||||
"indices = indices[0]\n",
|
||||
"indices_distances = list(zip(indices, distances))\n",
|
||||
"indices_distances.sort(key=lambda x: x[1], reverse=True)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "0O-GYQ-1QAqm"
|
||||
},
|
||||
"source": [
|
||||
"We require the indices as we will use this to serach through our image_directory and selecting the image at the location of the index to feed into the vision model for RAG."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "9-6SVzwSJVuT"
|
||||
},
|
||||
"source": [
|
||||
"And let's see what it brought back (we display these in order of similarity):"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"id": "Lt1ZYuKDFeww"
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"#display similar images\n",
|
||||
"for idx, distance in indices_distances:\n",
|
||||
" print(idx)\n",
|
||||
" path = get_image_paths(direc, idx)[0]\n",
|
||||
" im = Image.open(path)\n",
|
||||
" plt.imshow(im)\n",
|
||||
" plt.show()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "GPTvKIUJ2tgz"
|
||||
},
|
||||
"source": [
|
||||
"![Delta Pro2](../images/train2.jpeg)\n",
|
||||
"\n",
|
||||
"![Delta Pro3](../images/train17.jpeg)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "x4kF2-MJQAqm"
|
||||
},
|
||||
"source": [
|
||||
"We can see here it brought back two images which contain the DELTA Pro Ultra Whole House Battery Generator. In one of the images it also has some background which could be distracting but manages to find the right image."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "Qc2sOKzY7yv3"
|
||||
},
|
||||
"source": [
|
||||
"# User querying the most similar image"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "8Sio6OR4MDjI"
|
||||
},
|
||||
"source": [
|
||||
"Now for our most similar image, we want to pass it and the description of it to gpt-v with a user query so they can inquire about the technology that they may have bought. This is where the power of the vision model comes in, where you can ask general queries for which the model hasn't been explicitly trained on to the model and it responds with high accuracy."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "uPzsRk66QAqn"
|
||||
},
|
||||
"source": [
|
||||
"In our example below, we will inquire as to the capacity of the item in question."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 14,
|
||||
"metadata": {
|
||||
"colab": {
|
||||
"base_uri": "https://localhost:8080/",
|
||||
"height": 87
|
||||
},
|
||||
"id": "-_5W_xwitbr3",
|
||||
"outputId": "99a40617-0153-492a-d8b0-6782b8421e40"
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"application/vnd.google.colaboratory.intrinsic+json": {
|
||||
"type": "string"
|
||||
},
|
||||
"text/plain": [
|
||||
"'The portable home battery DELTA Pro has a base capacity of 3.6kWh. This capacity can be expanded up to 25kWh with additional batteries. The image showcases the DELTA Pro, which has an impressive 3600W power capacity for AC output as well.'"
|
||||
]
|
||||
},
|
||||
"execution_count": 14,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"similar_path = get_image_paths(direc, indices_distances[0][0])[0]\n",
|
||||
"element = find_entry(data, 'image_path', similar_path)\n",
|
||||
"\n",
|
||||
"user_query = 'What is the capacity of this item?'\n",
|
||||
"prompt = f\"\"\"\n",
|
||||
"Below is a user query, I want you to answer the query using the description and image provided.\n",
|
||||
"\n",
|
||||
"user query:\n",
|
||||
"{user_query}\n",
|
||||
"\n",
|
||||
"description:\n",
|
||||
"{element['description']}\n",
|
||||
"\"\"\"\n",
|
||||
"image_query(prompt, similar_path)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "VIInamGaAG9L"
|
||||
},
|
||||
"source": [
|
||||
"And we see it is able to answer the question. This was only possible by matching images directly and from there gathering the relevant description as context."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "Ljrf0VKR_2q9"
|
||||
},
|
||||
"source": [
|
||||
"# Conclusion"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "PexvxTF5_7ay"
|
||||
},
|
||||
"source": [
|
||||
"In this notebook, we have gone through how to use the CLIP model, an example of creating an image embedding database using the CLIP model, performing semantic search and finally providing a user query to answer the question."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "gOgRBeh6eMiq"
|
||||
},
|
||||
"source": [
|
||||
"The applications of this pattern of usage spread across many different application domains and this is easily improved to further enhance the technique. For example you may finetune CLIP, you may improve the retrieval process just like in RAG and you can prompt engineer GPT-V.\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"colab": {
|
||||
"provenance": []
|
||||
},
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"name": "python",
|
||||
"version": "3.10.14"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 0
|
||||
}
|
After Width: | Height: | Size: 78 KiB |
After Width: | Height: | Size: 76 KiB |
After Width: | Height: | Size: 42 KiB |
After Width: | Height: | Size: 124 KiB |
After Width: | Height: | Size: 61 KiB |
After Width: | Height: | Size: 73 KiB |
After Width: | Height: | Size: 127 KiB |