{ "cells": [ { "cell_type": "markdown", "id": "30995a82", "metadata": {}, "source": [ "# How to build a tool-using agent with LangChain\n", "\n", "This notebook takes you through how to use LangChain to augment an OpenAI model with access to external tools. In particular, you'll be able to create LLM agents that use custom tools to answer user queries.\n", "\n", "\n", "## What is Langchain?\n", "[LangChain](https://python.langchain.com/en/latest/index.html) is a framework for developing applications powered by language models. Their framework enables you to build layered LLM-powered applications that are context-aware and able to interact dynamically with their environment as agents, leading to simplified code for you and a more dynamic user experience for your customers.\n", "\n", "## Why do LLMs need to use Tools?\n", "One of the most common challenges with LLMs is overcoming the lack of recency and specificity in their training data - answers can be out of date, and they are prone to hallucinations given the huge variety in their knowledge base. Tools are a great method of allowing an LLM to answer within a controlled context that draws on your existing knowledge bases and internal APIs - instead of trying to prompt engineer the LLM all the way to your intended answer, you allow it access to tools that it calls on dynamically for info, parses, and serves to customer. \n", "\n", "Providing LLMs access to tools can enable them to answer questions with context directly from search engines, APIs or your own databases. Instead of answering directly, an LLM with access to tools can perform intermediate steps to gather relevant information. Tools can also be used in combination. [For example](https://python.langchain.com/en/latest/modules/agents/agents/examples/mrkl_chat.html), a language model can be made to use a search tool to lookup quantitative information and a calculator to execute calculations.\n", "\n", "## Notebook Sections\n", "\n", "- **Setup:** Import packages and connect to a Pinecone vector database.\n", "- **LLM Agent:** Build an agent that leverages a modified version of the [ReAct](https://react-lm.github.io/) framework to do chain-of-thought reasoning.\n", "- **LLM Agent with History:** Provide the LLM with access to previous steps in the conversation.\n", "- **Knowledge Base:** Create a knowledge base of \"Stuff You Should Know\" podcast episodes, to be accessed through a tool.\n", "- **LLM Agent with Tools:** Extend the agent with access to multiple tools and test that it uses them to answer questions." ] }, { "cell_type": "code", "execution_count": 32, "id": "9c069980", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The autoreload extension is already loaded. To reload it, use:\n", " %reload_ext autoreload\n" ] } ], "source": [ "%load_ext autoreload\n", "%autoreload 2" ] }, { "cell_type": "markdown", "id": "7a254003", "metadata": {}, "source": [ "# Setup\n", "\n", "Import libraries and set up a connection to a [Pinecone](https://www.pinecone.io) vector database.\n", "\n", "You can substitute Pinecone for any other vectorstore or database - there are a [selection](https://python.langchain.com/en/latest/modules/indexes/vectorstores.html) that are supported by Langchain natively, while other connectors will need to be developed yourself." ] }, { "cell_type": "code", "execution_count": null, "id": "f55905f7", "metadata": {}, "outputs": [], "source": [ "!pip install openai\n", "!pip install pinecone-client\n", "!pip install pandas\n", "!pip install typing\n", "!pip install tqdm\n", "!pip install langchain\n", "!pip install wget" ] }, { "cell_type": "code", "execution_count": null, "id": "18be3d9f", "metadata": {}, "outputs": [], "source": [ "import datetime\n", "import json\n", "import openai\n", "import os\n", "import pandas as pd\n", "import pinecone\n", "import re\n", "from tqdm.auto import tqdm\n", "from typing import List, Union\n", "import zipfile\n", "\n", "# Langchain imports\n", "from langchain.agents import Tool, AgentExecutor, LLMSingleActionAgent, AgentOutputParser\n", "from langchain.prompts import BaseChatPromptTemplate, ChatPromptTemplate\n", "from langchain import SerpAPIWrapper, LLMChain\n", "from langchain.schema import AgentAction, AgentFinish, HumanMessage, SystemMessage\n", "# LLM wrapper\n", "from langchain.chat_models import ChatOpenAI\n", "from langchain import OpenAI\n", "# Conversational memory\n", "from langchain.memory import ConversationBufferWindowMemory\n", "# Embeddings and vectorstore\n", "from langchain.embeddings.openai import OpenAIEmbeddings\n", "from langchain.vectorstores import Pinecone\n", "\n", "# Vectorstore Index\n", "index_name = 'podcasts'" ] }, { "attachments": {}, "cell_type": "markdown", "id": "9c85f298", "metadata": {}, "source": [ "For acquiring an API key to connect with Pinecone, you can set up a [free account](https://app.pinecone.io/) and store it in the `api_key` variable below or in your environment variables under `PINECONE_API_KEY`" ] }, { "cell_type": "code", "execution_count": 4, "id": "af825644", "metadata": {}, "outputs": [], "source": [ "api_key = os.getenv(\"PINECONE_API_KEY\") or \"PINECONE_API_KEY\"\n", "\n", "# find environment next to your API key in the Pinecone console\n", "env = os.getenv(\"PINECONE_ENVIRONMENT\") or \"PINECONE_ENVIRONMENT\"\n", "\n", "pinecone.init(api_key=api_key, environment=env)\n", "pinecone.whoami()" ] }, { "cell_type": "code", "execution_count": 5, "id": "c0518596", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['podcasts']" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pinecone.list_indexes()" ] }, { "cell_type": "markdown", "id": "7e7644cd", "metadata": {}, "source": [ "Run this code block if you want to clear the index, or if the index doesn't exist yet\n", "\n", "```\n", "# Check whether the index with the same name already exists - if so, delete it\n", "if index_name in pinecone.list_indexes():\n", " pinecone.delete_index(index_name)\n", " \n", "# Creates new index\n", "pinecone.create_index(name=index_name, dimension=1536)\n", "index = pinecone.Index(index_name=index_name)\n", "\n", "# Confirm our index was created\n", "pinecone.list_indexes()\n", "```" ] }, { "cell_type": "markdown", "id": "0eb63e37", "metadata": {}, "source": [ "## LLM Agent\n", "\n", "An [LLM agent](https://python.langchain.com/en/latest/modules/agents/agents/custom_llm_agent.html) in Langchain has many configurable components, which are detailed in the Langchain documentation.\n", "\n", "We'll employ a few of the core concepts to make an agent that talks in the way we want, can use tools to answer questions, and uses the appropriate language model to power the conversation.\n", "- **Prompt Template:** The input template to control the LLM's behaviour and how it accepts inputs and produces outputs - this is the brain that drives your application ([docs](https://python.langchain.com/en/latest/modules/prompts/prompt_templates.html)).\n", "- **Output Parser:** A method of parsing the output from the prompt. If the LLM produces output using certain headers, you can enable complex interactions where variables are generated by the LLM in their response and passed into the next step of the chain ([docs](https://python.langchain.com/en/latest/modules/prompts/output_parsers.html)).\n", "- **LLM Chain:** A Chain brings together a prompt template with an LLM that will execute it - in this case we'll be using ```gpt-3.5-turbo``` but this framework can be used with OpenAI completions models, or other LLMs entirely ([docs](https://python.langchain.com/en/latest/modules/chains.html)).\n", "- **Tool:** An external service that the LLM can use to retrieve information or execute commands should the user require it ([docs](https://python.langchain.com/en/latest/modules/agents/tools.html)).\n", "- **Agent:** The glue that brings all of this together, an agent can call multiple LLM Chains, each with their own tools. Agents can be extended with your own logic to allow retries, error handling and any other methods you choose to add reliability to your application ([docs](https://python.langchain.com/en/latest/modules/agents.html)).\n", "\n", "**NB:** Before using this cookbook with the Search tool you'll need to sign up on https://serpapi.com/ and generate an API key. Once you have it, store it in an environment variable named ```SERPAPI_API_KEY```" ] }, { "cell_type": "code", "execution_count": 34, "id": "3f813668", "metadata": {}, "outputs": [], "source": [ "# Initiate a Search tool - note you'll need to have set SERPAPI_API_KEY as an environment variable as per the above instructions\n", "search = SerpAPIWrapper()\n", "\n", "# Define a list of tools\n", "tools = [\n", " Tool(\n", " name = \"Search\",\n", " func=search.run,\n", " description=\"useful for when you need to answer questions about current events\"\n", " )\n", "]" ] }, { "cell_type": "code", "execution_count": 35, "id": "effae642", "metadata": {}, "outputs": [], "source": [ "# Set up the prompt with input variables for tools, user input and a scratchpad for the model to record its workings\n", "template = \"\"\"Answer the following questions as best you can, but speaking as a pirate might speak. You have access to the following tools:\n", "\n", "{tools}\n", "\n", "Use the following format:\n", "\n", "Question: the input question you must answer\n", "Thought: you should always think about what to do\n", "Action: the action to take, should be one of [{tool_names}]\n", "Action Input: the input to the action\n", "Observation: the result of the action\n", "... (this Thought/Action/Action Input/Observation can repeat N times)\n", "Thought: I now know the final answer\n", "Final Answer: the final answer to the original input question\n", "\n", "Begin! Remember to speak as a pirate when giving your final answer. Use lots of \"Arg\"s\n", "\n", "Question: {input}\n", "{agent_scratchpad}\"\"\"" ] }, { "cell_type": "code", "execution_count": 36, "id": "b4efd85d", "metadata": {}, "outputs": [], "source": [ "# Set up a prompt template\n", "class CustomPromptTemplate(BaseChatPromptTemplate):\n", " # The template to use\n", " template: str\n", " # The list of tools available\n", " tools: List[Tool]\n", " \n", " def format_messages(self, **kwargs) -> str:\n", " # Get the intermediate steps (AgentAction, Observation tuples)\n", " \n", " # Format them in a particular way\n", " intermediate_steps = kwargs.pop(\"intermediate_steps\")\n", " thoughts = \"\"\n", " for action, observation in intermediate_steps:\n", " thoughts += action.log\n", " thoughts += f\"\\nObservation: {observation}\\nThought: \"\n", " \n", " # Set the agent_scratchpad variable to that value\n", " kwargs[\"agent_scratchpad\"] = thoughts\n", " \n", " # Create a tools variable from the list of tools provided\n", " kwargs[\"tools\"] = \"\\n\".join([f\"{tool.name}: {tool.description}\" for tool in self.tools])\n", " \n", " # Create a list of tool names for the tools provided\n", " kwargs[\"tool_names\"] = \", \".join([tool.name for tool in self.tools])\n", " formatted = self.template.format(**kwargs)\n", " return [HumanMessage(content=formatted)]\n", " \n", "prompt = CustomPromptTemplate(\n", " template=template,\n", " tools=tools,\n", " # This omits the `agent_scratchpad`, `tools`, and `tool_names` variables because those are generated dynamically\n", " # This includes the `intermediate_steps` variable because that is needed\n", " input_variables=[\"input\", \"intermediate_steps\"]\n", ")" ] }, { "cell_type": "code", "execution_count": 37, "id": "533657a5", "metadata": {}, "outputs": [], "source": [ "class CustomOutputParser(AgentOutputParser):\n", " \n", " def parse(self, llm_output: str) -> Union[AgentAction, AgentFinish]:\n", " \n", " # Check if agent should finish\n", " if \"Final Answer:\" in llm_output:\n", " return AgentFinish(\n", " # Return values is generally always a dictionary with a single `output` key\n", " # It is not recommended to try anything else at the moment :)\n", " return_values={\"output\": llm_output.split(\"Final Answer:\")[-1].strip()},\n", " log=llm_output,\n", " )\n", " \n", " # Parse out the action and action input\n", " regex = r\"Action: (.*?)[\\n]*Action Input:[\\s]*(.*)\"\n", " match = re.search(regex, llm_output, re.DOTALL)\n", " \n", " # If it can't parse the output it raises an error\n", " # You can add your own logic here to handle errors in a different way i.e. pass to a human, give a canned response\n", " if not match:\n", " raise ValueError(f\"Could not parse LLM output: `{llm_output}`\")\n", " action = match.group(1).strip()\n", " action_input = match.group(2)\n", " \n", " # Return the action and action input\n", " return AgentAction(tool=action, tool_input=action_input.strip(\" \").strip('\"'), log=llm_output)\n", " \n", "output_parser = CustomOutputParser()" ] }, { "cell_type": "code", "execution_count": 38, "id": "4c7d93d1", "metadata": {}, "outputs": [], "source": [ "# Initiate our LLM - default is 'gpt-3.5-turbo'\n", "llm = ChatOpenAI(temperature=0)\n", "\n", "# LLM chain consisting of the LLM and a prompt\n", "llm_chain = LLMChain(llm=llm, prompt=prompt)\n", "\n", "# Using tools, the LLM chain and output_parser to make an agent\n", "tool_names = [tool.name for tool in tools]\n", "\n", "agent = LLMSingleActionAgent(\n", " llm_chain=llm_chain, \n", " output_parser=output_parser,\n", " # We use \"Observation\" as our stop sequence so it will stop when it receives Tool output\n", " # If you change your prompt template you'll need to adjust this as well\n", " stop=[\"\\nObservation:\"], \n", " allowed_tools=tool_names\n", ")" ] }, { "cell_type": "code", "execution_count": 39, "id": "dd893fa5", "metadata": {}, "outputs": [], "source": [ "# Initiate the agent that will respond to our queries\n", "# Set verbose=True to share the CoT reasoning the LLM goes through\n", "agent_executor = AgentExecutor.from_agent_and_tools(agent=agent, tools=tools, verbose=True)" ] }, { "cell_type": "code", "execution_count": 40, "id": "c7ca3ee2", "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "\u001b[1m> Entering new AgentExecutor chain...\u001b[0m\n", "\u001b[32;1m\u001b[1;3mThought: Hmm, I be not sure of the answer to that one. Let me think.\n", "Action: Search\n", "Action Input: \"Canada population 2023\"\u001b[0m\n", "\n", "Observation:\u001b[36;1m\u001b[1;3m39,566,248\u001b[0m\u001b[32;1m\u001b[1;3mAhoy, that be a lot of people! But I need to make sure this be true.\n", "Action: Search\n", "Action Input: \"Canada population 2023 official source\"\u001b[0m\n", "\n", "Observation:\u001b[36;1m\u001b[1;3mThe current population of Canada is 38,664,637 as of Wednesday, April 19, 2023, based on Worldometer elaboration of the latest United Nations data.\u001b[0m\u001b[32;1m\u001b[1;3mArrr, that be the official number! I be confident in me answer now.\n", "Final Answer: The population of Canada as of 2023 is 38,664,637. Arg!\u001b[0m\n", "\n", "\u001b[1m> Finished chain.\u001b[0m\n" ] }, { "data": { "text/plain": [ "'The population of Canada as of 2023 is 38,664,637. Arg!'" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "agent_executor.run(\"How many people live in canada as of 2023?\")" ] }, { "cell_type": "code", "execution_count": 41, "id": "9ff96040", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "\u001b[1m> Entering new AgentExecutor chain...\u001b[0m\n", "\u001b[32;1m\u001b[1;3mThought: Hmm, I'm not sure what this question is asking about. I better use the search tool.\n", "Action: Search\n", "Action Input: \"2022 events\"\u001b[0m\n", "\n", "Observation:\u001b[36;1m\u001b[1;3m8. Humanitarian Crises Deepen · 7. Latin America Moves Left. · 6. Iranians Protest. · 5. COVID Eases. · 4. Inflation Returns. · 3. Climate Change ...\u001b[0m\u001b[32;1m\u001b[1;3mAhoy, it looks like this be a question about what be happenin' in 2022. Let me search again.\n", "Action: Search\n", "Action Input: \"2022 calendar\"\u001b[0m\n", "\n", "Observation:\u001b[36;1m\u001b[1;3mUnited States 2022 – Calendar with American holidays. Yearly calendar showing months for the year 2022. Calendars – online and print friendly – for any year ...\u001b[0m\u001b[32;1m\u001b[1;3mShiver me timbers, it looks like this be a question about the year 2022. Let me search one more time.\n", "Action: Search\n", "Action Input: \"What be happenin' in 2022?\"\u001b[0m\n", "\n", "Observation:\u001b[36;1m\u001b[1;3m8. Humanitarian Crises Deepen · 7. Latin America Moves Left. · 6. Iranians Protest. · 5. COVID Eases. · 4. Inflation Returns. · 3. Climate Change ...\u001b[0m\u001b[32;1m\u001b[1;3mAvast ye, it looks like the same results be comin' up. I reckon there be no clear answer to this question.\n", "Final Answer: Arg, I be sorry matey, but I can't give ye a clear answer to that question.\u001b[0m\n", "\n", "\u001b[1m> Finished chain.\u001b[0m\n" ] }, { "data": { "text/plain": [ "\"Arg, I be sorry matey, but I can't give ye a clear answer to that question.\"" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "agent_executor.run(\"How many in 2022?\")" ] }, { "cell_type": "markdown", "id": "c3b51817", "metadata": {}, "source": [ "## LLM Agent with History\n", "\n", "Extend the LLM Agent with the ability to retain a [memory](https://python.langchain.com/en/latest/modules/agents/agents/custom_llm_agent.html#adding-memory) and use it as context as it continues the conversation.\n", "\n", "We use a simple ```ConversationBufferWindowMemory``` for this example that keeps a rolling window of the last two conversation turns. LangChain has other [memory options](https://python.langchain.com/en/latest/modules/memory.html), with different tradeoffs suitable for different use cases." ] }, { "cell_type": "code", "execution_count": 1, "id": "85737eea", "metadata": {}, "outputs": [], "source": [ "# Set up a prompt template which can interpolate the history\n", "template_with_history = \"\"\"You are SearchGPT, a professional search engine who provides informative answers to users. Answer the following questions as best you can. You have access to the following tools:\n", "\n", "{tools}\n", "\n", "Use the following format:\n", "\n", "Question: the input question you must answer\n", "Thought: you should always think about what to do\n", "Action: the action to take, should be one of [{tool_names}]\n", "Action Input: the input to the action\n", "Observation: the result of the action\n", "... (this Thought/Action/Action Input/Observation can repeat N times)\n", "Thought: I now know the final answer\n", "Final Answer: the final answer to the original input question\n", "\n", "Begin! Remember to give detailed, informative answers\n", "\n", "Previous conversation history:\n", "{history}\n", "\n", "New question: {input}\n", "{agent_scratchpad}\"\"\"" ] }, { "cell_type": "code", "execution_count": 43, "id": "4d85c309", "metadata": {}, "outputs": [], "source": [ "prompt_with_history = CustomPromptTemplate(\n", " template=template_with_history,\n", " tools=tools,\n", " # The history template includes \"history\" as an input variable so we can interpolate it into the prompt\n", " input_variables=[\"input\", \"intermediate_steps\", \"history\"]\n", ")\n", "\n", "llm_chain = LLMChain(llm=llm, prompt=prompt_with_history)\n", "tool_names = [tool.name for tool in tools]\n", "agent = LLMSingleActionAgent(\n", " llm_chain=llm_chain, \n", " output_parser=output_parser,\n", " stop=[\"\\nObservation:\"], \n", " allowed_tools=tool_names\n", ")" ] }, { "cell_type": "code", "execution_count": 44, "id": "4f8c8fea", "metadata": {}, "outputs": [], "source": [ "# Initiate the memory with k=2 to keep the last two turns\n", "# Provide the memory to the agent\n", "memory = ConversationBufferWindowMemory(k=2)\n", "agent_executor = AgentExecutor.from_agent_and_tools(agent=agent, tools=tools, verbose=True, memory=memory)" ] }, { "cell_type": "code", "execution_count": 45, "id": "e1f031bb", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "\u001b[1m> Entering new AgentExecutor chain...\u001b[0m\n", "\u001b[32;1m\u001b[1;3mThought: I need to find the most recent population data for Canada.\n", "Action: Search\n", "Action Input: \"Canada population 2023\"\u001b[0m\n", "\n", "Observation:\u001b[36;1m\u001b[1;3m39,566,248\u001b[0m\u001b[32;1m\u001b[1;3mThis data seems reliable, but I should double-check the source.\n", "Action: Search\n", "Action Input: \"Source of Canada population 2023\"\u001b[0m\n", "\n", "Observation:\u001b[36;1m\u001b[1;3mThe current population of Canada is 38,664,637 as of Wednesday, April 19, 2023, based on Worldometer elaboration of the latest United Nations data. Canada 2020 population is estimated at 37,742,154 people at mid year according to UN data. Canada population is equivalent to 0.48% of the total world population.\u001b[0m\u001b[32;1m\u001b[1;3mI now know the final answer\n", "Final Answer: As of April 19, 2023, the population of Canada is 38,664,637.\u001b[0m\n", "\n", "\u001b[1m> Finished chain.\u001b[0m\n" ] }, { "data": { "text/plain": [ "'As of April 19, 2023, the population of Canada is 38,664,637.'" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "agent_executor.run(\"How many people live in canada as of 2023?\")" ] }, { "cell_type": "code", "execution_count": 46, "id": "9d28d73f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "\u001b[1m> Entering new AgentExecutor chain...\u001b[0m\n", "\u001b[32;1m\u001b[1;3mThought: I need to search for the current population of Mexico.\n", "Action: Search\n", "Action Input: \"current population of Mexico\"\u001b[0m\n", "\n", "Observation:\u001b[36;1m\u001b[1;3mMexico, officially the United Mexican States, is a country in the southern portion of North America. It is bordered to the north by the United States; to the south and west by the Pacific Ocean; to the southeast by Guatemala, Belize, and the Caribbean Sea; and to the east by the Gulf of Mexico.\u001b[0m\u001b[32;1m\u001b[1;3mThat's not the answer to the question, I need to refine my search.\n", "Action: Search\n", "Action Input: \"population of Mexico 2023\"\u001b[0m\n", "\n", "Observation:\u001b[36;1m\u001b[1;3m132,709,512\u001b[0m\u001b[32;1m\u001b[1;3mI now know the final answer.\n", "Final Answer: As of 2023, the population of Mexico is 132,709,512.\u001b[0m\n", "\n", "\u001b[1m> Finished chain.\u001b[0m\n" ] }, { "data": { "text/plain": [ "'As of 2023, the population of Mexico is 132,709,512.'" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "agent_executor.run(\"how about in mexico?\")" ] }, { "cell_type": "markdown", "id": "62356eed", "metadata": {}, "source": [ "## Knowledge base\n", "\n", "Create a custom vectorstore for the Agent to use as a tool to answer questions with. We'll store the results in [Pinecone](https://docs.pinecone.io/docs/quickstart), which is supported by LangChain ([Docs](https://python.langchain.com/en/latest/modules/indexes/vectorstores/examples/pinecone.html), [API reference](https://python.langchain.com/en/latest/reference/modules/vectorstore.html)). For help getting started with Pinecone or other vector databases, we have a [cookbook](https://github.com/openai/openai-cookbook/blob/colin/examples/vector_databases/Using_vector_databases_for_embeddings_search.ipynb) to help you get started.\n", "\n", "You can check the LangChain documentation to see what other [vectorstores](https://python.langchain.com/en/latest/modules/indexes/vectorstores.html) and [databases](https://python.langchain.com/en/latest/modules/chains/examples/sqlite.html) are available.\n", "\n", "For this example we'll use the transcripts of the Stuff You Should Know podcast, which was provided thanks to OSF DOI [10.17605/OSF.IO/VM9NT](https://doi.org/10.17605/OSF.IO/VM9NT)" ] }, { "cell_type": "code", "execution_count": 18, "id": "87f86008", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "100% [......................................................................] 571275039 / 571275039" ] }, { "data": { "text/plain": [ "'sysk_podcast_transcripts_embedded.json.zip'" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import wget\n", "\n", "# Here is a URL to a zip archive containing the transcribed podcasts\n", "# Note that this data has already been split into chunks and embeddings from OpenAI's text-embedding-ada-002 embedding model are included\n", "content_url = 'https://cdn.openai.com/API/examples/data/sysk_podcast_transcripts_embedded.json.zip'\n", "\n", "# Download the file (it is ~541 MB so this will take some time)\n", "wget.download(content_url)" ] }, { "cell_type": "code", "execution_count": 19, "id": "4889c292", "metadata": {}, "outputs": [], "source": [ "# Load podcasts\n", "with zipfile.ZipFile(\"sysk_podcast_transcripts_embedded.json.zip\",\"r\") as zip_ref:\n", " zip_ref.extractall(\"./data\")\n", "f = open('./data/sysk_podcast_transcripts_embedded.json')\n", "processed_podcasts = json.load(f)" ] }, { "cell_type": "code", "execution_count": 47, "id": "1070ef60", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | id | \n", "filename | \n", "title | \n", "url | \n", "text_chunk | \n", "embedding | \n", "cleaned_id | \n", "
---|---|---|---|---|---|---|---|
0 | \n", "sysk_with_transcripts_SYSK Selects How Crime S... | \n", "sysk_with_transcripts_SYSK Selects How Crime S... | \n", "\\n\\nSYSK Selects How Crime Scene Cleanup Works | \n", "https://chtbl.com/track/5899E/podtrac.com/pts/... | \n", "Title: sysk_with_transcripts_SYSK Selects How ... | \n", "[0.021279960870742798, -0.005817972123622894, ... | \n", "sysk_with_transcripts_SYSK Selects How Crime S... | \n", "
1 | \n", "sysk_with_transcripts_SYSK Selects How Crime S... | \n", "sysk_with_transcripts_SYSK Selects How Crime S... | \n", "\\n\\nSYSK Selects How Crime Scene Cleanup Works | \n", "https://chtbl.com/track/5899E/podtrac.com/pts/... | \n", "Title: sysk_with_transcripts_SYSK Selects How ... | \n", "[0.013859338127076626, 0.00857278611510992, 0.... | \n", "sysk_with_transcripts_SYSK Selects How Crime S... | \n", "
2 | \n", "sysk_with_transcripts_SYSK Selects How Crime S... | \n", "sysk_with_transcripts_SYSK Selects How Crime S... | \n", "\\n\\nSYSK Selects How Crime Scene Cleanup Works | \n", "https://chtbl.com/track/5899E/podtrac.com/pts/... | \n", "Title: sysk_with_transcripts_SYSK Selects How ... | \n", "[0.015242221765220165, 0.016030369326472282, 0... | \n", "sysk_with_transcripts_SYSK Selects How Crime S... | \n", "
3 | \n", "sysk_with_transcripts_SYSK Selects How Crime S... | \n", "sysk_with_transcripts_SYSK Selects How Crime S... | \n", "\\n\\nSYSK Selects How Crime Scene Cleanup Works | \n", "https://chtbl.com/track/5899E/podtrac.com/pts/... | \n", "Title: sysk_with_transcripts_SYSK Selects How ... | \n", "[0.004371842369437218, -0.003036574460566044, ... | \n", "sysk_with_transcripts_SYSK Selects How Crime S... | \n", "
4 | \n", "sysk_with_transcripts_SYSK Selects How Crime S... | \n", "sysk_with_transcripts_SYSK Selects How Crime S... | \n", "\\n\\nSYSK Selects How Crime Scene Cleanup Works | \n", "https://chtbl.com/track/5899E/podtrac.com/pts/... | \n", "Title: sysk_with_transcripts_SYSK Selects How ... | \n", "[0.017309172078967094, 0.015154214575886726, 0... | \n", "sysk_with_transcripts_SYSK Selects How Crime S... | \n", "