openai-cookbook/examples/third_party/financial_document_analysis...

{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "0f67904b-5fd6-443f-bf10-d49a69b25fcd",
   "metadata": {
    "tags": []
   },
   "source": [
    "# Financial Document Analysis with LlamaIndex"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "9aebd935-07e0-44bb-a682-cf1d7b2a9b07",
   "metadata": {
    "tags": []
   },
   "source": [
    "In this example notebook, we showcase how to perform financial analysis over [**10-K**](https://en.wikipedia.org/wiki/Form_10-K) documents with the [**LlamaIndex**](https://gpt-index.readthedocs.io/en/latest/) framework with just a few lines of code."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "cb6f657b-b692-44ed-9b8f-2b277e07828d",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Notebook Outline\n",
    "* [Introduction](#Introduction)\n",
    "* [Setup](#Setup)\n",
    "* [Data Loading & Indexing](#Data-Loading-and-Indexing)\n",
    "* [Simple QA](#Simple-QA)\n",
    "* [Advanced QA - Compare and Contrast](#Advanced-QA---Compare-and-Contrast)\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "1e38922b-10ad-4442-bff3-32ee2a384c4d",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Introduction\n",
    "\n",
    "### LLamaIndex\n",
    "[LlamaIndex](https://gpt-index.readthedocs.io/en/latest/) is a data framework for LLM applications.\n",
    "You can get started with just a few lines of code and build a retrieval-augmented generation (RAG) system in minutes.\n",
    "For more advanced users, LlamaIndex offers a rich toolkit for ingesting and indexing your data, modules for retrieval and re-ranking, and composable components for building custom query engines.\n",
    "\n",
    "See [full documentation](https://gpt-index.readthedocs.io/en/latest/) for more details.\n",
    "\n",
    "### Financial Analysis over 10-K documents\n",
    "A key part of a financial analyst's job is to extract information and synthesize insight from long financial documents.\n",
    "A great example is the 10-K form - an annual report required by the U.S. Securities and Exchange Commission (SEC), that gives a comprehensive summary of a company's financial performance.\n",
    "These documents typically run hundred of pages in length, and contain domain-specific terminology that makes it challenging for a layperson to digest quickly.\n",
    "\n",
    "\n",
    "We showcase how LlamaIndex can support a financial analyst in quickly extracting information and synthesize insights **across multiple documents** with very little coding. "
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "ad72f9c1-e35c-476d-94d8-ca9619d0cc08",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Setup"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "15284a54-35b3-4129-9a7a-57b7547bfaf4",
   "metadata": {
    "tags": []
   },
   "source": [
    "To begin, we need to install the llama-index library"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d1062e24-6177-459a-94e0-8e77b3e0859b",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "!pip install llama-index pypdf"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "f51cb5df-9c98-44d8-8c69-1a0b784de3e3",
   "metadata": {},
   "source": [
    "Now, we import all modules used in this tutorial"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "09fbec4c-1864-4d76-9dbf-3d213ba58fc8",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from langchain import OpenAI\n",
    "\n",
    "from llama_index import SimpleDirectoryReader, ServiceContext, VectorStoreIndex\n",
    "from llama_index import set_global_service_context\n",
    "from llama_index.response.pprint_utils import pprint_response\n",
    "from llama_index.tools import QueryEngineTool, ToolMetadata\n",
    "from llama_index.query_engine import SubQuestionQueryEngine"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "c743f504-f28c-4802-89b6-ad152b74b0eb",
   "metadata": {
    "tags": []
   },
   "source": [
    "Before we start, we can configure the LLM provider and model that will power our RAG system.  \n",
    "Here, we pick `gpt-3.5-turbo-instruct` from OpenAI.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "c4ec8b0a-d5fa-4f74-a2cc-5cc52e009bc6",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "llm = OpenAI(temperature=0, model_name=\"gpt-3.5-turbo-instruct\", max_tokens=-1)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "c2810e8c-1c88-49f5-aada-c49eccded166",
   "metadata": {},
   "source": [
    "We construct a `ServiceContext` and set it as the global default, so all subsequent operations that depends on LLM calls will use the model we configured here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "05e016f9-2055-4885-8416-cc3aa2968242",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "service_context = ServiceContext.from_defaults(llm=llm)\n",
    "set_global_service_context(service_context=service_context)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "71fddd07-ff4c-44d4-82af-64e2e416e853",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Data Loading and Indexing"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "e3c1a778-fd5a-4249-9704-580c52cacb10",
   "metadata": {
    "tags": []
   },
   "source": [
    "Now, we load and parse 2 PDFs (one for Uber 10-K in 2021 and another for Lyft 10-k in 2021).    \n",
    "Under the hood, the PDFs are converted to plain text `Document` objects, separate by page.  \n",
    "\n",
    "> Note: this operation might take a while to run, since each document is more than 100 pages."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "dd0ba028-1e70-4164-8af1-5f1df0ea76a9",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "lyft_docs = SimpleDirectoryReader(input_files=[\"../data/10k/lyft_2021.pdf\"]).load_data()\n",
    "uber_docs = SimpleDirectoryReader(input_files=[\"../data/10k/uber_2021.pdf\"]).load_data()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "d026ef11-ebc5-4ec3-9aab-8e065cd7f8a9",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loaded lyft 10-K with 238 pages\n",
      "Loaded Uber 10-K with 307 pages\n"
     ]
    }
   ],
   "source": [
    "print(f'Loaded lyft 10-K with {len(lyft_docs)} pages')\n",
    "print(f'Loaded Uber 10-K with {len(uber_docs)} pages')"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "fd122d0d-2da6-4f46-aa2a-8a0049ad8694",
   "metadata": {},
   "source": [
    "Now, we can build an (in-memory) `VectorStoreIndex` over the documents that we've loaded.  \n",
    "\n",
    "> Note: this operation might take a while to run, since it calls OpenAI API for computing vector embedding over document chunks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "id": "1e0b6e4c-2255-42cf-be88-0fe75a945d85",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "lyft_index = VectorStoreIndex.from_documents(lyft_docs)\n",
    "uber_index = VectorStoreIndex.from_documents(uber_docs)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "8f5fdd5b-4c02-43b3-9ac6-cb42003350ca",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Simple QA"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "a86b9ba7-e9b9-498d-88f5-c6991c7cafa7",
   "metadata": {
    "tags": []
   },
   "source": [
    "Now we are ready to run some queries against our indices!  \n",
    "To do so, we first configure a `QueryEngine`, which just captures a set of configurations for how we want to query the underlying index.\n",
    "\n",
    "For a `VectorStoreIndex`, the most common configuration to adjust is `similarity_top_k` which controls how many document chunks (which we call `Node` objects) are retrieved to use as context for answering our question."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "id": "82466534-c3d8-4619-ab1b-4abcd05c8ba7",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "lyft_engine = lyft_index.as_query_engine(similarity_top_k=3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "id": "ff449977-2c7c-433f-b303-ff1d7b66c7b3",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "uber_engine = uber_index.as_query_engine(similarity_top_k=3)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "633d6126-8b94-4653-980f-0946d92df36a",
   "metadata": {
    "tags": []
   },
   "source": [
    "Let's see some queries in action!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 79,
   "id": "18df061f-238d-4a27-8fd6-1037b0098ae8",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "response = await lyft_engine.aquery('What is the revenue of Lyft in 2021? Answer in millions with page reference')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 80,
   "id": "0e2ab622-e76f-43b6-aea3-122c8a6946de",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "$3,208.3 million (page 63)\n"
     ]
    }
   ],
   "source": [
    "print(response)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 81,
   "id": "2e101199-454b-4aca-913b-20c9631909b8",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "response = await uber_engine.aquery('What is the revenue of Uber in 2021? Answer in millions, with page reference')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 82,
   "id": "82b9cced-f7cf-49e4-965a-ee7c45baae7f",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "$17,455 (page 53)\n"
     ]
    }
   ],
   "source": [
    "print(response)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "ee379703-68d1-4a88-a21b-a6e74e78cc91",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Advanced QA - Compare and Contrast"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "668dffa8-1eb3-4209-913a-ed7debe7bee8",
   "metadata": {
    "tags": []
   },
   "source": [
    "For more complex financial analysis, one often needs to reference multiple documents.  \n",
    "\n",
    "As a example, let's take a look at how to do compare-and-contrast queries over both Lyft and Uber financials.  \n",
    "For this, we build a `SubQuestionQueryEngine`, which breaks down a complex compare-and-contrast query, into simpler sub-questions to execute on respective sub query engine backed by individual indices."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 83,
   "id": "8775650f-b164-478c-8129-9a8e6a0cdc97",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "query_engine_tools = [\n",
    "    QueryEngineTool(\n",
    "        query_engine=lyft_engine, \n",
    "        metadata=ToolMetadata(name='lyft_10k', description='Provides information about Lyft financials for year 2021')\n",
    "    ),\n",
    "    QueryEngineTool(\n",
    "        query_engine=uber_engine, \n",
    "        metadata=ToolMetadata(name='uber_10k', description='Provides information about Uber financials for year 2021')\n",
    "    ),\n",
    "]\n",
    "\n",
    "s_engine = SubQuestionQueryEngine.from_defaults(query_engine_tools=query_engine_tools)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "6981caf5-38bb-4d5e-9068-b4874c62bfc9",
   "metadata": {
    "tags": []
   },
   "source": [
    "Let's see these queries in action!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 84,
   "id": "edd4bbb7-eef9-4b53-b05d-f91033635ac2",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Generated 4 sub questions.\n",
      "\u001b[36;1m\u001b[1;3m[uber_10k] Q: What customer segments grew the fastest for Uber\n",
      "\u001b[0m\u001b[36;1m\u001b[1;3m[uber_10k] A: in 2021?\n",
      "\n",
      "The customer segments that grew the fastest for Uber in 2021 were its Mobility Drivers, Couriers, Riders, and Eaters. These segments experienced growth due to the continued stay-at-home order demand related to COVID-19, as well as Uber's introduction of its Uber One, Uber Pass, Eats Pass, and Rides Pass membership programs. Additionally, Uber's marketplace-centric advertising helped to connect merchants and brands with its platform network, further driving growth.\n",
      "\u001b[0m\u001b[33;1m\u001b[1;3m[uber_10k] Q: What geographies grew the fastest for Uber\n",
      "\u001b[0m\u001b[33;1m\u001b[1;3m[uber_10k] A: \n",
      "Based on the context information, it appears that Uber experienced the most growth in large metropolitan areas, such as Chicago, Miami, New York City, Sao Paulo, and London. Additionally, Uber experienced growth in suburban and rural areas, as well as in countries such as Argentina, Germany, Italy, Japan, South Korea, and Spain.\n",
      "\u001b[0m\u001b[38;5;200m\u001b[1;3m[lyft_10k] Q: What customer segments grew the fastest for Lyft\n",
      "\u001b[0m\u001b[38;5;200m\u001b[1;3m[lyft_10k] A: \n",
      "The customer segments that grew the fastest for Lyft were ridesharing, light vehicles, and public transit. Ridesharing grew as Lyft was able to predict demand and proactively incentivize drivers to be available for rides in the right place at the right time. Light vehicles grew as users were looking for options that were more active, usually lower-priced, and often more efficient for short trips during heavy traffic. Public transit grew as Lyft integrated third-party public transit data into the Lyft App to offer users a robust view of transportation options around them.\n",
      "\u001b[0m\u001b[32;1m\u001b[1;3m[lyft_10k] Q: What geographies grew the fastest for Lyft\n",
      "\u001b[0m\u001b[32;1m\u001b[1;3m[lyft_10k] A: \n",
      "It is not possible to answer this question with the given context information.\n",
      "\u001b[0m"
     ]
    }
   ],
   "source": [
    "response = await s_engine.aquery('Compare and contrast the customer segments and geographies that grew the fastest')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 85,
   "id": "b631d68b-dd17-4afd-9ed7-da0131041c8b",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "The customer segments that grew the fastest for Uber in 2021 were its Mobility Drivers, Couriers, Riders, and Eaters. These segments experienced growth due to the continued stay-at-home order demand related to COVID-19, as well as Uber's introduction of its Uber One, Uber Pass, Eats Pass, and Rides Pass membership programs. Additionally, Uber's marketplace-centric advertising helped to connect merchants and brands with its platform network, further driving growth. Uber experienced the most growth in large metropolitan areas, such as Chicago, Miami, New York City, Sao Paulo, and London. Additionally, Uber experienced growth in suburban and rural areas, as well as in countries such as Argentina, Germany, Italy, Japan, South Korea, and Spain.\n",
      "\n",
      "The customer segments that grew the fastest for Lyft were ridesharing, light vehicles, and public transit. Ridesharing grew as Lyft was able to predict demand and proactively incentivize drivers to be available for rides in the right place at the right time. Light vehicles grew as users were looking for options that were more active, usually lower-priced, and often more efficient for short trips during heavy traffic. Public transit grew as Lyft integrated third-party public transit data into the Lyft App to offer users a robust view of transportation options around them. It is not possible to answer the question of which geographies grew the fastest for Lyft with the given context information.\n",
      "\n",
      "In summary, Uber and Lyft both experienced growth in customer segments related to mobility, couriers, riders, and eaters. Uber experienced the most growth in large metropolitan areas, as well as in suburban and rural areas, and in countries such as Argentina, Germany, Italy, Japan, South Korea, and Spain. Lyft experienced the most growth in ridesharing, light vehicles, and public transit. It is not possible to answer the question of which geographies grew the fastest for Lyft with the given context information.\n"
     ]
    }
   ],
   "source": [
    "print(response)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 86,
   "id": "6bbbdd5b-0076-48c8-b233-e2ba43d7a6de",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Generated 2 sub questions.\n",
      "\u001b[36;1m\u001b[1;3m[uber_10k] Q: What is the revenue growth of Uber from 2020 to 2021\n",
      "\u001b[0m\u001b[36;1m\u001b[1;3m[uber_10k] A: \n",
      "The revenue growth of Uber from 2020 to 2021 was 57%, or 54% on a constant currency basis.\n",
      "\u001b[0m\u001b[33;1m\u001b[1;3m[lyft_10k] Q: What is the revenue growth of Lyft from 2020 to 2021\n",
      "\u001b[0m\u001b[33;1m\u001b[1;3m[lyft_10k] A: \n",
      "The revenue growth of Lyft from 2020 to 2021 is 36%, increasing from $2,364,681 thousand to $3,208,323 thousand.\n",
      "\u001b[0m"
     ]
    }
   ],
   "source": [
    "response = await s_engine.aquery('Compare revenue growth of Uber and Lyft from 2020 to 2021')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 87,
   "id": "fadf421e-5938-4031-81df-cfbfd347b674",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "The revenue growth of Uber from 2020 to 2021 was 57%, or 54% on a constant currency basis, while the revenue growth of Lyft from 2020 to 2021 was 36%. This means that Uber had a higher revenue growth than Lyft from 2020 to 2021.\n"
     ]
    }
   ],
   "source": [
    "print(response)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.16"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}