2023-10-13 15:45:54 +00:00
{
"cells": [
{
2023-10-16 20:37:51 +00:00
"attachments": {
"7b5c5a30-393c-4b27-8fa1-688306ef2aef.png": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABngAAAGCCAYAAADDr81aAAAMQGlDQ1BJQ0MgUHJvZmlsZQAASImVVwdYU8kWnluSkJAQIICAlNCbIFIDSAmhBZBeBBshCRBKjIGgYkcXFVy7iIANXRVR7IDYETuLYu+LBRVlXSzYlTcpoOu+8r35vrnz33/O/OfMuTP33gGAfpwnkeSimgDkiQukcaGBzNEpqUzSU0AEdEAFVkCLx8+XsGNiIgEsA+3fy7vrAJG3VxzlWv/s/69FSyDM5wOAxECcLsjn50G8HwC8mi+RFgBAlPMWkwskcgwr0JHCACFeIMeZSlwtx+lKvFthkxDHgbgVADUqjyfNBEDjEuSZhfxMqKHRC7GzWCASA0BnQuyXlzdRAHEaxLbQRgKxXJ+V/oNO5t800wc1ebzMQayci6KoBYnyJbm8qf9nOv53ycuVDfiwhpWaJQ2Lk88Z5u1mzsQIOaZC3CNOj4qGWBviDyKBwh5ilJIlC0tU2qNG/HwOzBnQg9hZwAuKgNgI4hBxblSkik/PEIVwIYYrBJ0iKuAmQKwP8QJhfnC8ymaDdGKcyhfakCHlsFX8WZ5U4Vfu674sJ5Gt0n+dJeSq9DGNoqyEZIgpEFsWipKiINaA2Ck/Jz5CZTOyKIsTNWAjlcXJ47eEOE4oDg1U6mOFGdKQOJV9aV7+wHyxDVkibpQK7y3ISghT5gdr5fMU8cO5YJeEYnbigI4wf3TkwFwEwqBg5dyxZ0JxYrxK54OkIDBOORanSHJjVPa4uTA3VM6bQ+yWXxivGosnFcAFqdTHMyQFMQnKOPGibF54jDIefCmIBBwQBJhABms6mAiygai9p7EH3il7QgAPSEEmEAJHFTMwIlnRI4bXeFAE/oRICPIHxwUqeoWgEPJfB1nl1RFkKHoLFSNywBOI80AEyIX3MsUo8aC3JPAYMqJ/eOfByofx5sIq7//3/AD7nWFDJlLFyAY8MukDlsRgYhAxjBhCtMMNcT/cB4+E1wBYXXAW7jUwj+/2hCeEDsJDwjVCJ+HWBFGx9KcoR4FOqB+iykX6j7nAraGmOx6I+0J1qIzr4YbAEXeDfti4P/TsDlmOKm55Vpg/af9tBj88DZUd2ZmMkoeQA8i2P4/UsNdwH1SR5/rH/ChjTR/MN2ew52f/nB+yL4BtxM+W2AJsH3YGO4Gdww5jjYCJHcOasDbsiBwPrq7HitU14C1OEU8O1BH9w9/Ak5VnMt+5zrnb+Yuyr0A4Rf6OBpyJkqlSUWZWAZMNvwhCJlfMdxrGdHF2cQVA/n1Rvr7exCq+G4he23du7h8A+B7r7+8/9J0LPwbAHk+4/Q9+52xZ8NOhDsDZg3yZtFDJ4fILAb4l6HCnGQATYAFs4XxcgAfwAQEgGISDaJAAUsB4GH0WXOdSMBlMB3NACSgDS8EqUAnWg01gG9gJ9oJGcBicAKfBBXAJXAN34OrpAi9AL3gHPiMIQkJoCAMxQEwRK8QBcUFYiB8SjEQicUgKkoZkImJEhkxH5iJlyHKkEtmI1CJ7kIPICeQc0oHcQh4g3chr5BOKoVRUBzVGrdHhKAtloxFoAjoOzUQnoUXoPHQxWoHWoDvQBvQEegG9hnaiL9A+DGDqmB5mhjliLIyDRWOpWAYmxWZipVg5VoPVY83wOV/BOrEe7CNOxBk4E3eEKzgMT8T5+CR8Jr4Ir8S34Q14K34Ff4D34t8INIIRwYHgTeASRhMyCZMJJYRywhbCAcIpuJe6CO+IRKIe0YboCfdiCjGbOI24iLiWuIt4nNhBfETsI5FIBiQHki8pmsQjFZBKSGtIO0jHSJdJXaQPaupqpmouaiFqqWpitWK1crXtakfVLqs9VftM1iRbkb3J0WQBeSp5CXkzuZl8kdxF/kzRothQfCkJlGzKHEoFpZ5yinKX8kZdXd1c3Us9Vl2kPlu9Qn23+ln1B+ofqdpUeyqHOpYqoy6mbqUep96ivqHRaNa0AFoqrYC2mFZLO0m7T/ugwdBw0uBqCDRmaVRpNGhc1nhJJ9Ot6Gz6eHoRvZy+j36R3qNJ1rTW5GjyNGdqVmke1Lyh2afF0BqhFa2Vp7VIa7vWOa1n2iRta+1gbYH2PO1N2ie1HzEwhgWDw+Az5jI2M04xunSIOjY6XJ1snTKdnTrtOr262rpuukm6U3SrdI/oduphetZ6XL1cvSV6e/Wu630aYjyEPUQ4ZOGQ+iGXh7zXH6ofoC/UL9XfpX9N/5MB0yDYIMdgmUGjwT1D3NDeMNZwsuE6w1OGPUN1hvoM5Q8tHbp36G0j1MjeKM5omtEmozajPmMT41BjifEa45PGPSZ6JgEm2SYrTY6adJsyTP1MRaYrTY+ZPmfqMtnMXGYFs5XZa2ZkFmYmM9to1m722dzGPNG82HyX+T0LigXLIsNipUWLRa+lqeUoy+mWdZa3rchWLKssq9VWZ6zeW9tYJ1vPt260fmajb8O1KbKps7lrS7P1t51kW2N71Y5ox7LLsVtrd8ketXe3z7Kvsr/ogDp4OIgc1jp0DCMM8xomHlYz7IYj1ZHtWOhY5/jASc8p0qnYqdHp5XDL4anDlw0/M/ybs7tzrvNm5zsjtEeEjyge0TzitYu9C9+lyuWqK801xHWWa5PrKzcHN6HbOreb7gz3Ue7z3Vvcv3p4ekg96j26PS090zyrPW+wdFgxrEWss14Er0CvWV6HvT56e3gXeO/1/svH0SfHZ7vPs5E2I4UjN4985Gvuy/Pd6Nvpx/RL89vg1+lv5s/zr/F/GGARIAjYEvCUbcfOZu9gvwx0DpQGHgh8z/HmzOAcD8KCQoNKg9qDtYMTgyuD74eYh2SG1IX0hrqHTgs9HkYIiwhbFnaDa8zlc2u5veGe4TPCWyOoEfERlREPI+0jpZHNo9BR4aNWjLobZRUljmqMBtHc6BXR92JsYibFHIolxsbEVsU+iRsRNz3uTDwjfkL89vh3CYEJSxLuJNomyhJbkuhJY5Nqk94nByUvT+4cPXz0jNEXUgxTRClNqaTUpNQtqX1jgsesGtM11n1sydjr42zGTRl3brzh+NzxRybQJ/Am7EsjpCWnbU/7wovm1fD60rnp1em9fA5/Nf+FIECwUtAt9BUuFz7N8M1YnvEs0zdzRWZ3ln9WeVaPiCOqFL3KDsten/0+Jzpna05/bnLurjy1vLS8g2JtcY64daLJxCkTOyQOkhJJ5yTvSasm9UojpFvykfxx+U0FOvBHvk1mK/tF9qDQr7Cq8MPkpMn7pmhNEU9pm2o/deHUp0UhRb9Nw6fxp7VMN5s+Z/qDGewZG2ciM9NntsyymDVvVtfs0Nnb5lDm5Mz5vdi5eHnx27nJc5vnGc+bPe/RL6G/1JVolEhLbsz3mb9+Ab5AtKB9oevCNQu/lQpKz5c5l5WXfVnEX3T+1xG/VvzavzhjcfsSjyXrlhKXipdeX+a/bNtyreVFyx+tGLWiYSVzZenKt6smrDpX7la+fjVltWx1Z0VkRdMayzVL13ypzKq8VhVYtavaqHph9fu1grWX1wWsq19vvL5s/acNog03N4ZubKixrinfRNxUuOnJ5qTNZ35j/Va7xXBL2ZavW8VbO7fFbWut9ayt3W60fUkdWier694xdselnUE7m+od6zfu0ttVthvslu1+vidtz/W9EXtb9rH21e+32l99gHGgtAFpmNrQ25jV2NmU0tRxMPxgS7NP84FDToe2HjY7XHVE98iSo5Sj8472Hys61ndccrznROaJRy0TWu6cHH3yamtsa/upiFNnT4ecPnmGfebYWd+zh895nzt4nnW+8YLHhYY297YDv7v/fqDdo73houfFpktel5o7RnYcvex/+cSVoCunr3KvXrgWda3jeuL1mzfG3ui8Kbj57FburVe3C29/vjP7LuFu6T3Ne+X3je7X/GH3x65Oj84jD4IetD2Mf3jnEf/Ri8f5j790zXtCe1L+1PRp7TOXZ4e7Q7ovPR/zvOuF5MXnnpI/tf6sfmn7cv9fAX+19Y7u7XolfdX/etEbgzdb37q9bemL6bv/Lu/d5/elHww+bPvI+njmU/Knp58nfyF9qfhq97X5
}
},
2023-10-13 15:45:54 +00:00
"cell_type": "markdown",
"id": "b6d466cc-aa8b-4baf-a80a-fef01921ca8d",
"metadata": {},
"source": [
2023-10-16 20:37:51 +00:00
"## Semi-structured RAG\n",
2023-10-13 15:45:54 +00:00
"\n",
2023-10-16 20:37:51 +00:00
"Many documents contain a mixture of content types, including text and tables. \n",
2023-10-13 15:45:54 +00:00
"\n",
2023-10-17 03:50:15 +00:00
"Semi-structured data can be challenging for conventional RAG for at least two reasons: \n",
"\n",
"* Text splitting may break up tables, corrupting the data in retrieval\n",
"* Embedding tables may pose challenges for semantic similarity search \n",
"\n",
2023-10-16 20:37:51 +00:00
"This cookbook shows how to perform RAG on documents with semi-structured data: \n",
2023-10-13 15:45:54 +00:00
"\n",
2023-10-16 20:37:51 +00:00
"* We will use [Unstructured](https://unstructured.io/) to parse both text and tables from documents (PDFs).\n",
2023-10-17 03:50:15 +00:00
"* We will use the [multi-vector retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector) to store raw tables, text along with table summaries better suited for retrieval.\n",
2023-10-16 20:37:51 +00:00
"* We will use [LCEL](https://python.langchain.com/docs/expression_language/) to implement the chains used.\n",
2023-10-13 15:45:54 +00:00
"\n",
2023-10-17 03:50:15 +00:00
"The overall flow is here:\n",
"\n",
2023-10-16 20:37:51 +00:00
"![MVR.png](attachment:7b5c5a30-393c-4b27-8fa1-688306ef2aef.png)\n",
2023-10-13 15:45:54 +00:00
"\n",
2023-10-16 20:37:51 +00:00
"## Packages"
2023-10-13 15:45:54 +00:00
]
},
{
2023-10-16 20:37:51 +00:00
"cell_type": "code",
"execution_count": null,
"id": "5740fc70-c513-4ff4-9d72-cfc098f85fef",
2023-10-13 15:45:54 +00:00
"metadata": {},
2023-10-16 20:37:51 +00:00
"outputs": [],
2023-10-13 15:45:54 +00:00
"source": [
2023-10-17 03:50:15 +00:00
"! pip install langchain unstructured[all-docs] pydantic lxml langchainhub"
]
},
{
"cell_type": "markdown",
"id": "44349a83-e1dc-4eed-ba75-587f309d8c88",
"metadata": {},
"source": [
"The PDF partitioning used by Unstructured will use: \n",
"\n",
"* `tesseract` for Optical Character Recognition (OCR)\n",
"* `poppler` for PDF rendering and processing"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f7880871-4949-4ea2-aed8-540a09188a41",
"metadata": {},
"outputs": [],
"source": [
2023-10-29 22:50:09 +00:00
"! brew install tesseract\n",
2023-10-17 03:50:15 +00:00
"! brew install poppler"
2023-10-13 15:45:54 +00:00
]
},
{
"cell_type": "markdown",
2023-10-16 20:37:51 +00:00
"id": "7c24efa9-b6f6-4dc2-bfe3-70819ba3ef75",
2023-10-13 15:45:54 +00:00
"metadata": {},
"source": [
"## Data Loading\n",
"\n",
2023-10-16 20:37:51 +00:00
"### Partition PDF tables and text\n",
"\n",
2023-10-17 03:50:15 +00:00
"Apply to the [`LLaMA2`](https://arxiv.org/pdf/2307.09288.pdf) paper. \n",
"\n",
2024-04-22 23:22:55 +00:00
"We use the Unstructured [`partition_pdf`](https://unstructured-io.github.io/unstructured/core/partition.html#partition-pdf), which segments a PDF document by using a layout model. \n",
2023-10-17 03:50:15 +00:00
"\n",
"This layout model makes it possible to extract elements, such as tables, from pdfs. \n",
"\n",
"We also can use `Unstructured` chunking, which:\n",
"\n",
"* Tries to identify document sections (e.g., Introduction, etc)\n",
"* Then, builds text blocks that maintain sections while also honoring user-defined chunk sizes"
2023-10-13 15:45:54 +00:00
]
},
{
"cell_type": "code",
2023-10-16 20:37:51 +00:00
"execution_count": 1,
"id": "62cf502b-407d-4645-a72c-24498fd55130",
2023-10-13 15:45:54 +00:00
"metadata": {},
2023-10-16 20:37:51 +00:00
"outputs": [],
"source": [
"path = \"/Users/rlm/Desktop/Papers/LLaMA2/\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3867a654-61ba-4759-9a64-de953a429ced",
"metadata": {},
"outputs": [],
2023-10-13 15:45:54 +00:00
"source": [
2023-11-14 20:58:22 +00:00
"from typing import Any\n",
2023-11-14 22:17:44 +00:00
"\n",
"from pydantic import BaseModel\n",
2023-10-16 20:37:51 +00:00
"from unstructured.partition.pdf import partition_pdf\n",
"\n",
2023-10-13 15:45:54 +00:00
"# Get elements\n",
2023-10-29 22:50:09 +00:00
"raw_pdf_elements = partition_pdf(\n",
" filename=path + \"LLaMA2.pdf\",\n",
" # Unstructured first finds embedded image blocks\n",
" extract_images_in_pdf=False,\n",
" # Use layout model (YOLOX) to get bounding boxes (for tables) and find titles\n",
" # Titles are any sub-section of the document\n",
" infer_table_structure=True,\n",
" # Post processing to aggregate text once we have the title\n",
" chunking_strategy=\"by_title\",\n",
" # Chunking params to aggregate text blocks\n",
" # Attempt to create a new chunk 3800 chars\n",
" # Attempt to keep chunks > 2000 chars\n",
" max_characters=4000,\n",
" new_after_n_chars=3800,\n",
" combine_text_under_n_chars=2000,\n",
" image_output_dir_path=path,\n",
")"
2023-10-13 15:45:54 +00:00
]
},
2023-10-17 03:50:15 +00:00
{
"cell_type": "markdown",
"id": "b09cd727-aeab-49af-8a51-0dc377321e7c",
"metadata": {},
"source": [
"We can examine the elements extracted by `partition_pdf`.\n",
"\n",
"`CompositeElement` are aggregated chunks."
]
},
2023-10-13 15:45:54 +00:00
{
"cell_type": "code",
"execution_count": 13,
"id": "628abfc6-4057-434b-b880-d88e3ba44657",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{\"<class 'unstructured.documents.elements.CompositeElement'>\": 184,\n",
" \"<class 'unstructured.documents.elements.Table'>\": 47,\n",
" \"<class 'unstructured.documents.elements.TableChunk'>\": 2}"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Create a dictionary to store counts of each type\n",
"category_counts = {}\n",
"\n",
"for element in raw_pdf_elements:\n",
" category = str(type(element))\n",
" if category in category_counts:\n",
" category_counts[category] += 1\n",
" else:\n",
" category_counts[category] = 1\n",
"\n",
"# Unique_categories will have unique elements\n",
"unique_categories = set(category_counts.keys())\n",
"category_counts"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "5462f29e-fd59-4e0e-9493-ea3b560e523e",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"49\n",
"184\n"
]
}
],
"source": [
"class Element(BaseModel):\n",
" type: str\n",
" text: Any\n",
"\n",
2023-10-29 22:50:09 +00:00
"\n",
2023-10-13 15:45:54 +00:00
"# Categorize by type\n",
"categorized_elements = []\n",
"for element in raw_pdf_elements:\n",
" if \"unstructured.documents.elements.Table\" in str(type(element)):\n",
" categorized_elements.append(Element(type=\"table\", text=str(element)))\n",
" elif \"unstructured.documents.elements.CompositeElement\" in str(type(element)):\n",
" categorized_elements.append(Element(type=\"text\", text=str(element)))\n",
"\n",
"# Tables\n",
"table_elements = [e for e in categorized_elements if e.type == \"table\"]\n",
"print(len(table_elements))\n",
"\n",
"# Text\n",
"text_elements = [e for e in categorized_elements if e.type == \"text\"]\n",
"print(len(text_elements))"
]
},
{
"cell_type": "markdown",
"id": "731b3dfc-7ddf-4a11-9a30-9a79b7c66e16",
"metadata": {},
"source": [
"## Multi-vector retriever\n",
"\n",
2023-10-17 03:50:15 +00:00
"Use [multi-vector-retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector#summary) to produce summaries of tables and, optionally, text. \n",
"\n",
"With the summary, we will also store the raw table elements.\n",
"\n",
"The summaries are used to improve the quality of retrieval, [as explained in the multi vector retriever docs](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector).\n",
"\n",
"The raw tables are passed to the LLM, providing the full table context for the LLM to generate the answer. \n",
2023-10-13 15:45:54 +00:00
"\n",
"### Summaries"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "8e275736-3408-4d7a-990e-4362c88e81f8",
"metadata": {},
"outputs": [],
"source": [
2024-01-06 23:54:48 +00:00
"from langchain_core.output_parsers import StrOutputParser\n",
"from langchain_core.prompts import ChatPromptTemplate\n",
"from langchain_openai import ChatOpenAI"
2023-10-13 15:45:54 +00:00
]
},
2023-10-17 03:50:15 +00:00
{
"cell_type": "markdown",
"id": "37b65677-aeb4-44fd-b06d-4539341ede97",
"metadata": {},
"source": [
"We create a simple summarize chain for each element.\n",
"\n",
"You can also see, re-use, or modify the prompt in the Hub [here](https://smith.langchain.com/hub/rlm/multi-vector-retriever-summarization).\n",
"\n",
"```\n",
"from langchain import hub\n",
"obj = hub.pull(\"rlm/multi-vector-retriever-summarization\")\n",
"```"
]
},
2023-10-13 15:45:54 +00:00
{
"cell_type": "code",
"execution_count": 17,
"id": "1b12536a-1303-41ad-9948-4eb5a5f32614",
"metadata": {},
"outputs": [],
"source": [
2023-10-29 22:50:09 +00:00
"# Prompt\n",
"prompt_text = \"\"\"You are an assistant tasked with summarizing tables and text. \\ \n",
2023-10-13 15:45:54 +00:00
"Give a concise summary of the table or text. Table or text chunk: {element} \"\"\"\n",
2023-10-29 22:50:09 +00:00
"prompt = ChatPromptTemplate.from_template(prompt_text)\n",
2023-10-13 15:45:54 +00:00
"\n",
2023-10-29 22:50:09 +00:00
"# Summary chain\n",
"model = ChatOpenAI(temperature=0, model=\"gpt-4\")\n",
"summarize_chain = {\"element\": lambda x: x} | prompt | model | StrOutputParser()"
2023-10-13 15:45:54 +00:00
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8d8b567c-b442-4bf0-b639-04bd89effc62",
"metadata": {},
"outputs": [],
"source": [
"# Apply to tables\n",
"tables = [i.text for i in table_elements]\n",
"table_summaries = summarize_chain.batch(tables, {\"max_concurrency\": 5})"
]
},
{
"cell_type": "code",
"execution_count": 26,
"id": "3e9c176c-3d46-4034-b169-0d7305d42d27",
"metadata": {},
"outputs": [],
"source": [
"# Apply to texts\n",
"texts = [i.text for i in text_elements]\n",
"text_summaries = summarize_chain.batch(texts, {\"max_concurrency\": 5})"
]
},
{
"cell_type": "markdown",
"id": "60524010-754f-4924-ad75-78cb54ca7257",
"metadata": {},
"source": [
"### Add to vectorstore\n",
"\n",
2023-10-17 03:50:15 +00:00
"Use [Multi Vector Retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector#summary) with summaries: \n",
"\n",
"* `InMemoryStore` stores the raw text, tables\n",
"* `vectorstore` stores the embedded summaries"
2023-10-13 15:45:54 +00:00
]
},
{
"cell_type": "code",
"execution_count": 27,
"id": "346c3a02-8fea-4f75-a69e-fc9542b99dbc",
"metadata": {},
"outputs": [],
"source": [
"import uuid\n",
2023-11-14 22:17:44 +00:00
"\n",
2023-10-13 15:45:54 +00:00
"from langchain.retrievers.multi_vector import MultiVectorRetriever\n",
2023-11-14 22:17:44 +00:00
"from langchain.storage import InMemoryStore\n",
2024-01-02 21:47:11 +00:00
"from langchain_community.vectorstores import Chroma\n",
docs[patch], templates[patch]: Import from core (#14575)
Update imports to use core for the low-hanging fruit changes. Ran
following
```bash
git grep -l 'langchain.schema.runnable' {docs,templates,cookbook} | xargs sed -i '' 's/langchain\.schema\.runnable/langchain_core.runnables/g'
git grep -l 'langchain.schema.output_parser' {docs,templates,cookbook} | xargs sed -i '' 's/langchain\.schema\.output_parser/langchain_core.output_parsers/g'
git grep -l 'langchain.schema.messages' {docs,templates,cookbook} | xargs sed -i '' 's/langchain\.schema\.messages/langchain_core.messages/g'
git grep -l 'langchain.schema.chat_histry' {docs,templates,cookbook} | xargs sed -i '' 's/langchain\.schema\.chat_history/langchain_core.chat_history/g'
git grep -l 'langchain.schema.prompt_template' {docs,templates,cookbook} | xargs sed -i '' 's/langchain\.schema\.prompt_template/langchain_core.prompts/g'
git grep -l 'from langchain.pydantic_v1' {docs,templates,cookbook} | xargs sed -i '' 's/from langchain\.pydantic_v1/from langchain_core.pydantic_v1/g'
git grep -l 'from langchain.tools.base' {docs,templates,cookbook} | xargs sed -i '' 's/from langchain\.tools\.base/from langchain_core.tools/g'
git grep -l 'from langchain.chat_models.base' {docs,templates,cookbook} | xargs sed -i '' 's/from langchain\.chat_models.base/from langchain_core.language_models.chat_models/g'
git grep -l 'from langchain.llms.base' {docs,templates,cookbook} | xargs sed -i '' 's/from langchain\.llms\.base\ /from langchain_core.language_models.llms\ /g'
git grep -l 'from langchain.embeddings.base' {docs,templates,cookbook} | xargs sed -i '' 's/from langchain\.embeddings\.base/from langchain_core.embeddings/g'
git grep -l 'from langchain.vectorstores.base' {docs,templates,cookbook} | xargs sed -i '' 's/from langchain\.vectorstores\.base/from langchain_core.vectorstores/g'
git grep -l 'from langchain.agents.tools' {docs,templates,cookbook} | xargs sed -i '' 's/from langchain\.agents\.tools/from langchain_core.tools/g'
git grep -l 'from langchain.schema.output' {docs,templates,cookbook} | xargs sed -i '' 's/from langchain\.schema\.output\ /from langchain_core.outputs\ /g'
git grep -l 'from langchain.schema.embeddings' {docs,templates,cookbook} | xargs sed -i '' 's/from langchain\.schema\.embeddings/from langchain_core.embeddings/g'
git grep -l 'from langchain.schema.document' {docs,templates,cookbook} | xargs sed -i '' 's/from langchain\.schema\.document/from langchain_core.documents/g'
git grep -l 'from langchain.schema.agent' {docs,templates,cookbook} | xargs sed -i '' 's/from langchain\.schema\.agent/from langchain_core.agents/g'
git grep -l 'from langchain.schema.prompt ' {docs,templates,cookbook} | xargs sed -i '' 's/from langchain\.schema\.prompt\ /from langchain_core.prompt_values /g'
git grep -l 'from langchain.schema.language_model' {docs,templates,cookbook} | xargs sed -i '' 's/from langchain\.schema\.language_model/from langchain_core.language_models/g'
```
2023-12-12 00:49:10 +00:00
"from langchain_core.documents import Document\n",
2024-01-06 23:54:48 +00:00
"from langchain_openai import OpenAIEmbeddings\n",
2023-10-13 15:45:54 +00:00
"\n",
"# The vectorstore to use to index the child chunks\n",
2023-10-29 22:50:09 +00:00
"vectorstore = Chroma(collection_name=\"summaries\", embedding_function=OpenAIEmbeddings())\n",
2023-10-13 15:45:54 +00:00
"\n",
"# The storage layer for the parent documents\n",
"store = InMemoryStore()\n",
"id_key = \"doc_id\"\n",
"\n",
"# The retriever (empty to start)\n",
"retriever = MultiVectorRetriever(\n",
2023-10-29 22:50:09 +00:00
" vectorstore=vectorstore,\n",
" docstore=store,\n",
2023-10-13 15:45:54 +00:00
" id_key=id_key,\n",
")\n",
"\n",
"# Add texts\n",
"doc_ids = [str(uuid.uuid4()) for _ in texts]\n",
2023-10-29 22:50:09 +00:00
"summary_texts = [\n",
" Document(page_content=s, metadata={id_key: doc_ids[i]})\n",
" for i, s in enumerate(text_summaries)\n",
"]\n",
2023-10-13 15:45:54 +00:00
"retriever.vectorstore.add_documents(summary_texts)\n",
"retriever.docstore.mset(list(zip(doc_ids, texts)))\n",
"\n",
"# Add tables\n",
"table_ids = [str(uuid.uuid4()) for _ in tables]\n",
2023-10-29 22:50:09 +00:00
"summary_tables = [\n",
" Document(page_content=s, metadata={id_key: table_ids[i]})\n",
" for i, s in enumerate(table_summaries)\n",
"]\n",
2023-10-13 15:45:54 +00:00
"retriever.vectorstore.add_documents(summary_tables)\n",
"retriever.docstore.mset(list(zip(table_ids, tables)))"
]
},
{
"cell_type": "markdown",
"id": "1d8bbbd9-009b-4b34-a206-5874a60adbda",
"metadata": {},
"source": [
"## RAG\n",
"\n",
"Run [RAG pipeline](https://python.langchain.com/docs/expression_language/cookbook/retrieval)."
]
},
{
"cell_type": "code",
"execution_count": 28,
"id": "f2489de4-51e3-48b4-bbcd-ed9171deadf3",
"metadata": {},
"outputs": [],
"source": [
docs[patch], templates[patch]: Import from core (#14575)
Update imports to use core for the low-hanging fruit changes. Ran
following
```bash
git grep -l 'langchain.schema.runnable' {docs,templates,cookbook} | xargs sed -i '' 's/langchain\.schema\.runnable/langchain_core.runnables/g'
git grep -l 'langchain.schema.output_parser' {docs,templates,cookbook} | xargs sed -i '' 's/langchain\.schema\.output_parser/langchain_core.output_parsers/g'
git grep -l 'langchain.schema.messages' {docs,templates,cookbook} | xargs sed -i '' 's/langchain\.schema\.messages/langchain_core.messages/g'
git grep -l 'langchain.schema.chat_histry' {docs,templates,cookbook} | xargs sed -i '' 's/langchain\.schema\.chat_history/langchain_core.chat_history/g'
git grep -l 'langchain.schema.prompt_template' {docs,templates,cookbook} | xargs sed -i '' 's/langchain\.schema\.prompt_template/langchain_core.prompts/g'
git grep -l 'from langchain.pydantic_v1' {docs,templates,cookbook} | xargs sed -i '' 's/from langchain\.pydantic_v1/from langchain_core.pydantic_v1/g'
git grep -l 'from langchain.tools.base' {docs,templates,cookbook} | xargs sed -i '' 's/from langchain\.tools\.base/from langchain_core.tools/g'
git grep -l 'from langchain.chat_models.base' {docs,templates,cookbook} | xargs sed -i '' 's/from langchain\.chat_models.base/from langchain_core.language_models.chat_models/g'
git grep -l 'from langchain.llms.base' {docs,templates,cookbook} | xargs sed -i '' 's/from langchain\.llms\.base\ /from langchain_core.language_models.llms\ /g'
git grep -l 'from langchain.embeddings.base' {docs,templates,cookbook} | xargs sed -i '' 's/from langchain\.embeddings\.base/from langchain_core.embeddings/g'
git grep -l 'from langchain.vectorstores.base' {docs,templates,cookbook} | xargs sed -i '' 's/from langchain\.vectorstores\.base/from langchain_core.vectorstores/g'
git grep -l 'from langchain.agents.tools' {docs,templates,cookbook} | xargs sed -i '' 's/from langchain\.agents\.tools/from langchain_core.tools/g'
git grep -l 'from langchain.schema.output' {docs,templates,cookbook} | xargs sed -i '' 's/from langchain\.schema\.output\ /from langchain_core.outputs\ /g'
git grep -l 'from langchain.schema.embeddings' {docs,templates,cookbook} | xargs sed -i '' 's/from langchain\.schema\.embeddings/from langchain_core.embeddings/g'
git grep -l 'from langchain.schema.document' {docs,templates,cookbook} | xargs sed -i '' 's/from langchain\.schema\.document/from langchain_core.documents/g'
git grep -l 'from langchain.schema.agent' {docs,templates,cookbook} | xargs sed -i '' 's/from langchain\.schema\.agent/from langchain_core.agents/g'
git grep -l 'from langchain.schema.prompt ' {docs,templates,cookbook} | xargs sed -i '' 's/from langchain\.schema\.prompt\ /from langchain_core.prompt_values /g'
git grep -l 'from langchain.schema.language_model' {docs,templates,cookbook} | xargs sed -i '' 's/from langchain\.schema\.language_model/from langchain_core.language_models/g'
```
2023-12-12 00:49:10 +00:00
"from langchain_core.runnables import RunnablePassthrough\n",
2023-10-13 15:45:54 +00:00
"\n",
"# Prompt template\n",
"template = \"\"\"Answer the question based only on the following context, which can include text and tables:\n",
"{context}\n",
"Question: {question}\n",
"\"\"\"\n",
"prompt = ChatPromptTemplate.from_template(template)\n",
"\n",
"# LLM\n",
2023-10-29 22:50:09 +00:00
"model = ChatOpenAI(temperature=0, model=\"gpt-4\")\n",
2023-10-13 15:45:54 +00:00
"\n",
"# RAG pipeline\n",
"chain = (\n",
2023-10-29 22:50:09 +00:00
" {\"context\": retriever, \"question\": RunnablePassthrough()}\n",
" | prompt\n",
" | model\n",
2023-10-13 15:45:54 +00:00
" | StrOutputParser()\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 29,
"id": "90e3d100-10e8-4ee6-ae46-2480b1524ec8",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'The number of training tokens for LLaMA2 is 2.0T.'"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"chain.invoke(\"What is the number of training tokens for LLaMA2?\")"
]
},
{
"cell_type": "markdown",
"id": "37f46054-e239-4ba8-af81-22d0d6a9bc32",
"metadata": {},
"source": [
"We can check the [trace](https://smith.langchain.com/public/4739ae7c-1a13-406d-bc4e-3462670ebc01/r) to see what chunks were retrieved:\n",
"\n",
"This includes Table 1 of the paper, showing the Tokens used for training.\n",
"\n",
"```\n",
"Training Data Params Context GQA Tokens LR Length 7B 2k 1.0T 3.0x 10-4 See Touvron et al. 13B 2k 1.0T 3.0 x 10-4 LiaMa 1 (2023) 33B 2k 14T 1.5 x 10-4 65B 2k 1.4T 1.5 x 10-4 7B 4k 2.0T 3.0x 10-4 Liama 2 A new mix of publicly 13B 4k 2.0T 3.0 x 10-4 available online data 34B 4k v 2.0T 1.5 x 10-4 70B 4k v 2.0T 1.5 x 10-4\n",
"```"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.16"
}
},
"nbformat": 4,
"nbformat_minor": 5
}