langchain/cookbook/rag_semantic_chunking_azureaidocintelligence.ipynb

275 lines
364 KiB
Plaintext
Raw Normal View History

{
"cells": [
{
"attachments": {
"semantic-chunking-rag.png": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAACkYAAAWjCAYAAACHMB+WAAAAAXNSR0IArs4c6QAAAARnQU1BAACxjwv8YQUAAAAJcEhZcwAAEnQAABJ0Ad5mH3gAAABlaVRYdFNuaXBNZXRhZGF0YQAAAAAAeyJjbGlwUG9pbnRzIjpbeyJ4IjowLCJ5IjowfSx7IngiOjI2MzAsInkiOjB9LHsieCI6MjYzMCwieSI6MTQ0NH0seyJ4IjowLCJ5IjoxNDQ0fV19hYXkiwAA/zRJREFUeF7s/QmwLNd933n+q+ru29t3POw7Qd4nEqBEa+vWQou25U3tlkaWRpow2gHZGkeMO0RMB9tj2Q5GB6i2HOGQbYQbjpDHsqfV7Xa3gxOhMG25e6LbGjVBSSCHTVELAYqSQIIUARDrW+69NfWrqgOcl8jl5FqZWd8PkK/q1pJ5tjx5MvNfmYP9/f2xAQAAAAAAAAAAAAAAAAAA9MBw/ggAAAAAAAAAAAAAAAAAANB5BEYCAAAAAAAAAAAAAAAAAIDeIDASAAAAAAAAAAAAAAAAAAD0BoGRAAAAAAAAAAAAAAAAAACgNwiMBAAAAAAAAAAAAAAAAAAAvUFgJAAAAAAAAAAAAAAAAAAA6A0CIwEAAAAAAAAAAAAAAAAAQG8QGAkAAAAAAAAAAAAAAAAAAHqDwEgAAAAAAAAAAAAAAAAAANAbBEYCAAAAAAAAAAAAAAAAAIDeIDASAAAAAAAAAAAAAAAAAAD0BoGRAAAAAAAAAAAAAAAAAACgNwiMBAAAAAAAAAAAAAAAAAAAvUFgJAAAAAAAAAAAAAAAAAAA6A0CIwEAAAAAAAAAAAAAAAAAQG8QGAkAAAAAAAAAAAAAAAAAAHqDwEgAAAAAAAAAAAAAAAAAANAbBEYCAAAAAAAAAAAAAAAAAIDeIDASAAAAAAAAAAAAAAAAAAD0BoGRAAAAAAAAAAAAAAAAAACgNwiMBAAAAAAAAAAAAAAAAAAAvUFgJAAAAAAAAAAAAAAAAAAA6A0CIwEAAAAAAAAAAAAAAAAAQG8QGAkAAAAAAAAAAAAAAAAAAHqDwEgAAAAAAAAAAAAAAAAAANAbBEYCAAAAAAAAAAAAAAAAAIDeIDASAAAAAAAAAAAAAAAAAAD0BoGRAAAAAAAAAAAAAAAAAACgNwiMBAAAAAAAAAAAAAAAAAAAvUFgJAAAAAAAAAAAAAAAAAAA6A0CIwEAAAAAAAAAAAAAAAAAQG8QGAkAHbbzvX9mOgFAm9FXwaEtAO3AuggAAAAA3cZ+HQAAAJBtsL+/P54/BwB0hA54nPnrf2v+18y13/mcvfLxX7TX/u2/nr+CPrjjl56ZP0v23IeuzJ9hWZz+v/207X7wz87/ivfqJ/4n+6O/99PzvxaDvgoObQFoB9ZFZGHsCQDoIrZfAJbJxb//L2z9ngfnf8204TggAAAA0EZcMRIAOkYBUdET2qKDIXpdB0YAYNHoq+DQFoB2YF0EAAAAgG5TIHg0KFL0A2q9xxUkAQAAgJtxxcie0M7OxkPfZGt33Dt/ZXaCK4muCiLXn/vt6ePVz/4GVwgBOiDuKj9xvvazf5N1ugaurxX1tyH9rKivLdrPctUDxGn7FSPpq+DQFoB2YF1EKMaeAIAuWqbtV/TYlOQ5PiVcVQ7oprgrRUZpnX/+r/3w/C8AuJnOKwjxBACAZUJgZIdp8JIVmJOXgigY1ADtFRIMJRwAqYY72BxS5qFUN9qJDD0IzclpxGl7YCR9FRzaQn1Ctg95RQ92csK0P1gXEarPY8/Q9UAYXwNAt/T92Ekdx6ck7zEqAIsVehyAH7whKi2oNu04wLIee+rbvqPyU3U8gYTGFGT9WHeR5zEAAMuBW2l3kAYwGoxqUFb1IEbzDLmSCIDF8H/FhfpoR00HC9QfVn3QWf225ql+nFtWoq/oq+DQFrpF2yi3nfK3Vdr/QLexLgKz4x2h6PcAAIumY1PuPEAdx6ckeoxKywTQTqyfKCotKFJe+fgvzp81owvHnvqw76h0qVxVvspPWhsoSvPVGEXLSCsHBU4qYDuJ5kMfBwCoE4GRHaJBgRvAAACq5/pZ7czVsaMIAEDXuIPVBEgC6LK8J1k47gIAWBRtszTurisYMonG/e4W3QCAftA2Je08R1uuLtqmY09d33d0AZFKV5PnuLLKQe1MV4ZMonFP3rIHACAUgZEdoUGMBgUAuskd1HRT3dxtCBCOfhZoHn0VHNpC+/kHqZeZG8v1tRxYF9FXRQI9mthvA7B8+j6WQDlqG00HRAKoV9PnBYTbaMNJO9+hW1q3ra204dhTV/cd1deo3JoOiMxDt8t2t1KPw/k5AEBdCIzsAA1k2jqIAZDMHfRwVyDUDknZA5uhtzW4+tnfmD9DFtWT6oh+FqgOfRUc2kL/aHup7aa2n8tC4zntk7mr97f5IHMS1kUsuyL7YdyCHkBV+jCWQL38YAYA3VfHeQEFsKUFFDkhn8Fy0HYlzfN/7Yfnz9pnkceeiqyri953VH+jvqYL48us41NZ7RYAgCIIjGw5DQA4UAZ0R9xBjyrpAIhub5CmLbc/6AK3wwigWvRVcGgL/aXtZ5+DI6MBDF3fJ2NdxDIr2ldxLAZAGX0bS6A+2k51JZgBQLK6zwuIAtnSAh/1XpuD3dActcW07UrW8YG2aPrYUxf3HTXerKO/qYuOO6XdUltl2WSdAwCWA4GRLVY2KFI7QXETgProMvt174S4E9vR9Vk7E5zQDte1HUaga+ir4NAW+mvv+39w/qx/+hjAwLqIZVXmh1A6oQgARfRxLIHquaBIAN3XxHkBUeBjNKhI+3h6jaBIOGltUe2lS/v/TR576tq+Y9k4gkXRLbXTMDYCAFRtsL+/P54/R4sUOSiiwez15347c0DhaBnaWZPoIPm5D12ZPwOQh3Z+sg6A6CBF6HqKelQReK7+1r/dZPRggt/H6lYKccvTfLIOWOlXxlnos5cPfQ2AkO1D9GRJEnfLnyLbxr72NWx/saz61vbLBpyEjNcBIA5jiWZ1sbzLbqM0DnfHpbICXNwxqqTjUw7HEYDiOFaHtsg695Fne7hMx566tu9Y5hyXf34rZAwhIcHfedpWVnnTXwIAqkRgZEvlGdBoAPPKx3+x1C983MERN6jhwBxQDAdA2i+kjuJU0ddq2f5B6JCdZU6mIA59DYCQ7UORqwAW2U72cTvE9hfLqm9tP6tP03g869gL6zqAIhhLNKtr5V00+KKKY1NO9BiVcBwBKI5jdWiDrO1LyPkI3zIde+rSvmPRoMiq+qBoTIGTN/9Z+eDuJgCAqnAr7ZYKHdC4QWzZgYG+r8GQBi0aGAFAH2mHLe8Ot/pZ7YBV0deqn9V8XF+rX+UBANAm2lZpu6ftXygdPAaANsoa+yu4JAt9HACgSkWCIqs8NuW4Y1Q6PpVn7A8AaK+s206r32+DNh576sq+Y5GgSG3rdU5K5V4FF1OgOiwTU5BVpu6ObAAAlEVgZAuFDpw0YKxjEFvVwAgA2qTIgeeqDzr71NfS3wIA2kjbvTz7Ge52SADQJhr/Z1F/l3UyLu8PqwAASNOmY1OiY1OaPxdLAIBu07nltIC5tvXzbTr21JV9x6w6jlJ6qwyIjFKZaN55g1ydrDJVeYbUDQAAWQiMbKHQwV2eASMALLusX0tGaWeuroPOAAB0gbaFIfL+Uh0AmpB1dQl3YjDkKu5NXPkDANB/uspTHk0em+IHvADQbVnnltvax7fh2FMX9h0VIJgn8FIBh03FEWisUnRZXDUSANAEAiM7qsgvLwBgWWmnMc+OM0GRAACE/Rre4RfcANom66TR1c/+xvQx5AQhV8YFAJTFsSkAQF2ytjFtPqfchmNPXdh3zHPhjyaDIsvKGutw1UgAQBUIjGwhrrgCANXKc5siDjwDAPC2kF/DA0DbhFylwx/zZ52I4zgNAKAsjk0BAOqSFTTX9iC5RR576sK+Y55baHcpKNLJumooV40EAJQ12N/fH8+foyXu+KVn5s+SdXFgIxq8uV/LxA3ilC8NgPXrmzYd/ElLd1VpbmIZofTrGzfQVJqSBtxKlyhtbbsMv8tDXPrr
}
},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Retrieval Augmented Generation (RAG)\n",
"\n",
"This notebook demonstrates an example of using [LangChain](https://www.langchain.com/) to delvelop a Retrieval Augmented Generation (RAG) pattern. It uses Azure AI Document Intelligence as document loader, which can extracts tables, paragraphs, and layout information from pdf, image, office and html files. The output markdown can be used in LangChain's markdown header splitter, which enables semantic chunking of the documents. Then the chunked documents are indexed into Azure AI Search vectore store. Given a user query, it will use Azure AI Search to get the relevant chunks, then feed the context into the prompt with the query to generate the answer.\n",
"\n",
"![semantic-chunking-rag.png](attachment:semantic-chunking-rag.png)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Prerequisites\n",
"- An Azure AI Document Intelligence resource in one of the 3 preview regions: **East US**, **West US2**, **West Europe** - follow [this document](https://learn.microsoft.com/azure/ai-services/document-intelligence/create-document-intelligence-resource?view=doc-intel-4.0.0) to create one if you don't have.\n",
"- An Azure AI Search resource - follow [this document](https://learn.microsoft.com/azure/search/search-create-service-portal) to create one if you don't have.\n",
"- An Azure OpenAI resource and deployments for embeddings model and chat model - follow [this document](https://learn.microsoft.com/azure/ai-services/openai/how-to/create-resource?pivots=web-portal) to create one if you don't have.\n",
"\n",
"Well use an Azure OpenAI chat model and embeddings and Azure AI Search in this walkthrough, but everything shown here works with any ChatModel or LLM, Embeddings, and VectorStore or Retriever."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"! pip install python-dotenv langchain langchain-community langchain-openai langchainhub openai tiktoken azure-ai-documentintelligence azure-identity azure-search-documents==11.4.0b8"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"\"\"\"\n",
"This code loads environment variables using the `dotenv` library and sets the necessary environment variables for Azure services.\n",
"The environment variables are loaded from the `.env` file in the same directory as this notebook.\n",
"\"\"\"\n",
"import os\n",
"\n",
"from dotenv import load_dotenv\n",
"\n",
"load_dotenv()\n",
"\n",
"os.environ[\"AZURE_OPENAI_ENDPOINT\"] = os.getenv(\"AZURE_OPENAI_ENDPOINT\")\n",
"os.environ[\"AZURE_OPENAI_API_KEY\"] = os.getenv(\"AZURE_OPENAI_API_KEY\")\n",
"doc_intelligence_endpoint = os.getenv(\"AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT\")\n",
"doc_intelligence_key = os.getenv(\"AZURE_DOCUMENT_INTELLIGENCE_KEY\")"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"from langchain import hub\n",
"from langchain.schema import StrOutputParser\n",
"from langchain.schema.runnable import RunnablePassthrough\n",
"from langchain.text_splitter import MarkdownHeaderTextSplitter\n",
"from langchain.vectorstores.azuresearch import AzureSearch\n",
"from langchain_community.document_loaders import AzureAIDocumentIntelligenceLoader\n",
"from langchain_openai import AzureChatOpenAI, AzureOpenAIEmbeddings"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load a document and split it into semantic chunks"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Initiate Azure AI Document Intelligence to load the document. You can either specify file_path or url_path to load the document.\n",
"loader = AzureAIDocumentIntelligenceLoader(\n",
" file_path=\"<path to your file>\",\n",
" api_key=doc_intelligence_key,\n",
" api_endpoint=doc_intelligence_endpoint,\n",
" api_model=\"prebuilt-layout\",\n",
")\n",
"docs = loader.load()\n",
"\n",
"# Split the document into chunks base on markdown headers.\n",
"headers_to_split_on = [\n",
" (\"#\", \"Header 1\"),\n",
" (\"##\", \"Header 2\"),\n",
" (\"###\", \"Header 3\"),\n",
"]\n",
"text_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)\n",
"\n",
"docs_string = docs[0].page_content\n",
"splits = text_splitter.split_text(docs_string)\n",
"\n",
"print(\"Length of splits: \" + str(len(splits)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Embed and index the chunks"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Embed the splitted documents and insert into Azure Search vector store\n",
"\n",
"aoai_embeddings = AzureOpenAIEmbeddings(\n",
" azure_deployment=\"<Azure OpenAI embeddings model>\",\n",
" openai_api_version=\"<Azure OpenAI API version>\", # e.g., \"2023-07-01-preview\"\n",
")\n",
"\n",
"vector_store_address: str = os.getenv(\"AZURE_SEARCH_ENDPOINT\")\n",
"vector_store_password: str = os.getenv(\"AZURE_SEARCH_ADMIN_KEY\")\n",
"\n",
"index_name: str = \"<your index name>\"\n",
"vector_store: AzureSearch = AzureSearch(\n",
" azure_search_endpoint=vector_store_address,\n",
" azure_search_key=vector_store_password,\n",
" index_name=index_name,\n",
" embedding_function=aoai_embeddings.embed_query,\n",
")\n",
"\n",
"vector_store.add_documents(documents=splits)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Retrive relevant chunks based on a question"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Retrieve relevant chunks based on the question\n",
"\n",
"retriever = vector_store.as_retriever(search_type=\"similarity\", search_kwargs={\"k\": 3})\n",
"\n",
"retrieved_docs = retriever.get_relevant_documents(\"<your question>\")\n",
"\n",
"print(retrieved_docs[0].page_content)\n",
"\n",
"# Use a prompt for RAG that is checked into the LangChain prompt hub (https://smith.langchain.com/hub/rlm/rag-prompt?organizationId=989ad331-949f-4bac-9694-660074a208a7)\n",
"prompt = hub.pull(\"rlm/rag-prompt\")\n",
"llm = AzureChatOpenAI(\n",
" openai_api_version=\"<Azure OpenAI API version>\", # e.g., \"2023-07-01-preview\"\n",
" azure_deployment=\"<your chat model deployment name>\",\n",
" temperature=0,\n",
")\n",
"\n",
"\n",
"def format_docs(docs):\n",
" return \"\\n\\n\".join(doc.page_content for doc in docs)\n",
"\n",
"\n",
"rag_chain = (\n",
" {\"context\": retriever | format_docs, \"question\": RunnablePassthrough()}\n",
" | prompt\n",
" | llm\n",
" | StrOutputParser()\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Document Q&A"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Ask a question about the document\n",
"\n",
"rag_chain.invoke(\"<your question>\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Doucment Q&A with references"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Return the retrieved documents or certain source metadata from the documents\n",
"\n",
"from operator import itemgetter\n",
"\n",
"from langchain.schema.runnable import RunnableMap\n",
"\n",
"rag_chain_from_docs = (\n",
" {\n",
" \"context\": lambda input: format_docs(input[\"documents\"]),\n",
" \"question\": itemgetter(\"question\"),\n",
" }\n",
" | prompt\n",
" | llm\n",
" | StrOutputParser()\n",
")\n",
"rag_chain_with_source = RunnableMap(\n",
" {\"documents\": retriever, \"question\": RunnablePassthrough()}\n",
") | {\n",
" \"documents\": lambda input: [doc.metadata for doc in input[\"documents\"]],\n",
" \"answer\": rag_chain_from_docs,\n",
"}\n",
"\n",
"rag_chain_with_source.invoke(\"<your question>\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.13"
}
},
"nbformat": 4,
"nbformat_minor": 4
}