diff --git a/docs/docs/integrations/providers/pebblo/index.md b/docs/docs/integrations/providers/pebblo/index.md new file mode 100644 index 0000000000..d0ed9a69b2 --- /dev/null +++ b/docs/docs/integrations/providers/pebblo/index.md @@ -0,0 +1,21 @@ +# Pebblo + +[Pebblo](https://www.daxa.ai/pebblo) enables developers to safely load and retrieve data to promote their Gen AI app to deployment without +worrying about the organization’s compliance and security requirements. The Pebblo SafeLoader identifies semantic topics and entities found in the +loaded data and the Pebblo SafeRetriever enforces identity and semantic controls on the retrieved context. The results are +summarized on the UI or a PDF report. + + +## Pebblo Overview: + +`Pebblo` provides a safe way to load and retrieve data for Gen AI applications. +It includes: +1. **Identity-aware Safe Loader** that loads data and identifies semantic topics and entities. +2. **SafeRetrieval** that enforces identity and semantic controls on the retrieved context. +3. **User Data Report** that summarizes the data loaded and retrieved. + +## Example Notebooks + +For a more detailed examples of using Pebblo, see the following notebooks: +* [PebbloSafeLoader](/docs/integrations/document_loaders/pebblo) shows how to use Pebblo loader to safely load data. +* [PebbloRetrievalQA](/docs/integrations/providers/pebblo/pebblo_retrieval_qa) shows how to use Pebblo retrieval QA chain to safely retrieve data. diff --git a/docs/docs/integrations/providers/pebblo/pebblo_retrieval_qa.ipynb b/docs/docs/integrations/providers/pebblo/pebblo_retrieval_qa.ipynb new file mode 100644 index 0000000000..14cd3c1603 --- /dev/null +++ b/docs/docs/integrations/providers/pebblo/pebblo_retrieval_qa.ipynb @@ -0,0 +1,584 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "3ce451e9-f8f1-4f27-8c6b-4a93a406d504", + "metadata": {}, + "source": [ + "# Identity-enabled RAG using PebbloRetrievalQA\n", + "\n", + "> PebbloRetrievalQA is a Retrieval chain with Identity & Semantic Enforcement for question-answering\n", + "against a vector database.\n", + "\n", + "This notebook covers how to retrieve documents using Identity & Semantic Enforcement (Deny Topics/Entities).\n", + "For more details on Pebblo and its SafeRetriever feature visit [Pebblo documentation](https://daxa-ai.github.io/pebblo/retrieval_chain/)\n", + "\n", + "### Steps:\n", + "\n", + "1. **Loading Documents:**\n", + "We will load documents with authorization and semantic metadata into an in-memory Qdrant vectorstore. This vectorstore will be used as a retriever in PebbloRetrievalQA. \n", + "\n", + "> **Note:** It is recommended to use [PebbloSafeLoader](https://daxa-ai.github.io/pebblo/rag) as the counterpart for loading documents with authentication and semantic metadata on the ingestion side. `PebbloSafeLoader` guarantees the secure and efficient loading of documents while maintaining the integrity of the metadata.\n", + "\n", + "\n", + "\n", + "2. **Testing Enforcement Mechanisms**:\n", + " We will test Identity and Semantic Enforcement separately. For each use case, we will define a specific \"ask\" function with the required contexts (*auth_context* and *semantic_context*) and then pose our questions.\n" + ] + }, + { + "cell_type": "markdown", + "id": "4ee16b6b-5dac-4b5c-bb69-3ec87398a33c", + "metadata": {}, + "source": [ + "## Setup\n", + "\n", + "### Dependencies\n", + "\n", + "We'll use an OpenAI LLM, OpenAI embeddings and a Qdrant vector store in this walkthrough.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e68494fa-f387-4481-9a6c-58294865d7b7", + "metadata": {}, + "outputs": [], + "source": [ + "%pip install --upgrade --quiet langchain langchain_core langchain-community langchain-openai qdrant_client" + ] + }, + { + "cell_type": "markdown", + "id": "61498d51-0c38-40e2-adcd-19dfdf4d37ef", + "metadata": {}, + "source": [ + "### Identity-aware Data Ingestion\n", + "\n", + "Here we are using Qdrant as a vector database; however, you can use any of the supported vector databases.\n", + "\n", + "**PebbloRetrievalQA chain supports the following vector databases:**\n", + "- Qdrant\n", + "- Pinecone\n", + "\n", + "\n", + "**Load vector database with authorization and semantic information in metadata:**\n", + "\n", + "In this step, we capture the authorization and semantic information of the source document into the `authorized_identities`, `pebblo_semantic_topics`, and `pebblo_semantic_entities` fields within the metadata of the VectorDB entry for each chunk. \n", + "\n", + "\n", + "*NOTE: To use the PebbloRetrievalQA chain, you must always place authorization and semantic metadata in the specified fields. These fields must contain a list of strings.*" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "ae4fcbc1-bdc3-40d2-b2df-8c82cad1f89c", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Vectordb loaded.\n" + ] + } + ], + "source": [ + "from langchain_community.vectorstores.qdrant import Qdrant\n", + "from langchain_core.documents import Document\n", + "from langchain_openai.embeddings import OpenAIEmbeddings\n", + "from langchain_openai.llms import OpenAI\n", + "\n", + "llm = OpenAI()\n", + "embeddings = OpenAIEmbeddings()\n", + "collection_name = \"pebblo-identity-and-semantic-rag\"\n", + "\n", + "page_content = \"\"\"\n", + "**ACME Corp Financial Report**\n", + "\n", + "**Overview:**\n", + "ACME Corp, a leading player in the merger and acquisition industry, presents its financial report for the fiscal year ending December 31, 2020. \n", + "Despite a challenging economic landscape, ACME Corp demonstrated robust performance and strategic growth.\n", + "\n", + "**Financial Highlights:**\n", + "Revenue soared to $50 million, marking a 15% increase from the previous year, driven by successful deal closures and expansion into new markets. \n", + "Net profit reached $12 million, showcasing a healthy margin of 24%.\n", + "\n", + "**Key Metrics:**\n", + "Total assets surged to $80 million, reflecting a 20% growth, highlighting ACME Corp's strong financial position and asset base. \n", + "Additionally, the company maintained a conservative debt-to-equity ratio of 0.5, ensuring sustainable financial stability.\n", + "\n", + "**Future Outlook:**\n", + "ACME Corp remains optimistic about the future, with plans to capitalize on emerging opportunities in the global M&A landscape. \n", + "The company is committed to delivering value to shareholders while maintaining ethical business practices.\n", + "\n", + "**Bank Account Details:**\n", + "For inquiries or transactions, please refer to ACME Corp's US bank account:\n", + "Account Number: 123456789012\n", + "Bank Name: Fictitious Bank of America\n", + "\"\"\"\n", + "\n", + "documents = [\n", + " Document(\n", + " **{\n", + " \"page_content\": page_content,\n", + " \"metadata\": {\n", + " \"pebblo_semantic_topics\": [\"financial-report\"],\n", + " \"pebblo_semantic_entities\": [\"us-bank-account-number\"],\n", + " \"authorized_identities\": [\"finance-team\", \"exec-leadership\"],\n", + " \"page\": 0,\n", + " \"source\": \"https://drive.google.com/file/d/xxxxxxxxxxxxx/view\",\n", + " \"title\": \"ACME Corp Financial Report.pdf\",\n", + " },\n", + " }\n", + " )\n", + "]\n", + "\n", + "vectordb = Qdrant.from_documents(\n", + " documents,\n", + " embeddings,\n", + " location=\":memory:\",\n", + " collection_name=collection_name,\n", + ")\n", + "\n", + "print(\"Vectordb loaded.\")" + ] + }, + { + "cell_type": "markdown", + "id": "f630bb8b-67ba-41f9-8715-76d006207e75", + "metadata": {}, + "source": [ + "## Retrieval with Identity Enforcement\n", + "\n", + "PebbloRetrievalQA chain uses a SafeRetrieval to enforce that the snippets used for in-context are retrieved only from the documents authorized for the user. \n", + "To achieve this, the Gen-AI application needs to provide an authorization context for this retrieval chain. \n", + "This *auth_context* should be filled with the identity and authorization groups of the user accessing the Gen-AI app.\n", + "\n", + "\n", + "Here is the sample code for the `PebbloRetrievalQA` with `user_auth`(List of user authorizations, which may include their User ID and \n", + " the groups they are part of) from the user accessing the RAG application, passed in `auth_context`." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "e978bee6-3a8c-459f-ab82-d380d7499b36", + "metadata": {}, + "outputs": [], + "source": [ + "from langchain_community.chains import PebbloRetrievalQA\n", + "from langchain_community.chains.pebblo_retrieval.models import AuthContext, ChainInput\n", + "\n", + "# Initialize PebbloRetrievalQA chain\n", + "qa_chain = PebbloRetrievalQA.from_chain_type(\n", + " llm=llm,\n", + " retriever=vectordb.as_retriever(),\n", + " app_name=\"pebblo-identity-rag\",\n", + " description=\"Identity Enforcement app using PebbloRetrievalQA\",\n", + " owner=\"ACME Corp\",\n", + ")\n", + "\n", + "\n", + "def ask(question: str, auth_context: dict):\n", + " \"\"\"\n", + " Ask a question to the PebbloRetrievalQA chain\n", + " \"\"\"\n", + " auth_context_obj = AuthContext(**auth_context) if auth_context else None\n", + " chain_input_obj = ChainInput(query=question, auth_context=auth_context_obj)\n", + " return qa_chain.invoke(chain_input_obj.dict())" + ] + }, + { + "cell_type": "markdown", + "id": "7a267e96-70cb-468f-b830-83b65e9b7f6f", + "metadata": {}, + "source": [ + "### 1. Questions by Authorized User\n", + "\n", + "We ingested data for authorized identities `[\"finance-team\", \"exec-leadership\"]`, so a user with the authorized identity/group `finance-team` should receive the correct answer." + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "id": "2688fc18-1eac-45a5-be55-aabbe6b25af5", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Question: Share the financial performance of ACME Corp for the year 2020\n", + "\n", + "Answer: \n", + "Revenue: $50 million (15% increase from previous year)\n", + "Net profit: $12 million (24% margin)\n", + "Total assets: $80 million (20% growth)\n", + "Debt-to-equity ratio: 0.5\n" + ] + } + ], + "source": [ + "auth = {\n", + " \"user_id\": \"finance-user@acme.org\",\n", + " \"user_auth\": [\n", + " \"finance-team\",\n", + " ],\n", + "}\n", + "\n", + "question = \"Share the financial performance of ACME Corp for the year 2020\"\n", + "resp = ask(question, auth)\n", + "print(f\"Question: {question}\\n\\nAnswer: {resp['result']}\")" + ] + }, + { + "cell_type": "markdown", + "id": "b4db6566-6562-4a49-b19c-6d99299b374e", + "metadata": {}, + "source": [ + "### 2. Questions by Unauthorized User\n", + "\n", + "Since the user's authorized identity/group `eng-support` is not included in the authorized identities `[\"finance-team\", \"exec-leadership\"]`, we should not receive an answer." + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "id": "2d736ce3-6e05-48d3-a5e1-fb4e7cccc1ee", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Question: Share the financial performance of ACME Corp for the year 2020\n", + "\n", + "Answer: I don't know.\n" + ] + } + ], + "source": [ + "auth = {\n", + " \"user_id\": \"eng-user@acme.org\",\n", + " \"user_auth\": [\n", + " \"eng-support\",\n", + " ],\n", + "}\n", + "\n", + "question = \"Share the financial performance of ACME Corp for the year 2020\"\n", + "resp = ask(question, auth)\n", + "print(f\"Question: {question}\\n\\nAnswer: {resp['result']}\")" + ] + }, + { + "cell_type": "markdown", + "id": "33a8afe1-3071-4118-9714-a17cba809ee4", + "metadata": {}, + "source": [ + "### 3. Using PromptTemplate to provide additional instructions\n", + "You can use PromptTemplate to provide additional instructions to the LLM for generating a custom response." + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "id": "59c055ba-fdd1-48c6-9bc9-2793eb47438d", + "metadata": {}, + "outputs": [], + "source": [ + "from langchain_core.prompts import PromptTemplate\n", + "\n", + "prompt_template = PromptTemplate.from_template(\n", + " \"\"\"\n", + "Answer the question using the provided context. \n", + "If no context is provided, just say \"I'm sorry, but that information is unavailable, or Access to it is restricted.\".\n", + "\n", + "Question: {question}\n", + "\"\"\"\n", + ")\n", + "\n", + "question = \"Share the financial performance of ACME Corp for the year 2020\"\n", + "prompt = prompt_template.format(question=question)" + ] + }, + { + "cell_type": "markdown", + "id": "c4d27c00-73d9-4ce8-bc70-29535deaf0e2", + "metadata": {}, + "source": [ + "#### 3.1 Questions by Authorized User" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "id": "e68a13a4-b735-421d-9655-2a9a087ba9e5", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Question: Share the financial performance of ACME Corp for the year 2020\n", + "\n", + "Answer: \n", + "Revenue soared to $50 million, marking a 15% increase from the previous year, and net profit reached $12 million, showcasing a healthy margin of 24%. Total assets also grew by 20% to $80 million, and the company maintained a conservative debt-to-equity ratio of 0.5.\n" + ] + } + ], + "source": [ + "auth = {\n", + " \"user_id\": \"finance-user@acme.org\",\n", + " \"user_auth\": [\n", + " \"finance-team\",\n", + " ],\n", + "}\n", + "resp = ask(prompt, auth)\n", + "print(f\"Question: {question}\\n\\nAnswer: {resp['result']}\")" + ] + }, + { + "cell_type": "markdown", + "id": "7b97a9ca-bdc6-400a-923d-65a8536658be", + "metadata": {}, + "source": [ + "#### 3.2 Questions by Unauthorized Users" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "id": "438e48c6-96a2-4d5e-81db-47f8c8f37739", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Question: Share the financial performance of ACME Corp for the year 2020\n", + "\n", + "Answer: \n", + "I'm sorry, but that information is unavailable, or Access to it is restricted.\n" + ] + } + ], + "source": [ + "auth = {\n", + " \"user_id\": \"eng-user@acme.org\",\n", + " \"user_auth\": [\n", + " \"eng-support\",\n", + " ],\n", + "}\n", + "resp = ask(prompt, auth)\n", + "print(f\"Question: {question}\\n\\nAnswer: {resp['result']}\")" + ] + }, + { + "cell_type": "markdown", + "id": "4306cab3-d070-405f-a23b-5c6011a61c50", + "metadata": {}, + "source": [ + "## Retrieval with Semantic Enforcement" + ] + }, + { + "cell_type": "markdown", + "id": "1c3757cf-832f-483e-aafe-cb09b5130ec0", + "metadata": {}, + "source": [ + "The PebbloRetrievalQA chain uses SafeRetrieval to ensure that the snippets used in context are retrieved only from documents that comply with the\n", + "provided semantic context.\n", + "To achieve this, the Gen-AI application must provide a semantic context for this retrieval chain.\n", + "This `semantic_context` should include the topics and entities that should be denied for the user accessing the Gen-AI app.\n", + "\n", + "Below is a sample code for PebbloRetrievalQA with `topics_to_deny` and `entities_to_deny`. These are passed in `semantic_context` to the chain input." + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "id": "daf37bf7-9a16-4102-8893-5b698cae1b07", + "metadata": {}, + "outputs": [], + "source": [ + "from typing import List, Optional\n", + "\n", + "from langchain_community.chains import PebbloRetrievalQA\n", + "from langchain_community.chains.pebblo_retrieval.models import (\n", + " ChainInput,\n", + " SemanticContext,\n", + ")\n", + "\n", + "# Initialize PebbloRetrievalQA chain\n", + "qa_chain = PebbloRetrievalQA.from_chain_type(\n", + " llm=llm,\n", + " retriever=vectordb.as_retriever(),\n", + " app_name=\"pebblo-semantic-rag\",\n", + " description=\"Semantic Enforcement app using PebbloRetrievalQA\",\n", + " owner=\"ACME Corp\",\n", + ")\n", + "\n", + "\n", + "def ask(\n", + " question: str,\n", + " topics_to_deny: Optional[List[str]] = None,\n", + " entities_to_deny: Optional[List[str]] = None,\n", + "):\n", + " \"\"\"\n", + " Ask a question to the PebbloRetrievalQA chain\n", + " \"\"\"\n", + " semantic_context = dict()\n", + " if topics_to_deny:\n", + " semantic_context[\"pebblo_semantic_topics\"] = {\"deny\": topics_to_deny}\n", + " if entities_to_deny:\n", + " semantic_context[\"pebblo_semantic_entities\"] = {\"deny\": entities_to_deny}\n", + "\n", + " semantic_context_obj = (\n", + " SemanticContext(**semantic_context) if semantic_context else None\n", + " )\n", + " chain_input_obj = ChainInput(query=question, semantic_context=semantic_context_obj)\n", + " return qa_chain.invoke(chain_input_obj.dict())" + ] + }, + { + "cell_type": "markdown", + "id": "9718819b-f5cd-4212-9947-d18cd507c8b7", + "metadata": {}, + "source": [ + "### 1. Without semantic enforcement\n", + "\n", + "Since no semantic enforcement is applied, the system should return the answer without excluding any context due to the semantic labels associated with the context.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "id": "69158be1-f223-4d14-b61f-f4afdf5af526", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Topics to deny: []\n", + "Entities to deny: []\n", + "Question: Share the financial performance of ACME Corp for the year 2020\n", + "Answer: \n", + "Revenue for ACME Corp increased by 15% to $50 million in 2020, with a net profit of $12 million and a strong asset base of $80 million. The company also maintained a conservative debt-to-equity ratio of 0.5.\n" + ] + } + ], + "source": [ + "topic_to_deny = []\n", + "entities_to_deny = []\n", + "question = \"Share the financial performance of ACME Corp for the year 2020\"\n", + "resp = ask(question, topics_to_deny=topic_to_deny, entities_to_deny=entities_to_deny)\n", + "print(\n", + " f\"Topics to deny: {topic_to_deny}\\nEntities to deny: {entities_to_deny}\\n\"\n", + " f\"Question: {question}\\nAnswer: {resp['result']}\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "c8789c58-0d64-404e-bc09-92f6952022ac", + "metadata": {}, + "source": [ + "### 2. Deny financial-report topic\n", + "\n", + "Data has been ingested with the topics: `[\"financial-report\"]`.\n", + "Therefore, an app that denies the `financial-report` topic should not receive an answer." + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "id": "9b17b2fc-eefb-4229-a41e-2f943d2eb48e", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Topics to deny: ['financial-report']\n", + "Entities to deny: []\n", + "Question: Share the financial performance of ACME Corp for the year 2020\n", + "Answer: \n", + "\n", + "Unfortunately, I do not have access to the financial performance of ACME Corp for the year 2020.\n" + ] + } + ], + "source": [ + "topic_to_deny = [\"financial-report\"]\n", + "entities_to_deny = []\n", + "question = \"Share the financial performance of ACME Corp for the year 2020\"\n", + "resp = ask(question, topics_to_deny=topic_to_deny, entities_to_deny=entities_to_deny)\n", + "print(\n", + " f\"Topics to deny: {topic_to_deny}\\nEntities to deny: {entities_to_deny}\\n\"\n", + " f\"Question: {question}\\nAnswer: {resp['result']}\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "894f21b0-2913-4ef6-b5ed-cbca8f74214d", + "metadata": {}, + "source": [ + "### 3. Deny us-bank-account-number entity\n", + "Since the entity `us-bank-account-number` is denied, the system should not return the answer." + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "id": "2b8abce3-7af3-437f-8999-2866a4b9beda", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Topics to deny: []\n", + "Entities to deny: ['us-bank-account-number']\n", + "Question: Share the financial performance of ACME Corp for the year 2020\n", + "Answer: I don't have information about ACME Corp's financial performance for 2020.\n" + ] + } + ], + "source": [ + "topic_to_deny = []\n", + "entities_to_deny = [\"us-bank-account-number\"]\n", + "question = \"Share the financial performance of ACME Corp for the year 2020\"\n", + "resp = ask(question, topics_to_deny=topic_to_deny, entities_to_deny=entities_to_deny)\n", + "print(\n", + " f\"Topics to deny: {topic_to_deny}\\nEntities to deny: {entities_to_deny}\\n\"\n", + " f\"Question: {question}\\nAnswer: {resp['result']}\"\n", + ")" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.18" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}