docs: Added providers page for Pebblo and docs for PebbloRetrievalQA (#20746)

- **Description:** Added providers page for Pebblo and docs for
PebbloRetrievalQA
- **Issue:** NA
- **Dependencies:** None
- **Unit tests**: NA
This commit is contained in:
Rajendra Kadam 2024-06-25 22:16:11 +05:30 committed by GitHub
parent a75b32a54a
commit d3520a784f
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
2 changed files with 605 additions and 0 deletions

View File

@ -0,0 +1,21 @@
# Pebblo
[Pebblo](https://www.daxa.ai/pebblo) enables developers to safely load and retrieve data to promote their Gen AI app to deployment without
worrying about the organizations compliance and security requirements. The Pebblo SafeLoader identifies semantic topics and entities found in the
loaded data and the Pebblo SafeRetriever enforces identity and semantic controls on the retrieved context. The results are
summarized on the UI or a PDF report.
## Pebblo Overview:
`Pebblo` provides a safe way to load and retrieve data for Gen AI applications.
It includes:
1. **Identity-aware Safe Loader** that loads data and identifies semantic topics and entities.
2. **SafeRetrieval** that enforces identity and semantic controls on the retrieved context.
3. **User Data Report** that summarizes the data loaded and retrieved.
## Example Notebooks
For a more detailed examples of using Pebblo, see the following notebooks:
* [PebbloSafeLoader](/docs/integrations/document_loaders/pebblo) shows how to use Pebblo loader to safely load data.
* [PebbloRetrievalQA](/docs/integrations/providers/pebblo/pebblo_retrieval_qa) shows how to use Pebblo retrieval QA chain to safely retrieve data.

View File

@ -0,0 +1,584 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "3ce451e9-f8f1-4f27-8c6b-4a93a406d504",
"metadata": {},
"source": [
"# Identity-enabled RAG using PebbloRetrievalQA\n",
"\n",
"> PebbloRetrievalQA is a Retrieval chain with Identity & Semantic Enforcement for question-answering\n",
"against a vector database.\n",
"\n",
"This notebook covers how to retrieve documents using Identity & Semantic Enforcement (Deny Topics/Entities).\n",
"For more details on Pebblo and its SafeRetriever feature visit [Pebblo documentation](https://daxa-ai.github.io/pebblo/retrieval_chain/)\n",
"\n",
"### Steps:\n",
"\n",
"1. **Loading Documents:**\n",
"We will load documents with authorization and semantic metadata into an in-memory Qdrant vectorstore. This vectorstore will be used as a retriever in PebbloRetrievalQA. \n",
"\n",
"> **Note:** It is recommended to use [PebbloSafeLoader](https://daxa-ai.github.io/pebblo/rag) as the counterpart for loading documents with authentication and semantic metadata on the ingestion side. `PebbloSafeLoader` guarantees the secure and efficient loading of documents while maintaining the integrity of the metadata.\n",
"\n",
"\n",
"\n",
"2. **Testing Enforcement Mechanisms**:\n",
" We will test Identity and Semantic Enforcement separately. For each use case, we will define a specific \"ask\" function with the required contexts (*auth_context* and *semantic_context*) and then pose our questions.\n"
]
},
{
"cell_type": "markdown",
"id": "4ee16b6b-5dac-4b5c-bb69-3ec87398a33c",
"metadata": {},
"source": [
"## Setup\n",
"\n",
"### Dependencies\n",
"\n",
"We'll use an OpenAI LLM, OpenAI embeddings and a Qdrant vector store in this walkthrough.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e68494fa-f387-4481-9a6c-58294865d7b7",
"metadata": {},
"outputs": [],
"source": [
"%pip install --upgrade --quiet langchain langchain_core langchain-community langchain-openai qdrant_client"
]
},
{
"cell_type": "markdown",
"id": "61498d51-0c38-40e2-adcd-19dfdf4d37ef",
"metadata": {},
"source": [
"### Identity-aware Data Ingestion\n",
"\n",
"Here we are using Qdrant as a vector database; however, you can use any of the supported vector databases.\n",
"\n",
"**PebbloRetrievalQA chain supports the following vector databases:**\n",
"- Qdrant\n",
"- Pinecone\n",
"\n",
"\n",
"**Load vector database with authorization and semantic information in metadata:**\n",
"\n",
"In this step, we capture the authorization and semantic information of the source document into the `authorized_identities`, `pebblo_semantic_topics`, and `pebblo_semantic_entities` fields within the metadata of the VectorDB entry for each chunk. \n",
"\n",
"\n",
"*NOTE: To use the PebbloRetrievalQA chain, you must always place authorization and semantic metadata in the specified fields. These fields must contain a list of strings.*"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "ae4fcbc1-bdc3-40d2-b2df-8c82cad1f89c",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Vectordb loaded.\n"
]
}
],
"source": [
"from langchain_community.vectorstores.qdrant import Qdrant\n",
"from langchain_core.documents import Document\n",
"from langchain_openai.embeddings import OpenAIEmbeddings\n",
"from langchain_openai.llms import OpenAI\n",
"\n",
"llm = OpenAI()\n",
"embeddings = OpenAIEmbeddings()\n",
"collection_name = \"pebblo-identity-and-semantic-rag\"\n",
"\n",
"page_content = \"\"\"\n",
"**ACME Corp Financial Report**\n",
"\n",
"**Overview:**\n",
"ACME Corp, a leading player in the merger and acquisition industry, presents its financial report for the fiscal year ending December 31, 2020. \n",
"Despite a challenging economic landscape, ACME Corp demonstrated robust performance and strategic growth.\n",
"\n",
"**Financial Highlights:**\n",
"Revenue soared to $50 million, marking a 15% increase from the previous year, driven by successful deal closures and expansion into new markets. \n",
"Net profit reached $12 million, showcasing a healthy margin of 24%.\n",
"\n",
"**Key Metrics:**\n",
"Total assets surged to $80 million, reflecting a 20% growth, highlighting ACME Corp's strong financial position and asset base. \n",
"Additionally, the company maintained a conservative debt-to-equity ratio of 0.5, ensuring sustainable financial stability.\n",
"\n",
"**Future Outlook:**\n",
"ACME Corp remains optimistic about the future, with plans to capitalize on emerging opportunities in the global M&A landscape. \n",
"The company is committed to delivering value to shareholders while maintaining ethical business practices.\n",
"\n",
"**Bank Account Details:**\n",
"For inquiries or transactions, please refer to ACME Corp's US bank account:\n",
"Account Number: 123456789012\n",
"Bank Name: Fictitious Bank of America\n",
"\"\"\"\n",
"\n",
"documents = [\n",
" Document(\n",
" **{\n",
" \"page_content\": page_content,\n",
" \"metadata\": {\n",
" \"pebblo_semantic_topics\": [\"financial-report\"],\n",
" \"pebblo_semantic_entities\": [\"us-bank-account-number\"],\n",
" \"authorized_identities\": [\"finance-team\", \"exec-leadership\"],\n",
" \"page\": 0,\n",
" \"source\": \"https://drive.google.com/file/d/xxxxxxxxxxxxx/view\",\n",
" \"title\": \"ACME Corp Financial Report.pdf\",\n",
" },\n",
" }\n",
" )\n",
"]\n",
"\n",
"vectordb = Qdrant.from_documents(\n",
" documents,\n",
" embeddings,\n",
" location=\":memory:\",\n",
" collection_name=collection_name,\n",
")\n",
"\n",
"print(\"Vectordb loaded.\")"
]
},
{
"cell_type": "markdown",
"id": "f630bb8b-67ba-41f9-8715-76d006207e75",
"metadata": {},
"source": [
"## Retrieval with Identity Enforcement\n",
"\n",
"PebbloRetrievalQA chain uses a SafeRetrieval to enforce that the snippets used for in-context are retrieved only from the documents authorized for the user. \n",
"To achieve this, the Gen-AI application needs to provide an authorization context for this retrieval chain. \n",
"This *auth_context* should be filled with the identity and authorization groups of the user accessing the Gen-AI app.\n",
"\n",
"\n",
"Here is the sample code for the `PebbloRetrievalQA` with `user_auth`(List of user authorizations, which may include their User ID and \n",
" the groups they are part of) from the user accessing the RAG application, passed in `auth_context`."
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "e978bee6-3a8c-459f-ab82-d380d7499b36",
"metadata": {},
"outputs": [],
"source": [
"from langchain_community.chains import PebbloRetrievalQA\n",
"from langchain_community.chains.pebblo_retrieval.models import AuthContext, ChainInput\n",
"\n",
"# Initialize PebbloRetrievalQA chain\n",
"qa_chain = PebbloRetrievalQA.from_chain_type(\n",
" llm=llm,\n",
" retriever=vectordb.as_retriever(),\n",
" app_name=\"pebblo-identity-rag\",\n",
" description=\"Identity Enforcement app using PebbloRetrievalQA\",\n",
" owner=\"ACME Corp\",\n",
")\n",
"\n",
"\n",
"def ask(question: str, auth_context: dict):\n",
" \"\"\"\n",
" Ask a question to the PebbloRetrievalQA chain\n",
" \"\"\"\n",
" auth_context_obj = AuthContext(**auth_context) if auth_context else None\n",
" chain_input_obj = ChainInput(query=question, auth_context=auth_context_obj)\n",
" return qa_chain.invoke(chain_input_obj.dict())"
]
},
{
"cell_type": "markdown",
"id": "7a267e96-70cb-468f-b830-83b65e9b7f6f",
"metadata": {},
"source": [
"### 1. Questions by Authorized User\n",
"\n",
"We ingested data for authorized identities `[\"finance-team\", \"exec-leadership\"]`, so a user with the authorized identity/group `finance-team` should receive the correct answer."
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "2688fc18-1eac-45a5-be55-aabbe6b25af5",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Question: Share the financial performance of ACME Corp for the year 2020\n",
"\n",
"Answer: \n",
"Revenue: $50 million (15% increase from previous year)\n",
"Net profit: $12 million (24% margin)\n",
"Total assets: $80 million (20% growth)\n",
"Debt-to-equity ratio: 0.5\n"
]
}
],
"source": [
"auth = {\n",
" \"user_id\": \"finance-user@acme.org\",\n",
" \"user_auth\": [\n",
" \"finance-team\",\n",
" ],\n",
"}\n",
"\n",
"question = \"Share the financial performance of ACME Corp for the year 2020\"\n",
"resp = ask(question, auth)\n",
"print(f\"Question: {question}\\n\\nAnswer: {resp['result']}\")"
]
},
{
"cell_type": "markdown",
"id": "b4db6566-6562-4a49-b19c-6d99299b374e",
"metadata": {},
"source": [
"### 2. Questions by Unauthorized User\n",
"\n",
"Since the user's authorized identity/group `eng-support` is not included in the authorized identities `[\"finance-team\", \"exec-leadership\"]`, we should not receive an answer."
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "2d736ce3-6e05-48d3-a5e1-fb4e7cccc1ee",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Question: Share the financial performance of ACME Corp for the year 2020\n",
"\n",
"Answer: I don't know.\n"
]
}
],
"source": [
"auth = {\n",
" \"user_id\": \"eng-user@acme.org\",\n",
" \"user_auth\": [\n",
" \"eng-support\",\n",
" ],\n",
"}\n",
"\n",
"question = \"Share the financial performance of ACME Corp for the year 2020\"\n",
"resp = ask(question, auth)\n",
"print(f\"Question: {question}\\n\\nAnswer: {resp['result']}\")"
]
},
{
"cell_type": "markdown",
"id": "33a8afe1-3071-4118-9714-a17cba809ee4",
"metadata": {},
"source": [
"### 3. Using PromptTemplate to provide additional instructions\n",
"You can use PromptTemplate to provide additional instructions to the LLM for generating a custom response."
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "59c055ba-fdd1-48c6-9bc9-2793eb47438d",
"metadata": {},
"outputs": [],
"source": [
"from langchain_core.prompts import PromptTemplate\n",
"\n",
"prompt_template = PromptTemplate.from_template(\n",
" \"\"\"\n",
"Answer the question using the provided context. \n",
"If no context is provided, just say \"I'm sorry, but that information is unavailable, or Access to it is restricted.\".\n",
"\n",
"Question: {question}\n",
"\"\"\"\n",
")\n",
"\n",
"question = \"Share the financial performance of ACME Corp for the year 2020\"\n",
"prompt = prompt_template.format(question=question)"
]
},
{
"cell_type": "markdown",
"id": "c4d27c00-73d9-4ce8-bc70-29535deaf0e2",
"metadata": {},
"source": [
"#### 3.1 Questions by Authorized User"
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "e68a13a4-b735-421d-9655-2a9a087ba9e5",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Question: Share the financial performance of ACME Corp for the year 2020\n",
"\n",
"Answer: \n",
"Revenue soared to $50 million, marking a 15% increase from the previous year, and net profit reached $12 million, showcasing a healthy margin of 24%. Total assets also grew by 20% to $80 million, and the company maintained a conservative debt-to-equity ratio of 0.5.\n"
]
}
],
"source": [
"auth = {\n",
" \"user_id\": \"finance-user@acme.org\",\n",
" \"user_auth\": [\n",
" \"finance-team\",\n",
" ],\n",
"}\n",
"resp = ask(prompt, auth)\n",
"print(f\"Question: {question}\\n\\nAnswer: {resp['result']}\")"
]
},
{
"cell_type": "markdown",
"id": "7b97a9ca-bdc6-400a-923d-65a8536658be",
"metadata": {},
"source": [
"#### 3.2 Questions by Unauthorized Users"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "438e48c6-96a2-4d5e-81db-47f8c8f37739",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Question: Share the financial performance of ACME Corp for the year 2020\n",
"\n",
"Answer: \n",
"I'm sorry, but that information is unavailable, or Access to it is restricted.\n"
]
}
],
"source": [
"auth = {\n",
" \"user_id\": \"eng-user@acme.org\",\n",
" \"user_auth\": [\n",
" \"eng-support\",\n",
" ],\n",
"}\n",
"resp = ask(prompt, auth)\n",
"print(f\"Question: {question}\\n\\nAnswer: {resp['result']}\")"
]
},
{
"cell_type": "markdown",
"id": "4306cab3-d070-405f-a23b-5c6011a61c50",
"metadata": {},
"source": [
"## Retrieval with Semantic Enforcement"
]
},
{
"cell_type": "markdown",
"id": "1c3757cf-832f-483e-aafe-cb09b5130ec0",
"metadata": {},
"source": [
"The PebbloRetrievalQA chain uses SafeRetrieval to ensure that the snippets used in context are retrieved only from documents that comply with the\n",
"provided semantic context.\n",
"To achieve this, the Gen-AI application must provide a semantic context for this retrieval chain.\n",
"This `semantic_context` should include the topics and entities that should be denied for the user accessing the Gen-AI app.\n",
"\n",
"Below is a sample code for PebbloRetrievalQA with `topics_to_deny` and `entities_to_deny`. These are passed in `semantic_context` to the chain input."
]
},
{
"cell_type": "code",
"execution_count": 28,
"id": "daf37bf7-9a16-4102-8893-5b698cae1b07",
"metadata": {},
"outputs": [],
"source": [
"from typing import List, Optional\n",
"\n",
"from langchain_community.chains import PebbloRetrievalQA\n",
"from langchain_community.chains.pebblo_retrieval.models import (\n",
" ChainInput,\n",
" SemanticContext,\n",
")\n",
"\n",
"# Initialize PebbloRetrievalQA chain\n",
"qa_chain = PebbloRetrievalQA.from_chain_type(\n",
" llm=llm,\n",
" retriever=vectordb.as_retriever(),\n",
" app_name=\"pebblo-semantic-rag\",\n",
" description=\"Semantic Enforcement app using PebbloRetrievalQA\",\n",
" owner=\"ACME Corp\",\n",
")\n",
"\n",
"\n",
"def ask(\n",
" question: str,\n",
" topics_to_deny: Optional[List[str]] = None,\n",
" entities_to_deny: Optional[List[str]] = None,\n",
"):\n",
" \"\"\"\n",
" Ask a question to the PebbloRetrievalQA chain\n",
" \"\"\"\n",
" semantic_context = dict()\n",
" if topics_to_deny:\n",
" semantic_context[\"pebblo_semantic_topics\"] = {\"deny\": topics_to_deny}\n",
" if entities_to_deny:\n",
" semantic_context[\"pebblo_semantic_entities\"] = {\"deny\": entities_to_deny}\n",
"\n",
" semantic_context_obj = (\n",
" SemanticContext(**semantic_context) if semantic_context else None\n",
" )\n",
" chain_input_obj = ChainInput(query=question, semantic_context=semantic_context_obj)\n",
" return qa_chain.invoke(chain_input_obj.dict())"
]
},
{
"cell_type": "markdown",
"id": "9718819b-f5cd-4212-9947-d18cd507c8b7",
"metadata": {},
"source": [
"### 1. Without semantic enforcement\n",
"\n",
"Since no semantic enforcement is applied, the system should return the answer without excluding any context due to the semantic labels associated with the context.\n"
]
},
{
"cell_type": "code",
"execution_count": 29,
"id": "69158be1-f223-4d14-b61f-f4afdf5af526",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Topics to deny: []\n",
"Entities to deny: []\n",
"Question: Share the financial performance of ACME Corp for the year 2020\n",
"Answer: \n",
"Revenue for ACME Corp increased by 15% to $50 million in 2020, with a net profit of $12 million and a strong asset base of $80 million. The company also maintained a conservative debt-to-equity ratio of 0.5.\n"
]
}
],
"source": [
"topic_to_deny = []\n",
"entities_to_deny = []\n",
"question = \"Share the financial performance of ACME Corp for the year 2020\"\n",
"resp = ask(question, topics_to_deny=topic_to_deny, entities_to_deny=entities_to_deny)\n",
"print(\n",
" f\"Topics to deny: {topic_to_deny}\\nEntities to deny: {entities_to_deny}\\n\"\n",
" f\"Question: {question}\\nAnswer: {resp['result']}\"\n",
")"
]
},
{
"cell_type": "markdown",
"id": "c8789c58-0d64-404e-bc09-92f6952022ac",
"metadata": {},
"source": [
"### 2. Deny financial-report topic\n",
"\n",
"Data has been ingested with the topics: `[\"financial-report\"]`.\n",
"Therefore, an app that denies the `financial-report` topic should not receive an answer."
]
},
{
"cell_type": "code",
"execution_count": 30,
"id": "9b17b2fc-eefb-4229-a41e-2f943d2eb48e",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Topics to deny: ['financial-report']\n",
"Entities to deny: []\n",
"Question: Share the financial performance of ACME Corp for the year 2020\n",
"Answer: \n",
"\n",
"Unfortunately, I do not have access to the financial performance of ACME Corp for the year 2020.\n"
]
}
],
"source": [
"topic_to_deny = [\"financial-report\"]\n",
"entities_to_deny = []\n",
"question = \"Share the financial performance of ACME Corp for the year 2020\"\n",
"resp = ask(question, topics_to_deny=topic_to_deny, entities_to_deny=entities_to_deny)\n",
"print(\n",
" f\"Topics to deny: {topic_to_deny}\\nEntities to deny: {entities_to_deny}\\n\"\n",
" f\"Question: {question}\\nAnswer: {resp['result']}\"\n",
")"
]
},
{
"cell_type": "markdown",
"id": "894f21b0-2913-4ef6-b5ed-cbca8f74214d",
"metadata": {},
"source": [
"### 3. Deny us-bank-account-number entity\n",
"Since the entity `us-bank-account-number` is denied, the system should not return the answer."
]
},
{
"cell_type": "code",
"execution_count": 31,
"id": "2b8abce3-7af3-437f-8999-2866a4b9beda",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Topics to deny: []\n",
"Entities to deny: ['us-bank-account-number']\n",
"Question: Share the financial performance of ACME Corp for the year 2020\n",
"Answer: I don't have information about ACME Corp's financial performance for 2020.\n"
]
}
],
"source": [
"topic_to_deny = []\n",
"entities_to_deny = [\"us-bank-account-number\"]\n",
"question = \"Share the financial performance of ACME Corp for the year 2020\"\n",
"resp = ask(question, topics_to_deny=topic_to_deny, entities_to_deny=entities_to_deny)\n",
"print(\n",
" f\"Topics to deny: {topic_to_deny}\\nEntities to deny: {entities_to_deny}\\n\"\n",
" f\"Question: {question}\\nAnswer: {resp['result']}\"\n",
")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.18"
}
},
"nbformat": 4,
"nbformat_minor": 5
}