You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
langchain/docs/docs/guides/privacy/presidio_data_anonymization/qa_privacy_protection.ipynb

995 lines
34 KiB
Plaintext

{
"cells": [
{
"cell_type": "raw",
"metadata": {},
"source": [
"---\n",
"sidebar_position: 3\n",
"title: QA with private data protection\n",
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# QA with private data protection\n",
"\n",
"[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/docs/guides/privacy/presidio_data_anonymization/qa_privacy_protection.ipynb)\n",
"\n",
"\n",
"In this notebook, we will look at building a basic system for question answering, based on private data. Before feeding the LLM with this data, we need to protect it so that it doesn't go to an external API (e.g. OpenAI, Anthropic). Then, after receiving the model output, we would like the data to be restored to its original form. Below you can observe an example flow of this QA system:\n",
"\n",
"<img src=\"/img/qa_privacy_protection.png\" width=\"900\"/>\n",
"\n",
"\n",
"In the following notebook, we will not go into the details of how the anonymizer works. If you are interested, please visit [this part of the documentation](https://python.langchain.com/docs/guides/privacy/presidio_data_anonymization/).\n",
"\n",
"## Quickstart\n",
"\n",
"### Iterative process of upgrading the anonymizer"
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"%pip install --upgrade --quiet langchain langchain-experimental langchain-openai presidio-analyzer presidio-anonymizer spacy Faker faiss-cpu tiktoken"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# Download model\n",
"! python -m spacy download en_core_web_lg"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"document_content = \"\"\"Date: October 19, 2021\n",
" Witness: John Doe\n",
" Subject: Testimony Regarding the Loss of Wallet\n",
"\n",
" Testimony Content:\n",
"\n",
" Hello Officer,\n",
"\n",
" My name is John Doe and on October 19, 2021, my wallet was stolen in the vicinity of Kilmarnock during a bike trip. This wallet contains some very important things to me.\n",
"\n",
" Firstly, the wallet contains my credit card with number 4111 1111 1111 1111, which is registered under my name and linked to my bank account, PL61109010140000071219812874.\n",
"\n",
" Additionally, the wallet had a driver's license - DL No: 999000680 issued to my name. It also houses my Social Security Number, 602-76-4532.\n",
"\n",
" What's more, I had my polish identity card there, with the number ABC123456.\n",
"\n",
" I would like this data to be secured and protected in all possible ways. I believe It was stolen at 9:30 AM.\n",
"\n",
" In case any information arises regarding my wallet, please reach out to me on my phone number, 999-888-7777, or through my personal email, johndoe@example.com.\n",
"\n",
" Please consider this information to be highly confidential and respect my privacy.\n",
"\n",
" The bank has been informed about the stolen credit card and necessary actions have been taken from their end. They will be reachable at their official email, support@bankname.com.\n",
" My representative there is Victoria Cherry (her business phone: 987-654-3210).\n",
"\n",
" Thank you for your assistance,\n",
"\n",
" John Doe\"\"\""
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"from langchain_core.documents import Document\n",
"\n",
"documents = [Document(page_content=document_content)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We only have one document, so before we move on to creating a QA system, let's focus on its content to begin with.\n",
"\n",
"You may observe that the text contains many different PII values, some types occur repeatedly (names, phone numbers, emails), and some specific PIIs are repeated (John Doe)."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# Util function for coloring the PII markers\n",
"# NOTE: It will not be visible on documentation page, only in the notebook\n",
"import re\n",
"\n",
"\n",
"def print_colored_pii(string):\n",
" colored_string = re.sub(\n",
" r\"(<[^>]*>)\", lambda m: \"\\033[31m\" + m.group(1) + \"\\033[0m\", string\n",
" )\n",
" print(colored_string)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's proceed and try to anonymize the text with the default settings. For now, we don't replace the data with synthetic, we just mark it with markers (e.g. `<PERSON>`), so we set `add_default_faker_operators=False`:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Date: \u001b[31m<DATE_TIME>\u001b[0m\n",
"Witness: \u001b[31m<PERSON>\u001b[0m\n",
"Subject: Testimony Regarding the Loss of Wallet\n",
"\n",
"Testimony Content:\n",
"\n",
"Hello Officer,\n",
"\n",
"My name is \u001b[31m<PERSON>\u001b[0m and on \u001b[31m<DATE_TIME>\u001b[0m, my wallet was stolen in the vicinity of \u001b[31m<LOCATION>\u001b[0m during a bike trip. This wallet contains some very important things to me.\n",
"\n",
"Firstly, the wallet contains my credit card with number \u001b[31m<CREDIT_CARD>\u001b[0m, which is registered under my name and linked to my bank account, \u001b[31m<IBAN_CODE>\u001b[0m.\n",
"\n",
"Additionally, the wallet had a driver's license - DL No: \u001b[31m<US_DRIVER_LICENSE>\u001b[0m issued to my name. It also houses my Social Security Number, \u001b[31m<US_SSN>\u001b[0m. \n",
"\n",
"What's more, I had my polish identity card there, with the number ABC123456.\n",
"\n",
"I would like this data to be secured and protected in all possible ways. I believe It was stolen at \u001b[31m<DATE_TIME_2>\u001b[0m.\n",
"\n",
"In case any information arises regarding my wallet, please reach out to me on my phone number, \u001b[31m<PHONE_NUMBER>\u001b[0m, or through my personal email, \u001b[31m<EMAIL_ADDRESS>\u001b[0m.\n",
"\n",
"Please consider this information to be highly confidential and respect my privacy. \n",
"\n",
"The bank has been informed about the stolen credit card and necessary actions have been taken from their end. They will be reachable at their official email, \u001b[31m<EMAIL_ADDRESS_2>\u001b[0m.\n",
"My representative there is \u001b[31m<PERSON_2>\u001b[0m (her business phone: \u001b[31m<UK_NHS>\u001b[0m).\n",
"\n",
"Thank you for your assistance,\n",
"\n",
"\u001b[31m<PERSON>\u001b[0m\n"
]
}
],
"source": [
"from langchain_experimental.data_anonymizer import PresidioReversibleAnonymizer\n",
"\n",
"anonymizer = PresidioReversibleAnonymizer(\n",
" add_default_faker_operators=False,\n",
")\n",
"\n",
"print_colored_pii(anonymizer.anonymize(document_content))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's also look at the mapping between original and anonymized values:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'CREDIT_CARD': {'<CREDIT_CARD>': '4111 1111 1111 1111'},\n",
" 'DATE_TIME': {'<DATE_TIME>': 'October 19, 2021', '<DATE_TIME_2>': '9:30 AM'},\n",
" 'EMAIL_ADDRESS': {'<EMAIL_ADDRESS>': 'johndoe@example.com',\n",
" '<EMAIL_ADDRESS_2>': 'support@bankname.com'},\n",
" 'IBAN_CODE': {'<IBAN_CODE>': 'PL61109010140000071219812874'},\n",
" 'LOCATION': {'<LOCATION>': 'Kilmarnock'},\n",
" 'PERSON': {'<PERSON>': 'John Doe', '<PERSON_2>': 'Victoria Cherry'},\n",
" 'PHONE_NUMBER': {'<PHONE_NUMBER>': '999-888-7777'},\n",
" 'UK_NHS': {'<UK_NHS>': '987-654-3210'},\n",
" 'US_DRIVER_LICENSE': {'<US_DRIVER_LICENSE>': '999000680'},\n",
" 'US_SSN': {'<US_SSN>': '602-76-4532'}}\n"
]
}
],
"source": [
"import pprint\n",
"\n",
"pprint.pprint(anonymizer.deanonymizer_mapping)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In general, the anonymizer works pretty well, but I can observe two things to improve here:\n",
"\n",
"1. Datetime redundancy - we have two different entities recognized as `DATE_TIME`, but they contain different type of information. The first one is a date (*October 19, 2021*), the second one is a time (*9:30 AM*). We can improve this by adding a new recognizer to the anonymizer, which will treat time separately from the date.\n",
"2. Polish ID - polish ID has unique pattern, which is not by default part of anonymizer recognizers. The value *ABC123456* is not anonymized.\n",
"\n",
"The solution is simple: we need to add a new recognizers to the anonymizer. You can read more about it in [presidio documentation](https://microsoft.github.io/presidio/analyzer/adding_recognizers/).\n",
"\n",
"\n",
"Let's add new recognizers:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"# Define the regex pattern in a Presidio `Pattern` object:\n",
"from presidio_analyzer import Pattern, PatternRecognizer\n",
"\n",
"polish_id_pattern = Pattern(\n",
" name=\"polish_id_pattern\",\n",
" regex=\"[A-Z]{3}\\d{6}\",\n",
" score=1,\n",
")\n",
"time_pattern = Pattern(\n",
" name=\"time_pattern\",\n",
" regex=\"(1[0-2]|0?[1-9]):[0-5][0-9] (AM|PM)\",\n",
" score=1,\n",
")\n",
"\n",
"# Define the recognizer with one or more patterns\n",
"polish_id_recognizer = PatternRecognizer(\n",
" supported_entity=\"POLISH_ID\", patterns=[polish_id_pattern]\n",
")\n",
"time_recognizer = PatternRecognizer(supported_entity=\"TIME\", patterns=[time_pattern])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And now, we're adding recognizers to our anonymizer:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"anonymizer.add_recognizer(polish_id_recognizer)\n",
"anonymizer.add_recognizer(time_recognizer)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that our anonymization instance remembers previously detected and anonymized values, including those that were not detected correctly (e.g., *\"9:30 AM\"* taken as `DATE_TIME`). So it's worth removing this value, or resetting the entire mapping now that our recognizers have been updated:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"anonymizer.reset_deanonymizer_mapping()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's anonymize the text and see the results:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Date: \u001b[31m<DATE_TIME>\u001b[0m\n",
"Witness: \u001b[31m<PERSON>\u001b[0m\n",
"Subject: Testimony Regarding the Loss of Wallet\n",
"\n",
"Testimony Content:\n",
"\n",
"Hello Officer,\n",
"\n",
"My name is \u001b[31m<PERSON>\u001b[0m and on \u001b[31m<DATE_TIME>\u001b[0m, my wallet was stolen in the vicinity of \u001b[31m<LOCATION>\u001b[0m during a bike trip. This wallet contains some very important things to me.\n",
"\n",
"Firstly, the wallet contains my credit card with number \u001b[31m<CREDIT_CARD>\u001b[0m, which is registered under my name and linked to my bank account, \u001b[31m<IBAN_CODE>\u001b[0m.\n",
"\n",
"Additionally, the wallet had a driver's license - DL No: \u001b[31m<US_DRIVER_LICENSE>\u001b[0m issued to my name. It also houses my Social Security Number, \u001b[31m<US_SSN>\u001b[0m. \n",
"\n",
"What's more, I had my polish identity card there, with the number \u001b[31m<POLISH_ID>\u001b[0m.\n",
"\n",
"I would like this data to be secured and protected in all possible ways. I believe It was stolen at \u001b[31m<TIME>\u001b[0m.\n",
"\n",
"In case any information arises regarding my wallet, please reach out to me on my phone number, \u001b[31m<PHONE_NUMBER>\u001b[0m, or through my personal email, \u001b[31m<EMAIL_ADDRESS>\u001b[0m.\n",
"\n",
"Please consider this information to be highly confidential and respect my privacy. \n",
"\n",
"The bank has been informed about the stolen credit card and necessary actions have been taken from their end. They will be reachable at their official email, \u001b[31m<EMAIL_ADDRESS_2>\u001b[0m.\n",
"My representative there is \u001b[31m<PERSON_2>\u001b[0m (her business phone: \u001b[31m<UK_NHS>\u001b[0m).\n",
"\n",
"Thank you for your assistance,\n",
"\n",
"\u001b[31m<PERSON>\u001b[0m\n"
]
}
],
"source": [
"print_colored_pii(anonymizer.anonymize(document_content))"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'CREDIT_CARD': {'<CREDIT_CARD>': '4111 1111 1111 1111'},\n",
" 'DATE_TIME': {'<DATE_TIME>': 'October 19, 2021'},\n",
" 'EMAIL_ADDRESS': {'<EMAIL_ADDRESS>': 'johndoe@example.com',\n",
" '<EMAIL_ADDRESS_2>': 'support@bankname.com'},\n",
" 'IBAN_CODE': {'<IBAN_CODE>': 'PL61109010140000071219812874'},\n",
" 'LOCATION': {'<LOCATION>': 'Kilmarnock'},\n",
" 'PERSON': {'<PERSON>': 'John Doe', '<PERSON_2>': 'Victoria Cherry'},\n",
" 'PHONE_NUMBER': {'<PHONE_NUMBER>': '999-888-7777'},\n",
" 'POLISH_ID': {'<POLISH_ID>': 'ABC123456'},\n",
" 'TIME': {'<TIME>': '9:30 AM'},\n",
" 'UK_NHS': {'<UK_NHS>': '987-654-3210'},\n",
" 'US_DRIVER_LICENSE': {'<US_DRIVER_LICENSE>': '999000680'},\n",
" 'US_SSN': {'<US_SSN>': '602-76-4532'}}\n"
]
}
],
"source": [
"pprint.pprint(anonymizer.deanonymizer_mapping)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As you can see, our new recognizers work as expected. The anonymizer has replaced the time and Polish ID entities with the `<TIME>` and `<POLISH_ID>` markers, and the deanonymizer mapping has been updated accordingly.\n",
"\n",
"Now, when all PII values are detected correctly, we can proceed to the next step, which is replacing the original values with synthetic ones. To do this, we need to set `add_default_faker_operators=True` (or just remove this parameter, because it's set to `True` by default):"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Date: 1986-04-18\n",
"Witness: Brian Cox DVM\n",
"Subject: Testimony Regarding the Loss of Wallet\n",
"\n",
"Testimony Content:\n",
"\n",
"Hello Officer,\n",
"\n",
"My name is Brian Cox DVM and on 1986-04-18, my wallet was stolen in the vicinity of New Rita during a bike trip. This wallet contains some very important things to me.\n",
"\n",
"Firstly, the wallet contains my credit card with number 6584801845146275, which is registered under my name and linked to my bank account, GB78GSWK37672423884969.\n",
"\n",
"Additionally, the wallet had a driver's license - DL No: 781802744 issued to my name. It also houses my Social Security Number, 687-35-1170. \n",
"\n",
"What's more, I had my polish identity card there, with the number \u001b[31m<POLISH_ID>\u001b[0m.\n",
"\n",
"I would like this data to be secured and protected in all possible ways. I believe It was stolen at \u001b[31m<TIME>\u001b[0m.\n",
"\n",
"In case any information arises regarding my wallet, please reach out to me on my phone number, 7344131647, or through my personal email, jamesmichael@example.com.\n",
"\n",
"Please consider this information to be highly confidential and respect my privacy. \n",
"\n",
"The bank has been informed about the stolen credit card and necessary actions have been taken from their end. They will be reachable at their official email, blakeerik@example.com.\n",
"My representative there is Cristian Santos (her business phone: 2812140441).\n",
"\n",
"Thank you for your assistance,\n",
"\n",
"Brian Cox DVM\n"
]
}
],
"source": [
"anonymizer = PresidioReversibleAnonymizer(\n",
" add_default_faker_operators=True,\n",
" # Faker seed is used here to make sure the same fake data is generated for the test purposes\n",
" # In production, it is recommended to remove the faker_seed parameter (it will default to None)\n",
" faker_seed=42,\n",
")\n",
"\n",
"anonymizer.add_recognizer(polish_id_recognizer)\n",
"anonymizer.add_recognizer(time_recognizer)\n",
"\n",
"print_colored_pii(anonymizer.anonymize(document_content))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As you can see, almost all values have been replaced with synthetic ones. The only exception is the Polish ID number and time, which are not supported by the default faker operators. We can add new operators to the anonymizer, which will generate random data. You can read more about custom operators [here](https://microsoft.github.io/presidio/tutorial/11_custom_anonymization/)."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'VTC592627'"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from faker import Faker\n",
"\n",
"fake = Faker()\n",
"\n",
"\n",
"def fake_polish_id(_=None):\n",
" return fake.bothify(text=\"???######\").upper()\n",
"\n",
"\n",
"fake_polish_id()"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'03:14 PM'"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def fake_time(_=None):\n",
" return fake.time(pattern=\"%I:%M %p\")\n",
"\n",
"\n",
"fake_time()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's add newly created operators to the anonymizer:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"from presidio_anonymizer.entities import OperatorConfig\n",
"\n",
"new_operators = {\n",
" \"POLISH_ID\": OperatorConfig(\"custom\", {\"lambda\": fake_polish_id}),\n",
" \"TIME\": OperatorConfig(\"custom\", {\"lambda\": fake_time}),\n",
"}\n",
"\n",
"anonymizer.add_operators(new_operators)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And anonymize everything once again:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Date: 1974-12-26\n",
"Witness: Jimmy Murillo\n",
"Subject: Testimony Regarding the Loss of Wallet\n",
"\n",
"Testimony Content:\n",
"\n",
"Hello Officer,\n",
"\n",
"My name is Jimmy Murillo and on 1974-12-26, my wallet was stolen in the vicinity of South Dianeshire during a bike trip. This wallet contains some very important things to me.\n",
"\n",
"Firstly, the wallet contains my credit card with number 213108121913614, which is registered under my name and linked to my bank account, GB17DBUR01326773602606.\n",
"\n",
"Additionally, the wallet had a driver's license - DL No: 532311310 issued to my name. It also houses my Social Security Number, 690-84-1613. \n",
"\n",
"What's more, I had my polish identity card there, with the number UFB745084.\n",
"\n",
"I would like this data to be secured and protected in all possible ways. I believe It was stolen at 11:54 AM.\n",
"\n",
"In case any information arises regarding my wallet, please reach out to me on my phone number, 876.931.1656, or through my personal email, briannasmith@example.net.\n",
"\n",
"Please consider this information to be highly confidential and respect my privacy. \n",
"\n",
"The bank has been informed about the stolen credit card and necessary actions have been taken from their end. They will be reachable at their official email, samuel87@example.org.\n",
"My representative there is Joshua Blair (her business phone: 3361388464).\n",
"\n",
"Thank you for your assistance,\n",
"\n",
"Jimmy Murillo\n"
]
}
],
"source": [
"anonymizer.reset_deanonymizer_mapping()\n",
"print_colored_pii(anonymizer.anonymize(document_content))"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'CREDIT_CARD': {'213108121913614': '4111 1111 1111 1111'},\n",
" 'DATE_TIME': {'1974-12-26': 'October 19, 2021'},\n",
" 'EMAIL_ADDRESS': {'briannasmith@example.net': 'johndoe@example.com',\n",
" 'samuel87@example.org': 'support@bankname.com'},\n",
" 'IBAN_CODE': {'GB17DBUR01326773602606': 'PL61109010140000071219812874'},\n",
" 'LOCATION': {'South Dianeshire': 'Kilmarnock'},\n",
" 'PERSON': {'Jimmy Murillo': 'John Doe', 'Joshua Blair': 'Victoria Cherry'},\n",
" 'PHONE_NUMBER': {'876.931.1656': '999-888-7777'},\n",
" 'POLISH_ID': {'UFB745084': 'ABC123456'},\n",
" 'TIME': {'11:54 AM': '9:30 AM'},\n",
" 'UK_NHS': {'3361388464': '987-654-3210'},\n",
" 'US_DRIVER_LICENSE': {'532311310': '999000680'},\n",
" 'US_SSN': {'690-84-1613': '602-76-4532'}}\n"
]
}
],
"source": [
"pprint.pprint(anonymizer.deanonymizer_mapping)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Voilà! Now all values are replaced with synthetic ones. Note that the deanonymizer mapping has been updated accordingly."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Question-answering system with PII anonymization"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, let's wrap it up together and create full question-answering system, based on `PresidioReversibleAnonymizer` and LangChain Expression Language (LCEL)."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"# 1. Initialize anonymizer\n",
"anonymizer = PresidioReversibleAnonymizer(\n",
" # Faker seed is used here to make sure the same fake data is generated for the test purposes\n",
" # In production, it is recommended to remove the faker_seed parameter (it will default to None)\n",
" faker_seed=42,\n",
")\n",
"\n",
"anonymizer.add_recognizer(polish_id_recognizer)\n",
"anonymizer.add_recognizer(time_recognizer)\n",
"\n",
"anonymizer.add_operators(new_operators)"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
"from langchain_community.vectorstores import FAISS\n",
"from langchain_openai import OpenAIEmbeddings\n",
"\n",
"# 2. Load the data: In our case data's already loaded\n",
"# 3. Anonymize the data before indexing\n",
"for doc in documents:\n",
" doc.page_content = anonymizer.anonymize(doc.page_content)\n",
"\n",
"# 4. Split the documents into chunks\n",
"text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)\n",
"chunks = text_splitter.split_documents(documents)\n",
"\n",
"# 5. Index the chunks (using OpenAI embeddings, because the data is already anonymized)\n",
"embeddings = OpenAIEmbeddings()\n",
"docsearch = FAISS.from_documents(chunks, embeddings)\n",
"retriever = docsearch.as_retriever()"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"from operator import itemgetter\n",
"\n",
"from langchain_core.output_parsers import StrOutputParser\n",
"from langchain_core.prompts import ChatPromptTemplate\n",
"from langchain_core.runnables import (\n",
" RunnableLambda,\n",
" RunnableParallel,\n",
" RunnablePassthrough,\n",
")\n",
"from langchain_openai import ChatOpenAI\n",
"\n",
"# 6. Create anonymizer chain\n",
"template = \"\"\"Answer the question based only on the following context:\n",
"{context}\n",
"\n",
"Question: {anonymized_question}\n",
"\"\"\"\n",
"prompt = ChatPromptTemplate.from_template(template)\n",
"\n",
"model = ChatOpenAI(temperature=0.3)\n",
"\n",
"\n",
"_inputs = RunnableParallel(\n",
" question=RunnablePassthrough(),\n",
" # It is important to remember about question anonymization\n",
" anonymized_question=RunnableLambda(anonymizer.anonymize),\n",
")\n",
"\n",
"anonymizer_chain = (\n",
" _inputs\n",
" | {\n",
" \"context\": itemgetter(\"anonymized_question\") | retriever,\n",
" \"anonymized_question\": itemgetter(\"anonymized_question\"),\n",
" }\n",
" | prompt\n",
" | model\n",
" | StrOutputParser()\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'The theft of the wallet occurred in the vicinity of New Rita during a bike trip. It was stolen from Brian Cox DVM. The time of the theft was 02:22 AM.'"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"anonymizer_chain.invoke(\n",
" \"Where did the theft of the wallet occur, at what time, and who was it stolen from?\"\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The theft of the wallet occurred in the vicinity of Kilmarnock during a bike trip. It was stolen from John Doe. The time of the theft was 9:30 AM.\n"
]
}
],
"source": [
"# 7. Add deanonymization step to the chain\n",
"chain_with_deanonymization = anonymizer_chain | RunnableLambda(anonymizer.deanonymize)\n",
"\n",
"print(\n",
" chain_with_deanonymization.invoke(\n",
" \"Where did the theft of the wallet occur, at what time, and who was it stolen from?\"\n",
" )\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The content of the wallet included a credit card with the number 4111 1111 1111 1111, registered under the name of John Doe and linked to the bank account PL61109010140000071219812874. It also contained a driver's license with the number 999000680 issued to John Doe, as well as his Social Security Number 602-76-4532. Additionally, the wallet had a Polish identity card with the number ABC123456.\n"
]
}
],
"source": [
"print(\n",
" chain_with_deanonymization.invoke(\"What was the content of the wallet in detail?\")\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The phone number 999-888-7777 belongs to John Doe.\n"
]
}
],
"source": [
"print(chain_with_deanonymization.invoke(\"Whose phone number is it: 999-888-7777?\"))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Alternative approach: local embeddings + anonymizing the context after indexing"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If for some reason you would like to index the data in its original form, or simply use custom embeddings, below is an example of how to do it:"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
"anonymizer = PresidioReversibleAnonymizer(\n",
" # Faker seed is used here to make sure the same fake data is generated for the test purposes\n",
" # In production, it is recommended to remove the faker_seed parameter (it will default to None)\n",
" faker_seed=42,\n",
")\n",
"\n",
"anonymizer.add_recognizer(polish_id_recognizer)\n",
"anonymizer.add_recognizer(time_recognizer)\n",
"\n",
"anonymizer.add_operators(new_operators)"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
"from langchain_community.embeddings import HuggingFaceBgeEmbeddings\n",
"\n",
"model_name = \"BAAI/bge-base-en-v1.5\"\n",
"# model_kwargs = {'device': 'cuda'}\n",
"encode_kwargs = {\"normalize_embeddings\": True} # set True to compute cosine similarity\n",
"local_embeddings = HuggingFaceBgeEmbeddings(\n",
" model_name=model_name,\n",
" # model_kwargs=model_kwargs,\n",
" encode_kwargs=encode_kwargs,\n",
" query_instruction=\"Represent this sentence for searching relevant passages:\",\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"documents = [Document(page_content=document_content)]\n",
"\n",
"text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)\n",
"chunks = text_splitter.split_documents(documents)\n",
"\n",
"docsearch = FAISS.from_documents(chunks, local_embeddings)\n",
"retriever = docsearch.as_retriever()"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [],
"source": [
"template = \"\"\"Answer the question based only on the following context:\n",
"{context}\n",
"\n",
"Question: {anonymized_question}\n",
"\"\"\"\n",
"prompt = ChatPromptTemplate.from_template(template)\n",
"\n",
"model = ChatOpenAI(temperature=0.2)"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [],
"source": [
"from langchain.prompts.prompt import PromptTemplate\n",
"from langchain_core.prompts import format_document\n",
"\n",
"DEFAULT_DOCUMENT_PROMPT = PromptTemplate.from_template(template=\"{page_content}\")\n",
"\n",
"\n",
"def _combine_documents(\n",
" docs, document_prompt=DEFAULT_DOCUMENT_PROMPT, document_separator=\"\\n\\n\"\n",
"):\n",
" doc_strings = [format_document(doc, document_prompt) for doc in docs]\n",
" return document_separator.join(doc_strings)\n",
"\n",
"\n",
"chain_with_deanonymization = (\n",
" RunnableParallel({\"question\": RunnablePassthrough()})\n",
" | {\n",
" \"context\": itemgetter(\"question\")\n",
" | retriever\n",
" | _combine_documents\n",
" | anonymizer.anonymize,\n",
" \"anonymized_question\": lambda x: anonymizer.anonymize(x[\"question\"]),\n",
" }\n",
" | prompt\n",
" | model\n",
" | StrOutputParser()\n",
" | RunnableLambda(anonymizer.deanonymize)\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The theft of the wallet occurred in the vicinity of Kilmarnock during a bike trip. It was stolen from John Doe. The time of the theft was 9:30 AM.\n"
]
}
],
"source": [
"print(\n",
" chain_with_deanonymization.invoke(\n",
" \"Where did the theft of the wallet occur, at what time, and who was it stolen from?\"\n",
" )\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The content of the wallet included:\n",
"1. Credit card number: 4111 1111 1111 1111\n",
"2. Bank account number: PL61109010140000071219812874\n",
"3. Driver's license number: 999000680\n",
"4. Social Security Number: 602-76-4532\n",
"5. Polish identity card number: ABC123456\n"
]
}
],
"source": [
"print(\n",
" chain_with_deanonymization.invoke(\"What was the content of the wallet in detail?\")\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The phone number 999-888-7777 belongs to John Doe.\n"
]
}
],
"source": [
"print(chain_with_deanonymization.invoke(\"Whose phone number is it: 999-888-7777?\"))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "langchain-py-env",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.4"
}
},
"nbformat": 4,
"nbformat_minor": 4
}