mirror of
https://github.com/hwchase17/langchain
synced 2024-10-29 17:07:25 +00:00
4cc4534d81
### Description The feature for pseudonymizing data with ability to retrieve original text (deanonymization) has been implemented. In order to protect private data, such as when querying external APIs (OpenAI), it is worth pseudonymizing sensitive data to maintain full privacy. But then, after the model response, it would be good to have the data in the original form. I implemented the `PresidioReversibleAnonymizer`, which consists of two parts: 1. anonymization - it works the same way as `PresidioAnonymizer`, plus the object itself stores a mapping of made-up values to original ones, for example: ``` { "PERSON": { "<anonymized>": "<original>", "John Doe": "Slim Shady" }, "PHONE_NUMBER": { "111-111-1111": "555-555-5555" } ... } ``` 2. deanonymization - using the mapping described above, it matches fake data with original data and then substitutes it. Between anonymization and deanonymization user can perform different operations, for example, passing the output to LLM. ### Future works - **instance anonymization** - at this point, each occurrence of PII is treated as a separate entity and separately anonymized. Therefore, two occurrences of the name John Doe in the text will be changed to two different names. It is therefore worth introducing support for full instance detection, so that repeated occurrences are treated as a single object. - **better matching and substitution of fake values for real ones** - currently the strategy is based on matching full strings and then substituting them. Due to the indeterminism of language models, it may happen that the value in the answer is slightly changed (e.g. *John Doe* -> *John* or *Main St, New York* -> *New York*) and such a substitution is then no longer possible. Therefore, it is worth adjusting the matching for your needs. - **Q&A with anonymization** - when I'm done writing all the functionality, I thought it would be a cool resource in documentation to write a notebook about retrieval from documents using anonymization. An iterative process, adding new recognizers to fit the data, lessons learned and what to look out for ### Twitter handle @deepsense_ai / @MaksOpp --------- Co-authored-by: MaksOpp <maks.operlejn@gmail.com> Co-authored-by: Bagatur <baskaryan@gmail.com>
470 lines
14 KiB
Plaintext
470 lines
14 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Data anonymization with Microsoft Presidio\n",
|
|
"\n",
|
|
"[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/extras/guides/privacy/presidio_data_anonymization.ipynb)\n",
|
|
"\n",
|
|
"## Use case\n",
|
|
"\n",
|
|
"Data anonymization is crucial before passing information to a language model like GPT-4 because it helps protect privacy and maintain confidentiality. If data is not anonymized, sensitive information such as names, addresses, contact numbers, or other identifiers linked to specific individuals could potentially be learned and misused. Hence, by obscuring or removing this personally identifiable information (PII), data can be used freely without compromising individuals' privacy rights or breaching data protection laws and regulations.\n",
|
|
"\n",
|
|
"## Overview\n",
|
|
"\n",
|
|
"Anonynization consists of two steps:\n",
|
|
"\n",
|
|
"1. **Identification:** Identify all data fields that contain personally identifiable information (PII).\n",
|
|
"2. **Replacement**: Replace all PIIs with pseudo values or codes that do not reveal any personal information about the individual but can be used for reference. We're not using regular encryption, because the language model won't be able to understand the meaning or context of the encrypted data.\n",
|
|
"\n",
|
|
"We use *Microsoft Presidio* together with *Faker* framework for anonymization purposes because of the wide range of functionalities they provide. The full implementation is available in `PresidioAnonymizer`.\n",
|
|
"\n",
|
|
"## Quickstart\n",
|
|
"\n",
|
|
"Below you will find the use case on how to leverage anonymization in LangChain."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Install necessary packages\n",
|
|
"# ! pip install langchain langchain-experimental openai presidio-analyzer presidio-anonymizer spacy Faker\n",
|
|
"# ! python -m spacy download en_core_web_lg"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"\\\n",
|
|
"Let's see how PII anonymization works using a sample sentence:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"'My name is Laura Ruiz, call me at +1-412-982-8374x13414 or email me at javierwatkins@example.net'"
|
|
]
|
|
},
|
|
"execution_count": 2,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"from langchain_experimental.data_anonymizer import PresidioAnonymizer\n",
|
|
"\n",
|
|
"anonymizer = PresidioAnonymizer()\n",
|
|
"\n",
|
|
"anonymizer.anonymize(\n",
|
|
" \"My name is Slim Shady, call me at 313-666-7440 or email me at real.slim.shady@gmail.com\"\n",
|
|
")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Using with LangChain Expression Language\n",
|
|
"\n",
|
|
"With LCEL we can easily chain together anonymization with the rest of our application."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Set env var OPENAI_API_KEY or load from a .env file:\n",
|
|
"# import dotenv\n",
|
|
"\n",
|
|
"# dotenv.load_dotenv()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 4,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"text = f\"\"\"Slim Shady recently lost his wallet. \n",
|
|
"Inside is some cash and his credit card with the number 4916 0387 9536 0861. \n",
|
|
"If you would find it, please call at 313-666-7440 or write an email here: real.slim.shady@gmail.com.\"\"\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 5,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Dear Sir/Madam,\n",
|
|
"\n",
|
|
"We regret to inform you that Richard Fields has recently misplaced his wallet, which contains a sum of cash and his credit card bearing the number 30479847307774. \n",
|
|
"\n",
|
|
"Should you happen to come across it, we kindly request that you contact us immediately at 6439182672 or via email at frank45@example.com.\n",
|
|
"\n",
|
|
"Thank you for your attention to this matter.\n",
|
|
"\n",
|
|
"Yours faithfully,\n",
|
|
"\n",
|
|
"[Your Name]\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"from langchain.prompts.prompt import PromptTemplate\n",
|
|
"from langchain.chat_models import ChatOpenAI\n",
|
|
"\n",
|
|
"anonymizer = PresidioAnonymizer()\n",
|
|
"\n",
|
|
"template = \"\"\"Rewrite this text into an official, short email:\n",
|
|
"\n",
|
|
"{anonymized_text}\"\"\"\n",
|
|
"prompt = PromptTemplate.from_template(template)\n",
|
|
"llm = ChatOpenAI(temperature=0)\n",
|
|
"\n",
|
|
"chain = {\"anonymized_text\": anonymizer.anonymize} | prompt | llm\n",
|
|
"response = chain.invoke(text)\n",
|
|
"print(response.content)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Customization\n",
|
|
"We can specify ``analyzed_fields`` to only anonymize particular types of data."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 6,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"'My name is Adrian Fleming, call me at 313-666-7440 or email me at real.slim.shady@gmail.com'"
|
|
]
|
|
},
|
|
"execution_count": 6,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"anonymizer = PresidioAnonymizer(analyzed_fields=[\"PERSON\"])\n",
|
|
"\n",
|
|
"anonymizer.anonymize(\n",
|
|
" \"My name is Slim Shady, call me at 313-666-7440 or email me at real.slim.shady@gmail.com\"\n",
|
|
")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"As can be observed, the name was correctly identified and replaced with another. The `analyzed_fields` attribute is responsible for what values are to be detected and substituted. We can add *PHONE_NUMBER* to the list:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 7,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"'My name is Justin Miller, call me at 761-824-1889 or email me at real.slim.shady@gmail.com'"
|
|
]
|
|
},
|
|
"execution_count": 7,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"anonymizer = PresidioAnonymizer(analyzed_fields=[\"PERSON\", \"PHONE_NUMBER\"])\n",
|
|
"anonymizer.anonymize(\n",
|
|
" \"My name is Slim Shady, call me at 313-666-7440 or email me at real.slim.shady@gmail.com\"\n",
|
|
")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"\\\n",
|
|
"If no analyzed_fields are specified, by default the anonymizer will detect all supported formats. Below is the full list of them:\n",
|
|
"\n",
|
|
"`['PERSON', 'EMAIL_ADDRESS', 'PHONE_NUMBER', 'IBAN_CODE', 'CREDIT_CARD', 'CRYPTO', 'IP_ADDRESS', 'LOCATION', 'DATE_TIME', 'NRP', 'MEDICAL_LICENSE', 'URL', 'US_BANK_NUMBER', 'US_DRIVER_LICENSE', 'US_ITIN', 'US_PASSPORT', 'US_SSN']`\n",
|
|
"\n",
|
|
"**Disclaimer:** We suggest carefully defining the private data to be detected - Presidio doesn't work perfectly and it sometimes makes mistakes, so it's better to have more control over the data."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 8,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"'My name is Dr. Jennifer Baker, call me at (508)839-9329x232 or email me at ehamilton@example.com'"
|
|
]
|
|
},
|
|
"execution_count": 8,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"anonymizer = PresidioAnonymizer()\n",
|
|
"anonymizer.anonymize(\n",
|
|
" \"My name is Slim Shady, call me at 313-666-7440 or email me at real.slim.shady@gmail.com\"\n",
|
|
")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"\\\n",
|
|
"It may be that the above list of detected fields is not sufficient. For example, the already available *PHONE_NUMBER* field does not support polish phone numbers and confuses it with another field:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 9,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"'My polish phone number is NRGN41434238921378'"
|
|
]
|
|
},
|
|
"execution_count": 9,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"anonymizer = PresidioAnonymizer()\n",
|
|
"anonymizer.anonymize(\"My polish phone number is 666555444\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"\\\n",
|
|
"You can then write your own recognizers and add them to the pool of those present. How exactly to create recognizers is described in the [Presidio documentation](https://microsoft.github.io/presidio/samples/python/customizing_presidio_analyzer/)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 10,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Define the regex pattern in a Presidio `Pattern` object:\n",
|
|
"from presidio_analyzer import Pattern, PatternRecognizer\n",
|
|
"\n",
|
|
"\n",
|
|
"polish_phone_numbers_pattern = Pattern(\n",
|
|
" name=\"polish_phone_numbers_pattern\",\n",
|
|
" regex=\"(?<!\\w)(\\(?(\\+|00)?48\\)?)?[ -]?\\d{3}[ -]?\\d{3}[ -]?\\d{3}(?!\\w)\",\n",
|
|
" score=1,\n",
|
|
")\n",
|
|
"\n",
|
|
"# Define the recognizer with one or more patterns\n",
|
|
"polish_phone_numbers_recognizer = PatternRecognizer(\n",
|
|
" supported_entity=\"POLISH_PHONE_NUMBER\", patterns=[polish_phone_numbers_pattern]\n",
|
|
")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"\\\n",
|
|
"Now, we can add recognizer by calling `add_recognizer` method on the anonymizer:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 11,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"anonymizer.add_recognizer(polish_phone_numbers_recognizer)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"\\\n",
|
|
"And voilà! With the added pattern-based recognizer, the anonymizer now handles polish phone numbers."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 12,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"My polish phone number is <POLISH_PHONE_NUMBER>\n",
|
|
"My polish phone number is <POLISH_PHONE_NUMBER>\n",
|
|
"My polish phone number is <POLISH_PHONE_NUMBER>\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"print(anonymizer.anonymize(\"My polish phone number is 666555444\"))\n",
|
|
"print(anonymizer.anonymize(\"My polish phone number is 666 555 444\"))\n",
|
|
"print(anonymizer.anonymize(\"My polish phone number is +48 666 555 444\"))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"\\\n",
|
|
"The problem is - even though we recognize polish phone numbers now, we don't have a method (operator) that would tell how to substitute a given field - because of this, in the outpit we only provide string `<POLISH_PHONE_NUMBER>` We need to create a method to replace it correctly: "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 13,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"'511 622 683'"
|
|
]
|
|
},
|
|
"execution_count": 13,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"from faker import Faker\n",
|
|
"\n",
|
|
"fake = Faker(locale=\"pl_PL\")\n",
|
|
"\n",
|
|
"\n",
|
|
"def fake_polish_phone_number(_=None):\n",
|
|
" return fake.phone_number()\n",
|
|
"\n",
|
|
"\n",
|
|
"fake_polish_phone_number()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"\\\n",
|
|
"We used Faker to create pseudo data. Now we can create an operator and add it to the anonymizer. For complete information about operators and their creation, see the Presidio documentation for [simple](https://microsoft.github.io/presidio/tutorial/10_simple_anonymization/) and [custom](https://microsoft.github.io/presidio/tutorial/11_custom_anonymization/) anonymization."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 14,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from presidio_anonymizer.entities import OperatorConfig\n",
|
|
"\n",
|
|
"new_operators = {\n",
|
|
" \"POLISH_PHONE_NUMBER\": OperatorConfig(\n",
|
|
" \"custom\", {\"lambda\": fake_polish_phone_number}\n",
|
|
" )\n",
|
|
"}"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 15,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"anonymizer.add_operators(new_operators)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 16,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"'My polish phone number is +48 734 630 977'"
|
|
]
|
|
},
|
|
"execution_count": 16,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"anonymizer.anonymize(\"My polish phone number is 666555444\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Future works\n",
|
|
"\n",
|
|
"- **deanonymization** - add the ability to reverse anonymization. For example, the workflow could look like this: `anonymize -> LLMChain -> deanonymize`. By doing this, we will retain anonymity in requests to, for example, OpenAI, and then be able restore the original data.\n",
|
|
"- **instance anonymization** - at this point, each occurrence of PII is treated as a separate entity and separately anonymized. Therefore, two occurrences of the name John Doe in the text will be changed to two different names. It is therefore worth introducing support for full instance detection, so that repeated occurrences are treated as a single object."
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.11.4"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 4
|
|
}
|