mirror of https://github.com/hwchase17/langchain
Data deanonymization (#10093)
### Description The feature for pseudonymizing data with ability to retrieve original text (deanonymization) has been implemented. In order to protect private data, such as when querying external APIs (OpenAI), it is worth pseudonymizing sensitive data to maintain full privacy. But then, after the model response, it would be good to have the data in the original form. I implemented the `PresidioReversibleAnonymizer`, which consists of two parts: 1. anonymization - it works the same way as `PresidioAnonymizer`, plus the object itself stores a mapping of made-up values to original ones, for example: ``` { "PERSON": { "<anonymized>": "<original>", "John Doe": "Slim Shady" }, "PHONE_NUMBER": { "111-111-1111": "555-555-5555" } ... } ``` 2. deanonymization - using the mapping described above, it matches fake data with original data and then substitutes it. Between anonymization and deanonymization user can perform different operations, for example, passing the output to LLM. ### Future works - **instance anonymization** - at this point, each occurrence of PII is treated as a separate entity and separately anonymized. Therefore, two occurrences of the name John Doe in the text will be changed to two different names. It is therefore worth introducing support for full instance detection, so that repeated occurrences are treated as a single object. - **better matching and substitution of fake values for real ones** - currently the strategy is based on matching full strings and then substituting them. Due to the indeterminism of language models, it may happen that the value in the answer is slightly changed (e.g. *John Doe* -> *John* or *Main St, New York* -> *New York*) and such a substitution is then no longer possible. Therefore, it is worth adjusting the matching for your needs. - **Q&A with anonymization** - when I'm done writing all the functionality, I thought it would be a cool resource in documentation to write a notebook about retrieval from documents using anonymization. An iterative process, adding new recognizers to fit the data, lessons learned and what to look out for ### Twitter handle @deepsense_ai / @MaksOpp --------- Co-authored-by: MaksOpp <maks.operlejn@gmail.com> Co-authored-by: Bagatur <baskaryan@gmail.com>pull/10238/head^2
parent
67696fe3ba
commit
4cc4534d81
@ -0,0 +1,461 @@
|
|||||||
|
{
|
||||||
|
"cells": [
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"# Reversible data anonymization with Microsoft Presidio\n",
|
||||||
|
"\n",
|
||||||
|
"[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/extras/guides/privacy/reversible_anonymization.ipynb)\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"## Use case\n",
|
||||||
|
"\n",
|
||||||
|
"We have already written about the importance of anonymizing sensitive data in the previous section. **Reversible Anonymization** is an equally essential technology while sharing information with language models, as it balances data protection with data usability. This technique involves masking sensitive personally identifiable information (PII), yet it can be reversed and original data can be restored when authorized users need it. Its main advantage lies in the fact that while it conceals individual identities to prevent misuse, it also allows the concealed data to be accurately unmasked should it be necessary for legal or compliance purposes. \n",
|
||||||
|
"\n",
|
||||||
|
"## Overview\n",
|
||||||
|
"\n",
|
||||||
|
"We implemented the `PresidioReversibleAnonymizer`, which consists of two parts:\n",
|
||||||
|
"\n",
|
||||||
|
"1. anonymization - it works the same way as `PresidioAnonymizer`, plus the object itself stores a mapping of made-up values to original ones, for example:\n",
|
||||||
|
"```\n",
|
||||||
|
" {\n",
|
||||||
|
" \"PERSON\": {\n",
|
||||||
|
" \"<anonymized>\": \"<original>\",\n",
|
||||||
|
" \"John Doe\": \"Slim Shady\"\n",
|
||||||
|
" },\n",
|
||||||
|
" \"PHONE_NUMBER\": {\n",
|
||||||
|
" \"111-111-1111\": \"555-555-5555\"\n",
|
||||||
|
" }\n",
|
||||||
|
" ...\n",
|
||||||
|
" }\n",
|
||||||
|
"```\n",
|
||||||
|
"\n",
|
||||||
|
"2. deanonymization - using the mapping described above, it matches fake data with original data and then substitutes it.\n",
|
||||||
|
"\n",
|
||||||
|
"Between anonymization and deanonymization user can perform different operations, for example, passing the output to LLM.\n",
|
||||||
|
"\n",
|
||||||
|
"## Quickstart\n",
|
||||||
|
"\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 1,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Install necessary packages\n",
|
||||||
|
"# ! pip install langchain langchain-experimental openai presidio-analyzer presidio-anonymizer spacy Faker\n",
|
||||||
|
"# ! python -m spacy download en_core_web_lg"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"`PresidioReversibleAnonymizer` is not significantly different from its predecessor (`PresidioAnonymizer`) in terms of anonymization:"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 2,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/plain": [
|
||||||
|
"'My name is Maria Lynch, call me at 7344131647 or email me at jamesmichael@example.com. By the way, my card number is: 4838637940262'"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"execution_count": 2,
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "execute_result"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"from langchain_experimental.data_anonymizer import PresidioReversibleAnonymizer\n",
|
||||||
|
"\n",
|
||||||
|
"anonymizer = PresidioReversibleAnonymizer(\n",
|
||||||
|
" analyzed_fields=[\"PERSON\", \"PHONE_NUMBER\", \"EMAIL_ADDRESS\", \"CREDIT_CARD\"],\n",
|
||||||
|
" # Faker seed is used here to make sure the same fake data is generated for the test purposes\n",
|
||||||
|
" # In production, it is recommended to remove the faker_seed parameter (it will default to None)\n",
|
||||||
|
" faker_seed=42,\n",
|
||||||
|
")\n",
|
||||||
|
"\n",
|
||||||
|
"anonymizer.anonymize(\n",
|
||||||
|
" \"My name is Slim Shady, call me at 313-666-7440 or email me at real.slim.shady@gmail.com. \"\n",
|
||||||
|
" \"By the way, my card number is: 4916 0387 9536 0861\"\n",
|
||||||
|
")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"This is what the full string we want to deanonymize looks like:"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 3,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"name": "stdout",
|
||||||
|
"output_type": "stream",
|
||||||
|
"text": [
|
||||||
|
"Maria Lynch recently lost his wallet. \n",
|
||||||
|
"Inside is some cash and his credit card with the number 4838637940262. \n",
|
||||||
|
"If you would find it, please call at 7344131647 or write an email here: jamesmichael@example.com.\n",
|
||||||
|
"Maria Lynch would be very grateful!\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"# We know this data, as we set the faker_seed parameter\n",
|
||||||
|
"fake_name = \"Maria Lynch\"\n",
|
||||||
|
"fake_phone = \"7344131647\"\n",
|
||||||
|
"fake_email = \"jamesmichael@example.com\"\n",
|
||||||
|
"fake_credit_card = \"4838637940262\"\n",
|
||||||
|
"\n",
|
||||||
|
"anonymized_text = f\"\"\"{fake_name} recently lost his wallet. \n",
|
||||||
|
"Inside is some cash and his credit card with the number {fake_credit_card}. \n",
|
||||||
|
"If you would find it, please call at {fake_phone} or write an email here: {fake_email}.\n",
|
||||||
|
"{fake_name} would be very grateful!\"\"\"\n",
|
||||||
|
"\n",
|
||||||
|
"print(anonymized_text)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"And now, using the `deanonymize` method, we can reverse the process:"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 4,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"name": "stdout",
|
||||||
|
"output_type": "stream",
|
||||||
|
"text": [
|
||||||
|
"Slim Shady recently lost his wallet. \n",
|
||||||
|
"Inside is some cash and his credit card with the number 4916 0387 9536 0861. \n",
|
||||||
|
"If you would find it, please call at 313-666-7440 or write an email here: real.slim.shady@gmail.com.\n",
|
||||||
|
"Slim Shady would be very grateful!\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"print(anonymizer.deanonymize(anonymized_text))"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"### Using with LangChain Expression Language\n",
|
||||||
|
"\n",
|
||||||
|
"With LCEL we can easily chain together anonymization and deanonymization with the rest of our application. This is an example of using the anonymization mechanism with a query to LLM (without deanonymization for now):"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 5,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"text = f\"\"\"Slim Shady recently lost his wallet. \n",
|
||||||
|
"Inside is some cash and his credit card with the number 4916 0387 9536 0861. \n",
|
||||||
|
"If you would find it, please call at 313-666-7440 or write an email here: real.slim.shady@gmail.com.\"\"\""
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 6,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"name": "stdout",
|
||||||
|
"output_type": "stream",
|
||||||
|
"text": [
|
||||||
|
"Dear Sir/Madam,\n",
|
||||||
|
"\n",
|
||||||
|
"We regret to inform you that Mr. Dana Rhodes has reported the loss of his wallet. The wallet contains a sum of cash and his credit card, bearing the number 4397528473885757. \n",
|
||||||
|
"\n",
|
||||||
|
"If you happen to come across the aforementioned wallet, we kindly request that you contact us immediately at 258-481-7074x714 or via email at laurengoodman@example.com.\n",
|
||||||
|
"\n",
|
||||||
|
"Your prompt assistance in this matter would be greatly appreciated.\n",
|
||||||
|
"\n",
|
||||||
|
"Yours faithfully,\n",
|
||||||
|
"\n",
|
||||||
|
"[Your Name]\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"from langchain.prompts.prompt import PromptTemplate\n",
|
||||||
|
"from langchain.chat_models import ChatOpenAI\n",
|
||||||
|
"\n",
|
||||||
|
"anonymizer = PresidioReversibleAnonymizer()\n",
|
||||||
|
"\n",
|
||||||
|
"template = \"\"\"Rewrite this text into an official, short email:\n",
|
||||||
|
"\n",
|
||||||
|
"{anonymized_text}\"\"\"\n",
|
||||||
|
"prompt = PromptTemplate.from_template(template)\n",
|
||||||
|
"llm = ChatOpenAI(temperature=0)\n",
|
||||||
|
"\n",
|
||||||
|
"chain = {\"anonymized_text\": anonymizer.anonymize} | prompt | llm\n",
|
||||||
|
"response = chain.invoke(text)\n",
|
||||||
|
"print(response.content)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"Now, let's add **deanonymization step** to our sequence:"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 7,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"name": "stdout",
|
||||||
|
"output_type": "stream",
|
||||||
|
"text": [
|
||||||
|
"Dear Sir/Madam,\n",
|
||||||
|
"\n",
|
||||||
|
"We regret to inform you that Mr. Slim Shady has recently misplaced his wallet. The wallet contains a sum of cash and his credit card, bearing the number 4916 0387 9536 0861. \n",
|
||||||
|
"\n",
|
||||||
|
"If by any chance you come across the lost wallet, kindly contact us immediately at 313-666-7440 or send an email to real.slim.shady@gmail.com.\n",
|
||||||
|
"\n",
|
||||||
|
"Your prompt assistance in this matter would be greatly appreciated.\n",
|
||||||
|
"\n",
|
||||||
|
"Yours faithfully,\n",
|
||||||
|
"\n",
|
||||||
|
"[Your Name]\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"chain = chain | (lambda ai_message: anonymizer.deanonymize(ai_message.content))\n",
|
||||||
|
"response = chain.invoke(text)\n",
|
||||||
|
"print(response)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"Anonymized data was given to the model itself, and therefore it was protected from being leaked to the outside world. Then, the model's response was processed, and the factual value was replaced with the real one."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Extra knowledge"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"`PresidioReversibleAnonymizer` stores the mapping of the fake values to the original values in the `deanonymizer_mapping` parameter, where key is fake PII and value is the original one: "
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 8,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/plain": [
|
||||||
|
"{'PERSON': {'Maria Lynch': 'Slim Shady'},\n",
|
||||||
|
" 'PHONE_NUMBER': {'7344131647': '313-666-7440'},\n",
|
||||||
|
" 'EMAIL_ADDRESS': {'jamesmichael@example.com': 'real.slim.shady@gmail.com'},\n",
|
||||||
|
" 'CREDIT_CARD': {'4838637940262': '4916 0387 9536 0861'}}"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"execution_count": 8,
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "execute_result"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"from langchain_experimental.data_anonymizer import PresidioReversibleAnonymizer\n",
|
||||||
|
"\n",
|
||||||
|
"anonymizer = PresidioReversibleAnonymizer(\n",
|
||||||
|
" analyzed_fields=[\"PERSON\", \"PHONE_NUMBER\", \"EMAIL_ADDRESS\", \"CREDIT_CARD\"],\n",
|
||||||
|
" # Faker seed is used here to make sure the same fake data is generated for the test purposes\n",
|
||||||
|
" # In production, it is recommended to remove the faker_seed parameter (it will default to None)\n",
|
||||||
|
" faker_seed=42,\n",
|
||||||
|
")\n",
|
||||||
|
"\n",
|
||||||
|
"anonymizer.anonymize(\n",
|
||||||
|
" \"My name is Slim Shady, call me at 313-666-7440 or email me at real.slim.shady@gmail.com. \"\n",
|
||||||
|
" \"By the way, my card number is: 4916 0387 9536 0861\"\n",
|
||||||
|
")\n",
|
||||||
|
"\n",
|
||||||
|
"anonymizer.deanonymizer_mapping"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"Anonymizing more texts will result in new mapping entries:"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 9,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"name": "stdout",
|
||||||
|
"output_type": "stream",
|
||||||
|
"text": [
|
||||||
|
"Do you have his VISA card number? Yep, it's 3537672423884966. I'm William Bowman by the way.\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/plain": [
|
||||||
|
"{'PERSON': {'Maria Lynch': 'Slim Shady', 'William Bowman': 'John Doe'},\n",
|
||||||
|
" 'PHONE_NUMBER': {'7344131647': '313-666-7440'},\n",
|
||||||
|
" 'EMAIL_ADDRESS': {'jamesmichael@example.com': 'real.slim.shady@gmail.com'},\n",
|
||||||
|
" 'CREDIT_CARD': {'4838637940262': '4916 0387 9536 0861',\n",
|
||||||
|
" '3537672423884966': '4001 9192 5753 7193'}}"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"execution_count": 9,
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "execute_result"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"print(\n",
|
||||||
|
" anonymizer.anonymize(\n",
|
||||||
|
" \"Do you have his VISA card number? Yep, it's 4001 9192 5753 7193. I'm John Doe by the way.\"\n",
|
||||||
|
" )\n",
|
||||||
|
")\n",
|
||||||
|
"\n",
|
||||||
|
"anonymizer.deanonymizer_mapping"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"We can save the mapping itself to a file for future use: "
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 10,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# We can save the deanonymizer mapping as a JSON or YAML file\n",
|
||||||
|
"\n",
|
||||||
|
"anonymizer.save_deanonymizer_mapping(\"deanonymizer_mapping.json\")\n",
|
||||||
|
"# anonymizer.save_deanonymizer_mapping(\"deanonymizer_mapping.yaml\")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"And then, load it in another `PresidioReversibleAnonymizer` instance:"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 11,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/plain": [
|
||||||
|
"{}"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"execution_count": 11,
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "execute_result"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"anonymizer = PresidioReversibleAnonymizer()\n",
|
||||||
|
"\n",
|
||||||
|
"anonymizer.deanonymizer_mapping"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 12,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/plain": [
|
||||||
|
"{'PERSON': {'Maria Lynch': 'Slim Shady', 'William Bowman': 'John Doe'},\n",
|
||||||
|
" 'PHONE_NUMBER': {'7344131647': '313-666-7440'},\n",
|
||||||
|
" 'EMAIL_ADDRESS': {'jamesmichael@example.com': 'real.slim.shady@gmail.com'},\n",
|
||||||
|
" 'CREDIT_CARD': {'4838637940262': '4916 0387 9536 0861',\n",
|
||||||
|
" '3537672423884966': '4001 9192 5753 7193'}}"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"execution_count": 12,
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "execute_result"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"anonymizer.load_deanonymizer_mapping(\"deanonymizer_mapping.json\")\n",
|
||||||
|
"\n",
|
||||||
|
"anonymizer.deanonymizer_mapping"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Future works\n",
|
||||||
|
"\n",
|
||||||
|
"- **instance anonymization** - at this point, each occurrence of PII is treated as a separate entity and separately anonymized. Therefore, two occurrences of the name John Doe in the text will be changed to two different names. It is therefore worth introducing support for full instance detection, so that repeated occurrences are treated as a single object.\n",
|
||||||
|
"- **better matching and substitution of fake values for real ones** - currently the strategy is based on matching full strings and then substituting them. Due to the indeterminism of language models, it may happen that the value in the answer is slightly changed (e.g. *John Doe* -> *John* or *Main St, New York* -> *New York*) and such a substitution is then no longer possible. Therefore, it is worth adjusting the matching for your needs."
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"kernelspec": {
|
||||||
|
"display_name": "Python 3 (ipykernel)",
|
||||||
|
"language": "python",
|
||||||
|
"name": "python3"
|
||||||
|
},
|
||||||
|
"language_info": {
|
||||||
|
"codemirror_mode": {
|
||||||
|
"name": "ipython",
|
||||||
|
"version": 3
|
||||||
|
},
|
||||||
|
"file_extension": ".py",
|
||||||
|
"mimetype": "text/x-python",
|
||||||
|
"name": "python",
|
||||||
|
"nbconvert_exporter": "python",
|
||||||
|
"pygments_lexer": "ipython3",
|
||||||
|
"version": "3.11.4"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"nbformat": 4,
|
||||||
|
"nbformat_minor": 4
|
||||||
|
}
|
@ -1,4 +1,7 @@
|
|||||||
"""Data anonymizer package"""
|
"""Data anonymizer package"""
|
||||||
from langchain_experimental.data_anonymizer.presidio import PresidioAnonymizer
|
from langchain_experimental.data_anonymizer.presidio import (
|
||||||
|
PresidioAnonymizer,
|
||||||
|
PresidioReversibleAnonymizer,
|
||||||
|
)
|
||||||
|
|
||||||
__all__ = ["PresidioAnonymizer"]
|
__all__ = ["PresidioAnonymizer", "PresidioReversibleAnonymizer"]
|
||||||
|
@ -0,0 +1,21 @@
|
|||||||
|
from collections import defaultdict
|
||||||
|
from dataclasses import dataclass, field
|
||||||
|
from typing import Dict
|
||||||
|
|
||||||
|
MappingDataType = Dict[str, Dict[str, str]]
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class DeanonymizerMapping:
|
||||||
|
mapping: MappingDataType = field(
|
||||||
|
default_factory=lambda: defaultdict(lambda: defaultdict(str))
|
||||||
|
)
|
||||||
|
|
||||||
|
@property
|
||||||
|
def data(self) -> MappingDataType:
|
||||||
|
"""Return the deanonymizer mapping"""
|
||||||
|
return {k: dict(v) for k, v in self.mapping.items()}
|
||||||
|
|
||||||
|
def update(self, new_mapping: MappingDataType) -> None:
|
||||||
|
for entity_type, values in new_mapping.items():
|
||||||
|
self.mapping[entity_type].update(values)
|
@ -0,0 +1,17 @@
|
|||||||
|
from langchain_experimental.data_anonymizer.presidio import MappingDataType
|
||||||
|
|
||||||
|
|
||||||
|
def default_matching_strategy(text: str, deanonymizer_mapping: MappingDataType) -> str:
|
||||||
|
"""
|
||||||
|
Default matching strategy for deanonymization.
|
||||||
|
It replaces all the anonymized entities with the original ones.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
text: text to deanonymize
|
||||||
|
deanonymizer_mapping: mapping between anonymized entities and original ones"""
|
||||||
|
|
||||||
|
# Iterate over all the entities (PERSON, EMAIL_ADDRESS, etc.)
|
||||||
|
for entity_type in deanonymizer_mapping:
|
||||||
|
for anonymized, original in deanonymizer_mapping[entity_type].items():
|
||||||
|
text = text.replace(anonymized, original)
|
||||||
|
return text
|
@ -0,0 +1,154 @@
|
|||||||
|
import os
|
||||||
|
from typing import Iterator, List
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture(scope="module", autouse=True)
|
||||||
|
def check_spacy_model() -> Iterator[None]:
|
||||||
|
import spacy
|
||||||
|
|
||||||
|
if not spacy.util.is_package("en_core_web_lg"):
|
||||||
|
pytest.skip(reason="Spacy model 'en_core_web_lg' not installed")
|
||||||
|
yield
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.requires("presidio_analyzer", "presidio_anonymizer", "faker")
|
||||||
|
@pytest.mark.parametrize(
|
||||||
|
"analyzed_fields,should_contain",
|
||||||
|
[(["PERSON"], False), (["PHONE_NUMBER"], True), (None, False)],
|
||||||
|
)
|
||||||
|
def test_anonymize(analyzed_fields: List[str], should_contain: bool) -> None:
|
||||||
|
"""Test anonymizing a name in a simple sentence"""
|
||||||
|
from langchain_experimental.data_anonymizer import PresidioReversibleAnonymizer
|
||||||
|
|
||||||
|
text = "Hello, my name is John Doe."
|
||||||
|
anonymizer = PresidioReversibleAnonymizer(analyzed_fields=analyzed_fields)
|
||||||
|
anonymized_text = anonymizer.anonymize(text)
|
||||||
|
assert ("John Doe" in anonymized_text) == should_contain
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.requires("presidio_analyzer", "presidio_anonymizer", "faker")
|
||||||
|
def test_anonymize_multiple() -> None:
|
||||||
|
"""Test anonymizing multiple items in a sentence"""
|
||||||
|
from langchain_experimental.data_anonymizer import PresidioReversibleAnonymizer
|
||||||
|
|
||||||
|
text = "John Smith's phone number is 313-666-7440 and email is johnsmith@gmail.com"
|
||||||
|
anonymizer = PresidioReversibleAnonymizer()
|
||||||
|
anonymized_text = anonymizer.anonymize(text)
|
||||||
|
for phrase in ["John Smith", "313-666-7440", "johnsmith@gmail.com"]:
|
||||||
|
assert phrase not in anonymized_text
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.requires("presidio_analyzer", "presidio_anonymizer", "faker")
|
||||||
|
def test_anonymize_with_custom_operator() -> None:
|
||||||
|
"""Test anonymize a name with a custom operator"""
|
||||||
|
from presidio_anonymizer.entities import OperatorConfig
|
||||||
|
|
||||||
|
from langchain_experimental.data_anonymizer import PresidioReversibleAnonymizer
|
||||||
|
|
||||||
|
custom_operator = {"PERSON": OperatorConfig("replace", {"new_value": "<name>"})}
|
||||||
|
anonymizer = PresidioReversibleAnonymizer(operators=custom_operator)
|
||||||
|
|
||||||
|
text = "Jane Doe was here."
|
||||||
|
|
||||||
|
anonymized_text = anonymizer.anonymize(text)
|
||||||
|
assert anonymized_text == "<name> was here."
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.requires("presidio_analyzer", "presidio_anonymizer", "faker")
|
||||||
|
def test_add_recognizer_operator() -> None:
|
||||||
|
"""
|
||||||
|
Test add recognizer and anonymize a new type of entity and with a custom operator
|
||||||
|
"""
|
||||||
|
from presidio_analyzer import PatternRecognizer
|
||||||
|
from presidio_anonymizer.entities import OperatorConfig
|
||||||
|
|
||||||
|
from langchain_experimental.data_anonymizer import PresidioReversibleAnonymizer
|
||||||
|
|
||||||
|
anonymizer = PresidioReversibleAnonymizer(analyzed_fields=[])
|
||||||
|
titles_list = ["Sir", "Madam", "Professor"]
|
||||||
|
custom_recognizer = PatternRecognizer(
|
||||||
|
supported_entity="TITLE", deny_list=titles_list
|
||||||
|
)
|
||||||
|
anonymizer.add_recognizer(custom_recognizer)
|
||||||
|
|
||||||
|
# anonymizing with custom recognizer
|
||||||
|
text = "Madam Jane Doe was here."
|
||||||
|
anonymized_text = anonymizer.anonymize(text)
|
||||||
|
assert anonymized_text == "<TITLE> Jane Doe was here."
|
||||||
|
|
||||||
|
# anonymizing with custom recognizer and operator
|
||||||
|
custom_operator = {"TITLE": OperatorConfig("replace", {"new_value": "Dear"})}
|
||||||
|
anonymizer.add_operators(custom_operator)
|
||||||
|
anonymized_text = anonymizer.anonymize(text)
|
||||||
|
assert anonymized_text == "Dear Jane Doe was here."
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.requires("presidio_analyzer", "presidio_anonymizer", "faker")
|
||||||
|
def test_deanonymizer_mapping() -> None:
|
||||||
|
"""Test if deanonymizer mapping is correctly populated"""
|
||||||
|
from langchain_experimental.data_anonymizer import PresidioReversibleAnonymizer
|
||||||
|
|
||||||
|
anonymizer = PresidioReversibleAnonymizer(
|
||||||
|
analyzed_fields=["PERSON", "PHONE_NUMBER", "EMAIL_ADDRESS", "CREDIT_CARD"]
|
||||||
|
)
|
||||||
|
|
||||||
|
anonymizer.anonymize("Hello, my name is John Doe and my number is 444 555 6666.")
|
||||||
|
|
||||||
|
# ["PERSON", "PHONE_NUMBER"]
|
||||||
|
assert len(anonymizer.deanonymizer_mapping.keys()) == 2
|
||||||
|
assert "John Doe" in anonymizer.deanonymizer_mapping.get("PERSON", {}).values()
|
||||||
|
assert (
|
||||||
|
"444 555 6666"
|
||||||
|
in anonymizer.deanonymizer_mapping.get("PHONE_NUMBER", {}).values()
|
||||||
|
)
|
||||||
|
|
||||||
|
text_to_anonymize = (
|
||||||
|
"And my name is Jane Doe, my email is jane@gmail.com and "
|
||||||
|
"my credit card is 4929 5319 6292 5362."
|
||||||
|
)
|
||||||
|
anonymizer.anonymize(text_to_anonymize)
|
||||||
|
|
||||||
|
# ["PERSON", "PHONE_NUMBER", "EMAIL_ADDRESS", "CREDIT_CARD"]
|
||||||
|
assert len(anonymizer.deanonymizer_mapping.keys()) == 4
|
||||||
|
assert "Jane Doe" in anonymizer.deanonymizer_mapping.get("PERSON", {}).values()
|
||||||
|
assert (
|
||||||
|
"jane@gmail.com"
|
||||||
|
in anonymizer.deanonymizer_mapping.get("EMAIL_ADDRESS", {}).values()
|
||||||
|
)
|
||||||
|
assert (
|
||||||
|
"4929 5319 6292 5362"
|
||||||
|
in anonymizer.deanonymizer_mapping.get("CREDIT_CARD", {}).values()
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.requires("presidio_analyzer", "presidio_anonymizer", "faker")
|
||||||
|
def test_deanonymize() -> None:
|
||||||
|
"""Test deanonymizing a name in a simple sentence"""
|
||||||
|
from langchain_experimental.data_anonymizer import PresidioReversibleAnonymizer
|
||||||
|
|
||||||
|
text = "Hello, my name is John Doe."
|
||||||
|
anonymizer = PresidioReversibleAnonymizer(analyzed_fields=["PERSON"])
|
||||||
|
anonymized_text = anonymizer.anonymize(text)
|
||||||
|
deanonymized_text = anonymizer.deanonymize(anonymized_text)
|
||||||
|
assert deanonymized_text == text
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.requires("presidio_analyzer", "presidio_anonymizer", "faker")
|
||||||
|
def test_save_load_deanonymizer_mapping() -> None:
|
||||||
|
from langchain_experimental.data_anonymizer import PresidioReversibleAnonymizer
|
||||||
|
|
||||||
|
anonymizer = PresidioReversibleAnonymizer(analyzed_fields=["PERSON"])
|
||||||
|
anonymizer.anonymize("Hello, my name is John Doe.")
|
||||||
|
try:
|
||||||
|
anonymizer.save_deanonymizer_mapping("test_file.json")
|
||||||
|
assert os.path.isfile("test_file.json")
|
||||||
|
|
||||||
|
anonymizer = PresidioReversibleAnonymizer()
|
||||||
|
anonymizer.load_deanonymizer_mapping("test_file.json")
|
||||||
|
|
||||||
|
assert "John Doe" in anonymizer.deanonymizer_mapping.get("PERSON", {}).values()
|
||||||
|
|
||||||
|
finally:
|
||||||
|
os.remove("test_file.json")
|
Loading…
Reference in New Issue