Add multilingual data anon chain (#10346)

bagatur/konko
Bagatur 1 year ago committed by GitHub
commit 8826293c88
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

@ -6,7 +6,7 @@
"source": [
"# Data anonymization with Microsoft Presidio\n",
"\n",
"[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/extras/guides/privacy/presidio_data_anonymization.ipynb)\n",
"[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/extras/guides/privacy/presidio_data_anonymization/index.ipynb)\n",
"\n",
"## Use case\n",
"\n",
@ -439,8 +439,6 @@
"metadata": {},
"source": [
"## Future works\n",
"\n",
"- **deanonymization** - add the ability to reverse anonymization. For example, the workflow could look like this: `anonymize -> LLMChain -> deanonymize`. By doing this, we will retain anonymity in requests to, for example, OpenAI, and then be able restore the original data.\n",
"- **instance anonymization** - at this point, each occurrence of PII is treated as a separate entity and separately anonymized. Therefore, two occurrences of the name John Doe in the text will be changed to two different names. It is therefore worth introducing support for full instance detection, so that repeated occurrences are treated as a single object."
]
}
@ -461,7 +459,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.4"
"version": "3.9.1"
}
},
"nbformat": 4,

@ -0,0 +1,520 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Mutli-language data anonymization with Microsoft Presidio\n",
"\n",
"[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/extras/guides/privacy/presidio_data_anonymization/multi_language.ipynb)\n",
"\n",
"\n",
"## Use case\n",
"\n",
"Multi-language support in data pseudonymization is essential due to differences in language structures and cultural contexts. Different languages may have varying formats for personal identifiers. For example, the structure of names, locations and dates can differ greatly between languages and regions. Furthermore, non-alphanumeric characters, accents, and the direction of writing can impact pseudonymization processes. Without multi-language support, data could remain identifiable or be misinterpreted, compromising data privacy and accuracy. Hence, it enables effective and precise pseudonymization suited for global operations.\n",
"\n",
"## Overview\n",
"\n",
"PII detection in Microsoft Presidio relies on several components - in addition to the usual pattern matching (e.g. using regex), the analyser uses a model for Named Entity Recognition (NER) to extract entities such as:\n",
"- `PERSON`\n",
"- `LOCATION`\n",
"- `DATE_TIME`\n",
"- `NRP`\n",
"- `ORGANIZATION`\n",
"\n",
"[[Source]](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/predefined_recognizers/spacy_recognizer.py)\n",
"\n",
"To handle NER in specific languages, we utilize unique models from the `spaCy` library, recognized for its extensive selection covering multiple languages and sizes. However, it's not restrictive, allowing for integration of alternative frameworks such as [Stanza](https://microsoft.github.io/presidio/analyzer/nlp_engines/spacy_stanza/) or [transformers](https://microsoft.github.io/presidio/analyzer/nlp_engines/transformers/) when necessary.\n",
"\n",
"\n",
"## Quickstart\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# Install necessary packages\n",
"# ! pip install langchain langchain-experimental openai presidio-analyzer presidio-anonymizer spacy Faker\n",
"# ! python -m spacy download en_core_web_lg"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"from langchain_experimental.data_anonymizer import PresidioReversibleAnonymizer\n",
"\n",
"anonymizer = PresidioReversibleAnonymizer(\n",
" analyzed_fields=[\"PERSON\"],\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"By default, `PresidioAnonymizer` and `PresidioReversibleAnonymizer` use a model trained on English texts, so they handle other languages moderately well. \n",
"\n",
"For example, here the model did not detect the person:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Me llamo Sofía'"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"anonymizer.anonymize(\"Me llamo Sofía\") # \"My name is Sofía\" in Spanish"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"They may also take words from another language as actual entities. Here, both the word *'Yo'* (*'I'* in Spanish) and *Sofía* have been classified as `PERSON`:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Bridget Kirk soy Sally Knight'"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"anonymizer.anonymize(\"Yo soy Sofía\") # \"I am Sofía\" in Spanish"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you want to anonymise texts from other languages, you need to download other models and add them to the anonymiser configuration:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"# Download the models for the languages you want to use\n",
"# ! python -m spacy download en_core_web_md\n",
"# ! python -m spacy download es_core_news_md"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"nlp_config = {\n",
" \"nlp_engine_name\": \"spacy\",\n",
" \"models\": [\n",
" {\"lang_code\": \"en\", \"model_name\": \"en_core_web_md\"},\n",
" {\"lang_code\": \"es\", \"model_name\": \"es_core_news_md\"},\n",
" ],\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We have therefore added a Spanish language model. Note also that we have downloaded an alternative model for English as well - in this case we have replaced the large model `en_core_web_lg` (560MB) with its smaller version `en_core_web_md` (40MB) - the size is therefore reduced by 14 times! If you care about the speed of anonymisation, it is worth considering it.\n",
"\n",
"All models for the different languages can be found in the [spaCy documentation](https://spacy.io/usage/models).\n",
"\n",
"Now pass the configuration as the `languages_config` parameter to Anonymiser. As you can see, both previous examples work flawlessly:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Me llamo Michelle Smith\n",
"Yo soy Rachel Wright\n"
]
}
],
"source": [
"anonymizer = PresidioReversibleAnonymizer(\n",
" analyzed_fields=[\"PERSON\"],\n",
" languages_config=nlp_config,\n",
")\n",
"\n",
"print(\n",
" anonymizer.anonymize(\"Me llamo Sofía\", language=\"es\")\n",
") # \"My name is Sofía\" in Spanish\n",
"print(anonymizer.anonymize(\"Yo soy Sofía\", language=\"es\")) # \"I am Sofía\" in Spanish"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"By default, the language indicated first in the configuration will be used when anonymising text (in this case English):"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"My name is Ronnie Ayala\n"
]
}
],
"source": [
"print(anonymizer.anonymize(\"My name is John\"))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Advanced usage\n",
"\n",
"### Custom labels in NER model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It may be that the spaCy model has different class names than those supported by the Microsoft Presidio by default. Take Polish, for example:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Text: Wiktoria, Start: 12, End: 20, Label: persName\n"
]
}
],
"source": [
"# ! python -m spacy download pl_core_news_md\n",
"\n",
"import spacy\n",
"\n",
"nlp = spacy.load(\"pl_core_news_md\")\n",
"doc = nlp(\"Nazywam się Wiktoria\") # \"My name is Wiktoria\" in Polish\n",
"\n",
"for ent in doc.ents:\n",
" print(\n",
" f\"Text: {ent.text}, Start: {ent.start_char}, End: {ent.end_char}, Label: {ent.label_}\"\n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The name *Victoria* was classified as `persName`, which does not correspond to the default class names `PERSON`/`PER` implemented in Microsoft Presidio (look for `CHECK_LABEL_GROUPS` in [SpacyRecognizer implementation](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/predefined_recognizers/spacy_recognizer.py)). \n",
"\n",
"You can find out more about custom labels in spaCy models (including your own, trained ones) in [this thread](https://github.com/microsoft/presidio/issues/851).\n",
"\n",
"That's why our sentence will not be anonymized:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Nazywam się Wiktoria\n"
]
}
],
"source": [
"nlp_config = {\n",
" \"nlp_engine_name\": \"spacy\",\n",
" \"models\": [\n",
" {\"lang_code\": \"en\", \"model_name\": \"en_core_web_md\"},\n",
" {\"lang_code\": \"es\", \"model_name\": \"es_core_news_md\"},\n",
" {\"lang_code\": \"pl\", \"model_name\": \"pl_core_news_md\"},\n",
" ],\n",
"}\n",
"\n",
"anonymizer = PresidioReversibleAnonymizer(\n",
" analyzed_fields=[\"PERSON\", \"LOCATION\", \"DATE_TIME\"],\n",
" languages_config=nlp_config,\n",
")\n",
"\n",
"print(\n",
" anonymizer.anonymize(\"Nazywam się Wiktoria\", language=\"pl\")\n",
") # \"My name is Wiktoria\" in Polish"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To address this, create your own `SpacyRecognizer` with your own class mapping and add it to the anonymizer:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"from presidio_analyzer.predefined_recognizers import SpacyRecognizer\n",
"\n",
"polish_check_label_groups = [\n",
" ({\"LOCATION\"}, {\"placeName\", \"geogName\"}),\n",
" ({\"PERSON\"}, {\"persName\"}),\n",
" ({\"DATE_TIME\"}, {\"date\", \"time\"}),\n",
"]\n",
"\n",
"spacy_recognizer = SpacyRecognizer(\n",
" supported_language=\"pl\",\n",
" check_label_groups=polish_check_label_groups,\n",
")\n",
"\n",
"anonymizer.add_recognizer(spacy_recognizer)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now everything works smoothly:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Nazywam się Morgan Walters\n"
]
}
],
"source": [
"print(\n",
" anonymizer.anonymize(\"Nazywam się Wiktoria\", language=\"pl\")\n",
") # \"My name is Wiktoria\" in Polish"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's try on more complex example:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Nazywam się Ernest Liu. New Taylorburgh to moje miasto rodzinne. Urodziłam się 1987-01-19\n"
]
}
],
"source": [
"print(\n",
" anonymizer.anonymize(\n",
" \"Nazywam się Wiktoria. Płock to moje miasto rodzinne. Urodziłam się dnia 6 kwietnia 2001 roku\",\n",
" language=\"pl\",\n",
" )\n",
") # \"My name is Wiktoria. Płock is my home town. I was born on 6 April 2001\" in Polish"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As you can see, thanks to class mapping, the anonymiser can cope with different types of entities. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Custom language-specific operators\n",
"\n",
"In the example above, the sentence has been anonymised correctly, but the fake data does not fit the Polish language at all. Custom operators can therefore be added, which will resolve the issue:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"from faker import Faker\n",
"from presidio_anonymizer.entities import OperatorConfig\n",
"\n",
"fake = Faker(locale=\"pl_PL\") # Setting faker to provide Polish data\n",
"\n",
"new_operators = {\n",
" \"PERSON\": OperatorConfig(\"custom\", {\"lambda\": lambda _: fake.first_name_female()}),\n",
" \"LOCATION\": OperatorConfig(\"custom\", {\"lambda\": lambda _: fake.city()}),\n",
"}\n",
"\n",
"anonymizer.add_operators(new_operators)"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Nazywam się Marianna. Szczecin to moje miasto rodzinne. Urodziłam się 1976-11-16\n"
]
}
],
"source": [
"print(\n",
" anonymizer.anonymize(\n",
" \"Nazywam się Wiktoria. Płock to moje miasto rodzinne. Urodziłam się dnia 6 kwietnia 2001 roku\",\n",
" language=\"pl\",\n",
" )\n",
") # \"My name is Wiktoria. Płock is my home town. I was born on 6 April 2001\" in Polish"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Limitations\n",
"\n",
"Remember - results are as good as your recognizers and as your NER models!\n",
"\n",
"Look at the example below - we downloaded the small model for Spanish (12MB) and it no longer performs as well as the medium version (40MB):"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Model: es_core_news_sm. Result: Me llamo Sofía\n",
"Model: es_core_news_md. Result: Me llamo Lawrence Davis\n"
]
}
],
"source": [
"# ! python -m spacy download es_core_news_sm\n",
"\n",
"for model in [\"es_core_news_sm\", \"es_core_news_md\"]:\n",
" nlp_config = {\n",
" \"nlp_engine_name\": \"spacy\",\n",
" \"models\": [\n",
" {\"lang_code\": \"es\", \"model_name\": model},\n",
" ],\n",
" }\n",
"\n",
" anonymizer = PresidioReversibleAnonymizer(\n",
" analyzed_fields=[\"PERSON\"],\n",
" languages_config=nlp_config,\n",
" )\n",
"\n",
" print(\n",
" f\"Model: {model}. Result: {anonymizer.anonymize('Me llamo Sofía', language='es')}\"\n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In many cases, even the larger models from spaCy will not be sufficient - there are already other, more complex and better methods of detecting named entities, based on transformers. You can read more about this [here](https://microsoft.github.io/presidio/analyzer/nlp_engines/transformers/)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Future works\n",
"\n",
"- **automatic language detection** - instead of passing the language as a parameter in `anonymizer.anonymize`, we could detect the language/s beforehand and then use the corresponding NER model."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.1"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

@ -6,7 +6,7 @@
"source": [
"# Reversible data anonymization with Microsoft Presidio\n",
"\n",
"[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/extras/guides/privacy/presidio_reversible_anonymization.ipynb)\n",
"[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/extras/guides/privacy/presidio_data_anonymization/reversible.ipynb)\n",
"\n",
"\n",
"## Use case\n",
@ -453,7 +453,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.4"
"version": "3.9.1"
}
},
"nbformat": 4,

@ -1,4 +1,5 @@
from abc import ABC, abstractmethod
from typing import Optional
class AnonymizerBase(ABC):
@ -8,12 +9,12 @@ class AnonymizerBase(ABC):
wrapping the behavior for all methods in a base class.
"""
def anonymize(self, text: str) -> str:
def anonymize(self, text: str, language: Optional[str] = None) -> str:
"""Anonymize text"""
return self._anonymize(text)
return self._anonymize(text, language)
@abstractmethod
def _anonymize(self, text: str) -> str:
def _anonymize(self, text: str, language: Optional[str]) -> str:
"""Abstract method to anonymize text"""

@ -27,8 +27,8 @@ def get_pseudoanonymizer_mapping(seed: Optional[int] = None) -> Dict[str, Callab
fake.random_choices(string.ascii_lowercase + string.digits, length=26)
),
"IP_ADDRESS": lambda _: fake.ipv4_public(),
"LOCATION": lambda _: fake.address(),
"DATE_TIME": lambda _: fake.iso8601(),
"LOCATION": lambda _: fake.city(),
"DATE_TIME": lambda _: fake.date(),
"NRP": lambda _: str(fake.random_number(digits=8, fix_len=True)),
"MEDICAL_LICENSE": lambda _: fake.bothify(text="??######").upper(),
"URL": lambda _: fake.url(),

@ -24,6 +24,8 @@ from langchain_experimental.data_anonymizer.faker_presidio_mapping import (
try:
from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.nlp_engine import NlpEngineProvider
except ImportError as e:
raise ImportError(
"Could not import presidio_analyzer, please install with "
@ -44,12 +46,29 @@ if TYPE_CHECKING:
from presidio_analyzer import EntityRecognizer, RecognizerResult
from presidio_anonymizer.entities import EngineResult
# Configuring Anonymizer for multiple languages
# Detailed description and examples can be found here:
# langchain/docs/extras/guides/privacy/multi_language_anonymization.ipynb
DEFAULT_LANGUAGES_CONFIG = {
# You can also use Stanza or transformers library.
# See https://microsoft.github.io/presidio/analyzer/customizing_nlp_models/
"nlp_engine_name": "spacy",
"models": [
{"lang_code": "en", "model_name": "en_core_web_lg"},
# {"lang_code": "de", "model_name": "de_core_news_md"},
# {"lang_code": "es", "model_name": "es_core_news_md"},
# ...
# List of available models: https://spacy.io/usage/models
],
}
class PresidioAnonymizerBase(AnonymizerBase):
def __init__(
self,
analyzed_fields: Optional[List[str]] = None,
operators: Optional[Dict[str, OperatorConfig]] = None,
languages_config: Dict = DEFAULT_LANGUAGES_CONFIG,
faker_seed: Optional[int] = None,
):
"""
@ -60,6 +79,11 @@ class PresidioAnonymizerBase(AnonymizerBase):
Operators allow for custom anonymization of detected PII.
Learn more:
https://microsoft.github.io/presidio/tutorial/10_simple_anonymization/
languages_config: Configuration for the NLP engine.
First language in the list will be used as the main language
in self.anonymize(...) when no language is specified.
Learn more:
https://microsoft.github.io/presidio/analyzer/customizing_nlp_models/
faker_seed: Seed used to initialize faker.
Defaults to None, in which case faker will be seeded randomly
and provide random values.
@ -81,7 +105,15 @@ class PresidioAnonymizerBase(AnonymizerBase):
).items()
}
)
self._analyzer = AnalyzerEngine()
provider = NlpEngineProvider(nlp_configuration=languages_config)
nlp_engine = provider.create_engine()
self.supported_languages = list(nlp_engine.nlp.keys())
self._analyzer = AnalyzerEngine(
supported_languages=self.supported_languages, nlp_engine=nlp_engine
)
self._anonymizer = AnonymizerEngine()
def add_recognizer(self, recognizer: EntityRecognizer) -> None:
@ -103,18 +135,31 @@ class PresidioAnonymizerBase(AnonymizerBase):
class PresidioAnonymizer(PresidioAnonymizerBase):
def _anonymize(self, text: str) -> str:
def _anonymize(self, text: str, language: Optional[str] = None) -> str:
"""Anonymize text.
Each PII entity is replaced with a fake value.
Each time fake values will be different, as they are generated randomly.
Args:
text: text to anonymize
language: language to use for analysis of PII
If None, the first (main) language in the list
of languages specified in the configuration will be used.
"""
if language is None:
language = self.supported_languages[0]
if language not in self.supported_languages:
raise ValueError(
f"Language '{language}' is not supported. "
f"Supported languages are: {self.supported_languages}. "
"Change your language configuration file to add more languages."
)
results = self._analyzer.analyze(
text,
entities=self.analyzed_fields,
language="en",
language=language,
)
return self._anonymizer.anonymize(
@ -129,9 +174,10 @@ class PresidioReversibleAnonymizer(PresidioAnonymizerBase, ReversibleAnonymizerB
self,
analyzed_fields: Optional[List[str]] = None,
operators: Optional[Dict[str, OperatorConfig]] = None,
languages_config: Dict = DEFAULT_LANGUAGES_CONFIG,
faker_seed: Optional[int] = None,
):
super().__init__(analyzed_fields, operators, faker_seed)
super().__init__(analyzed_fields, operators, languages_config, faker_seed)
self._deanonymizer_mapping = DeanonymizerMapping()
@property
@ -191,7 +237,7 @@ class PresidioReversibleAnonymizer(PresidioAnonymizerBase, ReversibleAnonymizerB
self._deanonymizer_mapping.update(new_deanonymizer_mapping)
def _anonymize(self, text: str) -> str:
def _anonymize(self, text: str, language: Optional[str] = None) -> str:
"""Anonymize text.
Each PII entity is replaced with a fake value.
Each time fake values will be different, as they are generated randomly.
@ -200,11 +246,24 @@ class PresidioReversibleAnonymizer(PresidioAnonymizerBase, ReversibleAnonymizerB
Args:
text: text to anonymize
language: language to use for analysis of PII
If None, the first (main) language in the list
of languages specified in the configuration will be used.
"""
if language is None:
language = self.supported_languages[0]
if language not in self.supported_languages:
raise ValueError(
f"Language '{language}' is not supported. "
f"Supported languages are: {self.supported_languages}. "
"Change your language configuration file to add more languages."
)
analyzer_results = self._analyzer.analyze(
text,
entities=self.analyzed_fields,
language="en",
language=language,
)
filtered_analyzer_results = (

Loading…
Cancel
Save