langchain/libs/experimental/langchain_experimental/data_anonymizer/faker_presidio_mapping.py
maks-operlejn-ds 274c3dc3a8
Multilingual anonymization (#10327)
### Description

Add multiple language support to Anonymizer

PII detection in Microsoft Presidio relies on several components - in
addition to the usual pattern matching (e.g. using regex), the analyser
uses a model for Named Entity Recognition (NER) to extract entities such
as:
- `PERSON`
- `LOCATION`
- `DATE_TIME`
- `NRP`
- `ORGANIZATION`


[[Source]](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/predefined_recognizers/spacy_recognizer.py)

To handle NER in specific languages, we utilize unique models from the
`spaCy` library, recognized for its extensive selection covering
multiple languages and sizes. However, it's not restrictive, allowing
for integration of alternative frameworks such as
[Stanza](https://microsoft.github.io/presidio/analyzer/nlp_engines/spacy_stanza/)
or
[transformers](https://microsoft.github.io/presidio/analyzer/nlp_engines/transformers/)
when necessary.

### Future works

- **automatic language detection** - instead of passing the language as
a parameter in `anonymizer.anonymize`, we could detect the language/s
beforehand and then use the corresponding NER model. We have discussed
this internally and @mateusz-wosinski-ds will look into a standalone
language detection tool/chain for LangChain 😄

### Twitter handle
@deepsense_ai / @MaksOpp

### Tag maintainer
@baskaryan @hwchase17 @hinthornw
2023-09-07 14:42:24 -07:00

42 lines
1.6 KiB
Python

import string
from typing import Callable, Dict, Optional
def get_pseudoanonymizer_mapping(seed: Optional[int] = None) -> Dict[str, Callable]:
try:
from faker import Faker
except ImportError as e:
raise ImportError(
"Could not import faker, please install it with `pip install Faker`."
) from e
fake = Faker()
fake.seed_instance(seed)
# Listed entities supported by Microsoft Presidio (for now, global and US only)
# Source: https://microsoft.github.io/presidio/supported_entities/
return {
# Global entities
"PERSON": lambda _: fake.name(),
"EMAIL_ADDRESS": lambda _: fake.email(),
"PHONE_NUMBER": lambda _: fake.phone_number(),
"IBAN_CODE": lambda _: fake.iban(),
"CREDIT_CARD": lambda _: fake.credit_card_number(),
"CRYPTO": lambda _: "bc1"
+ "".join(
fake.random_choices(string.ascii_lowercase + string.digits, length=26)
),
"IP_ADDRESS": lambda _: fake.ipv4_public(),
"LOCATION": lambda _: fake.city(),
"DATE_TIME": lambda _: fake.date(),
"NRP": lambda _: str(fake.random_number(digits=8, fix_len=True)),
"MEDICAL_LICENSE": lambda _: fake.bothify(text="??######").upper(),
"URL": lambda _: fake.url(),
# US-specific entities
"US_BANK_NUMBER": lambda _: fake.bban(),
"US_DRIVER_LICENSE": lambda _: str(fake.random_number(digits=9, fix_len=True)),
"US_ITIN": lambda _: fake.bothify(text="9##-7#-####"),
"US_PASSPORT": lambda _: fake.bothify(text="#####??").upper(),
"US_SSN": lambda _: fake.ssn(),
}