langchain/libs/experimental/langchain_experimental/data_anonymizer/deanonymizer_mapping.py

from collections import defaultdict
from dataclasses import dataclass, field
from typing import Dict

MappingDataType = Dict[str, Dict[str, str]]


@dataclass
class DeanonymizerMapping:
    mapping: MappingDataType = field(
        default_factory=lambda: defaultdict(lambda: defaultdict(str))
    )

    @property
    def data(self) -> MappingDataType:
        """Return the deanonymizer mapping"""
        return {k: dict(v) for k, v in self.mapping.items()}

    def update(self, new_mapping: MappingDataType) -> None:
        for entity_type, values in new_mapping.items():
            self.mapping[entity_type].update(values)
Data deanonymization (#10093) ### Description The feature for pseudonymizing data with ability to retrieve original text (deanonymization) has been implemented. In order to protect private data, such as when querying external APIs (OpenAI), it is worth pseudonymizing sensitive data to maintain full privacy. But then, after the model response, it would be good to have the data in the original form. I implemented the `PresidioReversibleAnonymizer`, which consists of two parts: 1. anonymization - it works the same way as `PresidioAnonymizer`, plus the object itself stores a mapping of made-up values to original ones, for example: ``` { "PERSON": { "<anonymized>": "<original>", "John Doe": "Slim Shady" }, "PHONE_NUMBER": { "111-111-1111": "555-555-5555" } ... } ``` 2. deanonymization - using the mapping described above, it matches fake data with original data and then substitutes it. Between anonymization and deanonymization user can perform different operations, for example, passing the output to LLM. ### Future works - instance anonymization - at this point, each occurrence of PII is treated as a separate entity and separately anonymized. Therefore, two occurrences of the name John Doe in the text will be changed to two different names. It is therefore worth introducing support for full instance detection, so that repeated occurrences are treated as a single object. - better matching and substitution of fake values for real ones - currently the strategy is based on matching full strings and then substituting them. Due to the indeterminism of language models, it may happen that the value in the answer is slightly changed (e.g. John Doe -> John or Main St, New York -> New York) and such a substitution is then no longer possible. Therefore, it is worth adjusting the matching for your needs. - Q&A with anonymization - when I'm done writing all the functionality, I thought it would be a cool resource in documentation to write a notebook about retrieval from documents using anonymization. An iterative process, adding new recognizers to fit the data, lessons learned and what to look out for ### Twitter handle @deepsense_ai / @MaksOpp --------- Co-authored-by: MaksOpp <maks.operlejn@gmail.com> Co-authored-by: Bagatur <baskaryan@gmail.com> 1 year ago			`from collections import defaultdict`
			`from dataclasses import dataclass, field`
			`from typing import Dict`

			`MappingDataType = Dict[str, Dict[str, str]]`


			`@dataclass`
			`class DeanonymizerMapping:`
			`mapping: MappingDataType = field(`
			`default_factory=lambda: defaultdict(lambda: defaultdict(str))`
			`)`

			`@property`
			`def data(self) -> MappingDataType:`
			`"""Return the deanonymizer mapping"""`
			`return {k: dict(v) for k, v in self.mapping.items()}`

			`def update(self, new_mapping: MappingDataType) -> None:`
			`for entity_type, values in new_mapping.items():`
			`self.mapping[entity_type].update(values)`