mirror of
https://github.com/hwchase17/langchain
synced 2024-10-29 17:07:25 +00:00
274c3dc3a8
### Description Add multiple language support to Anonymizer PII detection in Microsoft Presidio relies on several components - in addition to the usual pattern matching (e.g. using regex), the analyser uses a model for Named Entity Recognition (NER) to extract entities such as: - `PERSON` - `LOCATION` - `DATE_TIME` - `NRP` - `ORGANIZATION` [[Source]](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/predefined_recognizers/spacy_recognizer.py) To handle NER in specific languages, we utilize unique models from the `spaCy` library, recognized for its extensive selection covering multiple languages and sizes. However, it's not restrictive, allowing for integration of alternative frameworks such as [Stanza](https://microsoft.github.io/presidio/analyzer/nlp_engines/spacy_stanza/) or [transformers](https://microsoft.github.io/presidio/analyzer/nlp_engines/transformers/) when necessary. ### Future works - **automatic language detection** - instead of passing the language as a parameter in `anonymizer.anonymize`, we could detect the language/s beforehand and then use the corresponding NER model. We have discussed this internally and @mateusz-wosinski-ds will look into a standalone language detection tool/chain for LangChain 😄 ### Twitter handle @deepsense_ai / @MaksOpp ### Tag maintainer @baskaryan @hwchase17 @hinthornw
33 lines
905 B
Python
33 lines
905 B
Python
from abc import ABC, abstractmethod
|
|
from typing import Optional
|
|
|
|
|
|
class AnonymizerBase(ABC):
|
|
"""
|
|
Base abstract class for anonymizers.
|
|
It is public and non-virtual because it allows
|
|
wrapping the behavior for all methods in a base class.
|
|
"""
|
|
|
|
def anonymize(self, text: str, language: Optional[str] = None) -> str:
|
|
"""Anonymize text"""
|
|
return self._anonymize(text, language)
|
|
|
|
@abstractmethod
|
|
def _anonymize(self, text: str, language: Optional[str]) -> str:
|
|
"""Abstract method to anonymize text"""
|
|
|
|
|
|
class ReversibleAnonymizerBase(AnonymizerBase):
|
|
"""
|
|
Base abstract class for reversible anonymizers.
|
|
"""
|
|
|
|
def deanonymize(self, text: str) -> str:
|
|
"""Deanonymize text"""
|
|
return self._deanonymize(text)
|
|
|
|
@abstractmethod
|
|
def _deanonymize(self, text: str) -> str:
|
|
"""Abstract method to deanonymize text"""
|