langchain

mirror of https://github.com/hwchase17/langchain synced 2024-11-18 09:25:54 +00:00

Author	SHA1	Message	Date
Suresh Kumar Ponnusamy	70f7558db2	langchain-experimental: Add allow_list support in experimental/data_anonymizer (#11597 ) - Description: Add allow_list support in langchain experimental data-anonymizer package - Issue: no - Dependencies: no - Tag maintainer: @hwchase17 - Twitter handle:	2023-10-11 14:50:41 -07:00
maks-operlejn-ds	4d62def9ff	Better deanonymizer matching strategy (#11557 ) @baskaryan, @hwchase17	2023-10-09 11:10:29 -07:00
Bagatur	8fafa1af91	merge	2023-10-05 18:09:35 -07:00
maks-operlejn-ds	2aae1102b0	Instance anonymization (#10501 ) ### Description Add instance anonymization - if `John Doe` will appear twice in the text, it will be treated as the same entity. The difference between `PresidioAnonymizer` and `PresidioReversibleAnonymizer` is that only the second one has a built-in memory, so it will remember anonymization mapping for multiple texts: ``` >>> anonymizer = PresidioAnonymizer() >>> anonymizer.anonymize("My name is John Doe. Hi John Doe!") 'My name is Noah Rhodes. Hi Noah Rhodes!' >>> anonymizer.anonymize("My name is John Doe. Hi John Doe!") 'My name is Brett Russell. Hi Brett Russell!' ``` ``` >>> anonymizer = PresidioReversibleAnonymizer() >>> anonymizer.anonymize("My name is John Doe. Hi John Doe!") 'My name is Noah Rhodes. Hi Noah Rhodes!' >>> anonymizer.anonymize("My name is John Doe. Hi John Doe!") 'My name is Noah Rhodes. Hi Noah Rhodes!' ``` ### Twitter handle @deepsense_ai / @MaksOpp ### Tag maintainer @baskaryan @hwchase17 @hinthornw --------- Co-authored-by: Bagatur <baskaryan@gmail.com>	2023-10-05 11:23:02 -07:00
olgavrou	30d02e3a34	fix linting	2023-09-11 13:36:01 -04:00
olgavrou	42d0d485a9	black formatting	2023-09-11 13:33:43 -04:00
olgavrou	7185fdc990	check if libcublas is available before running extended tests	2023-09-11 13:26:41 -04:00
maks-operlejn-ds	a8f804a618	Add data anonymizer (#9863 ) ### Description The feature for anonymizing data has been implemented. In order to protect private data, such as when querying external APIs (OpenAI), it is worth pseudonymizing sensitive data to maintain full privacy. Anonynization consists of two steps: 1. Identification: Identify all data fields that contain personally identifiable information (PII). 2. Replacement: Replace all PIIs with pseudo values or codes that do not reveal any personal information about the individual but can be used for reference. We're not using regular encryption, because the language model won't be able to understand the meaning or context of the encrypted data. We use Microsoft Presidio together with Faker framework for anonymization purposes because of the wide range of functionalities they provide. The full implementation is available in `PresidioAnonymizer`. ### Future works - deanonymization - add the ability to reverse anonymization. For example, the workflow could look like this: `anonymize -> LLMChain -> deanonymize`. By doing this, we will retain anonymity in requests to, for example, OpenAI, and then be able restore the original data. - instance anonymization - at this point, each occurrence of PII is treated as a separate entity and separately anonymized. Therefore, two occurrences of the name John Doe in the text will be changed to two different names. It is therefore worth introducing support for full instance detection, so that repeated occurrences are treated as a single object. ### Twitter handle @deepsense_ai / @MaksOpp --------- Co-authored-by: MaksOpp <maks.operlejn@gmail.com> Co-authored-by: Bagatur <baskaryan@gmail.com>	2023-08-30 10:39:44 -07:00

8 Commits