Add new types of document transformers (#7379)

- Description: Add two new document transformers that translates
documents into different languages and converts documents into q&a
format to improve vector search results. Uses OpenAI function calling
via the [doctran](https://github.com/psychic-api/doctran/tree/main)
library.
  - Issue: N/A
  - Dependencies: `doctran = "^0.0.5"`
  - Tag maintainer: @rlancemartin @eyurtsev @hwchase17 
  - Twitter handle: @psychicapi or @jfan001

Notes
- Adheres to the `DocumentTransformer` abstraction set by @dev2049 in
#3182
- refactored `EmbeddingsRedundantFilter` to put it in a file under a new
`document_transformers` module
- Added basic docs for `DocumentInterrogator`, `DocumentTransformer` as
well as the existing `EmbeddingsRedundantFilter`

---------

Co-authored-by: Lance Martin <lance@langchain.dev>
Co-authored-by: Bagatur <baskaryan@gmail.com>
This commit is contained in:
Jason Fan 2023-07-12 20:53:30 -07:00 committed by GitHub
parent f11d845dee
commit 8effd90be0
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
17 changed files with 985 additions and 6 deletions

View File

@ -24,7 +24,7 @@ That means there are two different axes along which you can customize your text
1. How the text is split 1. How the text is split
2. How the chunk size is measured 2. How the chunk size is measured
## Get started with text splitters ### Get started with text splitters
import GetStarted from "@snippets/modules/data_connection/document_transformers/get_started.mdx" import GetStarted from "@snippets/modules/data_connection/document_transformers/get_started.mdx"

View File

@ -1 +1,2 @@
label: 'Text splitters' label: 'Text splitters'
position: 0

View File

@ -8,7 +8,7 @@ Many LLM applications require user-specific data that is not part of the model's
building blocks to load, transform, store and query your data via: building blocks to load, transform, store and query your data via:
- [Document loaders](/docs/modules/data_connection/document_loaders/): Load documents from many different sources - [Document loaders](/docs/modules/data_connection/document_loaders/): Load documents from many different sources
- [Document transformers](/docs/modules/data_connection/document_transformers/): Split documents, drop redundant documents, and more - [Document transformers](/docs/modules/data_connection/document_transformers/): Split documents, convert documents into Q&A format, drop redundant documents, and more
- [Text embedding models](/docs/modules/data_connection/text_embedding/): Take unstructured text and turn it into a list of floating point numbers - [Text embedding models](/docs/modules/data_connection/text_embedding/): Take unstructured text and turn it into a list of floating point numbers
- [Vector stores](/docs/modules/data_connection/vectorstores/): Store and search over embedded data - [Vector stores](/docs/modules/data_connection/vectorstores/): Store and search over embedded data
- [Retrievers](/docs/modules/data_connection/retrievers/): Query your data - [Retrievers](/docs/modules/data_connection/retrievers/): Query your data

View File

@ -0,0 +1 @@
label: 'Integrations'

View File

@ -0,0 +1,269 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Doctran Extract Properties\n",
"\n",
"We can extract useful features of documents using the [Doctran](https://github.com/psychic-api/doctran) library, which uses OpenAI's function calling feature to extract specific metadata.\n",
"\n",
"Extracting metadata from documents is helpful for a variety of tasks, including:\n",
"* Classification: classifying documents into different categories\n",
"* Data mining: Extract structured data that can be used for data analysis\n",
"* Style transfer: Change the way text is written to more closely match expected user input, improving vector search results"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"! pip install doctran"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"scrolled": false
},
"outputs": [],
"source": [
"import json\n",
"from langchain.schema import Document\n",
"from langchain.document_transformers import DoctranPropertyExtractor"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from dotenv import load_dotenv\n",
"\n",
"load_dotenv()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Input\n",
"This is the document we'll extract properties from."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[Generated with ChatGPT]\n",
"\n",
"Confidential Document - For Internal Use Only\n",
"\n",
"Date: July 1, 2023\n",
"\n",
"Subject: Updates and Discussions on Various Topics\n",
"\n",
"Dear Team,\n",
"\n",
"I hope this email finds you well. In this document, I would like to provide you with some important updates and discuss various topics that require our attention. Please treat the information contained herein as highly confidential.\n",
"\n",
"Security and Privacy Measures\n",
"As part of our ongoing commitment to ensure the security and privacy of our customers' data, we have implemented robust measures across all our systems. We would like to commend John Doe (email: john.doe@example.com) from the IT department for his diligent work in enhancing our network security. Moving forward, we kindly remind everyone to strictly adhere to our data protection policies and guidelines. Additionally, if you come across any potential security risks or incidents, please report them immediately to our dedicated team at security@example.com.\n",
"\n",
"HR Updates and Employee Benefits\n",
"Recently, we welcomed several new team members who have made significant contributions to their respective departments. I would like to recognize Jane Smith (SSN: 049-45-5928) for her outstanding performance in customer service. Jane has consistently received positive feedback from our clients. Furthermore, please remember that the open enrollment period for our employee benefits program is fast approaching. Should you have any questions or require assistance, please contact our HR representative, Michael Johnson (phone: 418-492-3850, email: michael.johnson@example.com).\n",
"\n",
"Marketing Initiatives and Campaigns\n",
"Our marketing team has been actively working on developing new strategies to increase brand awareness and drive customer engagement. We would like to thank Sarah Thompson (phone: 415-555-1234) for her exceptional efforts in managing our social media platforms. Sarah has successfully increased our follower base by 20% in the past month alone. Moreover, please mark your calendars for the upcoming product launch event on July 15th. We encourage all team members to attend and support this exciting milestone for our company.\n",
"\n",
"Research and Development Projects\n",
"In our pursuit of innovation, our research and development department has been working tirelessly on various projects. I would like to acknowledge the exceptional work of David Rodriguez (email: david.rodriguez@example.com) in his role as project lead. David's contributions to the development of our cutting-edge technology have been instrumental. Furthermore, we would like to remind everyone to share their ideas and suggestions for potential new projects during our monthly R&D brainstorming session, scheduled for July 10th.\n",
"\n",
"Please treat the information in this document with utmost confidentiality and ensure that it is not shared with unauthorized individuals. If you have any questions or concerns regarding the topics discussed, please do not hesitate to reach out to me directly.\n",
"\n",
"Thank you for your attention, and let's continue to work together to achieve our goals.\n",
"\n",
"Best regards,\n",
"\n",
"Jason Fan\n",
"Cofounder & CEO\n",
"Psychic\n",
"jason@psychic.dev\n",
"\n"
]
}
],
"source": [
"sample_text = \"\"\"[Generated with ChatGPT]\n",
"\n",
"Confidential Document - For Internal Use Only\n",
"\n",
"Date: July 1, 2023\n",
"\n",
"Subject: Updates and Discussions on Various Topics\n",
"\n",
"Dear Team,\n",
"\n",
"I hope this email finds you well. In this document, I would like to provide you with some important updates and discuss various topics that require our attention. Please treat the information contained herein as highly confidential.\n",
"\n",
"Security and Privacy Measures\n",
"As part of our ongoing commitment to ensure the security and privacy of our customers' data, we have implemented robust measures across all our systems. We would like to commend John Doe (email: john.doe@example.com) from the IT department for his diligent work in enhancing our network security. Moving forward, we kindly remind everyone to strictly adhere to our data protection policies and guidelines. Additionally, if you come across any potential security risks or incidents, please report them immediately to our dedicated team at security@example.com.\n",
"\n",
"HR Updates and Employee Benefits\n",
"Recently, we welcomed several new team members who have made significant contributions to their respective departments. I would like to recognize Jane Smith (SSN: 049-45-5928) for her outstanding performance in customer service. Jane has consistently received positive feedback from our clients. Furthermore, please remember that the open enrollment period for our employee benefits program is fast approaching. Should you have any questions or require assistance, please contact our HR representative, Michael Johnson (phone: 418-492-3850, email: michael.johnson@example.com).\n",
"\n",
"Marketing Initiatives and Campaigns\n",
"Our marketing team has been actively working on developing new strategies to increase brand awareness and drive customer engagement. We would like to thank Sarah Thompson (phone: 415-555-1234) for her exceptional efforts in managing our social media platforms. Sarah has successfully increased our follower base by 20% in the past month alone. Moreover, please mark your calendars for the upcoming product launch event on July 15th. We encourage all team members to attend and support this exciting milestone for our company.\n",
"\n",
"Research and Development Projects\n",
"In our pursuit of innovation, our research and development department has been working tirelessly on various projects. I would like to acknowledge the exceptional work of David Rodriguez (email: david.rodriguez@example.com) in his role as project lead. David's contributions to the development of our cutting-edge technology have been instrumental. Furthermore, we would like to remind everyone to share their ideas and suggestions for potential new projects during our monthly R&D brainstorming session, scheduled for July 10th.\n",
"\n",
"Please treat the information in this document with utmost confidentiality and ensure that it is not shared with unauthorized individuals. If you have any questions or concerns regarding the topics discussed, please do not hesitate to reach out to me directly.\n",
"\n",
"Thank you for your attention, and let's continue to work together to achieve our goals.\n",
"\n",
"Best regards,\n",
"\n",
"Jason Fan\n",
"Cofounder & CEO\n",
"Psychic\n",
"jason@psychic.dev\n",
"\"\"\"\n",
"print(sample_text)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"documents = [Document(page_content=sample_text)]\n",
"properties = [\n",
" {\n",
" \"name\": \"category\",\n",
" \"description\": \"What type of email this is.\",\n",
" \"type\": \"string\",\n",
" \"enum\": [\"update\", \"action_item\", \"customer_feedback\", \"announcement\", \"other\"],\n",
" \"required\": True,\n",
" },\n",
" {\n",
" \"name\": \"mentions\",\n",
" \"description\": \"A list of all people mentioned in this email.\",\n",
" \"type\": \"array\",\n",
" \"items\": {\n",
" \"name\": \"full_name\",\n",
" \"description\": \"The full name of the person mentioned.\",\n",
" \"type\": \"string\",\n",
" },\n",
" \"required\": True,\n",
" },\n",
" {\n",
" \"name\": \"eli5\",\n",
" \"description\": \"Explain this email to me like I'm 5 years old.\",\n",
" \"type\": \"string\",\n",
" \"required\": True,\n",
" },\n",
"]\n",
"property_extractor = DoctranPropertyExtractor(properties=properties)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Output\n",
"After extracting properties from a document, the result will be returned as a new document with properties provided in the metadata"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"extracted_document = await property_extractor.atransform_documents(\n",
" documents, properties=properties\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{\n",
" \"extracted_properties\": {\n",
" \"category\": \"update\",\n",
" \"mentions\": [\n",
" \"John Doe\",\n",
" \"Jane Smith\",\n",
" \"Michael Johnson\",\n",
" \"Sarah Thompson\",\n",
" \"David Rodriguez\",\n",
" \"Jason Fan\"\n",
" ],\n",
" \"eli5\": \"This is an email from the CEO, Jason Fan, giving updates about different areas in the company. He talks about new security measures and praises John Doe for his work. He also mentions new hires and praises Jane Smith for her work in customer service. The CEO reminds everyone about the upcoming benefits enrollment and says to contact Michael Johnson with any questions. He talks about the marketing team's work and praises Sarah Thompson for increasing their social media followers. There's also a product launch event on July 15th. Lastly, he talks about the research and development projects and praises David Rodriguez for his work. There's a brainstorming session on July 10th.\"\n",
" }\n",
"}\n"
]
}
],
"source": [
"print(json.dumps(extracted_document[0].metadata, indent=2))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

View File

@ -0,0 +1,266 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Doctran Interrogate Documents\n",
"Documents used in a vector store knowledge base are typically stored in narrative or conversational format. However, most user queries are in question format. If we convert documents into Q&A format before vectorizing them, we can increase the liklihood of retrieving relevant documents, and decrease the liklihood of retrieving irrelevant documents.\n",
"\n",
"We can accomplish this using the [Doctran](https://github.com/psychic-api/doctran) library, which uses OpenAI's function calling feature to \"interrogate\" documents.\n",
"\n",
"See [this notebook](https://github.com/psychic-api/doctran/blob/main/benchmark.ipynb) for benchmarks on vector similarity scores for various queries based on raw documents versus interrogated documents."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"! pip install doctran"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"scrolled": false
},
"outputs": [],
"source": [
"import json\n",
"from langchain.schema import Document\n",
"from langchain.document_transformers import DoctranQATransformer"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from dotenv import load_dotenv\n",
"\n",
"load_dotenv()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Input\n",
"This is the document we'll interrogate"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[Generated with ChatGPT]\n",
"\n",
"Confidential Document - For Internal Use Only\n",
"\n",
"Date: July 1, 2023\n",
"\n",
"Subject: Updates and Discussions on Various Topics\n",
"\n",
"Dear Team,\n",
"\n",
"I hope this email finds you well. In this document, I would like to provide you with some important updates and discuss various topics that require our attention. Please treat the information contained herein as highly confidential.\n",
"\n",
"Security and Privacy Measures\n",
"As part of our ongoing commitment to ensure the security and privacy of our customers' data, we have implemented robust measures across all our systems. We would like to commend John Doe (email: john.doe@example.com) from the IT department for his diligent work in enhancing our network security. Moving forward, we kindly remind everyone to strictly adhere to our data protection policies and guidelines. Additionally, if you come across any potential security risks or incidents, please report them immediately to our dedicated team at security@example.com.\n",
"\n",
"HR Updates and Employee Benefits\n",
"Recently, we welcomed several new team members who have made significant contributions to their respective departments. I would like to recognize Jane Smith (SSN: 049-45-5928) for her outstanding performance in customer service. Jane has consistently received positive feedback from our clients. Furthermore, please remember that the open enrollment period for our employee benefits program is fast approaching. Should you have any questions or require assistance, please contact our HR representative, Michael Johnson (phone: 418-492-3850, email: michael.johnson@example.com).\n",
"\n",
"Marketing Initiatives and Campaigns\n",
"Our marketing team has been actively working on developing new strategies to increase brand awareness and drive customer engagement. We would like to thank Sarah Thompson (phone: 415-555-1234) for her exceptional efforts in managing our social media platforms. Sarah has successfully increased our follower base by 20% in the past month alone. Moreover, please mark your calendars for the upcoming product launch event on July 15th. We encourage all team members to attend and support this exciting milestone for our company.\n",
"\n",
"Research and Development Projects\n",
"In our pursuit of innovation, our research and development department has been working tirelessly on various projects. I would like to acknowledge the exceptional work of David Rodriguez (email: david.rodriguez@example.com) in his role as project lead. David's contributions to the development of our cutting-edge technology have been instrumental. Furthermore, we would like to remind everyone to share their ideas and suggestions for potential new projects during our monthly R&D brainstorming session, scheduled for July 10th.\n",
"\n",
"Please treat the information in this document with utmost confidentiality and ensure that it is not shared with unauthorized individuals. If you have any questions or concerns regarding the topics discussed, please do not hesitate to reach out to me directly.\n",
"\n",
"Thank you for your attention, and let's continue to work together to achieve our goals.\n",
"\n",
"Best regards,\n",
"\n",
"Jason Fan\n",
"Cofounder & CEO\n",
"Psychic\n",
"jason@psychic.dev\n",
"\n"
]
}
],
"source": [
"sample_text = \"\"\"[Generated with ChatGPT]\n",
"\n",
"Confidential Document - For Internal Use Only\n",
"\n",
"Date: July 1, 2023\n",
"\n",
"Subject: Updates and Discussions on Various Topics\n",
"\n",
"Dear Team,\n",
"\n",
"I hope this email finds you well. In this document, I would like to provide you with some important updates and discuss various topics that require our attention. Please treat the information contained herein as highly confidential.\n",
"\n",
"Security and Privacy Measures\n",
"As part of our ongoing commitment to ensure the security and privacy of our customers' data, we have implemented robust measures across all our systems. We would like to commend John Doe (email: john.doe@example.com) from the IT department for his diligent work in enhancing our network security. Moving forward, we kindly remind everyone to strictly adhere to our data protection policies and guidelines. Additionally, if you come across any potential security risks or incidents, please report them immediately to our dedicated team at security@example.com.\n",
"\n",
"HR Updates and Employee Benefits\n",
"Recently, we welcomed several new team members who have made significant contributions to their respective departments. I would like to recognize Jane Smith (SSN: 049-45-5928) for her outstanding performance in customer service. Jane has consistently received positive feedback from our clients. Furthermore, please remember that the open enrollment period for our employee benefits program is fast approaching. Should you have any questions or require assistance, please contact our HR representative, Michael Johnson (phone: 418-492-3850, email: michael.johnson@example.com).\n",
"\n",
"Marketing Initiatives and Campaigns\n",
"Our marketing team has been actively working on developing new strategies to increase brand awareness and drive customer engagement. We would like to thank Sarah Thompson (phone: 415-555-1234) for her exceptional efforts in managing our social media platforms. Sarah has successfully increased our follower base by 20% in the past month alone. Moreover, please mark your calendars for the upcoming product launch event on July 15th. We encourage all team members to attend and support this exciting milestone for our company.\n",
"\n",
"Research and Development Projects\n",
"In our pursuit of innovation, our research and development department has been working tirelessly on various projects. I would like to acknowledge the exceptional work of David Rodriguez (email: david.rodriguez@example.com) in his role as project lead. David's contributions to the development of our cutting-edge technology have been instrumental. Furthermore, we would like to remind everyone to share their ideas and suggestions for potential new projects during our monthly R&D brainstorming session, scheduled for July 10th.\n",
"\n",
"Please treat the information in this document with utmost confidentiality and ensure that it is not shared with unauthorized individuals. If you have any questions or concerns regarding the topics discussed, please do not hesitate to reach out to me directly.\n",
"\n",
"Thank you for your attention, and let's continue to work together to achieve our goals.\n",
"\n",
"Best regards,\n",
"\n",
"Jason Fan\n",
"Cofounder & CEO\n",
"Psychic\n",
"jason@psychic.dev\n",
"\"\"\"\n",
"print(sample_text)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"documents = [Document(page_content=sample_text)]\n",
"qa_transformer = DoctranQATransformer()\n",
"transformed_document = await qa_transformer.atransform_documents(documents)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Output\n",
"After interrogating a document, the result will be returned as a new document with questions and answers provided in the metadata."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{\n",
" \"questions_and_answers\": [\n",
" {\n",
" \"question\": \"What is the purpose of this document?\",\n",
" \"answer\": \"The purpose of this document is to provide important updates and discuss various topics that require the team's attention.\"\n",
" },\n",
" {\n",
" \"question\": \"Who is responsible for enhancing the network security?\",\n",
" \"answer\": \"John Doe from the IT department is responsible for enhancing the network security.\"\n",
" },\n",
" {\n",
" \"question\": \"Where should potential security risks or incidents be reported?\",\n",
" \"answer\": \"Potential security risks or incidents should be reported to the dedicated team at security@example.com.\"\n",
" },\n",
" {\n",
" \"question\": \"Who has been recognized for outstanding performance in customer service?\",\n",
" \"answer\": \"Jane Smith has been recognized for her outstanding performance in customer service.\"\n",
" },\n",
" {\n",
" \"question\": \"When is the open enrollment period for the employee benefits program?\",\n",
" \"answer\": \"The document does not specify the exact dates for the open enrollment period for the employee benefits program, but it mentions that it is fast approaching.\"\n",
" },\n",
" {\n",
" \"question\": \"Who should be contacted for questions or assistance regarding the employee benefits program?\",\n",
" \"answer\": \"For questions or assistance regarding the employee benefits program, the HR representative, Michael Johnson, should be contacted.\"\n",
" },\n",
" {\n",
" \"question\": \"Who has been acknowledged for managing the company's social media platforms?\",\n",
" \"answer\": \"Sarah Thompson has been acknowledged for managing the company's social media platforms.\"\n",
" },\n",
" {\n",
" \"question\": \"When is the upcoming product launch event?\",\n",
" \"answer\": \"The upcoming product launch event is on July 15th.\"\n",
" },\n",
" {\n",
" \"question\": \"Who has been recognized for their contributions to the development of the company's technology?\",\n",
" \"answer\": \"David Rodriguez has been recognized for his contributions to the development of the company's technology.\"\n",
" },\n",
" {\n",
" \"question\": \"When is the monthly R&D brainstorming session?\",\n",
" \"answer\": \"The monthly R&D brainstorming session is scheduled for July 10th.\"\n",
" },\n",
" {\n",
" \"question\": \"Who should be contacted for questions or concerns regarding the topics discussed in the document?\",\n",
" \"answer\": \"For questions or concerns regarding the topics discussed in the document, Jason Fan, the Cofounder & CEO, should be contacted.\"\n",
" }\n",
" ]\n",
"}\n"
]
}
],
"source": [
"transformed_document = await qa_transformer.atransform_documents(documents)\n",
"print(json.dumps(transformed_document[0].metadata, indent=2))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

View File

@ -0,0 +1,208 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Doctran Translate Documents\n",
"Comparing documents through embeddings has the benefit of working across multiple languages. \"Harrison says hello\" and \"Harrison dice hola\" will occupy similar positions in the vector space because they have the same meaning semantically.\n",
"\n",
"However, it can still be useful to use a LLM translate documents into other languages before vectorizing them. This is especially helpful when users are expected to query the knowledge base in different languages, or when state of the art embeddings models are not available for a given language.\n",
"\n",
"We can accomplish this using the [Doctran](https://github.com/psychic-api/doctran) library, which uses OpenAI's function calling feature to translate documents between languages."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"! pip install doctran"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"from langchain.schema import Document\n",
"from langchain.document_transformers import DoctranTextTranslator"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from dotenv import load_dotenv\n",
"\n",
"load_dotenv()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Input\n",
"This is the document we'll translate"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"sample_text = \"\"\"[Generated with ChatGPT]\n",
"\n",
"Confidential Document - For Internal Use Only\n",
"\n",
"Date: July 1, 2023\n",
"\n",
"Subject: Updates and Discussions on Various Topics\n",
"\n",
"Dear Team,\n",
"\n",
"I hope this email finds you well. In this document, I would like to provide you with some important updates and discuss various topics that require our attention. Please treat the information contained herein as highly confidential.\n",
"\n",
"Security and Privacy Measures\n",
"As part of our ongoing commitment to ensure the security and privacy of our customers' data, we have implemented robust measures across all our systems. We would like to commend John Doe (email: john.doe@example.com) from the IT department for his diligent work in enhancing our network security. Moving forward, we kindly remind everyone to strictly adhere to our data protection policies and guidelines. Additionally, if you come across any potential security risks or incidents, please report them immediately to our dedicated team at security@example.com.\n",
"\n",
"HR Updates and Employee Benefits\n",
"Recently, we welcomed several new team members who have made significant contributions to their respective departments. I would like to recognize Jane Smith (SSN: 049-45-5928) for her outstanding performance in customer service. Jane has consistently received positive feedback from our clients. Furthermore, please remember that the open enrollment period for our employee benefits program is fast approaching. Should you have any questions or require assistance, please contact our HR representative, Michael Johnson (phone: 418-492-3850, email: michael.johnson@example.com).\n",
"\n",
"Marketing Initiatives and Campaigns\n",
"Our marketing team has been actively working on developing new strategies to increase brand awareness and drive customer engagement. We would like to thank Sarah Thompson (phone: 415-555-1234) for her exceptional efforts in managing our social media platforms. Sarah has successfully increased our follower base by 20% in the past month alone. Moreover, please mark your calendars for the upcoming product launch event on July 15th. We encourage all team members to attend and support this exciting milestone for our company.\n",
"\n",
"Research and Development Projects\n",
"In our pursuit of innovation, our research and development department has been working tirelessly on various projects. I would like to acknowledge the exceptional work of David Rodriguez (email: david.rodriguez@example.com) in his role as project lead. David's contributions to the development of our cutting-edge technology have been instrumental. Furthermore, we would like to remind everyone to share their ideas and suggestions for potential new projects during our monthly R&D brainstorming session, scheduled for July 10th.\n",
"\n",
"Please treat the information in this document with utmost confidentiality and ensure that it is not shared with unauthorized individuals. If you have any questions or concerns regarding the topics discussed, please do not hesitate to reach out to me directly.\n",
"\n",
"Thank you for your attention, and let's continue to work together to achieve our goals.\n",
"\n",
"Best regards,\n",
"\n",
"Jason Fan\n",
"Cofounder & CEO\n",
"Psychic\n",
"jason@psychic.dev\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"documents = [Document(page_content=sample_text)]\n",
"qa_translator = DoctranTextTranslator(language=\"spanish\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Output\n",
"After translating a document, the result will be returned as a new document with the page_content translated into the target language"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"scrolled": false
},
"outputs": [],
"source": [
"translated_document = await qa_translator.atransform_documents(documents)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[Generado con ChatGPT]\n",
"\n",
"Documento confidencial - Solo para uso interno\n",
"\n",
"Fecha: 1 de julio de 2023\n",
"\n",
"Asunto: Actualizaciones y discusiones sobre varios temas\n",
"\n",
"Estimado equipo,\n",
"\n",
"Espero que este correo electrónico les encuentre bien. En este documento, me gustaría proporcionarles algunas actualizaciones importantes y discutir varios temas que requieren nuestra atención. Por favor, traten la información contenida aquí como altamente confidencial.\n",
"\n",
"Medidas de seguridad y privacidad\n",
"Como parte de nuestro compromiso continuo para garantizar la seguridad y privacidad de los datos de nuestros clientes, hemos implementado medidas robustas en todos nuestros sistemas. Nos gustaría elogiar a John Doe (correo electrónico: john.doe@example.com) del departamento de TI por su diligente trabajo en mejorar nuestra seguridad de red. En adelante, recordamos amablemente a todos que se adhieran estrictamente a nuestras políticas y directrices de protección de datos. Además, si se encuentran con cualquier riesgo de seguridad o incidente potencial, por favor repórtelo inmediatamente a nuestro equipo dedicado en security@example.com.\n",
"\n",
"Actualizaciones de RRHH y beneficios para empleados\n",
"Recientemente, dimos la bienvenida a varios nuevos miembros del equipo que han hecho contribuciones significativas a sus respectivos departamentos. Me gustaría reconocer a Jane Smith (SSN: 049-45-5928) por su sobresaliente rendimiento en el servicio al cliente. Jane ha recibido constantemente comentarios positivos de nuestros clientes. Además, recuerden que el período de inscripción abierta para nuestro programa de beneficios para empleados se acerca rápidamente. Si tienen alguna pregunta o necesitan asistencia, por favor contacten a nuestro representante de RRHH, Michael Johnson (teléfono: 418-492-3850, correo electrónico: michael.johnson@example.com).\n",
"\n",
"Iniciativas y campañas de marketing\n",
"Nuestro equipo de marketing ha estado trabajando activamente en el desarrollo de nuevas estrategias para aumentar la conciencia de marca y fomentar la participación del cliente. Nos gustaría agradecer a Sarah Thompson (teléfono: 415-555-1234) por sus excepcionales esfuerzos en la gestión de nuestras plataformas de redes sociales. Sarah ha aumentado con éxito nuestra base de seguidores en un 20% solo en el último mes. Además, por favor marquen sus calendarios para el próximo evento de lanzamiento de producto el 15 de julio. Animamos a todos los miembros del equipo a asistir y apoyar este emocionante hito para nuestra empresa.\n",
"\n",
"Proyectos de investigación y desarrollo\n",
"En nuestra búsqueda de la innovación, nuestro departamento de investigación y desarrollo ha estado trabajando incansablemente en varios proyectos. Me gustaría reconocer el excepcional trabajo de David Rodríguez (correo electrónico: david.rodriguez@example.com) en su papel de líder de proyecto. Las contribuciones de David al desarrollo de nuestra tecnología de vanguardia han sido fundamentales. Además, nos gustaría recordar a todos que compartan sus ideas y sugerencias para posibles nuevos proyectos durante nuestra sesión de lluvia de ideas de I+D mensual, programada para el 10 de julio.\n",
"\n",
"Por favor, traten la información de este documento con la máxima confidencialidad y asegúrense de que no se comparte con personas no autorizadas. Si tienen alguna pregunta o inquietud sobre los temas discutidos, no duden en ponerse en contacto conmigo directamente.\n",
"\n",
"Gracias por su atención, y sigamos trabajando juntos para alcanzar nuestros objetivos.\n",
"\n",
"Saludos cordiales,\n",
"\n",
"Jason Fan\n",
"Cofundador y CEO\n",
"Psychic\n",
"jason@psychic.dev\n"
]
}
],
"source": [
"print(translated_document[0].page_content)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

View File

@ -45,3 +45,13 @@ print(texts[1])
``` ```
</CodeOutputBlock> </CodeOutputBlock>
## Other transformations:
### Filter redundant docs, translate docs, extract metadata, and more
We can do perform a number of transformations on docs which are not simply splitting the text. With the
`EmbeddingsRedundantFilter` we can identify similar documents and filter out redundancies. With integrations like
[doctran](https://github.com/psychic-api/doctran/tree/main) we can do things like translate documents from one language
to another, extract desired properties and add them to metadata, and convert conversational dialogue into a Q/A format
set of documents.

View File

@ -0,0 +1,19 @@
from langchain.document_transformers.doctran_text_extract import (
DoctranPropertyExtractor,
)
from langchain.document_transformers.doctran_text_qa import DoctranQATransformer
from langchain.document_transformers.doctran_text_translate import DoctranTextTranslator
from langchain.document_transformers.embeddings_redundant_filter import (
EmbeddingsClusteringFilter,
EmbeddingsRedundantFilter,
get_stateful_documents,
)
__all__ = [
"DoctranQATransformer",
"DoctranTextTranslator",
"DoctranPropertyExtractor",
"EmbeddingsClusteringFilter",
"EmbeddingsRedundantFilter",
"get_stateful_documents",
]

View File

@ -0,0 +1,88 @@
from typing import Any, List, Optional, Sequence
from langchain.schema import BaseDocumentTransformer, Document
from langchain.utils import get_from_env
class DoctranPropertyExtractor(BaseDocumentTransformer):
"""Extracts properties from text documents using doctran.
Arguments:
properties: A list of the properties to extract.
openai_api_key: OpenAI API key. Can also be specified via environment variable
``OPENAI_API_KEY``.
Example:
.. code-block:: python
from langchain.document_transformers import DoctranPropertyExtractor
properties = [
{
"name": "category",
"description": "What type of email this is.",
"type": "string",
"enum": ["update", "action_item", "customer_feedback", "announcement", "other"],
"required": True,
},
{
"name": "mentions",
"description": "A list of all people mentioned in this email.",
"type": "array",
"items": {
"name": "full_name",
"description": "The full name of the person mentioned.",
"type": "string",
},
"required": True,
},
{
"name": "eli5",
"description": "Explain this email to me like I'm 5 years old.",
"type": "string",
"required": True,
},
]
# Pass in openai_api_key or set env var OPENAI_API_KEY
property_extractor = DoctranPropertyExtractor(properties)
transformed_document = await qa_transformer.atransform_documents(documents)
""" # noqa: E501
def __init__(
self,
properties: List[dict],
openai_api_key: Optional[str] = None,
) -> None:
self.properties = properties
self.openai_api_key = openai_api_key or get_from_env(
"openai_api_key", "OPENAI_API_KEY"
)
def transform_documents(
self, documents: Sequence[Document], **kwargs: Any
) -> Sequence[Document]:
raise NotImplementedError
async def atransform_documents(
self, documents: Sequence[Document], **kwargs: Any
) -> Sequence[Document]:
"""Extracts properties from text documents using doctran."""
try:
from doctran import Doctran, ExtractProperty
doctran = Doctran(openai_api_key=self.openai_api_key)
except ImportError:
raise ImportError(
"Install doctran to use this parser. (pip install doctran)"
)
properties = [ExtractProperty(**property) for property in self.properties]
for d in documents:
doctran_doc = (
await doctran.parse(content=d.page_content)
.extract(properties=properties)
.execute()
)
d.metadata["extracted_properties"] = doctran_doc.extracted_properties
return documents

View File

@ -0,0 +1,54 @@
from typing import Any, Optional, Sequence
from langchain.schema import BaseDocumentTransformer, Document
from langchain.utils import get_from_env
class DoctranQATransformer(BaseDocumentTransformer):
"""Extracts QA from text documents using doctran.
Arguments:
openai_api_key: OpenAI API key. Can also be specified via environment variable
``OPENAI_API_KEY``.
Example:
.. code-block:: python
from langchain.document_transformers import DoctranQATransformer
# Pass in openai_api_key or set env var OPENAI_API_KEY
qa_transformer = DoctranQATransformer()
transformed_document = await qa_transformer.atransform_documents(documents)
"""
def __init__(self, openai_api_key: Optional[str] = None) -> None:
self.openai_api_key = openai_api_key or get_from_env(
"openai_api_key", "OPENAI_API_KEY"
)
def transform_documents(
self, documents: Sequence[Document], **kwargs: Any
) -> Sequence[Document]:
raise NotImplementedError
async def atransform_documents(
self, documents: Sequence[Document], **kwargs: Any
) -> Sequence[Document]:
"""Extracts QA from text documents using doctran."""
try:
from doctran import Doctran
doctran = Doctran(openai_api_key=self.openai_api_key)
except ImportError:
raise ImportError(
"Install doctran to use this parser. (pip install doctran)"
)
for d in documents:
doctran_doc = (
await doctran.parse(content=d.page_content).interrogate().execute()
)
questions_and_answers = doctran_doc.extracted_properties.get(
"questions_and_answers"
)
d.metadata["questions_and_answers"] = questions_and_answers
return documents

View File

@ -0,0 +1,59 @@
from typing import Any, Optional, Sequence
from langchain.schema import BaseDocumentTransformer, Document
from langchain.utils import get_from_env
class DoctranTextTranslator(BaseDocumentTransformer):
"""Translates text documents using doctran.
Arguments:
openai_api_key: OpenAI API key. Can also be specified via environment variable
``OPENAI_API_KEY``.
language: The language to translate *to*.
Example:
.. code-block:: python
from langchain.document_transformers import DoctranTextTranslator
# Pass in openai_api_key or set env var OPENAI_API_KEY
qa_translator = DoctranTextTranslator(language="spanish")
translated_document = await qa_translator.atransform_documents(documents)
"""
def __init__(
self, openai_api_key: Optional[str] = None, language: str = "english"
) -> None:
self.openai_api_key = openai_api_key or get_from_env(
"openai_api_key", "OPENAI_API_KEY"
)
self.language = language
def transform_documents(
self, documents: Sequence[Document], **kwargs: Any
) -> Sequence[Document]:
raise NotImplementedError
async def atransform_documents(
self, documents: Sequence[Document], **kwargs: Any
) -> Sequence[Document]:
"""Translates text documents using doctran."""
try:
from doctran import Doctran
doctran = Doctran(openai_api_key=self.openai_api_key)
except ImportError:
raise ImportError(
"Install doctran to use this parser. (pip install doctran)"
)
doctran_docs = [
doctran.parse(content=doc.page_content, metadata=doc.metadata)
for doc in documents
]
for i, doc in enumerate(doctran_docs):
doctran_docs[i] = await doc.translate(language=self.language).execute()
return [
Document(page_content=doc.transformed_content, metadata=doc.metadata)
for doc in doctran_docs
]

View File

@ -5,7 +5,7 @@ import numpy as np
from pydantic import root_validator from pydantic import root_validator
from langchain.callbacks.manager import Callbacks from langchain.callbacks.manager import Callbacks
from langchain.document_transformers import ( from langchain.document_transformers.embeddings_redundant_filter import (
_get_embeddings_from_stateful_docs, _get_embeddings_from_stateful_docs,
get_stateful_documents, get_stateful_documents,
) )

View File

@ -1,7 +1,9 @@
"""Integration test for embedding-based relevant doc filtering.""" """Integration test for embedding-based relevant doc filtering."""
import numpy as np import numpy as np
from langchain.document_transformers import _DocumentWithState from langchain.document_transformers.embeddings_redundant_filter import (
_DocumentWithState,
)
from langchain.embeddings import OpenAIEmbeddings from langchain.embeddings import OpenAIEmbeddings
from langchain.retrievers.document_compressors import EmbeddingsFilter from langchain.retrievers.document_compressors import EmbeddingsFilter
from langchain.schema import Document from langchain.schema import Document

View File

@ -1,5 +1,5 @@
"""Integration test for embedding-based redundant doc filtering.""" """Integration test for embedding-based redundant doc filtering."""
from langchain.document_transformers import ( from langchain.document_transformers.embeddings_redundant_filter import (
EmbeddingsClusteringFilter, EmbeddingsClusteringFilter,
EmbeddingsRedundantFilter, EmbeddingsRedundantFilter,
_DocumentWithState, _DocumentWithState,

View File

@ -1,5 +1,7 @@
"""Unit tests for document transformers.""" """Unit tests for document transformers."""
from langchain.document_transformers import _filter_similar_embeddings from langchain.document_transformers.embeddings_redundant_filter import (
_filter_similar_embeddings,
)
from langchain.math_utils import cosine_similarity from langchain.math_utils import cosine_similarity