Merge pull request #12433

* feat: Add Google Cloud Translation document transformer * Merge branch 'langchain-ai:master' into google-translate * Add documentation for Google Translate Document Transformer * Fix line length error * Merge branch 'master' into google-translate * Merge branch 'google-translate' of https://github.com/holtskinner/lan… * Addressed code review comments * Merge branch 'master' into google-translate * Merge branch 'google-translate' of https://github.com/holtskinner/lan… * Removed extra variable * Merge branch 'google-translate' of https://github.com/holtskinner/lan… * Merge branch 'master' into google-translate * Merge branch 'google-translate' of https://github.com/holtskinner/lan… * Removed extra import
11 months ago · e05bb938de
parent d1fdcd4fcb
commit e05bb938de
5 changed files with 353 additions and 10 deletions
--- a/docs/api_reference/guide_imports.json
+++ b/docs/api_reference/guide_imports.json
--- a/docs/docs/integrations/document_transformers/google_translate.ipynb
+++ b/docs/docs/integrations/document_transformers/google_translate.ipynb
@ -0,0 +1,215 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Google Translate\n",
+    "\n",
+    "[Google Translate](https://translate.google.com/) is a multilingual neural machine translation service developed by Google to translate text, documents and websites from one language into another.\n",
+    "\n",
+    "The `GoogleTranslateTransformer` allows you to translate text and HTML with the [Google Cloud Translation API](https://cloud.google.com/translate).\n",
+    "\n",
+    "To use it, you should have the `google-cloud-translate` python package installed, and a Google Cloud project with the [Translation API enabled](https://cloud.google.com/translate/docs/setup). This transformer uses the [Advanced edition (v3)](https://cloud.google.com/translate/docs/intro-to-v3).\n",
+    "\n",
+    "- [Google Neural Machine Translation](https://en.wikipedia.org/wiki/Google_Neural_Machine_Translation)\n",
+    "- [A Neural Network for Machine Translation, at Production Scale](https://blog.research.google/2016/09/a-neural-network-for-machine.html)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "! pip install google-cloud-translate\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain.schema import Document\n",
+    "from langchain.document_transformers import GoogleTranslateTransformer\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Input\n",
+    "\n",
+    "This is the document we'll translate"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sample_text = \"\"\"[Generated with Google Bard]\n",
+    "Subject: Key Business Process Updates\n",
+    "\n",
+    "Date: Friday, 27 October 2023\n",
+    "\n",
+    "Dear team,\n",
+    "\n",
+    "I am writing to provide an update on some of our key business processes.\n",
+    "\n",
+    "Sales process\n",
+    "\n",
+    "We have recently implemented a new sales process that is designed to help us close more deals and grow our revenue. The new process includes a more rigorous qualification process, a more streamlined proposal process, and a more effective customer relationship management (CRM) system.\n",
+    "\n",
+    "Marketing process\n",
+    "\n",
+    "We have also revamped our marketing process to focus on creating more targeted and engaging content. We are also using more social media and paid advertising to reach a wider audience.\n",
+    "\n",
+    "Customer service process\n",
+    "\n",
+    "We have also made some improvements to our customer service process. We have implemented a new customer support system that makes it easier for customers to get help with their problems. We have also hired more customer support representatives to reduce wait times.\n",
+    "\n",
+    "Overall, we are very pleased with the progress we have made on improving our key business processes. We believe that these changes will help us to achieve our goals of growing our business and providing our customers with the best possible experience.\n",
+    "\n",
+    "If you have any questions or feedback about any of these changes, please feel free to contact me directly.\n",
+    "\n",
+    "Thank you,\n",
+    "\n",
+    "Lewis Cymbal\n",
+    "CEO, Cymbal Bank\n",
+    "\"\"\"\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "When initializing the `GoogleTranslateTransformer`, you can include the following parameters to configure the requests.\n",
+    "\n",
+    "- `project_id`: Google Cloud Project ID.\n",
+    "- `location`: (Optional) Translate model location.\n",
+    "  - Default: `global` \n",
+    "- `model_id`: (Optional) Translate [model ID][models] to use.\n",
+    "- `glossary_id`: (Optional) Translate [glossary ID][glossaries] to use.\n",
+    "- `api_endpoint`: (Optional) [Regional endpoint][endpoints] to use.\n",
+    "\n",
+    "[models]: https://cloud.google.com/translate/docs/advanced/translating-text-v3#comparing-models\n",
+    "[glossaries]: https://cloud.google.com/translate/docs/advanced/glossary\n",
+    "[endpoints]: https://cloud.google.com/translate/docs/advanced/endpoints"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "documents = [Document(page_content=sample_text)]\n",
+    "translator = GoogleTranslateTransformer(project_id=\"<YOUR_PROJECT_ID>\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Output\n",
+    "\n",
+    "After translating a document, the result will be returned as a new document with the `page_content` translated into the target language.\n",
+    "\n",
+    "You can provide the following keyword parameters to the `transform_documents()` method:\n",
+    "\n",
+    "- `target_language_code`: [ISO 639][iso-639] language code of the output document.\n",
+    "    - For supported languages, refer to [Language support][supported-languages].\n",
+    "- `source_language_code`: (Optional) [ISO 639][iso-639] language code of the input document.\n",
+    "    - If not provided, language will be auto-detected.\n",
+    "- `mime_type`: (Optional) [Media Type][media-type] of the input text.\n",
+    "    - Options: `text/plain` (Default), `text/html`.\n",
+    "\n",
+    "[iso-639]: https://en.wikipedia.org/wiki/ISO_639\n",
+    "[supported-languages]: https://cloud.google.com/translate/docs/languages\n",
+    "[media-type]: https://en.wikipedia.org/wiki/Media_type"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "translated_documents = translator.transform_documents(documents, target_language_code=\"es\")\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "{'model': '', 'detected_language_code': 'en'}\n",
+      "[Generado con Google Bard]\n",
+      "Asunto: Actualizaciones clave de procesos comerciales\n",
+      "\n",
+      "Fecha: viernes 27 de octubre de 2023\n",
+      "\n",
+      "Estimado equipo,\n",
+      "\n",
+      "Le escribo para brindarle una actualización sobre algunos de nuestros procesos comerciales clave.\n",
+      "\n",
+      "Proceso de ventas\n",
+      "\n",
+      "Recientemente implementamos un nuevo proceso de ventas que está diseñado para ayudarnos a cerrar más acuerdos y aumentar nuestros ingresos. El nuevo proceso incluye un proceso de calificación más riguroso, un proceso de propuesta más simplificado y un sistema de gestión de relaciones con el cliente (CRM) más eficaz.\n",
+      "\n",
+      "Proceso de mercadeo\n",
+      "\n",
+      "También hemos renovado nuestro proceso de marketing para centrarnos en crear contenido más específico y atractivo. También estamos utilizando más redes sociales y publicidad paga para llegar a una audiencia más amplia.\n",
+      "\n",
+      "proceso de atención al cliente\n",
+      "\n",
+      "También hemos realizado algunas mejoras en nuestro proceso de atención al cliente. Hemos implementado un nuevo sistema de atención al cliente que facilita que los clientes obtengan ayuda con sus problemas. También hemos contratado más representantes de atención al cliente para reducir los tiempos de espera.\n",
+      "\n",
+      "En general, estamos muy satisfechos con el progreso que hemos logrado en la mejora de nuestros procesos comerciales clave. Creemos que estos cambios nos ayudarán a lograr nuestros objetivos de hacer crecer nuestro negocio y brindar a nuestros clientes la mejor experiencia posible.\n",
+      "\n",
+      "Si tiene alguna pregunta o comentario sobre cualquiera de estos cambios, no dude en ponerse en contacto conmigo directamente.\n",
+      "\n",
+      "Gracias,\n",
+      "\n",
+      "Platillo Lewis\n",
+      "Director ejecutivo, banco de platillos\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "for doc in translated_documents:\n",
+    "    print(doc.metadata)\n",
+    "    print(doc.page_content)\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
--- a/docs/docs/integrations/platforms/google.mdx
+++ b/docs/docs/integrations/platforms/google.mdx
@ -83,7 +83,7 @@ First, we need to install several python packages.
 pip install google-api-python-client google-auth-httplib2 google-auth-oauthlib
 ```

-See a [usage example and authorizing instructions](/docs/integrations/document_loaders/google_drive).
+See a [usage example and authorization instructions](/docs/integrations/document_loaders/google_drive).

 ```python
 from langchain.document_loaders import GoogleDriveLoader
@ -101,7 +101,7 @@ First, we need to install the python package.
 pip install google-cloud-speech
 ```

-See a [usage example and authorizing instructions](/docs/integrations/document_loaders/google_speech_to_text).
+See a [usage example and authorization instructions](/docs/integrations/document_loaders/google_speech_to_text).

 ```python
 from langchain.document_loaders import GoogleSpeechToTextLoader
@ -221,15 +221,14 @@ pip install googlemaps
 from langchain.tools import GooglePlacesTool
 ```

-## Document Transformer
+## Document Transformers
+
 ### Google Document AI

 >[Document AI](https://cloud.google.com/document-ai/docs/overview) is a `Google Cloud Platform` 
 > service to transform unstructured data from documents into structured data, making it easier 
 > to understand, analyze, and consume.  

-
-
 We need to set up a [`GCS` bucket and create your own OCR processor](https://cloud.google.com/document-ai/docs/create-processor)  
 The `GCS_OUTPUT_PATH` should be a path to a folder on GCS (starting with `gs://`) 
 and a processor name should look like `projects/PROJECT_NUMBER/locations/LOCATION/processors/PROCESSOR_ID`.
@ -241,7 +240,6 @@ pip install google-cloud-documentai
 pip install google-cloud-documentai-toolbox
 ```

-
 See a [usage example](/docs/integrations/document_transformers/docai).

 ```python
@ -249,6 +247,28 @@ from langchain.document_loaders.blob_loaders import Blob
 from langchain.document_loaders.parsers import DocAIParser
 ```

+### Google Translate
+
+> [Google Translate](https://translate.google.com/) is a multilingual neural machine
+> translation service developed by Google to translate text, documents and websites
+> from one language into another.
+
+The `GoogleTranslateTransformer` allows you to translate text and HTML with the [Google Cloud Translation API](https://cloud.google.com/translate).
+
+To use it, you should have the `google-cloud-translate` python package installed, and a Google Cloud project with the [Translation API enabled](https://cloud.google.com/translate/docs/setup). This transformer uses the [Advanced edition (v3)](https://cloud.google.com/translate/docs/intro-to-v3).
+
+First, we need to install the python package.
+
+```bash
+pip install google-cloud-translate
+```
+
+See a [usage example and authorization instructions](/docs/integrations/document_transformers/google_translate).
+
+```python
+from langchain.document_transformers import GoogleTranslateTransformer
+```
+
 ## Chat loaders
 ### Gmail

@ -260,7 +280,7 @@ First, we need to install several python packages.
 pip install --upgrade google-auth google-auth-oauthlib google-auth-httplib2 google-api-python-client
 ```

-See a [usage example and authorizing instructions](/docs/integrations/chat_loaders/gmail).
+See a [usage example and authorization instructions](/docs/integrations/chat_loaders/gmail).

 ```python
 from langchain.chat_loaders.gmail import GMailLoader
@ -269,7 +289,7 @@ from langchain.chat_loaders.gmail import GMailLoader
 ## Agents and Toolkits
 ### Gmail

-See a [usage example and authorizing instructions](/docs/integrations/toolkits/gmail).
+See a [usage example and authorization instructions](/docs/integrations/toolkits/gmail).

 ```python
 from langchain.agents.agent_toolkits import GmailToolkit
@ -279,7 +299,7 @@ toolkit = GmailToolkit()

 ### Google Drive

-See a [usage example and authorizing instructions](/docs/integrations/toolkits/google_drive).
+See a [usage example and authorization instructions](/docs/integrations/toolkits/google_drive).

 ```python
 from langchain_googledrive.utilities.google_drive import GoogleDriveAPIWrapper
--- a/libs/langchain/langchain/document_transformers/init.py
+++ b/libs/langchain/langchain/document_transformers/init.py
@ -28,6 +28,7 @@ from langchain.document_transformers.embeddings_redundant_filter import (
    EmbeddingsRedundantFilter,
    get_stateful_documents,
 )
+from langchain.document_transformers.google_translate import GoogleTranslateTransformer
 from langchain.document_transformers.html2text import Html2TextTransformer
 from langchain.document_transformers.long_context_reorder import LongContextReorder
 from langchain.document_transformers.nuclia_text_transform import NucliaTextTransformer
@ -40,6 +41,7 @@ __all__ = [
    "DoctranPropertyExtractor",
    "EmbeddingsClusteringFilter",
    "EmbeddingsRedundantFilter",
+    "GoogleTranslateTransformer",
    "get_stateful_documents",
    "LongContextReorder",
    "NucliaTextTransformer",
--- a/libs/langchain/langchain/document_transformers/google_translate.py
+++ b/libs/langchain/langchain/document_transformers/google_translate.py
@ -0,0 +1,106 @@
+from typing import Any, Optional, Sequence
+
+from langchain.schema import BaseDocumentTransformer, Document
+from langchain.utilities.vertexai import get_client_info
+
+
+class GoogleTranslateTransformer(BaseDocumentTransformer):
+    """Translate text documents using Google Cloud Translation."""
+
+    def __init__(
+        self,
+        project_id: str,
+        *,
+        location: str = "global",
+        model_id: Optional[str] = None,
+        glossary_id: Optional[str] = None,
+        api_endpoint: Optional[str] = None,
+    ) -> None:
+        """
+        Arguments:
+            project_id: Google Cloud Project ID.
+            location: (Optional) Translate model location.
+            model_id: (Optional) Translate model ID to use.
+            glossary_id: (Optional) Translate glossary ID to use.
+            api_endpoint: (Optional) Regional endpoint to use.
+        """
+        try:
+            from google.api_core.client_options import ClientOptions
+            from google.cloud import translate
+        except ImportError as exc:
+            raise ImportError(
+                "Install Google Cloud Translate to use this parser."
+                "(pip install google-cloud-translate)"
+            ) from exc
+
+        self.project_id = project_id
+        self.location = location
+        self.model_id = model_id
+        self.glossary_id = glossary_id
+
+        self._client = translate.TranslationServiceClient(
+            client_info=get_client_info("translate"),
+            client_options=(
+                ClientOptions(api_endpoint=api_endpoint) if api_endpoint else None
+            ),
+        )
+        self._parent_path = self._client.common_location_path(project_id, location)
+        # For some reason, there's no `model_path()` method for the client.
+        self._model_path = (
+            f"{self._parent_path}/models/{model_id}" if model_id else None
+        )
+        self._glossary_path = (
+            self._client.glossary_path(project_id, location, glossary_id)
+            if glossary_id
+            else None
+        )
+
+    def transform_documents(
+        self, documents: Sequence[Document], **kwargs: Any
+    ) -> Sequence[Document]:
+        """Translate text documents using Google Translate.
+
+        Arguments:
+            source_language_code: ISO 639 language code of the input document.
+            target_language_code: ISO 639 language code of the output document.
+                For supported languages, refer to:
+                https://cloud.google.com/translate/docs/languages
+            mime_type: (Optional) Media Type of input text.
+                Options: `text/plain`, `text/html`
+        """
+        try:
+            from google.cloud import translate
+        except ImportError as exc:
+            raise ImportError(
+                "Install Google Cloud Translate to use this parser."
+                "(pip install google-cloud-translate)"
+            ) from exc
+
+        response = self._client.translate_text(
+            request=translate.TranslateTextRequest(
+                contents=[doc.page_content for doc in documents],
+                parent=self._parent_path,
+                model=self._model_path,
+                glossary_config=translate.TranslateTextGlossaryConfig(
+                    glossary=self._glossary_path
+                ),
+                source_language_code=kwargs.get("source_language_code", None),
+                target_language_code=kwargs.get("target_language_code"),
+                mime_type=kwargs.get("mime_type", "text/plain"),
+            )
+        )
+
+        # If using a glossary, the translations will be in `glossary_translations`.
+        translations = response.glossary_translations or response.translations
+
+        return [
+            Document(
+                page_content=translation.translated_text,
+                metadata={
+                    **doc.metadata,
+                    "model": translation.model,
+                    "detected_language_code": translation.detected_language_code,
+                },
+            )
+            for doc, translation in zip(documents, translations)
+        ]