mirror of
https://github.com/hwchase17/langchain
synced 2024-11-10 01:10:59 +00:00
168 lines
5.6 KiB
Plaintext
168 lines
5.6 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Nuclia Understanding\n",
|
|
"\n",
|
|
">[Nuclia](https://nuclia.com) automatically indexes your unstructured data from any internal and external source, providing optimized search results and generative answers. It can handle video and audio transcription, image content extraction, and document parsing.\n",
|
|
"\n",
|
|
"The `Nuclia Understanding API` supports the processing of unstructured data, including text, web pages, documents, and audio/video contents. It extracts all texts wherever it is (using speech-to-text or OCR when needed), it identifies entities, it aslo extracts metadata, embedded files (like images in a PDF), and web links. It also provides a summary of the content.\n",
|
|
"\n",
|
|
"To use the `Nuclia Understanding API`, you need to have a `Nuclia` account. You can create one for free at [https://nuclia.cloud](https://nuclia.cloud), and then [create a NUA key](https://docs.nuclia.dev/docs/docs/using/understanding/intro)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"#!pip install --upgrade protobuf\n",
|
|
"#!pip install nucliadb-protos"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import os\n",
|
|
"\n",
|
|
"os.environ[\"NUCLIA_ZONE\"] = \"<YOUR_ZONE>\" # e.g. europe-1\n",
|
|
"os.environ[\"NUCLIA_NUA_KEY\"] = \"<YOUR_API_KEY>\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from langchain.tools.nuclia import NucliaUnderstandingAPI\n",
|
|
"\n",
|
|
"nua = NucliaUnderstandingAPI(enable_ml=False)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"You can push files to the Nuclia Understanding API using the `push` action. As the processing is done asynchronously, the results might be returned in a different order than the files were pushed. That is why you need to provide an `id` to match the results with the corresponding file."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"nua.run({\"action\": \"push\", \"id\": \"1\", \"path\": \"./report.docx\"})\n",
|
|
"nua.run({\"action\": \"push\", \"id\": \"2\", \"path\": \"./interview.mp4\"})"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"You can now call the `pull` action in a loop until you get the JSON-formatted result."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import time\n",
|
|
"\n",
|
|
"pending = True\n",
|
|
"data = None\n",
|
|
"while pending:\n",
|
|
" time.sleep(15)\n",
|
|
" data = nua.run({\"action\": \"pull\", \"id\": \"1\", \"path\": None})\n",
|
|
" if data:\n",
|
|
" print(data)\n",
|
|
" pending = False\n",
|
|
" else:\n",
|
|
" print(\"waiting...\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"You can also do it in one step in `async` mode, you only need to do a push, and it will wait until the results are pulled:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import asyncio\n",
|
|
"\n",
|
|
"\n",
|
|
"async def process():\n",
|
|
" data = await nua.arun(\n",
|
|
" {\"action\": \"push\", \"id\": \"1\", \"path\": \"./talk.mp4\", \"text\": None}\n",
|
|
" )\n",
|
|
" print(data)\n",
|
|
"\n",
|
|
"\n",
|
|
"asyncio.run(process())"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Retrieved information\n",
|
|
"\n",
|
|
"Nuclia returns the following information:\n",
|
|
"\n",
|
|
"- file metadata\n",
|
|
"- extracted text\n",
|
|
"- nested text (like text in an embedded image)\n",
|
|
"- a summary (only when `enable_ml` is set to `True`)\n",
|
|
"- paragraphs and sentences splitting (defined by the position of their first and last characters, plus start time and end time for a video or audio file)\n",
|
|
"- named entities: people, dates, places, organizations, etc. (only when `enable_ml` is set to `True`)\n",
|
|
"- links\n",
|
|
"- a thumbnail\n",
|
|
"- embedded files\n",
|
|
"- the vector representations of the text (only when `enable_ml` is set to `True`)\n",
|
|
"\n",
|
|
"Note:\n",
|
|
"\n",
|
|
" Generated files (thumbnail, extracted embedded files, etc.) are provided as a token. You can download them with the [`/processing/download` endpoint](https://docs.nuclia.dev/docs/api#operation/Download_binary_file_processing_download_get).\n",
|
|
"\n",
|
|
" Also at any level, if an attribute exceeds a certain size, it will be put in a downloadable file and will be replaced in the document by a file pointer. This will consist of `{\"file\": {\"uri\": \"JWT_TOKEN\"}}`. The rule is that if the size of the message is greater than 1000000 characters, the biggest parts will be moved to downloadable files. First, the compression process will target vectors. If that is not enough, it will target large field metadata, and finally it will target extracted text.\n"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.10.12"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 4
|
|
}
|