langchain/cookbook/rag_with_quantized_embeddings.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "6195da33-34c3-4ca2-943a-050b6dcbacbc",
   "metadata": {},
   "source": [
    "# Embedding Documents using Optimized and Quantized Embedders\n",
    "\n",
    "In this tutorial, we will demo how to build a RAG pipeline, with the embedding for all documents done using Quantized Embedders.\n",
    "\n",
    "We will use a pipeline that will:\n",
    "\n",
    "* Create a document collection.\n",
    "* Embed all documents using Quantized Embedders.\n",
    "* Fetch relevant documents for our question.\n",
    "* Run an LLM answer the question.\n",
    "\n",
    "For more information about optimized models, we refer to [optimum-intel](https://github.com/huggingface/optimum-intel.git) and [IPEX](https://github.com/intel/intel-extension-for-pytorch).\n",
    "\n",
    "This tutorial is based on the [Langchain RAG tutorial here](https://towardsai.net/p/machine-learning/dense-x-retrieval-technique-in-langchain-and-llamaindex)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "26db2da5-3733-4a90-909e-6c11508ea140",
   "metadata": {},
   "outputs": [],
   "source": [
    "import uuid\n",
    "from pathlib import Path\n",
    "\n",
    "import langchain\n",
    "import torch\n",
    "from bs4 import BeautifulSoup as Soup\n",
    "from langchain.retrievers.multi_vector import MultiVectorRetriever\n",
    "from langchain.storage import InMemoryByteStore, LocalFileStore\n",
    "from langchain_community.document_loaders.recursive_url_loader import (\n",
    "    RecursiveUrlLoader,\n",
    ")\n",
    "from langchain_community.vectorstores import Chroma\n",
    "\n",
    "# For our example, we'll load docs from the web\n",
    "from langchain_text_splitters import RecursiveCharacterTextSplitter\n",
    "\n",
    "DOCSTORE_DIR = \".\"\n",
    "DOCSTORE_ID_KEY = \"doc_id\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f5ccda4e-7af5-4355-b9c4-25547edf33f9",
   "metadata": {},
   "source": [
    "Lets first load up this paper, and split into text chunks of size 1000."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "5f4d8888-53a6-49f5-a198-da5c92419ca4",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loaded 1 documents\n",
      "Split into 73 documents\n"
     ]
    }
   ],
   "source": [
    "# Could add more parsing here, as it's very raw.\n",
    "loader = RecursiveUrlLoader(\n",
    "    \"https://ar5iv.labs.arxiv.org/html/1706.03762\",\n",
    "    max_depth=2,\n",
    "    extractor=lambda x: Soup(x, \"html.parser\").text,\n",
    ")\n",
    "data = loader.load()\n",
    "print(f\"Loaded {len(data)} documents\")\n",
    "\n",
    "# Split\n",
    "text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
    "all_splits = text_splitter.split_documents(data)\n",
    "print(f\"Split into {len(all_splits)} documents\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "73e90632-2ac2-49eb-80da-ffe9ac4a278d",
   "metadata": {},
   "source": [
    "In order to embed our documents, we can use the ```QuantizedBiEncoderEmbeddings```, for efficient and fast embedding. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "9a68a6f6-332d-481e-bbea-ad763155ea36",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "89af89b48c55409b9999b8e0387fab5b",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "config.json:   0%|          | 0.00/747 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "01ad1b6278194b53bf6a5a286a311864",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "pytorch_model.bin:   0%|          | 0.00/45.9M [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "cb3bd1b88f7743c3b0322da3f021325c",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "inc_config.json:   0%|          | 0.00/287 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "loading configuration file inc_config.json from cache at \n",
      "INCConfig {\n",
      "  \"distillation\": {},\n",
      "  \"neural_compressor_version\": \"2.4.1\",\n",
      "  \"optimum_version\": \"1.16.2\",\n",
      "  \"pruning\": {},\n",
      "  \"quantization\": {\n",
      "    \"dataset_num_samples\": 50,\n",
      "    \"is_static\": true\n",
      "  },\n",
      "  \"save_onnx_model\": false,\n",
      "  \"torch_version\": \"2.2.0\",\n",
      "  \"transformers_version\": \"4.37.2\"\n",
      "}\n",
      "\n",
      "Using `INCModel` to load a TorchScript model will be deprecated in v1.15.0, to load your model please use `IPEXModel` instead.\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "7439315ebcb746f5be11fe30bc7693f6",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "tokenizer_config.json:   0%|          | 0.00/1.24k [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "05265a3912254ce1ad43cc8086bcb0ca",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "a48f4245c60744f28f37cd3a7a24d198",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "584a63cace934033b4ab22d3a178582a",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from langchain_community.embeddings import QuantizedBiEncoderEmbeddings\n",
    "from langchain_core.embeddings import Embeddings\n",
    "\n",
    "model_name = \"Intel/bge-small-en-v1.5-rag-int8-static\"\n",
    "encode_kwargs = {\"normalize_embeddings\": True}  # set True to compute cosine similarity\n",
    "\n",
    "model_inc = QuantizedBiEncoderEmbeddings(\n",
    "    model_name=model_name,\n",
    "    encode_kwargs=encode_kwargs,\n",
    "    query_instruction=\"Represent this sentence for searching relevant passages: \",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "360b2837-8024-47e0-a4ba-592505a9a5c8",
   "metadata": {},
   "source": [
    "With our embedder in place, lets define our retriever:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "18bc0a73-1a13-4b2f-96ac-05a5313343b7",
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_multi_vector_retriever(\n",
    "    docstore_id_key: str, collection_name: str, embedding_function: Embeddings\n",
    "):\n",
    "    \"\"\"Create the composed retriever object.\"\"\"\n",
    "    vectorstore = Chroma(\n",
    "        collection_name=collection_name,\n",
    "        embedding_function=embedding_function,\n",
    "    )\n",
    "    store = InMemoryByteStore()\n",
    "\n",
    "    return MultiVectorRetriever(\n",
    "        vectorstore=vectorstore,\n",
    "        byte_store=store,\n",
    "        id_key=docstore_id_key,\n",
    "    )\n",
    "\n",
    "\n",
    "retriever = get_multi_vector_retriever(DOCSTORE_ID_KEY, \"multi_vec_store\", model_inc)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8484078e-1bf0-4080-a354-ef23823fd6dc",
   "metadata": {},
   "source": [
    "Next, we divide each chunk into sub-docs:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "e12f48d4-6562-416b-8f28-342912e5756e",
   "metadata": {},
   "outputs": [],
   "source": [
    "child_text_splitter = RecursiveCharacterTextSplitter(chunk_size=400)\n",
    "id_key = \"doc_id\"\n",
    "doc_ids = [str(uuid.uuid4()) for _ in all_splits]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "a268ef5f-91c2-4d8e-87f0-53db376e6a29",
   "metadata": {},
   "outputs": [],
   "source": [
    "sub_docs = []\n",
    "for i, doc in enumerate(all_splits):\n",
    "    _id = doc_ids[i]\n",
    "    _sub_docs = child_text_splitter.split_documents([doc])\n",
    "    for _doc in _sub_docs:\n",
    "        _doc.metadata[id_key] = _id\n",
    "    sub_docs.extend(_sub_docs)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d84ea8f4-a5de-4d76-b44d-85e56583f489",
   "metadata": {},
   "source": [
    "Lets write our documents into our new store. This will use our embedder on each document."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "1af831ce-0eae-44bc-aca7-4d691063640b",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Batches: 100%|██████████| 8/8 [00:00<00:00,  9.05it/s]\n"
     ]
    }
   ],
   "source": [
    "retriever.vectorstore.add_documents(sub_docs)\n",
    "retriever.docstore.mset(list(zip(doc_ids, all_splits)))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "580bc212-8ecd-4d28-8656-b96fcd0d7eb6",
   "metadata": {},
   "source": [
    "Great! Our retriever is good to go. Lets load up an LLM, that will reason over the retrieved documents:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "008c992f",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": []
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "cbe70583ad964ae19582b72dab396784",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "import torch\n",
    "from langchain.llms.huggingface_pipeline import HuggingFacePipeline\n",
    "from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline\n",
    "\n",
    "model_id = \"Intel/neural-chat-7b-v3-3\"\n",
    "tokenizer = AutoTokenizer.from_pretrained(model_id)\n",
    "model = AutoModelForCausalLM.from_pretrained(\n",
    "    model_id, device_map=\"auto\", torch_dtype=torch.bfloat16\n",
    ")\n",
    "\n",
    "pipe = pipeline(\"text-generation\", model=model, tokenizer=tokenizer, max_new_tokens=100)\n",
    "\n",
    "hf = HuggingFacePipeline(pipeline=pipe)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6dd21fb2-0442-477d-aae2-9e7ee1d1d778",
   "metadata": {},
   "source": [
    "Next, we will load up a prompt for answering questions using retrieved documents:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "5e582509-caaf-4920-932c-4ce16162c789",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain import hub\n",
    "\n",
    "prompt = hub.pull(\"rlm/rag-prompt\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5cdfcba5-7ec7-4d0a-820e-4e200643a882",
   "metadata": {},
   "source": [
    "We can now build our pipeline:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "b74d8dfb-72bb-46da-9df9-0dc47a3ac791",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.schema.runnable import RunnablePassthrough\n",
    "\n",
    "rag_chain = {\"context\": retriever, \"question\": RunnablePassthrough()} | prompt | hf"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3bc53602-86d6-420f-91b1-fc2effa7e986",
   "metadata": {},
   "source": [
    "Excellent! lets ask it a question.\n",
    "We will also use a verbose and debug, to check which documents were used by the model to produce the answer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "f0a92c07-53da-4e1f-b880-ee83a36ee17d",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[1:chain:RunnableSequence] Entering Chain run with input:\n",
      "\u001b[0m{\n",
      "  \"input\": \"What is the first transduction model relying entirely on self-attention?\"\n",
      "}\n",
      "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[1:chain:RunnableSequence > 2:chain:RunnableParallel<context,question>] Entering Chain run with input:\n",
      "\u001b[0m{\n",
      "  \"input\": \"What is the first transduction model relying entirely on self-attention?\"\n",
      "}\n",
      "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[1:chain:RunnableSequence > 2:chain:RunnableParallel<context,question> > 4:chain:RunnablePassthrough] Entering Chain run with input:\n",
      "\u001b[0m{\n",
      "  \"input\": \"What is the first transduction model relying entirely on self-attention?\"\n",
      "}\n",
      "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[1:chain:RunnableSequence > 2:chain:RunnableParallel<context,question> > 4:chain:RunnablePassthrough] [1ms] Exiting Chain run with output:\n",
      "\u001b[0m{\n",
      "  \"output\": \"What is the first transduction model relying entirely on self-attention?\"\n",
      "}\n",
      "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[1:chain:RunnableSequence > 2:chain:RunnableParallel<context,question>] [66ms] Exiting Chain run with output:\n",
      "\u001b[0m[outputs]\n",
      "\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[1:chain:RunnableSequence > 5:prompt:ChatPromptTemplate] Entering Prompt run with input:\n",
      "\u001b[0m[inputs]\n",
      "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[1:chain:RunnableSequence > 5:prompt:ChatPromptTemplate] [1ms] Exiting Prompt run with output:\n",
      "\u001b[0m{\n",
      "  \"lc\": 1,\n",
      "  \"type\": \"constructor\",\n",
      "  \"id\": [\n",
      "    \"langchain\",\n",
      "    \"prompts\",\n",
      "    \"chat\",\n",
      "    \"ChatPromptValue\"\n",
      "  ],\n",
      "  \"kwargs\": {\n",
      "    \"messages\": [\n",
      "      {\n",
      "        \"lc\": 1,\n",
      "        \"type\": \"constructor\",\n",
      "        \"id\": [\n",
      "          \"langchain\",\n",
      "          \"schema\",\n",
      "          \"messages\",\n",
      "          \"HumanMessage\"\n",
      "        ],\n",
      "        \"kwargs\": {\n",
      "          \"content\": \"You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\\nQuestion: What is the first transduction model relying entirely on self-attention? \\nContext: [Document(page_content='To the best of our knowledge, however, the Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence-aligned RNNs or convolution.\\\\nIn the following sections, we will describe the Transformer, motivate self-attention and discuss its advantages over models such as (neural_gpu, ; NalBytenet2017, ) and (JonasFaceNet2017, ).\\\\n\\\\n\\\\n\\\\n\\\\n3 Model Architecture\\\\n\\\\nFigure 1: The Transformer - model architecture.', metadata={'source': 'https://ar5iv.labs.arxiv.org/html/1706.03762', 'title': '[1706.03762] Attention Is All You Need', 'language': 'en'}), Document(page_content='In this work, we presented the Transformer, the first sequence transduction model based entirely on attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention.\\\\n\\\\n\\\\nFor translation tasks, the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers. On both WMT 2014 English-to-German and WMT 2014 English-to-French translation tasks, we achieve a new state of the art. In the former task our best model outperforms even all previously reported ensembles. \\\\n\\\\n\\\\nWe are excited about the future of attention-based models and plan to apply them to other tasks. We plan to extend the Transformer to problems involving input and output modalities other than text and to investigate local, restricted attention mechanisms to efficiently handle large inputs and outputs such as images, audio and video.\\\\nMaking generation less sequential is another research goals of ours.', metadata={'source': 'https://ar5iv.labs.arxiv.org/html/1706.03762', 'title': '[1706.03762] Attention Is All You Need', 'language': 'en'}), Document(page_content='Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences (bahdanau2014neural, ; structuredAttentionNetworks, ). In all but a few cases (decomposableAttnModel, ), however, such attention mechanisms are used in conjunction with a recurrent network.\\\\n\\\\n\\\\nIn this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output. The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs.\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n2 Background', metadata={'source': 'https://ar5iv.labs.arxiv.org/html/1706.03762', 'title': '[1706.03762] Attention Is All You Need', 'language': 'en'}), Document(page_content='The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-
      "          \"additional_kwargs\": {}\n",
      "        }\n",
      "      }\n",
      "    ]\n",
      "  }\n",
      "}\n",
      "\u001b[32;1m\u001b[1;3m[llm/start]\u001b[0m \u001b[1m[1:chain:RunnableSequence > 6:llm:HuggingFacePipeline] Entering LLM run with input:\n",
      "\u001b[0m{\n",
      "  \"prompts\": [\n",
      "    \"Human: You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\\nQuestion: What is the first transduction model relying entirely on self-attention? \\nContext: [Document(page_content='To the best of our knowledge, however, the Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence-aligned RNNs or convolution.\\\\nIn the following sections, we will describe the Transformer, motivate self-attention and discuss its advantages over models such as (neural_gpu, ; NalBytenet2017, ) and (JonasFaceNet2017, ).\\\\n\\\\n\\\\n\\\\n\\\\n3 Model Architecture\\\\n\\\\nFigure 1: The Transformer - model architecture.', metadata={'source': 'https://ar5iv.labs.arxiv.org/html/1706.03762', 'title': '[1706.03762] Attention Is All You Need', 'language': 'en'}), Document(page_content='In this work, we presented the Transformer, the first sequence transduction model based entirely on attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention.\\\\n\\\\n\\\\nFor translation tasks, the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers. On both WMT 2014 English-to-German and WMT 2014 English-to-French translation tasks, we achieve a new state of the art. In the former task our best model outperforms even all previously reported ensembles. \\\\n\\\\n\\\\nWe are excited about the future of attention-based models and plan to apply them to other tasks. We plan to extend the Transformer to problems involving input and output modalities other than text and to investigate local, restricted attention mechanisms to efficiently handle large inputs and outputs such as images, audio and video.\\\\nMaking generation less sequential is another research goals of ours.', metadata={'source': 'https://ar5iv.labs.arxiv.org/html/1706.03762', 'title': '[1706.03762] Attention Is All You Need', 'language': 'en'}), Document(page_content='Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences (bahdanau2014neural, ; structuredAttentionNetworks, ). In all but a few cases (decomposableAttnModel, ), however, such attention mechanisms are used in conjunction with a recurrent network.\\\\n\\\\n\\\\nIn this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output. The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs.\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n2 Background', metadata={'source': 'https://ar5iv.labs.arxiv.org/html/1706.03762', 'title': '[1706.03762] Attention Is All You Need', 'language': 'en'}), Document(page_content='The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU
      "  ]\n",
      "}\n",
      "\u001b[36;1m\u001b[1;3m[llm/end]\u001b[0m \u001b[1m[1:chain:RunnableSequence > 6:llm:HuggingFacePipeline] [4.34s] Exiting LLM run with output:\n",
      "\u001b[0m{\n",
      "  \"generations\": [\n",
      "    [\n",
      "      {\n",
      "        \"text\": \" The first transduction model relying entirely on self-attention is the Transformer.\",\n",
      "        \"generation_info\": null,\n",
      "        \"type\": \"Generation\"\n",
      "      }\n",
      "    ]\n",
      "  ],\n",
      "  \"llm_output\": null,\n",
      "  \"run\": null\n",
      "}\n",
      "\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[1:chain:RunnableSequence] [4.41s] Exiting Chain run with output:\n",
      "\u001b[0m{\n",
      "  \"output\": \" The first transduction model relying entirely on self-attention is the Transformer.\"\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "langchain.verbose = True\n",
    "langchain.debug = True\n",
    "\n",
    "llm_res = rag_chain.invoke(\n",
    "    \"What is the first transduction model relying entirely on self-attention?\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "id": "023404a1-401a-46e1-8ab5-cafbc8593b04",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "' The first transduction model relying entirely on self-attention is the Transformer.'"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "llm_res"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0eaefd01-254a-445d-a95f-37889c126e0e",
   "metadata": {},
   "source": [
    "Based on the retrieved documents, the answer is indeed correct :)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.18"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}