community[minor]: Add QuantizedEmbedders (#17391)

* adding Quantized embedders using optimum-intel and
* added mdx documentation and example notebooks 
* added embedding import testing.

optimum = {extras = ["neural-compressor"], version = "^1.14.0", optional
= true}
intel_extension_for_pytorch = {version = "^2.2.0", optional = true}

Dependencies have been added to pyproject.toml for the community lib.  

**Twitter handle:** @peter_izsak


Co-authored-by: Bagatur <>
Moshe Berchansky 3 months ago committed by GitHub
parent bccc9241ea
commit 20a56fe0a2
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

@ -0,0 +1,591 @@
"cells": [
"cell_type": "markdown",
"id": "6195da33-34c3-4ca2-943a-050b6dcbacbc",
"metadata": {},
"source": [
"# Embedding Documents using Optimized and Quantized Embedders\n",
"In this tutorial, we will demo how to build a RAG pipeline, with the embedding for all documents done using Quantized Embedders.\n",
"We will use a pipeline that will:\n",
"* Create a document collection.\n",
"* Embed all documents using Quantized Embedders.\n",
"* Fetch relevant documents for our question.\n",
"* Run an LLM answer the question.\n",
"For more information about optimized models, we refer to [optimum-intel]( and [IPEX](\n",
"This tutorial is based on the [Langchain RAG tutorial here]("
"cell_type": "code",
"execution_count": 17,
"id": "26db2da5-3733-4a90-909e-6c11508ea140",
"metadata": {},
"outputs": [],
"source": [
"import uuid\n",
"from pathlib import Path\n",
"import langchain\n",
"import torch\n",
"from bs4 import BeautifulSoup as Soup\n",
"from langchain.retrievers.multi_vector import MultiVectorRetriever\n",
"from import InMemoryByteStore, LocalFileStore\n",
"# For our example, we'll load docs from the web\n",
"from langchain.text_splitter import RecursiveCharacterTextSplitter # noqa\n",
"from langchain_community.document_loaders.recursive_url_loader import (\n",
" RecursiveUrlLoader,\n",
"# noqa\n",
"from langchain_community.vectorstores import Chroma\n",
"DOCSTORE_DIR = \".\"\n",
"DOCSTORE_ID_KEY = \"doc_id\""
"cell_type": "markdown",
"id": "f5ccda4e-7af5-4355-b9c4-25547edf33f9",
"metadata": {},
"source": [
"Lets first load up this paper, and split into text chunks of size 1000."
"cell_type": "code",
"execution_count": 2,
"id": "5f4d8888-53a6-49f5-a198-da5c92419ca4",
"metadata": {},
"outputs": [
"name": "stdout",
"output_type": "stream",
"text": [
"Loaded 1 documents\n",
"Split into 73 documents\n"
"source": [
"# Could add more parsing here, as it's very raw.\n",
"loader = RecursiveUrlLoader(\n",
" \"\",\n",
" max_depth=2,\n",
" extractor=lambda x: Soup(x, \"html.parser\").text,\n",
"data = loader.load()\n",
"print(f\"Loaded {len(data)} documents\")\n",
"# Split\n",
"text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
"all_splits = text_splitter.split_documents(data)\n",
"print(f\"Split into {len(all_splits)} documents\")"
"cell_type": "markdown",
"id": "73e90632-2ac2-49eb-80da-ffe9ac4a278d",
"metadata": {},
"source": [
"In order to embed our documents, we can use the ```QuantizedBiEncoderEmbeddings```, for efficient and fast embedding. "
"cell_type": "code",
"execution_count": 9,
"id": "9a68a6f6-332d-481e-bbea-ad763155ea36",
"metadata": {},
"outputs": [
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "89af89b48c55409b9999b8e0387fab5b",
"version_major": 2,
"version_minor": 0
"text/plain": [
"config.json: 0%| | 0.00/747 [00:00<?, ?B/s]"
"metadata": {},
"output_type": "display_data"
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "01ad1b6278194b53bf6a5a286a311864",
"version_major": 2,
"version_minor": 0
"text/plain": [
"pytorch_model.bin: 0%| | 0.00/45.9M [00:00<?, ?B/s]"
"metadata": {},
"output_type": "display_data"
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "cb3bd1b88f7743c3b0322da3f021325c",
"version_major": 2,
"version_minor": 0
"text/plain": [
"inc_config.json: 0%| | 0.00/287 [00:00<?, ?B/s]"
"metadata": {},
"output_type": "display_data"
"name": "stderr",
"output_type": "stream",
"text": [
"loading configuration file inc_config.json from cache at \n",
"INCConfig {\n",
" \"distillation\": {},\n",
" \"neural_compressor_version\": \"2.4.1\",\n",
" \"optimum_version\": \"1.16.2\",\n",
" \"pruning\": {},\n",
" \"quantization\": {\n",
" \"dataset_num_samples\": 50,\n",
" \"is_static\": true\n",
" },\n",
" \"save_onnx_model\": false,\n",
" \"torch_version\": \"2.2.0\",\n",
" \"transformers_version\": \"4.37.2\"\n",
"Using `INCModel` to load a TorchScript model will be deprecated in v1.15.0, to load your model please use `IPEXModel` instead.\n"
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "7439315ebcb746f5be11fe30bc7693f6",
"version_major": 2,
"version_minor": 0
"text/plain": [
"tokenizer_config.json: 0%| | 0.00/1.24k [00:00<?, ?B/s]"
"metadata": {},
"output_type": "display_data"
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "05265a3912254ce1ad43cc8086bcb0ca",
"version_major": 2,
"version_minor": 0
"text/plain": [
"vocab.txt: 0%| | 0.00/232k [00:00<?, ?B/s]"
"metadata": {},
"output_type": "display_data"
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "a48f4245c60744f28f37cd3a7a24d198",
"version_major": 2,
"version_minor": 0
"text/plain": [
"tokenizer.json: 0%| | 0.00/711k [00:00<?, ?B/s]"
"metadata": {},
"output_type": "display_data"
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "584a63cace934033b4ab22d3a178582a",
"version_major": 2,
"version_minor": 0
"text/plain": [
"special_tokens_map.json: 0%| | 0.00/125 [00:00<?, ?B/s]"
"metadata": {},
"output_type": "display_data"
"source": [
"from langchain_community.embeddings import QuantizedBiEncoderEmbeddings\n",
"from langchain_core.embeddings import Embeddings\n",
"model_name = \"Intel/bge-small-en-v1.5-rag-int8-static\"\n",
"encode_kwargs = {\"normalize_embeddings\": True} # set True to compute cosine similarity\n",
"model_inc = QuantizedBiEncoderEmbeddings(\n",
" model_name=model_name,\n",
" encode_kwargs=encode_kwargs,\n",
" query_instruction=\"Represent this sentence for searching relevant passages: \",\n",
"cell_type": "markdown",
"id": "360b2837-8024-47e0-a4ba-592505a9a5c8",
"metadata": {},
"source": [
"With our embedder in place, lets define our retriever:"
"cell_type": "code",
"execution_count": 16,
"id": "18bc0a73-1a13-4b2f-96ac-05a5313343b7",
"metadata": {},
"outputs": [],
"source": [
"def get_multi_vector_retriever(\n",
" docstore_id_key: str, collection_name: str, embedding_function: Embeddings\n",
" \"\"\"Create the composed retriever object.\"\"\"\n",
" vectorstore = Chroma(\n",
" collection_name=collection_name,\n",
" embedding_function=embedding_function,\n",
" )\n",
" store = InMemoryByteStore()\n",
" return MultiVectorRetriever(\n",
" vectorstore=vectorstore,\n",
" byte_store=store,\n",
" id_key=docstore_id_key,\n",
" )\n",
"retriever = get_multi_vector_retriever(DOCSTORE_ID_KEY, \"multi_vec_store\", model_inc)"
"cell_type": "markdown",
"id": "8484078e-1bf0-4080-a354-ef23823fd6dc",
"metadata": {},
"source": [
"Next, we divide each chunk into sub-docs:"
"cell_type": "code",
"execution_count": 18,
"id": "e12f48d4-6562-416b-8f28-342912e5756e",
"metadata": {},
"outputs": [],
"source": [
"child_text_splitter = RecursiveCharacterTextSplitter(chunk_size=400)\n",
"id_key = \"doc_id\"\n",
"doc_ids = [str(uuid.uuid4()) for _ in all_splits]"
"cell_type": "code",
"execution_count": 19,
"id": "a268ef5f-91c2-4d8e-87f0-53db376e6a29",
"metadata": {},
"outputs": [],
"source": [
"sub_docs = []\n",
"for i, doc in enumerate(all_splits):\n",
" _id = doc_ids[i]\n",
" _sub_docs = child_text_splitter.split_documents([doc])\n",
" for _doc in _sub_docs:\n",
" _doc.metadata[id_key] = _id\n",
" sub_docs.extend(_sub_docs)"
"cell_type": "markdown",
"id": "d84ea8f4-a5de-4d76-b44d-85e56583f489",
"metadata": {},
"source": [
"Lets write our documents into our new store. This will use our embedder on each document."
"cell_type": "code",
"execution_count": 20,
"id": "1af831ce-0eae-44bc-aca7-4d691063640b",
"metadata": {},
"outputs": [
"name": "stderr",
"output_type": "stream",
"text": [
"Batches: 100%|██████████| 8/8 [00:00<00:00, 9.05it/s]\n"
"source": [
"retriever.docstore.mset(list(zip(doc_ids, all_splits)))"
"cell_type": "markdown",
"id": "580bc212-8ecd-4d28-8656-b96fcd0d7eb6",
"metadata": {},
"source": [
"Great! Our retriever is good to go. Lets load up an LLM, that will reason over the retrieved documents:"
"cell_type": "code",
"execution_count": 21,
"id": "008c992f",
"metadata": {},
"outputs": [
"name": "stderr",
"output_type": "stream",
"text": []
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "cbe70583ad964ae19582b72dab396784",
"version_major": 2,
"version_minor": 0
"text/plain": [
"Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]"
"metadata": {},
"output_type": "display_data"
"source": [
"import torch\n",
"from langchain.llms.huggingface_pipeline import HuggingFacePipeline\n",
"from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline\n",
"model_id = \"Intel/neural-chat-7b-v3-3\"\n",
"tokenizer = AutoTokenizer.from_pretrained(model_id)\n",
"model = AutoModelForCausalLM.from_pretrained(\n",
" model_id, device_map=\"auto\", torch_dtype=torch.bfloat16\n",
"pipe = pipeline(\"text-generation\", model=model, tokenizer=tokenizer, max_new_tokens=100)\n",
"hf = HuggingFacePipeline(pipeline=pipe)"
"cell_type": "markdown",
"id": "6dd21fb2-0442-477d-aae2-9e7ee1d1d778",
"metadata": {},
"source": [
"Next, we will load up a prompt for answering questions using retrieved documents:"
"cell_type": "code",
"execution_count": 22,
"id": "5e582509-caaf-4920-932c-4ce16162c789",
"metadata": {},
"outputs": [],
"source": [
"from langchain import hub\n",
"prompt = hub.pull(\"rlm/rag-prompt\")"
"cell_type": "markdown",
"id": "5cdfcba5-7ec7-4d0a-820e-4e200643a882",
"metadata": {},
"source": [
"We can now build our pipeline:"
"cell_type": "code",
"execution_count": 23,
"id": "b74d8dfb-72bb-46da-9df9-0dc47a3ac791",
"metadata": {},
"outputs": [],
"source": [
"from langchain.schema.runnable import RunnablePassthrough\n",
"rag_chain = {\"context\": retriever, \"question\": RunnablePassthrough()} | prompt | hf"
"cell_type": "markdown",
"id": "3bc53602-86d6-420f-91b1-fc2effa7e986",
"metadata": {},
"source": [
"Excellent! lets ask it a question.\n",
"We will also use a verbose and debug, to check which documents were used by the model to produce the answer."
"cell_type": "code",
"execution_count": 31,
"id": "f0a92c07-53da-4e1f-b880-ee83a36ee17d",
"metadata": {},
"outputs": [
"name": "stderr",
"output_type": "stream",
"text": [
"Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.\n"
"name": "stdout",
"output_type": "stream",
"text": [
"\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[1:chain:RunnableSequence] Entering Chain run with input:\n",
" \"input\": \"What is the first transduction model relying entirely on self-attention?\"\n",
"\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[1:chain:RunnableSequence > 2:chain:RunnableParallel<context,question>] Entering Chain run with input:\n",
" \"input\": \"What is the first transduction model relying entirely on self-attention?\"\n",
"\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[1:chain:RunnableSequence > 2:chain:RunnableParallel<context,question> > 4:chain:RunnablePassthrough] Entering Chain run with input:\n",
" \"input\": \"What is the first transduction model relying entirely on self-attention?\"\n",
"\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[1:chain:RunnableSequence > 2:chain:RunnableParallel<context,question> > 4:chain:RunnablePassthrough] [1ms] Exiting Chain run with output:\n",
" \"output\": \"What is the first transduction model relying entirely on self-attention?\"\n",
"\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[1:chain:RunnableSequence > 2:chain:RunnableParallel<context,question>] [66ms] Exiting Chain run with output:\n",
"\u001b[32;1m\u001b[1;3m[chain/start]\u001b[0m \u001b[1m[1:chain:RunnableSequence > 5:prompt:ChatPromptTemplate] Entering Prompt run with input:\n",
"\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[1:chain:RunnableSequence > 5:prompt:ChatPromptTemplate] [1ms] Exiting Prompt run with output:\n",
" \"lc\": 1,\n",
" \"type\": \"constructor\",\n",
" \"id\": [\n",
" \"langchain\",\n",
" \"prompts\",\n",
" \"chat\",\n",
" \"ChatPromptValue\"\n",
" ],\n",
" \"kwargs\": {\n",
" \"messages\": [\n",
" {\n",
" \"lc\": 1,\n",
" \"type\": \"constructor\",\n",
" \"id\": [\n",
" \"langchain\",\n",
" \"schema\",\n",
" \"messages\",\n",
" \"HumanMessage\"\n",
" ],\n",
" \"kwargs\": {\n",
" \"content\": \"You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\\nQuestion: What is the first transduction model relying entirely on self-attention? \\nContext: [Document(page_content='To the best of our knowledge, however, the Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence-aligned RNNs or convolution.\\\\nIn the following sections, we will describe the Transformer, motivate self-attention and discuss its advantages over models such as (neural_gpu, ; NalBytenet2017, ) and (JonasFaceNet2017, ).\\\\n\\\\n\\\\n\\\\n\\\\n3 Model Architecture\\\\n\\\\nFigure 1: The Transformer - model architecture.', metadata={'source': '', 'title': '[1706.03762] Attention Is All You Need', 'language': 'en'}), Document(page_content='In this work, we presented the Transformer, the first sequence transduction model based entirely on attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention.\\\\n\\\\n\\\\nFor translation tasks, the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers. On both WMT 2014 English-to-German and WMT 2014 English-to-French translation tasks, we achieve a new state of the art. In the former task our best model outperforms even all previously reported ensembles. \\\\n\\\\n\\\\nWe are excited about the future of attention-based models and plan to apply them to other tasks. We plan to extend the Transformer to problems involving input and output modalities other than text and to investigate local, restricted attention mechanisms to efficiently handle large inputs and outputs such as images, audio and video.\\\\nMaking generation less sequential is another research goals of ours.', metadata={'source': '', 'title': '[1706.03762] Attention Is All You Need', 'language': 'en'}), Document(page_content='Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences (bahdanau2014neural, ; structuredAttentionNetworks, ). In all but a few cases (decomposableAttnModel, ), however, such attention mechanisms are used in conjunction with a recurrent network.\\\\n\\\\n\\\\nIn this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output. The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs.\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n2 Background', metadata={'source': '', 'title': '[1706.03762] Attention Is All You Need', 'language': 'en'}), Document(page_content='The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the', metadata={'source': '', 'title': '[1706.03762] Attention Is All You Need', 'language': 'en'})] \\nAnswer:\",\n",
" \"additional_kwargs\": {}\n",
" }\n",
" }\n",
" ]\n",
" }\n",
"\u001b[32;1m\u001b[1;3m[llm/start]\u001b[0m \u001b[1m[1:chain:RunnableSequence > 6:llm:HuggingFacePipeline] Entering LLM run with input:\n",
" \"prompts\": [\n",
" \"Human: You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\\nQuestion: What is the first transduction model relying entirely on self-attention? \\nContext: [Document(page_content='To the best of our knowledge, however, the Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence-aligned RNNs or convolution.\\\\nIn the following sections, we will describe the Transformer, motivate self-attention and discuss its advantages over models such as (neural_gpu, ; NalBytenet2017, ) and (JonasFaceNet2017, ).\\\\n\\\\n\\\\n\\\\n\\\\n3 Model Architecture\\\\n\\\\nFigure 1: The Transformer - model architecture.', metadata={'source': '', 'title': '[1706.03762] Attention Is All You Need', 'language': 'en'}), Document(page_content='In this work, we presented the Transformer, the first sequence transduction model based entirely on attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention.\\\\n\\\\n\\\\nFor translation tasks, the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers. On both WMT 2014 English-to-German and WMT 2014 English-to-French translation tasks, we achieve a new state of the art. In the former task our best model outperforms even all previously reported ensembles. \\\\n\\\\n\\\\nWe are excited about the future of attention-based models and plan to apply them to other tasks. We plan to extend the Transformer to problems involving input and output modalities other than text and to investigate local, restricted attention mechanisms to efficiently handle large inputs and outputs such as images, audio and video.\\\\nMaking generation less sequential is another research goals of ours.', metadata={'source': '', 'title': '[1706.03762] Attention Is All You Need', 'language': 'en'}), Document(page_content='Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences (bahdanau2014neural, ; structuredAttentionNetworks, ). In all but a few cases (decomposableAttnModel, ), however, such attention mechanisms are used in conjunction with a recurrent network.\\\\n\\\\n\\\\nIn this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output. The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs.\\\\n\\\\n\\\\n\\\\n\\\\n\\\\n2 Background', metadata={'source': '', 'title': '[1706.03762] Attention Is All You Need', 'language': 'en'}), Document(page_content='The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the', metadata={'source': '', 'title': '[1706.03762] Attention Is All You Need', 'language': 'en'})] \\nAnswer:\"\n",
" ]\n",
"\u001b[36;1m\u001b[1;3m[llm/end]\u001b[0m \u001b[1m[1:chain:RunnableSequence > 6:llm:HuggingFacePipeline] [4.34s] Exiting LLM run with output:\n",
" \"generations\": [\n",
" [\n",
" {\n",
" \"text\": \" The first transduction model relying entirely on self-attention is the Transformer.\",\n",
" \"generation_info\": null,\n",
" \"type\": \"Generation\"\n",
" }\n",
" ]\n",
" ],\n",
" \"llm_output\": null,\n",
" \"run\": null\n",
"\u001b[36;1m\u001b[1;3m[chain/end]\u001b[0m \u001b[1m[1:chain:RunnableSequence] [4.41s] Exiting Chain run with output:\n",
" \"output\": \" The first transduction model relying entirely on self-attention is the Transformer.\"\n",
"source": [
"langchain.verbose = True\n",
"langchain.debug = True\n",
"llm_res = rag_chain.invoke(\n",
" \"What is the first transduction model relying entirely on self-attention?\",\n",
"cell_type": "code",
"execution_count": 32,
"id": "023404a1-401a-46e1-8ab5-cafbc8593b04",
"metadata": {},
"outputs": [
"data": {
"text/plain": [
"' The first transduction model relying entirely on self-attention is the Transformer.'"
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
"source": [
"cell_type": "markdown",
"id": "0eaefd01-254a-445d-a95f-37889c126e0e",
"metadata": {},
"source": [
"Based on the retrieved documents, the answer is indeed correct :)"
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.18"
"nbformat": 4,
"nbformat_minor": 5

@ -0,0 +1,26 @@
# Optimum-intel
All functionality related to the [optimum-intel]( and [IPEX](
## Installation
Install using optimum-intel and ipex using:
pip install optimum[neural-compressor]
pip install intel_extension_for_pytorch
Please follow the installation instructions as specified below:
* Install optimum-intel as shown [here](
* Install IPEX as shown [here](
## Embedding Models
See a [usage example](/docs/integrations/text_embedding/optimum_intel).
We also offer a full tutorial notebook "rag_with_quantized_embeddings.ipynb" for using the embedder in a RAG pipeline in the cookbook dir.
from langchain_community.embeddings import QuantizedBiEncoderEmbeddings

@ -0,0 +1,201 @@
"cells": [
"cell_type": "markdown",
"id": "ae6f9d9d-fe44-489c-9661-dac69683dcd2",
"metadata": {},
"source": [
"# Embedding Documents using Optimized and Quantized Embedders\n",
"Embedding all documents using Quantized Embedders.\n",
"The embedders are based on optimized models, created by using [optimum-intel]( and [IPEX](\n",
"Example text is based on [SBERT]("
"cell_type": "code",
"execution_count": 2,
"id": "b9d1a3bb-83b1-4029-ad8d-411db1fba034",
"metadata": {},
"outputs": [
"name": "stderr",
"output_type": "stream",
"text": [
"loading configuration file inc_config.json from cache at \n",
"INCConfig {\n",
" \"distillation\": {},\n",
" \"neural_compressor_version\": \"2.4.1\",\n",
" \"optimum_version\": \"1.16.2\",\n",
" \"pruning\": {},\n",
" \"quantization\": {\n",
" \"dataset_num_samples\": 50,\n",
" \"is_static\": true\n",
" },\n",
" \"save_onnx_model\": false,\n",
" \"torch_version\": \"2.2.0\",\n",
" \"transformers_version\": \"4.37.2\"\n",
"Using `INCModel` to load a TorchScript model will be deprecated in v1.15.0, to load your model please use `IPEXModel` instead.\n"
"source": [
"from langchain_community.embeddings import QuantizedBiEncoderEmbeddings\n",
"model_name = \"Intel/bge-small-en-v1.5-rag-int8-static\"\n",
"encode_kwargs = {\"normalize_embeddings\": True} # set True to compute cosine similarity\n",
"model = QuantizedBiEncoderEmbeddings(\n",
" model_name=model_name,\n",
" encode_kwargs=encode_kwargs,\n",
" query_instruction=\"Represent this sentence for searching relevant passages: \",\n",
"cell_type": "markdown",
"id": "34318164-7a6f-47b6-8690-3b1d71e1fcfc",
"metadata": {},
"source": [
"Lets ask a question, and compare to 2 documents. The first contains the answer to the question, and the second one does not. \n",
"We can check better suits our query."
"cell_type": "code",
"execution_count": 5,
"id": "55ff07ca-fb44-4dcf-b2d3-dde021a53983",
"metadata": {},
"outputs": [],
"source": [
"question = \"How many people live in Berlin?\""
"cell_type": "code",
"execution_count": 6,
"id": "aebef832-5534-440c-a4a8-4bf56ccd8ad4",
"metadata": {},
"outputs": [],
"source": [
"documents = [\n",
" \"Berlin had a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.\",\n",
" \"Berlin is well known for its museums.\",\n",
"cell_type": "code",
"execution_count": 7,
"id": "4eec7eda-0d9b-4488-a0e8-3eedd28ab0b1",
"metadata": {},
"outputs": [
"name": "stderr",
"output_type": "stream",
"text": [
"Batches: 100%|██████████| 1/1 [00:00<00:00, 4.18it/s]\n"
"source": [
"doc_vecs = model.embed_documents(documents)"
"cell_type": "code",
"execution_count": 8,
"id": "8e6dac72-5a0b-4421-9454-aa0a49b20c66",
"metadata": {},
"outputs": [],
"source": [
"query_vec = model.embed_query(question)"
"cell_type": "code",
"execution_count": 10,
"id": "ec26eb7a-a259-4bb9-b9d8-9ff345a8c798",
"metadata": {},
"outputs": [],
"source": [
"import torch"
"cell_type": "code",
"execution_count": 11,
"id": "9ca1ee83-2a6a-4f65-bc2f-3942a0c068c6",
"metadata": {},
"outputs": [],
"source": [
"doc_vecs_torch = torch.tensor(doc_vecs)"
"cell_type": "code",
"execution_count": 12,
"id": "4f6a1986-339e-443a-a2f6-ae3f3ad4266c",
"metadata": {},
"outputs": [],
"source": [
"query_vec_torch = torch.tensor(query_vec)"
"cell_type": "code",
"execution_count": 15,
"id": "2b49446e-1336-46b3-b9ef-af56b4870876",
"metadata": {},
"outputs": [
"data": {
"text/plain": [
"tensor([0.7980, 0.6529])"
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
"source": [
"query_vec_torch @ doc_vecs_torch.T"
"cell_type": "markdown",
"id": "6cc1ac2a-9641-408e-a373-736d121fc3c7",
"metadata": {},
"source": [
"We can see that indeed the first one ranks higher."
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.18"
"nbformat": 4,
"nbformat_minor": 5

@ -71,6 +71,7 @@ from langchain_community.embeddings.oci_generative_ai import OCIGenAIEmbeddings
from langchain_community.embeddings.octoai_embeddings import OctoAIEmbeddings
from langchain_community.embeddings.ollama import OllamaEmbeddings
from langchain_community.embeddings.openai import OpenAIEmbeddings
from langchain_community.embeddings.optimum_intel import QuantizedBiEncoderEmbeddings
from langchain_community.embeddings.sagemaker_endpoint import (
@ -149,6 +150,7 @@ __all__ = [

@ -0,0 +1,208 @@
from typing import Any, Dict, List, Optional
from langchain_core.embeddings import Embeddings
from langchain_core.pydantic_v1 import BaseModel, Extra
class QuantizedBiEncoderEmbeddings(BaseModel, Embeddings):
"""Quantized bi-encoders embedding models.
Please ensure that you have installed optimum-intel and ipex.
model_name: str = Model name.
max_seq_len: int = The maximum sequence length for tokenization. (default 512)
pooling_strategy: str =
"mean" or "cls", pooling strategy for the final layer. (default "mean")
query_instruction: Optional[str] =
An instruction to add to the query before embedding. (default None)
document_instruction: Optional[str] =
An instruction to add to each document before embedding. (default None)
padding: Optional[bool] =
Whether to add padding during tokenization or not. (default True)
model_kwargs: Optional[Dict] =
Parameters to add to the model during initialization. (default {})
encode_kwargs: Optional[Dict] =
Parameters to add during the embedding forward pass. (default {})
from langchain_community.embeddings import QuantizedBiEncoderEmbeddings
model_name = "Intel/bge-small-en-v1.5-rag-int8-static"
encode_kwargs = {'normalize_embeddings': True}
hf = QuantizedBiEncoderEmbeddings(
query_instruction="Represent this sentence for searching relevant passages: "
def __init__(
model_name: str,
max_seq_len: int = 512,
pooling_strategy: str = "mean", # "mean" or "cls"
query_instruction: Optional[str] = None,
document_instruction: Optional[str] = None,
padding: bool = True,
model_kwargs: Optional[Dict] = None,
encode_kwargs: Optional[Dict] = None,
**kwargs: Any,
) -> None:
self.model_name_or_path = model_name
self.max_seq_len = max_seq_len
self.pooling = pooling_strategy
self.padding = padding
self.encode_kwargs = encode_kwargs or {}
self.model_kwargs = model_kwargs or {}
self.normalize = self.encode_kwargs.get("normalize_embeddings", False)
self.batch_size = self.encode_kwargs.get("batch_size", 32)
self.query_instruction = query_instruction
self.document_instruction = document_instruction
def load_model(self) -> None:
from transformers import AutoTokenizer
except ImportError as e:
raise ImportError(
"Unable to import transformers, please install with "
"`pip install -U transformers`."
) from e
from import IPEXModel
self.transformer_model = IPEXModel.from_pretrained(
self.model_name_or_path, **self.model_kwargs
except Exception as e:
raise Exception(
Failed to load model {self.model_name_or_path}, due to the following error:
Please ensure that you have installed optimum-intel and ipex correctly,using:
pip install optimum[neural-compressor]
pip install intel_extension_for_pytorch
For more information, please visit:
* Install optimum-intel as shown here:
* Install IPEX as shown here:
self.transformer_tokenizer = AutoTokenizer.from_pretrained(
class Config:
"""Configuration for this pydantic object."""
extra = Extra.allow
def _embed(self, inputs: Any) -> Any:
import torch
except ImportError as e:
raise ImportError(
"Unable to import torch, please install with `pip install -U torch`."
) from e
with torch.inference_mode():
outputs = self.transformer_model(**inputs)
if self.pooling == "mean":
emb = self._mean_pooling(outputs, inputs["attention_mask"])
elif self.pooling == "cls":
emb = self._cls_pooling(outputs)
raise ValueError("pooling method no supported")
if self.normalize:
emb = torch.nn.functional.normalize(emb, p=2, dim=1)
return emb
def _cls_pooling(outputs: Any) -> Any:
if isinstance(outputs, dict):
token_embeddings = outputs["last_hidden_state"]
token_embeddings = outputs[0]
return token_embeddings[:, 0]
def _mean_pooling(outputs: Any, attention_mask: Any) -> Any:
import torch
except ImportError as e:
raise ImportError(
"Unable to import torch, please install with `pip install -U torch`."
) from e
if isinstance(outputs, dict):
token_embeddings = outputs["last_hidden_state"]
# First element of model_output contains all token embeddings
token_embeddings = outputs[0]
input_mask_expanded = (
sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
return sum_embeddings / sum_mask
def _embed_text(self, texts: List[str]) -> List[List[float]]:
inputs = self.transformer_tokenizer(
return self._embed(inputs).tolist()
def embed_documents(self, texts: List[str]) -> List[List[float]]:
"""Embed a list of text documents using the Optimized Embedder model.
texts: List[str] = List of text documents to embed.
List[List[float]] = The embeddings of each text document.
import pandas as pd
except ImportError as e:
raise ImportError(
"Unable to import pandas, please install with `pip install -U pandas`."
) from e
import tqdm
except ImportError as e:
raise ImportError(
"Unable to import tqdm, please install with `pip install -U tqdm`."
) from e
docs = [
self.document_instruction + d if self.document_instruction else d
for d in texts
# group into batches
text_list_df = pd.DataFrame(docs, columns=["texts"]).reset_index()
# assign each example with its batch
text_list_df["batch_index"] = text_list_df["index"] // self.batch_size
# create groups
batches = list(text_list_df.groupby(["batch_index"])["texts"].apply(list))
vectors = []
for batch in tqdm(batches, desc="Batches"):
vectors += self._embed_text(batch)
return vectors
def embed_query(self, text: str) -> List[float]:
if self.query_instruction:
text = self.query_instruction + text
return self._embed_text([text])[0]

@ -58,6 +58,7 @@ EXPECTED_ALL = [
