langchain/docs/modules/indexes/examples/embeddings.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "249b4058",
   "metadata": {},
   "source": [
    "# Embeddings\n",
    "\n",
    "This notebook goes over how to use the Embedding class in LangChain.\n",
    "\n",
    "The Embedding class is a class designed for interfacing with embeddings. There are lots of Embedding providers (OpenAI, Cohere, Hugging Face, etc) - this class is designed to provide a standard interface for all of them.\n",
    "\n",
    "Embeddings create a vector representation of a piece of text. This is useful because it means we can think about text in the vector space, and do things like semantic search where we look for pieces of text that are most similar in the vector space.\n",
    "\n",
    "The base Embedding class in LangChain exposes two methods: `embed_documents` and `embed_query`. The largest difference is that these two methods have different interfaces: one works over multiple documents, while the other works over a single document. Besides this, another reason for having these as two separate methods is that some embedding providers have different embedding methods for documents (to be searched over) vs queries (the search query itself)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "278b6c63",
   "metadata": {},
   "source": [
    "## OpenAI\n",
    "\n",
    "Let's load the OpenAI Embedding class."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "0be1af71",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.embeddings import OpenAIEmbeddings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "2c66e5da",
   "metadata": {},
   "outputs": [],
   "source": [
    "embeddings = OpenAIEmbeddings()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "01370375",
   "metadata": {},
   "outputs": [],
   "source": [
    "text = \"This is a test document.\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "bfb6142c",
   "metadata": {},
   "outputs": [],
   "source": [
    "query_result = embeddings.embed_query(text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "0356c3b7",
   "metadata": {},
   "outputs": [],
   "source": [
    "doc_result = embeddings.embed_documents([text])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "42f76e43",
   "metadata": {},
   "source": [
    "## Cohere\n",
    "\n",
    "Let's load the Cohere Embedding class."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "6b82f59f",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.embeddings import CohereEmbeddings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "26895c60",
   "metadata": {},
   "outputs": [],
   "source": [
    "embeddings = CohereEmbeddings(cohere_api_key= cohere_api_key)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "eea52814",
   "metadata": {},
   "outputs": [],
   "source": [
    "text = \"This is a test document.\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "fbe167bf",
   "metadata": {},
   "outputs": [],
   "source": [
    "query_result = embeddings.embed_query(text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "38ad3b20",
   "metadata": {},
   "outputs": [],
   "source": [
    "doc_result = embeddings.embed_documents([text])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ed47bb62",
   "metadata": {},
   "source": [
    "## Hugging Face Hub\n",
    "Let's load the Hugging Face Embedding class."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "861521a9",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.embeddings import HuggingFaceEmbeddings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "ff9be586",
   "metadata": {},
   "outputs": [],
   "source": [
    "embeddings = HuggingFaceEmbeddings()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "d0a98ae9",
   "metadata": {},
   "outputs": [],
   "source": [
    "text = \"This is a test document.\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "5d6c682b",
   "metadata": {},
   "outputs": [],
   "source": [
    "query_result = embeddings.embed_query(text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "bb5e74c0",
   "metadata": {},
   "outputs": [],
   "source": [
    "doc_result = embeddings.embed_documents([text])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fff4734f",
   "metadata": {},
   "source": [
    "## TensorflowHub\n",
    "Let's load the TensorflowHub Embedding class."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "f822104b",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.embeddings import TensorflowHubEmbeddings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "bac84e46",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2023-01-30 23:53:01.652176: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA\n",
      "To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.\n",
      "2023-01-30 23:53:34.362802: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA\n",
      "To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.\n"
     ]
    }
   ],
   "source": [
    "embeddings = TensorflowHubEmbeddings()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "4790d770",
   "metadata": {},
   "outputs": [],
   "source": [
    "text = \"This is a test document.\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "f556dcdb",
   "metadata": {},
   "outputs": [],
   "source": [
    "query_result = embeddings.embed_query(text)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "59428e05",
   "metadata": {},
   "source": [
    "## InstructEmbeddings\n",
    "Let's load the HuggingFace instruct Embeddings class."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "92c5b61e",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.embeddings import HuggingFaceInstructEmbeddings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "062547b9",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "load INSTRUCTOR_Transformer\n",
      "max_seq_length  512\n"
     ]
    }
   ],
   "source": [
    "embeddings = HuggingFaceInstructEmbeddings(query_instruction=\"Represent the query for retrieval: \")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "e1dcc4bd",
   "metadata": {},
   "outputs": [],
   "source": [
    "text = \"This is a test document.\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "90f0db94",
   "metadata": {},
   "outputs": [],
   "source": [
    "query_result = embeddings.embed_query(text)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "eec4efda",
   "metadata": {},
   "source": [
    "## Self Hosted Embeddings\n",
    "Let's load the SelfHostedEmbeddings, SelfHostedHuggingFaceEmbeddings, and SelfHostedHuggingFaceInstructEmbeddings classes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d338722a",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "from langchain.embeddings import (\n",
    "    SelfHostedEmbeddings, \n",
    "    SelfHostedHuggingFaceEmbeddings, \n",
    "    SelfHostedHuggingFaceInstructEmbeddings\n",
    ")\n",
    "import runhouse as rh"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "146559e8",
   "metadata": {},
   "outputs": [],
   "source": [
    "# For an on-demand A100 with GCP, Azure, or Lambda\n",
    "gpu = rh.cluster(name=\"rh-a10x\", instance_type=\"A100:1\", use_spot=False)\n",
    "\n",
    "# For an on-demand A10G with AWS (no single A100s on AWS)\n",
    "# gpu = rh.cluster(name='rh-a10x', instance_type='g5.2xlarge', provider='aws')\n",
    "\n",
    "# For an existing cluster\n",
    "# gpu = rh.cluster(ips=['<ip of the cluster>'], \n",
    "#                  ssh_creds={'ssh_user': '...', 'ssh_private_key':'<path_to_key>'},\n",
    "#                  name='my-cluster')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1230f7df",
   "metadata": {},
   "outputs": [],
   "source": [
    "embeddings = SelfHostedHuggingFaceEmbeddings(hardware=gpu)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "2684e928",
   "metadata": {},
   "outputs": [],
   "source": [
    "text = \"This is a test document.\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1dc5e606",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "query_result = embeddings.embed_query(text)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cef9cc54",
   "metadata": {},
   "source": [
    "And similarly for SelfHostedHuggingFaceInstructEmbeddings:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "81a17ca3",
   "metadata": {},
   "outputs": [],
   "source": [
    "embeddings = SelfHostedHuggingFaceInstructEmbeddings(hardware=gpu)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5a33d1c8",
   "metadata": {},
   "source": [
    "Now let's load an embedding model with a custom load function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "c4af5679",
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_pipeline():\n",
    "    from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline  # Must be inside the function in notebooks\n",
    "    model_id = \"facebook/bart-base\"\n",
    "    tokenizer = AutoTokenizer.from_pretrained(model_id)\n",
    "    model = AutoModelForCausalLM.from_pretrained(model_id)\n",
    "    return pipeline(\"feature-extraction\", model=model, tokenizer=tokenizer)\n",
    "\n",
    "def inference_fn(pipeline, prompt):\n",
    "    # Return last hidden state of the model\n",
    "    if isinstance(prompt, list):\n",
    "        return [emb[0][-1] for emb in pipeline(prompt)] \n",
    "    return pipeline(prompt)[0][-1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8654334b",
   "metadata": {},
   "outputs": [],
   "source": [
    "embeddings = SelfHostedEmbeddings(\n",
    "    model_load_fn=get_pipeline, \n",
    "    hardware=gpu,\n",
    "    model_reqs=[\"./\", \"torch\", \"transformers\"],\n",
    "    inference_fn=inference_fn\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fc1bfd0f",
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "query_result = embeddings.embed_query(text)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.9"
  },
  "vscode": {
   "interpreter": {
    "hash": "ce6f9b0d7cdac41515b0e0c38d0e6e153a2edce81d579281cb1ab99da6e8ea6d"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}