{ "cells": [ { "cell_type": "markdown", "id": "249b4058", "metadata": {}, "source": [ "# Embeddings\n", "\n", "This notebook goes over how to use the Embedding class in LangChain.\n", "\n", "The Embedding class is a class designed for interfacing with embeddings. There are lots of Embedding providers (OpenAI, Cohere, Hugging Face, etc) - this class is designed to provide a standard interface for all of them.\n", "\n", "Embeddings create a vector representation of a piece of text. This is useful because it means we can think about text in the vector space, and do things like semantic search where we look for pieces of text that are most similar in the vector space.\n", "\n", "The base Embedding class in LangChain exposes two methods: `embed_documents` and `embed_query`. The largest difference is that these two methods have different interfaces: one works over multiple documents, while the other works over a single document. Besides this, another reason for having these as two separate methods is that some embedding providers have different embedding methods for documents (to be searched over) vs queries (the search query itself)." ] }, { "cell_type": "markdown", "id": "278b6c63", "metadata": {}, "source": [ "## OpenAI\n", "\n", "Let's load the OpenAI Embedding class." ] }, { "cell_type": "code", "execution_count": 1, "id": "0be1af71", "metadata": {}, "outputs": [], "source": [ "from langchain.embeddings import OpenAIEmbeddings" ] }, { "cell_type": "code", "execution_count": 2, "id": "2c66e5da", "metadata": {}, "outputs": [], "source": [ "embeddings = OpenAIEmbeddings()" ] }, { "cell_type": "code", "execution_count": 3, "id": "01370375", "metadata": {}, "outputs": [], "source": [ "text = \"This is a test document.\"" ] }, { "cell_type": "code", "execution_count": 4, "id": "bfb6142c", "metadata": {}, "outputs": [], "source": [ "query_result = embeddings.embed_query(text)" ] }, { "cell_type": "code", "execution_count": 5, "id": "0356c3b7", "metadata": {}, "outputs": [], "source": [ "doc_result = embeddings.embed_documents([text])" ] }, { "cell_type": "markdown", "id": "bb61bbeb", "metadata": {}, "source": [ "Let's load the OpenAI Embedding class with first generation models (e.g. text-search-ada-doc-001/text-search-ada-query-001). Note: These are not recommended models - see [here](https://platform.openai.com/docs/guides/embeddings/what-are-embeddings)" ] }, { "cell_type": "code", "execution_count": null, "id": "c0b072cc", "metadata": {}, "outputs": [], "source": [ "from langchain.embeddings.openai import OpenAIEmbeddings" ] }, { "cell_type": "code", "execution_count": null, "id": "a56b70f5", "metadata": {}, "outputs": [], "source": [ "embeddings = OpenAIEmbeddings(model_name=\"ada\")" ] }, { "cell_type": "code", "execution_count": null, "id": "14aefb64", "metadata": {}, "outputs": [], "source": [ "text = \"This is a test document.\"" ] }, { "cell_type": "code", "execution_count": null, "id": "3c39ed33", "metadata": {}, "outputs": [], "source": [ "query_result = embeddings.embed_query(text)" ] }, { "cell_type": "code", "execution_count": null, "id": "e3221db6", "metadata": {}, "outputs": [], "source": [ "doc_result = embeddings.embed_documents([text])" ] }, { "cell_type": "markdown", "id": "c3852491", "metadata": {}, "source": [ "## AzureOpenAI\n", "\n", "Let's load the OpenAI Embedding class with environment variables set to indicate to use Azure endpoints." ] }, { "cell_type": "code", "execution_count": null, "id": "1b40f827", "metadata": {}, "outputs": [], "source": [ "# set the environment variables needed for openai package to know to reach out to azure\n", "import os\n", "\n", "os.environ[\"OPENAI_API_TYPE\"] = \"azure\"\n", "os.environ[\"OPENAI_API_BASE\"] = \"https://'],\n", "# ssh_creds={'ssh_user': '...', 'ssh_private_key':''},\n", "# name='my-cluster')" ] }, { "cell_type": "code", "execution_count": null, "id": "1230f7df", "metadata": {}, "outputs": [], "source": [ "embeddings = SelfHostedHuggingFaceEmbeddings(hardware=gpu)" ] }, { "cell_type": "code", "execution_count": 6, "id": "2684e928", "metadata": {}, "outputs": [], "source": [ "text = \"This is a test document.\"" ] }, { "cell_type": "code", "execution_count": null, "id": "1dc5e606", "metadata": { "scrolled": true }, "outputs": [], "source": [ "query_result = embeddings.embed_query(text)" ] }, { "cell_type": "markdown", "id": "cef9cc54", "metadata": {}, "source": [ "And similarly for SelfHostedHuggingFaceInstructEmbeddings:" ] }, { "cell_type": "code", "execution_count": null, "id": "81a17ca3", "metadata": {}, "outputs": [], "source": [ "embeddings = SelfHostedHuggingFaceInstructEmbeddings(hardware=gpu)" ] }, { "cell_type": "markdown", "id": "5a33d1c8", "metadata": {}, "source": [ "Now let's load an embedding model with a custom load function:" ] }, { "cell_type": "code", "execution_count": 12, "id": "c4af5679", "metadata": {}, "outputs": [], "source": [ "def get_pipeline():\n", " from transformers import (\n", " AutoModelForCausalLM,\n", " AutoTokenizer,\n", " pipeline,\n", " ) # Must be inside the function in notebooks\n", "\n", " model_id = \"facebook/bart-base\"\n", " tokenizer = AutoTokenizer.from_pretrained(model_id)\n", " model = AutoModelForCausalLM.from_pretrained(model_id)\n", " return pipeline(\"feature-extraction\", model=model, tokenizer=tokenizer)\n", "\n", "\n", "def inference_fn(pipeline, prompt):\n", " # Return last hidden state of the model\n", " if isinstance(prompt, list):\n", " return [emb[0][-1] for emb in pipeline(prompt)]\n", " return pipeline(prompt)[0][-1]" ] }, { "cell_type": "code", "execution_count": null, "id": "8654334b", "metadata": {}, "outputs": [], "source": [ "embeddings = SelfHostedEmbeddings(\n", " model_load_fn=get_pipeline,\n", " hardware=gpu,\n", " model_reqs=[\"./\", \"torch\", \"transformers\"],\n", " inference_fn=inference_fn,\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "fc1bfd0f", "metadata": { "scrolled": false }, "outputs": [], "source": [ "query_result = embeddings.embed_query(text)" ] }, { "cell_type": "markdown", "id": "f9c02c78", "metadata": {}, "source": [ "## Fake Embeddings\n", "\n", "LangChain also provides a fake embedding class. You can use this to test your pipelines." ] }, { "cell_type": "code", "execution_count": 1, "id": "2ffc2e4b", "metadata": {}, "outputs": [], "source": [ "from langchain.embeddings import FakeEmbeddings" ] }, { "cell_type": "code", "execution_count": 3, "id": "80777571", "metadata": {}, "outputs": [], "source": [ "embeddings = FakeEmbeddings(size=1352)" ] }, { "cell_type": "code", "execution_count": 5, "id": "3ec9d8f0", "metadata": {}, "outputs": [], "source": [ "query_result = embeddings.embed_query(\"foo\")" ] }, { "cell_type": "code", "execution_count": 6, "id": "3b9ae9e1", "metadata": {}, "outputs": [], "source": [ "doc_results = embeddings.embed_documents([\"foo\"])" ] }, { "cell_type": "markdown", "id": "1f83f273", "metadata": {}, "source": [ "## SageMaker Endpoint Embeddings\n", "\n", "Let's load the SageMaker Endpoints Embeddings class. The class can be used if you host, e.g. your own Hugging Face model on SageMaker.\n", "\n", "For instrucstions on how to do this, please see [here](https://www.philschmid.de/custom-inference-huggingface-sagemaker)" ] }, { "cell_type": "code", "execution_count": null, "id": "88d366bd", "metadata": {}, "outputs": [], "source": [ "!pip3 install langchain boto3" ] }, { "cell_type": "code", "execution_count": 3, "id": "1e9b926a", "metadata": {}, "outputs": [], "source": [ "from typing import Dict\n", "from langchain.embeddings import SagemakerEndpointEmbeddings\n", "from langchain.llms.sagemaker_endpoint import ContentHandlerBase\n", "import json\n", "\n", "\n", "class ContentHandler(ContentHandlerBase):\n", " content_type = \"application/json\"\n", " accepts = \"application/json\"\n", "\n", " def transform_input(self, prompt: str, model_kwargs: Dict) -> bytes:\n", " input_str = json.dumps({\"inputs\": prompt, **model_kwargs})\n", " return input_str.encode('utf-8')\n", " \n", " def transform_output(self, output: bytes) -> str:\n", " response_json = json.loads(output.read().decode(\"utf-8\"))\n", " return response_json[\"embeddings\"]\n", "\n", "content_handler = ContentHandler()\n", "\n", "\n", "embeddings = SagemakerEndpointEmbeddings(\n", " # endpoint_name=\"endpoint-name\", \n", " # credentials_profile_name=\"credentials-profile-name\", \n", " endpoint_name=\"huggingface-pytorch-inference-2023-03-21-16-14-03-834\", \n", " region_name=\"us-east-1\", \n", " content_handler=content_handler\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "fe9797b8", "metadata": {}, "outputs": [], "source": [ "query_result = embeddings.embed_query(\"foo\")" ] }, { "cell_type": "code", "execution_count": 6, "id": "76f1b752", "metadata": {}, "outputs": [], "source": [ "doc_results = embeddings.embed_documents([\"foo\"])" ] }, { "cell_type": "code", "execution_count": null, "id": "fff99b21", "metadata": {}, "outputs": [], "source": [ "doc_results" ] }, { "cell_type": "code", "execution_count": null, "id": "aaad49f8", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.1" }, "vscode": { "interpreter": { "hash": "7377c2ccc78bc62c2683122d48c8cd1fb85a53850a1b1fc29736ed39852c9885" } } }, "nbformat": 4, "nbformat_minor": 5 }