From 224ec0cfd3c05fc7b1d5010b117f7c78bae15bb4 Mon Sep 17 00:00:00 2001 From: Prakul Date: Fri, 27 Oct 2023 11:50:29 -0700 Subject: [PATCH] Mongo db $vector search doc update (#12404) **Description:** Updates the documentation for MongoDB Atlas Vector Search --- .../vectorstores/mongodb_atlas.ipynb | 285 ++++++++++++++---- 1 file changed, 231 insertions(+), 54 deletions(-) diff --git a/docs/docs/integrations/vectorstores/mongodb_atlas.ipynb b/docs/docs/integrations/vectorstores/mongodb_atlas.ipynb index ecd57f3e0b..eacbfcd94b 100644 --- a/docs/docs/integrations/vectorstores/mongodb_atlas.ipynb +++ b/docs/docs/integrations/vectorstores/mongodb_atlas.ipynb @@ -9,30 +9,74 @@ "\n", ">[MongoDB Atlas](https://www.mongodb.com/docs/atlas/) is a fully-managed cloud database available in AWS, Azure, and GCP. It now has support for native Vector Search on your MongoDB document data.\n", "\n", - "This notebook shows how to use `MongoDB Atlas Vector Search` to store your embeddings in MongoDB documents, create a vector search index, and perform KNN search with an approximate nearest neighbor algorithm.\n", + "This notebook shows how to use [MongoDB Atlas Vector Search](https://www.mongodb.com/products/platform/atlas-vector-search) to store your embeddings in MongoDB documents, create a vector search index, and perform KNN search with an approximate nearest neighbor algorithm (`Hierarchical Navigable Small Worlds`). It uses the [$vectorSearch MQL Stage](https://www.mongodb.com/docs/atlas/atlas-vector-search/vector-search-overview/). \n", "\n", - "It uses the [knnBeta Operator](https://www.mongodb.com/docs/atlas/atlas-search/knn-beta) available in MongoDB Atlas Search. This feature is in Public Preview and available for evaluation purposes, to validate functionality, and to gather feedback from public preview users. It is not recommended for production deployments as we may introduce breaking changes.\n", "\n", - "To use MongoDB Atlas, you must first deploy a cluster. We have a Forever-Free tier of clusters available. \n", - "To get started head over to Atlas here: [quick start](https://www.mongodb.com/docs/atlas/getting-started/)." + "To use MongoDB Atlas, you must first deploy a cluster. We have a Forever-Free tier of clusters available. To get started head over to Atlas here: [quick start](https://www.mongodb.com/docs/atlas/getting-started/).\n", + "\n", + " " + ] + }, + { + "cell_type": "markdown", + "id": "5abfec15", + "metadata": {}, + "source": [ + "> Note: \n", + ">* This feature is in Public Preview and available for evaluation purposes, to validate functionality, and to gather feedback from public preview users. It is not recommended for production deployments as we may introduce breaking changes.\n", + ">* The langchain version 0.0.35 ([release notes](https://github.com/langchain-ai/langchain/releases/tag/v0.0.305)) introduces the support for $vectorSearch MQL stage, which is available with MongoDB Atlas 6.0.11 and 7.0.2. Users utilizing earlier versions of MongoDB Atlas need to pin their LangChain version to <=0.0.304\n", + "> \n", + "> " + ] + }, + { + "cell_type": "markdown", + "id": "1b5ce18d", + "metadata": {}, + "source": [ + "In the notebook we will demonstrate how to perform `Retrieval Augmented Generation` (RAG) using MongoDB Atlas, OpenAI and Langchain. We will be performing Similarity Search and Question Answering over the PDF document for [GPT 4 technical report](https://arxiv.org/pdf/2303.08774.pdf) that came out in March 2023 and hence is not part of the OpenAI's Large Language Model(LLM)'s parametric memory, which had a knowledge cutoff of September 2021." + ] + }, + { + "cell_type": "markdown", + "id": "457ace44-1d95-4001-9dd5-78811ab208ad", + "metadata": {}, + "source": [ + "We want to use `OpenAIEmbeddings` so we need to set up our OpenAI API Key. " ] }, { "cell_type": "code", "execution_count": null, - "id": "b4c41cad-08ef-4f72-a545-2151e4598efe", - "metadata": { - "tags": [] - }, + "id": "2d8f240d", + "metadata": {}, + "outputs": [], + "source": [ + "os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"OpenAI API Key:\")" + ] + }, + { + "cell_type": "markdown", + "id": "70482cd8", + "metadata": {}, + "source": [ + "Now we will setup the environment variables for the MongoDB Atlas cluster" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4d7788cf", + "metadata": {}, "outputs": [], "source": [ - "!pip install pymongo" + "!pip install langchain pypdf pymongo openai tiktoken" ] }, { "cell_type": "code", "execution_count": null, - "id": "c1e38361-c1fe-4ac6-86e9-c90ebaf7ae87", + "id": "7ef41b37", "metadata": {}, "outputs": [], "source": [ @@ -43,21 +87,32 @@ ] }, { - "cell_type": "markdown", - "id": "457ace44-1d95-4001-9dd5-78811ab208ad", + "cell_type": "code", + "execution_count": null, + "id": "00d78318", "metadata": {}, + "outputs": [], "source": [ - "We want to use `OpenAIEmbeddings` so we need to set up our OpenAI API Key. " + "from pymongo import MongoClient\n", + "\n", + "# initialize MongoDB python client\n", + "client = MongoClient(MONGODB_ATLAS_CLUSTER_URI)\n", + "\n", + "DB_NAME = \"langchain_db\"\n", + "COLLECTION_NAME = \"test\"\n", + "ATLAS_VECTOR_SEARCH_INDEX_NAME = \"default\"\n", + "\n", + "MONGODB_COLLECTION = client[DB_NAME][COLLECTION_NAME]" ] }, { "cell_type": "code", "execution_count": null, - "id": "2d8f240d", + "id": "cacb61e9", "metadata": {}, "outputs": [], "source": [ - "os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"OpenAI API Key:\")" + "# Create Vector Search Index" ] }, { @@ -65,8 +120,8 @@ "id": "1f3ecc42", "metadata": {}, "source": [ - "Now, let's create a vector search index on your cluster. In the below example, `embedding` is the name of the field that contains the embedding vector. Please refer to the [documentation](https://www.mongodb.com/docs/atlas/atlas-search/define-field-mappings-for-vector-search) to get more details on how to define an Atlas Vector Search index.\n", - "You can name the index `langchain_demo` and create the index on the namespace `lanchain_db.langchain_col`. Finally, write the following definition in the JSON editor on MongoDB Atlas:\n", + "Now, let's create a vector search index on your cluster. In the below example, `embedding` is the name of the field that contains the embedding vector. Please refer to the [documentation](https://www.mongodb.com/docs/atlas/atlas-search/field-types/knn-vector) to get more details on how to define an Atlas Vector Search index.\n", + "You can name the index `{ATLAS_VECTOR_SEARCH_INDEX_NAME}` and create the index on the namespace `{DB_NAME}.{COLLECTION_NAME}`. Finally, write the following definition in the JSON editor on MongoDB Atlas:\n", "\n", "```json\n", "{\n", @@ -84,26 +139,51 @@ "```" ] }, + { + "cell_type": "markdown", + "id": "42873e5a", + "metadata": {}, + "source": [ + "# Insert Data" + ] + }, { "cell_type": "code", - "execution_count": 2, + "execution_count": null, "id": "aac9563e", "metadata": { "tags": [] }, "outputs": [], "source": [ - "from langchain.embeddings.openai import OpenAIEmbeddings\n", - "from langchain.text_splitter import CharacterTextSplitter\n", - "from langchain.vectorstores import MongoDBAtlasVectorSearch\n", - "from langchain.document_loaders import TextLoader\n", + "from langchain.document_loaders import PyPDFLoader\n", "\n", - "loader = TextLoader(\"../../modules/state_of_the_union.txt\")\n", - "documents = loader.load()\n", - "text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n", - "docs = text_splitter.split_documents(documents)\n", + "# Load the PDF \n", + "loader = PyPDFLoader(\"https://arxiv.org/pdf/2303.08774.pdf\")\n", + "data = loader.load()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a5578113", + "metadata": {}, + "outputs": [], + "source": [ + "from langchain.text_splitter import RecursiveCharacterTextSplitter\n", "\n", - "embeddings = OpenAIEmbeddings()" + "text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1000, chunk_overlap = 150)\n", + "docs = text_splitter.split_documents(data)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d378168f", + "metadata": {}, + "outputs": [], + "source": [ + "print(docs[0])" ] }, { @@ -113,34 +193,35 @@ "metadata": {}, "outputs": [], "source": [ - "from pymongo import MongoClient\n", - "\n", - "# initialize MongoDB python client\n", - "client = MongoClient(MONGODB_ATLAS_CLUSTER_URI)\n", - "\n", - "db_name = \"langchain_db\"\n", - "collection_name = \"langchain_col\"\n", - "collection = client[db_name][collection_name]\n", - "index_name = \"langchain_demo\"\n", + "from langchain.embeddings import OpenAIEmbeddings\n", + "from langchain.vectorstores import MongoDBAtlasVectorSearch\n", "\n", "# insert the documents in MongoDB Atlas with their embedding\n", - "docsearch = MongoDBAtlasVectorSearch.from_documents(\n", - " docs, embeddings, collection=collection, index_name=index_name\n", - ")\n", - "\n", - "# perform a similarity search between the embedding of the query and the embeddings of the documents\n", - "query = \"What did the president say about Ketanji Brown Jackson\"\n", - "docs = docsearch.similarity_search(query)" + "vector_search = MongoDBAtlasVectorSearch.from_documents(\n", + " documents=docs, embedding=OpenAIEmbeddings(disallowed_special=()), collection=MONGODB_COLLECTION, index_name=ATLAS_VECTOR_SEARCH_INDEX_NAME\n", + ")" ] }, { "cell_type": "code", "execution_count": null, - "id": "9c608226", + "id": "7bf6841e", "metadata": {}, "outputs": [], "source": [ - "print(docs[0].page_content)" + "# Perform a similarity search between the embedding of the query and the embeddings of the documents\n", + "query = \"What were the compute requirements for training GPT 4\"\n", + "results = vector_search.similarity_search(query)\n", + "\n", + "print(results[0].page_content)" + ] + }, + { + "cell_type": "markdown", + "id": "9e58c2d8", + "metadata": {}, + "source": [ + "# Querying data" ] }, { @@ -148,26 +229,122 @@ "id": "851a2ec9-9390-49a4-8412-3e132c9f789d", "metadata": {}, "source": [ - "You can also instantiate the vector store directly and execute a query as follows:" + "We can also instantiate the vector store directly and execute a query as follows:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "985d28fe", + "metadata": {}, + "outputs": [], + "source": [ + "from langchain.embeddings import OpenAIEmbeddings\n", + "from langchain.vectorstores import MongoDBAtlasVectorSearch\n", + "\n", + "vector_search = MongoDBAtlasVectorSearch.from_connection_string(\n", + " MONGODB_ATLAS_CLUSTER_URI,\n", + " DB_NAME + \".\" + COLLECTION_NAME,\n", + " OpenAIEmbeddings(disallowed_special=()),\n", + " index_name=ATLAS_VECTOR_SEARCH_INDEX_NAME\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "6d9a2dbe", + "metadata": {}, + "source": [ + "## Similarity Search with Score" ] }, { "cell_type": "code", "execution_count": null, - "id": "6336fe79-3e73-48be-b20a-0ff1bb6a4399", + "id": "497baffa", "metadata": {}, "outputs": [], "source": [ - "# initialize vector store\n", - "vectorstore = MongoDBAtlasVectorSearch(\n", - " collection, OpenAIEmbeddings(), index_name=index_name\n", + "query = \"What were the compute requirements for training GPT 4\"\n", + "\n", + "results = vector_search.similarity_search_with_score(\n", + " query=query,\n", + " k=5,\n", ")\n", "\n", - "# perform a similarity search between a query and the ingested documents\n", - "query = \"What did the president say about Ketanji Brown Jackson\"\n", - "docs = vectorstore.similarity_search(query)\n", + "# Display results\n", + "for result in results:\n", + " print(result)" + ] + }, + { + "cell_type": "markdown", + "id": "cbade5f0", + "metadata": {}, + "source": [ + "## Question Answering " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bc6475f9", + "metadata": {}, + "outputs": [], + "source": [ + "qa_retriever = vector_search.as_retriever(\n", + " search_type=\"similarity\",\n", + " search_kwargs={\n", + " \"k\": 100,\n", + " \"post_filter_pipeline\": [{\"$limit\": 25}]\n", + " }\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8e13e96c", + "metadata": {}, + "outputs": [], + "source": [ + "from langchain.prompts import PromptTemplate\n", + "prompt_template = \"\"\"Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.\n", + "\n", + "{context}\n", "\n", - "print(docs[0].page_content)" + "Question: {question}\n", + "\"\"\"\n", + "PROMPT = PromptTemplate(\n", + " template=prompt_template, input_variables=[\"context\", \"question\"]\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ff0edb02", + "metadata": {}, + "outputs": [], + "source": [ + "from langchain.chains import RetrievalQA\n", + "from langchain.chat_models import ChatOpenAI\n", + "from langchain.llms import OpenAI\n", + "\n", + "qa = RetrievalQA.from_chain_type(llm=OpenAI(),chain_type=\"stuff\", retriever=qa_retriever, return_source_documents=True, chain_type_kwargs={\"prompt\": PROMPT})\n", + "\n", + "docs = qa({\"query\": \"gpt-4 compute requirements\"})\n", + "\n", + "print(docs[\"result\"])\n", + "print(docs['source_documents'])" + ] + }, + { + "cell_type": "markdown", + "id": "61636bb2", + "metadata": {}, + "source": [ + "GPT-4 requires significantly more compute than earlier GPT models. On a dataset derived from OpenAI's internal codebase, GPT-4 requires 100p (petaflops) of compute to reach the lowest loss, while the smaller models require 1-10n (nanoflops)." ] } ],