langchain/docs/modules/indexes/retrievers/examples/vectorstore.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "fc0db1bc",
   "metadata": {},
   "source": [
    "# VectorStore\n",
    "\n",
    "The index - and therefore the retriever - that LangChain has the most support for is the `VectorStoreRetriever`. As the name suggests, this retriever is backed heavily by a VectorStore.\n",
    "\n",
    "Once you construct a VectorStore, its very easy to construct a retriever. Let's walk through an example."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "5831703b",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.document_loaders import TextLoader\n",
    "loader = TextLoader('../../../state_of_the_union.txt')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "9fbcc58f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Exiting: Cleaning up .chroma directory\n"
     ]
    }
   ],
   "source": [
    "from langchain.text_splitter import CharacterTextSplitter\n",
    "from langchain.vectorstores import FAISS\n",
    "from langchain.embeddings import OpenAIEmbeddings\n",
    "\n",
    "documents = loader.load()\n",
    "text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
    "texts = text_splitter.split_documents(documents)\n",
    "embeddings = OpenAIEmbeddings()\n",
    "db = FAISS.from_documents(texts, embeddings)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "0cbfb1af",
   "metadata": {},
   "outputs": [],
   "source": [
    "retriever = db.as_retriever()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "fc12700b",
   "metadata": {},
   "outputs": [],
   "source": [
    "docs = retriever.get_relevant_documents(\"what did he say about ketanji brown jackson\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "79b783de",
   "metadata": {},
   "source": [
    "## Maximum Marginal Relevance Retrieval\n",
    "By default, the vectorstore retriever uses similarity search. If the underlying vectorstore support maximum marginal relevance search, you can specify that as the search type."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "44c7303e",
   "metadata": {},
   "outputs": [],
   "source": [
    "retriever = db.as_retriever(search_type=\"mmr\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "d16ceec6",
   "metadata": {},
   "outputs": [],
   "source": [
    "docs = retriever.get_relevant_documents(\"what did he say abotu ketanji brown jackson\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2d958271",
   "metadata": {},
   "source": [
    "## Similarity Score Threshold Retrieval\n",
    "\n",
    "You can also a retrieval method that sets a similarity score threshold and only returns documents with a score above that threshold"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "d4272ad8",
   "metadata": {},
   "outputs": [],
   "source": [
    "retriever = db.as_retriever(search_type=\"similarity_score_threshold\", search_kwargs={\"score_threshold\": .5})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "438e761d",
   "metadata": {},
   "outputs": [],
   "source": [
    "docs = retriever.get_relevant_documents(\"what did he say abotu ketanji brown jackson\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c23b7698",
   "metadata": {},
   "source": [
    "## Specifying top k\n",
    "You can also specify search kwargs like `k` to use when doing retrieval."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "b5f44cdf",
   "metadata": {},
   "outputs": [],
   "source": [
    "retriever = db.as_retriever(search_kwargs={\"k\": 1})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "56b6a545",
   "metadata": {},
   "outputs": [],
   "source": [
    "docs = retriever.get_relevant_documents(\"what did he say abotu ketanji brown jackson\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "b5416858",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(docs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9a658023",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}