mirror of https://github.com/hwchase17/langchain
You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
792 lines
24 KiB
Plaintext
792 lines
24 KiB
Plaintext
1 year ago
|
{
|
||
|
"cells": [
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"id": "a0eb506a-f52e-4a92-9204-63233c3eb5bd",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"# DocArray Retriever\n",
|
||
|
"\n",
|
||
|
"[DocArray](https://github.com/docarray/docarray) is a versatile, open-source tool for managing your multi-modal data. It lets you shape your data however you want, and offers the flexibility to store and search it using various document index backends. Plus, it gets even better - you can utilize your DocArray document index to create a DocArrayRetriever, and build awesome Langchain apps!\n",
|
||
|
"\n",
|
||
|
"This notebook is split into two sections. The first section offers an introduction to all five supported document index backends. It provides guidance on setting up and indexing each backend, and also instructs you on how to build a DocArrayRetriever for finding relevant documents. In the second section, we'll select one of these backends and illustrate how to use it through a basic example.\n",
|
||
|
"\n",
|
||
|
"\n",
|
||
|
"[Document Index Backends](#Document-Index-Backends)\n",
|
||
|
"1. [InMemoryExactNNIndex](#inmemoryexactnnindex)\n",
|
||
|
"2. [HnswDocumentIndex](#hnswdocumentindex)\n",
|
||
|
"3. [WeaviateDocumentIndex](#weaviatedocumentindex)\n",
|
||
|
"4. [ElasticDocIndex](#elasticdocindex)\n",
|
||
|
"5. [QdrantDocumentIndex](#qdrantdocumentindex)\n",
|
||
|
"\n",
|
||
|
"[Movie Retrieval using HnswDocumentIndex](#Movie-Retrieval-using-HnswDocumentIndex)\n",
|
||
|
"\n",
|
||
|
"- [Normal Retriever](#normal-retriever)\n",
|
||
|
"- [Retriever with Filters](#retriever-with-filters)\n",
|
||
|
"- [Retriever with MMR Search](#Retriever-with-MMR-search)\n"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"id": "51db6285-58db-481d-8d24-b13d1888056b",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"# Document Index Backends"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 1,
|
||
|
"id": "b72a4512-6318-4572-adf2-12b06b2d2e72",
|
||
|
"metadata": {
|
||
|
"tags": []
|
||
|
},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"from langchain.retrievers import DocArrayRetriever\n",
|
||
|
"from docarray import BaseDoc\n",
|
||
|
"from docarray.typing import NdArray\n",
|
||
|
"import numpy as np\n",
|
||
|
"from langchain.embeddings import FakeEmbeddings\n",
|
||
|
"import random\n",
|
||
|
"\n",
|
||
|
"embeddings = FakeEmbeddings(size=32)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"id": "bdac41b4-67a1-483f-b3d6-fe662b7bdacd",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"Before you start building the index, it's important to define your document schema. This determines what fields your documents will have and what type of data each field will hold.\n",
|
||
|
"\n",
|
||
|
"For this demonstration, we'll create a somewhat random schema containing 'title' (str), 'title_embedding' (numpy array), 'year' (int), and 'color' (str)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 2,
|
||
|
"id": "8a97c56a-63a0-405c-929f-35e1ded79489",
|
||
|
"metadata": {
|
||
|
"tags": []
|
||
|
},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"class MyDoc(BaseDoc):\n",
|
||
|
" title: str\n",
|
||
|
" title_embedding: NdArray[32]\n",
|
||
|
" year: int\n",
|
||
|
" color: str"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"id": "297bfdb5-6bfe-47ce-90e7-feefc4c160b7",
|
||
|
"metadata": {
|
||
|
"tags": []
|
||
|
},
|
||
|
"source": [
|
||
|
"## InMemoryExactNNIndex\n",
|
||
|
"\n",
|
||
|
"InMemoryExactNNIndex stores all Documentsin memory. It is a great starting point for small datasets, where you may not want to launch a database server.\n",
|
||
|
"\n",
|
||
|
"Learn more here: https://docs.docarray.org/user_guide/storing/index_in_memory/"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 3,
|
||
|
"id": "8b6e6343-88c2-4206-92fd-5a634d39da09",
|
||
|
"metadata": {
|
||
|
"tags": []
|
||
|
},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"from docarray.index import InMemoryExactNNIndex\n",
|
||
|
"\n",
|
||
|
"\n",
|
||
|
"# initialize the index\n",
|
||
|
"db = InMemoryExactNNIndex[MyDoc]()\n",
|
||
|
"# index data\n",
|
||
|
"db.index(\n",
|
||
|
" [\n",
|
||
|
" MyDoc(\n",
|
||
1 year ago
|
" title=f\"My document {i}\",\n",
|
||
|
" title_embedding=embeddings.embed_query(f\"query {i}\"),\n",
|
||
1 year ago
|
" year=i,\n",
|
||
1 year ago
|
" color=random.choice([\"red\", \"green\", \"blue\"]),\n",
|
||
1 year ago
|
" )\n",
|
||
|
" for i in range(100)\n",
|
||
|
" ]\n",
|
||
|
")\n",
|
||
|
"# optionally, you can create a filter query\n",
|
||
|
"filter_query = {\"year\": {\"$lte\": 90}}"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 4,
|
||
|
"id": "142060e5-3e0c-4fa2-9f69-8c91f53617f4",
|
||
|
"metadata": {
|
||
|
"tags": []
|
||
|
},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"name": "stdout",
|
||
|
"output_type": "stream",
|
||
|
"text": [
|
||
|
"[Document(page_content='My document 56', metadata={'id': '1f33e58b6468ab722f3786b96b20afe6', 'year': 56, 'color': 'red'})]\n"
|
||
|
]
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"# create a retriever\n",
|
||
|
"retriever = DocArrayRetriever(\n",
|
||
1 year ago
|
" index=db,\n",
|
||
|
" embeddings=embeddings,\n",
|
||
|
" search_field=\"title_embedding\",\n",
|
||
|
" content_field=\"title\",\n",
|
||
1 year ago
|
" filters=filter_query,\n",
|
||
|
")\n",
|
||
|
"\n",
|
||
|
"# find the relevant document\n",
|
||
1 year ago
|
"doc = retriever.get_relevant_documents(\"some query\")\n",
|
||
1 year ago
|
"print(doc)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"id": "a9daf2c4-6568-4a49-ba6e-21687962d2c1",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"## HnswDocumentIndex\n",
|
||
|
"\n",
|
||
|
"HnswDocumentIndex is a lightweight Document Index implementation that runs fully locally and is best suited for small- to medium-sized datasets. It stores vectors on disk in [hnswlib](https://github.com/nmslib/hnswlib), and stores all other data in [SQLite](https://www.sqlite.org/index.html).\n",
|
||
|
"\n",
|
||
|
"Learn more here: https://docs.docarray.org/user_guide/storing/index_hnswlib/"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 5,
|
||
|
"id": "e0be3c00-470f-4448-92cc-3985f5b05809",
|
||
|
"metadata": {
|
||
|
"tags": []
|
||
|
},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"from docarray.index import HnswDocumentIndex\n",
|
||
|
"\n",
|
||
|
"\n",
|
||
|
"# initialize the index\n",
|
||
1 year ago
|
"db = HnswDocumentIndex[MyDoc](work_dir=\"hnsw_index\")\n",
|
||
1 year ago
|
"\n",
|
||
|
"# index data\n",
|
||
|
"db.index(\n",
|
||
|
" [\n",
|
||
|
" MyDoc(\n",
|
||
1 year ago
|
" title=f\"My document {i}\",\n",
|
||
|
" title_embedding=embeddings.embed_query(f\"query {i}\"),\n",
|
||
1 year ago
|
" year=i,\n",
|
||
1 year ago
|
" color=random.choice([\"red\", \"green\", \"blue\"]),\n",
|
||
1 year ago
|
" )\n",
|
||
|
" for i in range(100)\n",
|
||
|
" ]\n",
|
||
|
")\n",
|
||
|
"# optionally, you can create a filter query\n",
|
||
|
"filter_query = {\"year\": {\"$lte\": 90}}"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 6,
|
||
|
"id": "ea9eb5a0-a8f2-465b-81e2-52fb773466cf",
|
||
|
"metadata": {
|
||
|
"tags": []
|
||
|
},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"name": "stdout",
|
||
|
"output_type": "stream",
|
||
|
"text": [
|
||
|
"[Document(page_content='My document 28', metadata={'id': 'ca9f3f4268eec7c97a7d6e77f541cb82', 'year': 28, 'color': 'red'})]\n"
|
||
|
]
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"# create a retriever\n",
|
||
|
"retriever = DocArrayRetriever(\n",
|
||
1 year ago
|
" index=db,\n",
|
||
|
" embeddings=embeddings,\n",
|
||
|
" search_field=\"title_embedding\",\n",
|
||
|
" content_field=\"title\",\n",
|
||
1 year ago
|
" filters=filter_query,\n",
|
||
|
")\n",
|
||
|
"\n",
|
||
|
"# find the relevant document\n",
|
||
1 year ago
|
"doc = retriever.get_relevant_documents(\"some query\")\n",
|
||
1 year ago
|
"print(doc)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"id": "7177442e-3fd3-4f3d-ab22-cd8265b35112",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"## WeaviateDocumentIndex\n",
|
||
|
"\n",
|
||
|
"WeaviateDocumentIndex is a document index that is built upon [Weaviate](https://weaviate.io/) vector database.\n",
|
||
|
"\n",
|
||
|
"Learn more here: https://docs.docarray.org/user_guide/storing/index_weaviate/"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 7,
|
||
|
"id": "8bcf17ba-8dce-4413-ab4e-61d9baee50e7",
|
||
|
"metadata": {
|
||
|
"tags": []
|
||
|
},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
1 year ago
|
"# There's a small difference with the Weaviate backend compared to the others.\n",
|
||
|
"# Here, you need to 'mark' the field used for vector search with 'is_embedding=True'.\n",
|
||
1 year ago
|
"# So, let's create a new schema for Weaviate that takes care of this requirement.\n",
|
||
|
"\n",
|
||
1 year ago
|
"from pydantic import Field\n",
|
||
|
"\n",
|
||
1 year ago
|
"\n",
|
||
|
"class WeaviateDoc(BaseDoc):\n",
|
||
|
" title: str\n",
|
||
|
" title_embedding: NdArray[32] = Field(is_embedding=True)\n",
|
||
|
" year: int\n",
|
||
|
" color: str"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 8,
|
||
|
"id": "4065dced-3e7e-43d3-8518-b31df1e74383",
|
||
|
"metadata": {
|
||
|
"tags": []
|
||
|
},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"from docarray.index import WeaviateDocumentIndex\n",
|
||
|
"\n",
|
||
|
"\n",
|
||
|
"# initialize the index\n",
|
||
1 year ago
|
"dbconfig = WeaviateDocumentIndex.DBConfig(host=\"http://localhost:8080\")\n",
|
||
1 year ago
|
"db = WeaviateDocumentIndex[WeaviateDoc](db_config=dbconfig)\n",
|
||
|
"\n",
|
||
|
"# index data\n",
|
||
|
"db.index(\n",
|
||
|
" [\n",
|
||
|
" MyDoc(\n",
|
||
1 year ago
|
" title=f\"My document {i}\",\n",
|
||
|
" title_embedding=embeddings.embed_query(f\"query {i}\"),\n",
|
||
1 year ago
|
" year=i,\n",
|
||
1 year ago
|
" color=random.choice([\"red\", \"green\", \"blue\"]),\n",
|
||
1 year ago
|
" )\n",
|
||
|
" for i in range(100)\n",
|
||
|
" ]\n",
|
||
|
")\n",
|
||
|
"# optionally, you can create a filter query\n",
|
||
|
"filter_query = {\"path\": [\"year\"], \"operator\": \"LessThanEqual\", \"valueInt\": \"90\"}"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 9,
|
||
|
"id": "4e21d124-0f3c-445b-b9fc-dc7c8d6b3d2b",
|
||
|
"metadata": {
|
||
|
"tags": []
|
||
|
},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"name": "stdout",
|
||
|
"output_type": "stream",
|
||
|
"text": [
|
||
|
"[Document(page_content='My document 17', metadata={'id': '3a5b76e85f0d0a01785dc8f9d965ce40', 'year': 17, 'color': 'red'})]\n"
|
||
|
]
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"# create a retriever\n",
|
||
|
"retriever = DocArrayRetriever(\n",
|
||
1 year ago
|
" index=db,\n",
|
||
|
" embeddings=embeddings,\n",
|
||
|
" search_field=\"title_embedding\",\n",
|
||
|
" content_field=\"title\",\n",
|
||
1 year ago
|
" filters=filter_query,\n",
|
||
|
")\n",
|
||
|
"\n",
|
||
|
"# find the relevant document\n",
|
||
1 year ago
|
"doc = retriever.get_relevant_documents(\"some query\")\n",
|
||
1 year ago
|
"print(doc)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"id": "6ee8f920-9297-4b0a-a353-053a86947d10",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"## ElasticDocIndex\n",
|
||
|
"\n",
|
||
|
"ElasticDocIndex is a document index that is built upon [ElasticSearch](https://github.com/elastic/elasticsearch)\n",
|
||
|
"\n",
|
||
|
"Learn more here: https://docs.docarray.org/user_guide/storing/index_elastic/"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 10,
|
||
|
"id": "92980ead-e4dc-4eef-8618-1c0583f76d7a",
|
||
|
"metadata": {
|
||
|
"tags": []
|
||
|
},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"from docarray.index import ElasticDocIndex\n",
|
||
|
"\n",
|
||
|
"\n",
|
||
|
"# initialize the index\n",
|
||
|
"db = ElasticDocIndex[MyDoc](\n",
|
||
1 year ago
|
" hosts=\"http://localhost:9200\", index_name=\"docarray_retriever\"\n",
|
||
1 year ago
|
")\n",
|
||
|
"\n",
|
||
|
"# index data\n",
|
||
|
"db.index(\n",
|
||
|
" [\n",
|
||
|
" MyDoc(\n",
|
||
1 year ago
|
" title=f\"My document {i}\",\n",
|
||
|
" title_embedding=embeddings.embed_query(f\"query {i}\"),\n",
|
||
1 year ago
|
" year=i,\n",
|
||
1 year ago
|
" color=random.choice([\"red\", \"green\", \"blue\"]),\n",
|
||
1 year ago
|
" )\n",
|
||
|
" for i in range(100)\n",
|
||
|
" ]\n",
|
||
|
")\n",
|
||
|
"# optionally, you can create a filter query\n",
|
||
|
"filter_query = {\"range\": {\"year\": {\"lte\": 90}}}"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 11,
|
||
|
"id": "8a8e97f3-c3a1-4c7f-b776-363c5e7dd69d",
|
||
|
"metadata": {
|
||
|
"tags": []
|
||
|
},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"name": "stdout",
|
||
|
"output_type": "stream",
|
||
|
"text": [
|
||
|
"[Document(page_content='My document 46', metadata={'id': 'edbc721bac1c2ad323414ad1301528a4', 'year': 46, 'color': 'green'})]\n"
|
||
|
]
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"# create a retriever\n",
|
||
|
"retriever = DocArrayRetriever(\n",
|
||
1 year ago
|
" index=db,\n",
|
||
|
" embeddings=embeddings,\n",
|
||
|
" search_field=\"title_embedding\",\n",
|
||
|
" content_field=\"title\",\n",
|
||
1 year ago
|
" filters=filter_query,\n",
|
||
|
")\n",
|
||
|
"\n",
|
||
|
"# find the relevant document\n",
|
||
1 year ago
|
"doc = retriever.get_relevant_documents(\"some query\")\n",
|
||
1 year ago
|
"print(doc)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"id": "281432f8-87a5-4f22-a582-9d5dac33d158",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"## QdrantDocumentIndex\n",
|
||
|
"\n",
|
||
|
"QdrantDocumentIndex is a document index that is build upon [Qdrant](https://qdrant.tech/) vector database\n",
|
||
|
"\n",
|
||
|
"Learn more here: https://docs.docarray.org/user_guide/storing/index_qdrant/"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 12,
|
||
|
"id": "b6fd91d0-630a-4974-bdf1-6dfa4d1a68f5",
|
||
|
"metadata": {
|
||
|
"tags": []
|
||
|
},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"name": "stderr",
|
||
|
"output_type": "stream",
|
||
|
"text": [
|
||
|
"WARNING:root:Payload indexes have no effect in the local Qdrant. Please use server Qdrant if you need payload indexes.\n"
|
||
|
]
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"from docarray.index import QdrantDocumentIndex\n",
|
||
|
"from qdrant_client.http import models as rest\n",
|
||
|
"\n",
|
||
|
"\n",
|
||
|
"# initialize the index\n",
|
||
|
"qdrant_config = QdrantDocumentIndex.DBConfig(path=\":memory:\")\n",
|
||
|
"db = QdrantDocumentIndex[MyDoc](qdrant_config)\n",
|
||
|
"\n",
|
||
|
"# index data\n",
|
||
|
"db.index(\n",
|
||
|
" [\n",
|
||
|
" MyDoc(\n",
|
||
1 year ago
|
" title=f\"My document {i}\",\n",
|
||
|
" title_embedding=embeddings.embed_query(f\"query {i}\"),\n",
|
||
1 year ago
|
" year=i,\n",
|
||
1 year ago
|
" color=random.choice([\"red\", \"green\", \"blue\"]),\n",
|
||
1 year ago
|
" )\n",
|
||
|
" for i in range(100)\n",
|
||
|
" ]\n",
|
||
|
")\n",
|
||
|
"# optionally, you can create a filter query\n",
|
||
|
"filter_query = rest.Filter(\n",
|
||
|
" must=[\n",
|
||
|
" rest.FieldCondition(\n",
|
||
|
" key=\"year\",\n",
|
||
|
" range=rest.Range(\n",
|
||
|
" gte=10,\n",
|
||
|
" lt=90,\n",
|
||
|
" ),\n",
|
||
|
" )\n",
|
||
|
" ]\n",
|
||
|
")"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 13,
|
||
|
"id": "a6dd6460-7175-48ee-8cfb-9a0abf35ec13",
|
||
|
"metadata": {
|
||
|
"tags": []
|
||
|
},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"name": "stdout",
|
||
|
"output_type": "stream",
|
||
|
"text": [
|
||
|
"[Document(page_content='My document 80', metadata={'id': '97465f98d0810f1f330e4ecc29b13d20', 'year': 80, 'color': 'blue'})]\n"
|
||
|
]
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"# create a retriever\n",
|
||
|
"retriever = DocArrayRetriever(\n",
|
||
1 year ago
|
" index=db,\n",
|
||
|
" embeddings=embeddings,\n",
|
||
|
" search_field=\"title_embedding\",\n",
|
||
|
" content_field=\"title\",\n",
|
||
1 year ago
|
" filters=filter_query,\n",
|
||
|
")\n",
|
||
|
"\n",
|
||
|
"# find the relevant document\n",
|
||
1 year ago
|
"doc = retriever.get_relevant_documents(\"some query\")\n",
|
||
1 year ago
|
"print(doc)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"id": "3afb65b0-c620-411a-855f-1aa81481bdbb",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"# Movie Retrieval using HnswDocumentIndex"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 14,
|
||
|
"id": "07b71d96-381e-4965-b525-af9f7cc5f86c",
|
||
|
"metadata": {
|
||
|
"tags": []
|
||
|
},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"movies = [\n",
|
||
|
" {\n",
|
||
|
" \"title\": \"Inception\",\n",
|
||
|
" \"description\": \"A thief who steals corporate secrets through the use of dream-sharing technology is given the task of planting an idea into the mind of a CEO.\",\n",
|
||
|
" \"director\": \"Christopher Nolan\",\n",
|
||
|
" \"rating\": 8.8,\n",
|
||
|
" },\n",
|
||
|
" {\n",
|
||
|
" \"title\": \"The Dark Knight\",\n",
|
||
|
" \"description\": \"When the menace known as the Joker wreaks havoc and chaos on the people of Gotham, Batman must accept one of the greatest psychological and physical tests of his ability to fight injustice.\",\n",
|
||
|
" \"director\": \"Christopher Nolan\",\n",
|
||
|
" \"rating\": 9.0,\n",
|
||
|
" },\n",
|
||
|
" {\n",
|
||
|
" \"title\": \"Interstellar\",\n",
|
||
|
" \"description\": \"Interstellar explores the boundaries of human exploration as a group of astronauts venture through a wormhole in space. In their quest to ensure the survival of humanity, they confront the vastness of space-time and grapple with love and sacrifice.\",\n",
|
||
|
" \"director\": \"Christopher Nolan\",\n",
|
||
|
" \"rating\": 8.6,\n",
|
||
|
" },\n",
|
||
|
" {\n",
|
||
|
" \"title\": \"Pulp Fiction\",\n",
|
||
|
" \"description\": \"The lives of two mob hitmen, a boxer, a gangster's wife, and a pair of diner bandits intertwine in four tales of violence and redemption.\",\n",
|
||
|
" \"director\": \"Quentin Tarantino\",\n",
|
||
|
" \"rating\": 8.9,\n",
|
||
|
" },\n",
|
||
|
" {\n",
|
||
|
" \"title\": \"Reservoir Dogs\",\n",
|
||
|
" \"description\": \"When a simple jewelry heist goes horribly wrong, the surviving criminals begin to suspect that one of them is a police informant.\",\n",
|
||
|
" \"director\": \"Quentin Tarantino\",\n",
|
||
|
" \"rating\": 8.3,\n",
|
||
|
" },\n",
|
||
|
" {\n",
|
||
|
" \"title\": \"The Godfather\",\n",
|
||
|
" \"description\": \"An aging patriarch of an organized crime dynasty transfers control of his empire to his reluctant son.\",\n",
|
||
|
" \"director\": \"Francis Ford Coppola\",\n",
|
||
|
" \"rating\": 9.2,\n",
|
||
|
" },\n",
|
||
1 year ago
|
"]"
|
||
1 year ago
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 15,
|
||
|
"id": "1860edfb-936d-4cd8-a167-e8f9c4617709",
|
||
|
"metadata": {
|
||
|
"tags": []
|
||
|
},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"name": "stdin",
|
||
|
"output_type": "stream",
|
||
|
"text": [
|
||
|
"OpenAI API Key: ········\n"
|
||
|
]
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"import getpass\n",
|
||
1 year ago
|
"import os\n",
|
||
1 year ago
|
"\n",
|
||
1 year ago
|
"os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"OpenAI API Key:\")"
|
||
1 year ago
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 16,
|
||
|
"id": "0538541d-26ea-4323-96b9-47768c75dcd8",
|
||
|
"metadata": {
|
||
|
"tags": []
|
||
|
},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"from docarray import BaseDoc, DocList\n",
|
||
|
"from docarray.typing import NdArray\n",
|
||
|
"from langchain.embeddings.openai import OpenAIEmbeddings\n",
|
||
|
"\n",
|
||
1 year ago
|
"\n",
|
||
1 year ago
|
"# define schema for your movie documents\n",
|
||
|
"class MyDoc(BaseDoc):\n",
|
||
|
" title: str\n",
|
||
|
" description: str\n",
|
||
|
" description_embedding: NdArray[1536]\n",
|
||
|
" rating: float\n",
|
||
|
" director: str\n",
|
||
1 year ago
|
"\n",
|
||
1 year ago
|
"\n",
|
||
|
"embeddings = OpenAIEmbeddings()\n",
|
||
|
"\n",
|
||
|
"\n",
|
||
|
"# get \"description\" embeddings, and create documents\n",
|
||
|
"docs = DocList[MyDoc](\n",
|
||
|
" [\n",
|
||
|
" MyDoc(\n",
|
||
|
" description_embedding=embeddings.embed_query(movie[\"description\"]), **movie\n",
|
||
|
" )\n",
|
||
|
" for movie in movies\n",
|
||
|
" ]\n",
|
||
|
")"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 17,
|
||
|
"id": "f5ae1b41-0372-47ea-89bb-c6ad968a2919",
|
||
|
"metadata": {
|
||
|
"tags": []
|
||
|
},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"from docarray.index import HnswDocumentIndex\n",
|
||
|
"\n",
|
||
|
"# initialize the index\n",
|
||
1 year ago
|
"db = HnswDocumentIndex[MyDoc](work_dir=\"movie_search\")\n",
|
||
1 year ago
|
"\n",
|
||
|
"# add data\n",
|
||
|
"db.index(docs)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"id": "9ca3f91b-ed11-490b-b60a-0d1d9b50a5b2",
|
||
|
"metadata": {
|
||
|
"tags": []
|
||
|
},
|
||
|
"source": [
|
||
|
"## Normal Retriever"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 18,
|
||
|
"id": "efdb5cbf-218e-48a6-af0f-25b7a510e343",
|
||
|
"metadata": {
|
||
|
"tags": []
|
||
|
},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"name": "stdout",
|
||
|
"output_type": "stream",
|
||
|
"text": [
|
||
|
"[Document(page_content='A thief who steals corporate secrets through the use of dream-sharing technology is given the task of planting an idea into the mind of a CEO.', metadata={'id': 'f1649d5b6776db04fec9a116bbb6bbe5', 'title': 'Inception', 'rating': 8.8, 'director': 'Christopher Nolan'})]\n"
|
||
|
]
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"from langchain.retrievers import DocArrayRetriever\n",
|
||
|
"\n",
|
||
|
"# create a retriever\n",
|
||
|
"retriever = DocArrayRetriever(\n",
|
||
1 year ago
|
" index=db,\n",
|
||
|
" embeddings=embeddings,\n",
|
||
|
" search_field=\"description_embedding\",\n",
|
||
|
" content_field=\"description\",\n",
|
||
1 year ago
|
")\n",
|
||
|
"\n",
|
||
|
"# find the relevant document\n",
|
||
1 year ago
|
"doc = retriever.get_relevant_documents(\"movie about dreams\")\n",
|
||
1 year ago
|
"print(doc)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"id": "3defa711-51df-4b48-b02a-306706cfacd0",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"## Retriever with Filters"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 19,
|
||
|
"id": "205a9fe8-13bb-4280-9485-f6973bbc6943",
|
||
|
"metadata": {
|
||
|
"tags": []
|
||
|
},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"name": "stdout",
|
||
|
"output_type": "stream",
|
||
|
"text": [
|
||
|
"[Document(page_content='Interstellar explores the boundaries of human exploration as a group of astronauts venture through a wormhole in space. In their quest to ensure the survival of humanity, they confront the vastness of space-time and grapple with love and sacrifice.', metadata={'id': 'ab704cc7ae8573dc617f9a5e25df022a', 'title': 'Interstellar', 'rating': 8.6, 'director': 'Christopher Nolan'}), Document(page_content='A thief who steals corporate secrets through the use of dream-sharing technology is given the task of planting an idea into the mind of a CEO.', metadata={'id': 'f1649d5b6776db04fec9a116bbb6bbe5', 'title': 'Inception', 'rating': 8.8, 'director': 'Christopher Nolan'})]\n"
|
||
|
]
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"from langchain.retrievers import DocArrayRetriever\n",
|
||
|
"\n",
|
||
|
"# create a retriever\n",
|
||
|
"retriever = DocArrayRetriever(\n",
|
||
1 year ago
|
" index=db,\n",
|
||
|
" embeddings=embeddings,\n",
|
||
|
" search_field=\"description_embedding\",\n",
|
||
|
" content_field=\"description\",\n",
|
||
|
" filters={\"director\": {\"$eq\": \"Christopher Nolan\"}},\n",
|
||
1 year ago
|
" top_k=2,\n",
|
||
|
")\n",
|
||
|
"\n",
|
||
|
"# find relevant documents\n",
|
||
1 year ago
|
"docs = retriever.get_relevant_documents(\"space travel\")\n",
|
||
1 year ago
|
"print(docs)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"id": "fa10afa6-1554-4c2b-8afc-cff44e32d2f8",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"## Retriever with MMR search"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 20,
|
||
|
"id": "b7305599-b166-419c-8e1e-8ff7c247cce6",
|
||
|
"metadata": {
|
||
|
"tags": []
|
||
|
},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"name": "stdout",
|
||
|
"output_type": "stream",
|
||
|
"text": [
|
||
|
"[Document(page_content=\"The lives of two mob hitmen, a boxer, a gangster's wife, and a pair of diner bandits intertwine in four tales of violence and redemption.\", metadata={'id': 'e6aa313bbde514e23fbc80ab34511afd', 'title': 'Pulp Fiction', 'rating': 8.9, 'director': 'Quentin Tarantino'}), Document(page_content='A thief who steals corporate secrets through the use of dream-sharing technology is given the task of planting an idea into the mind of a CEO.', metadata={'id': 'f1649d5b6776db04fec9a116bbb6bbe5', 'title': 'Inception', 'rating': 8.8, 'director': 'Christopher Nolan'}), Document(page_content='When the menace known as the Joker wreaks havoc and chaos on the people of Gotham, Batman must accept one of the greatest psychological and physical tests of his ability to fight injustice.', metadata={'id': '91dec17d4272041b669fd113333a65f7', 'title': 'The Dark Knight', 'rating': 9.0, 'director': 'Christopher Nolan'})]\n"
|
||
|
]
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"from langchain.retrievers import DocArrayRetriever\n",
|
||
|
"\n",
|
||
|
"# create a retriever\n",
|
||
|
"retriever = DocArrayRetriever(\n",
|
||
1 year ago
|
" index=db,\n",
|
||
|
" embeddings=embeddings,\n",
|
||
|
" search_field=\"description_embedding\",\n",
|
||
|
" content_field=\"description\",\n",
|
||
|
" filters={\"rating\": {\"$gte\": 8.7}},\n",
|
||
|
" search_type=\"mmr\",\n",
|
||
1 year ago
|
" top_k=3,\n",
|
||
|
")\n",
|
||
|
"\n",
|
||
|
"# find relevant documents\n",
|
||
1 year ago
|
"docs = retriever.get_relevant_documents(\"action movies\")\n",
|
||
1 year ago
|
"print(docs)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"id": "4865cf25-48af-4d60-9337-9528b9b30f28",
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": []
|
||
|
}
|
||
|
],
|
||
|
"metadata": {
|
||
|
"kernelspec": {
|
||
|
"display_name": "Python 3 (ipykernel)",
|
||
|
"language": "python",
|
||
|
"name": "python3"
|
||
|
},
|
||
|
"language_info": {
|
||
|
"codemirror_mode": {
|
||
|
"name": "ipython",
|
||
|
"version": 3
|
||
|
},
|
||
|
"file_extension": ".py",
|
||
|
"mimetype": "text/x-python",
|
||
|
"name": "python",
|
||
|
"nbconvert_exporter": "python",
|
||
|
"pygments_lexer": "ipython3",
|
||
|
"version": "3.9.17"
|
||
|
}
|
||
|
},
|
||
|
"nbformat": 4,
|
||
|
"nbformat_minor": 5
|
||
|
}
|