langchain/docs/modules/indexes/retrievers/examples/pinecone_hybrid_search.ipynb
2023-04-04 14:09:57 -07:00

255 lines
5.2 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"id": "ab66dd43",
"metadata": {},
"source": [
"# Pinecone Hybrid Search\n",
"\n",
"This notebook goes over how to use a retriever that under the hood uses Pinecone and Hybrid Search.\n",
"\n",
"The logic of this retriever is largely taken from [this blog post](https://www.pinecone.io/learn/hybrid-search-intro/)"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "393ac030",
"metadata": {},
"outputs": [],
"source": [
"from langchain.retrievers import PineconeHybridSearchRetriever"
]
},
{
"cell_type": "markdown",
"id": "aaf80e7f",
"metadata": {},
"source": [
"## Setup Pinecone"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "15390796",
"metadata": {},
"outputs": [],
"source": [
"import pinecone # !pip install pinecone-client\n",
"\n",
"pinecone.init(\n",
" api_key=\"...\", # API key here\n",
" environment=\"...\" # find next to api key in console\n",
")\n",
"# choose a name for your index\n",
"index_name = \"...\""
]
},
{
"cell_type": "markdown",
"id": "95d5d7f9",
"metadata": {},
"source": [
"You should only have to do this part once."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cfa3a8d8",
"metadata": {},
"outputs": [],
"source": [
"# create the index\n",
"pinecone.create_index(\n",
" name = index_name,\n",
" dimension = 1536, # dimensionality of dense model\n",
" metric = \"dotproduct\",\n",
" pod_type = \"s1\"\n",
")"
]
},
{
"cell_type": "markdown",
"id": "e01549af",
"metadata": {},
"source": [
"Now that its created, we can use it"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "bcb3c8c2",
"metadata": {},
"outputs": [],
"source": [
"index = pinecone.Index(index_name)"
]
},
{
"cell_type": "markdown",
"id": "dbc025d6",
"metadata": {},
"source": [
"## Get embeddings and tokenizers\n",
"\n",
"Embeddings are used for the dense vectors, tokenizer is used for the sparse vector"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "2f63c911",
"metadata": {},
"outputs": [],
"source": [
"from langchain.embeddings import OpenAIEmbeddings\n",
"embeddings = OpenAIEmbeddings()"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "c3f030e5",
"metadata": {},
"outputs": [],
"source": [
"from transformers import BertTokenizerFast # !pip install transformers\n",
"\n",
"# load bert tokenizer from huggingface\n",
"tokenizer = BertTokenizerFast.from_pretrained(\n",
" 'bert-base-uncased'\n",
")"
]
},
{
"cell_type": "markdown",
"id": "5462801e",
"metadata": {},
"source": [
"## Load Retriever\n",
"\n",
"We can now construct the retriever!"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "ac77d835",
"metadata": {},
"outputs": [],
"source": [
"retriever = PineconeHybridSearchRetriever(embeddings=embeddings, index=index, tokenizer=tokenizer)"
]
},
{
"cell_type": "markdown",
"id": "1c518c42",
"metadata": {},
"source": [
"## Add texts (if necessary)\n",
"\n",
"We can optionally add texts to the retriever (if they aren't already in there)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "98b1c017",
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "4d6f3ee7ca754d07a1a18d100d99e0cd",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
" 0%| | 0/1 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"retriever.add_texts([\"foo\", \"bar\", \"world\", \"hello\"])"
]
},
{
"cell_type": "markdown",
"id": "08437fa2",
"metadata": {},
"source": [
"## Use Retriever\n",
"\n",
"We can now use the retriever!"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "c0455218",
"metadata": {},
"outputs": [],
"source": [
"result = retriever.get_relevant_documents(\"foo\")"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "7dfa5c29",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Document(page_content='foo', metadata={})"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"result[0]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "74bd9256",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.1"
}
},
"nbformat": 4,
"nbformat_minor": 5
}