langchain/docs/extras/integrations/retrievers/elastic_search_bm25.ipynb

187 lines
4.9 KiB
Plaintext
Raw Normal View History

2023-04-05 04:29:06 +00:00
{
"cells": [
{
"cell_type": "markdown",
"id": "ab66dd43",
"metadata": {},
"source": [
"# ElasticSearch BM25\n",
"\n",
">[Elasticsearch](https://www.elastic.co/elasticsearch/) is a distributed, RESTful search and analytics engine. It provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents.\n",
"\n",
">In information retrieval, [Okapi BM25](https://en.wikipedia.org/wiki/Okapi_BM25) (BM is an abbreviation of best matching) is a ranking function used by search engines to estimate the relevance of documents to a given search query. It is based on the probabilistic retrieval framework developed in the 1970s and 1980s by Stephen E. Robertson, Karen Spärck Jones, and others.\n",
"\n",
">The name of the actual ranking function is BM25. The fuller name, Okapi BM25, includes the name of the first system to use it, which was the Okapi information retrieval system, implemented at London's City University in the 1980s and 1990s. BM25 and its newer variants, e.g. BM25F (a version of BM25 that can take document structure and anchor text into account), represent TF-IDF-like retrieval functions used in document retrieval.\n",
"\n",
"This notebook shows how to use a retriever that uses `ElasticSearch` and `BM25`.\n",
2023-04-05 04:29:06 +00:00
"\n",
"For more information on the details of BM25 see [this blog post](https://www.elastic.co/blog/practical-bm25-part-2-the-bm25-algorithm-and-its-variables)."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "51b49135-a61a-49e8-869d-7c1d76794cd7",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"#!pip install elasticsearch"
]
},
{
"cell_type": "code",
"execution_count": 2,
2023-04-05 04:29:06 +00:00
"id": "393ac030",
"metadata": {
"tags": []
},
2023-04-05 04:29:06 +00:00
"outputs": [],
"source": [
"from langchain.retrievers import ElasticSearchBM25Retriever"
]
},
{
"cell_type": "markdown",
"id": "aaf80e7f",
"metadata": {},
"source": [
"## Create New Retriever"
]
},
{
"cell_type": "code",
"execution_count": null,
2023-04-05 04:29:06 +00:00
"id": "bcb3c8c2",
"metadata": {
"tags": []
},
2023-04-05 04:29:06 +00:00
"outputs": [],
"source": [
"elasticsearch_url = \"http://localhost:9200\"\n",
2023-04-05 14:36:49 +00:00
"retriever = ElasticSearchBM25Retriever.create(elasticsearch_url, \"langchain-index-4\")"
2023-04-05 04:29:06 +00:00
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "b605284d",
"metadata": {},
"outputs": [],
"source": [
"# Alternatively, you can load an existing index\n",
"# import elasticsearch\n",
"# elasticsearch_url=\"http://localhost:9200\"\n",
"# retriever = ElasticSearchBM25Retriever(elasticsearch.Elasticsearch(elasticsearch_url), \"langchain-index\")"
]
},
{
"cell_type": "markdown",
"id": "1c518c42",
"metadata": {},
"source": [
"## Add texts (if necessary)\n",
"\n",
"We can optionally add texts to the retriever (if they aren't already in there)"
]
},
{
"cell_type": "code",
2023-04-05 14:36:49 +00:00
"execution_count": 3,
2023-04-05 04:29:06 +00:00
"id": "98b1c017",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
2023-04-05 14:36:49 +00:00
"['cbd4cb47-8d9f-4f34-b80e-ea871bc49856',\n",
" 'f3bd2e24-76d1-4f9b-826b-ec4c0e8c7365',\n",
" '8631bfc8-7c12-48ee-ab56-8ad5f373676e',\n",
" '8be8374c-3253-4d87-928d-d73550a2ecf0',\n",
" 'd79f457b-2842-4eab-ae10-77aa420b53d7']"
2023-04-05 04:29:06 +00:00
]
},
2023-04-05 14:36:49 +00:00
"execution_count": 3,
2023-04-05 04:29:06 +00:00
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"retriever.add_texts([\"foo\", \"bar\", \"world\", \"hello\", \"foo bar\"])"
]
},
{
"cell_type": "markdown",
"id": "08437fa2",
"metadata": {},
"source": [
"## Use Retriever\n",
"\n",
"We can now use the retriever!"
]
},
{
"cell_type": "code",
2023-04-05 14:36:49 +00:00
"execution_count": 4,
2023-04-05 04:29:06 +00:00
"id": "c0455218",
"metadata": {},
"outputs": [],
"source": [
"result = retriever.get_relevant_documents(\"foo\")"
]
},
{
"cell_type": "code",
2023-04-05 14:36:49 +00:00
"execution_count": 5,
2023-04-05 04:29:06 +00:00
"id": "7dfa5c29",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Document(page_content='foo', metadata={}),\n",
" Document(page_content='foo bar', metadata={})]"
]
},
2023-04-05 14:36:49 +00:00
"execution_count": 5,
2023-04-05 04:29:06 +00:00
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"result"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "74bd9256",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.6"
2023-04-05 04:29:06 +00:00
}
},
"nbformat": 4,
"nbformat_minor": 5
}