You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
openai-cookbook/examples/vector_databases/elasticsearch/elasticsearch-retrieval-aug...

893 lines
133 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"id": "f96b815e",
"metadata": {},
"source": [
"# Retrieval augmented generation using Elasticsearch and OpenAI"
]
},
{
"cell_type": "markdown",
"id": "e0f537af",
"metadata": {},
"source": [
"[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](openai/openai-cookbook/blob/main/examples/vector_databases/elasticsearch/elasticsearch-retrieval-augmented-generation.ipynb)\n"
]
},
{
"cell_type": "markdown",
"id": "349e0e74",
"metadata": {},
"source": [
"This notebook demonstrates how to: \n",
"- Index the OpenAI Wikipedia vector dataset into Elasticsearch \n",
"- Embed a question with the OpenAI [`embeddings`](https://platform.openai.com/docs/api-reference/embeddings) endpoint\n",
"- Perform semantic search on the Elasticsearch index using the encoded question\n",
"- Send the top search results to the OpenAI [Chat Completions](https://platform.openai.com/docs/guides/gpt/chat-completions-api) API endpoint for retrieval augmented generation (RAG)\n",
"\n",
" If you've already worked through our semantic search notebook, you can skip ahead to the final step!"
]
},
{
"cell_type": "markdown",
"id": "aa9576ca",
"metadata": {},
"source": [
"## Install packages and import modules "
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "8c304b93",
"metadata": {},
"outputs": [],
"source": [
"# install packages\n",
"\n",
"!python3 -m pip install -qU openai pandas wget elasticsearch\n",
"\n",
"# import modules\n",
"\n",
"from getpass import getpass\n",
"from elasticsearch import Elasticsearch, helpers\n",
"import wget\n",
"import zipfile\n",
"import pandas as pd\n",
"import json\n",
"import openai"
]
},
{
"cell_type": "markdown",
"id": "de32a789",
"metadata": {},
"source": [
"## Connect to Elasticsearch\n",
"\n",
" We're using an Elastic Cloud deployment of Elasticsearch for this notebook.\n",
"If you don't already have an Elastic deployment, you can sign up for a free [Elastic Cloud trial](https://cloud.elastic.co/registration?utm_source=github&utm_content=openai-cookbook).\n",
"\n",
"To connect to Elasticsearch, you need to create a client instance with the Cloud ID and password for your deployment.\n",
"\n",
"Find the Cloud ID for your deployment by going to https://cloud.elastic.co/deployments and selecting your deployment."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "3a57b6a8",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'name': 'instance-0000000001', 'cluster_name': '29ef9817e13142f5ba0ea7b29c2a86e2', 'cluster_uuid': 'absjWgQvRw63IlwWKisN8w', 'version': {'number': '8.9.1', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': 'a813d015ef1826148d9d389bd1c0d781c6e349f0', 'build_date': '2023-08-10T05:02:32.517455352Z', 'build_snapshot': False, 'lucene_version': '9.7.0', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0'}, 'tagline': 'You Know, for Search'}\n"
]
}
],
"source": [
"CLOUD_ID = getpass(\"Elastic deployment Cloud ID\")\n",
"CLOUD_PASSWORD = getpass(\"Elastic deployment Password\")\n",
"client = Elasticsearch(\n",
" cloud_id = CLOUD_ID,\n",
" basic_auth=(\"elastic\", CLOUD_PASSWORD) # Alternatively use `api_key` instead of `basic_auth`\n",
")\n",
"\n",
"# Test connection to Elasticsearch\n",
"print(client.info())"
]
},
{
"cell_type": "markdown",
"id": "80b55952",
"metadata": {},
"source": [
"## Download the dataset \n",
"\n",
"In this step we download the OpenAI Wikipedia embeddings dataset, and extract the zip file."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c584f15c",
"metadata": {},
"outputs": [],
"source": [
"embeddings_url = 'https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip'\n",
"wget.download(embeddings_url)\n",
"\n",
"with zipfile.ZipFile(\"vector_database_wikipedia_articles_embedded.zip\",\n",
"\"r\") as zip_ref:\n",
" zip_ref.extractall(\"data\")"
]
},
{
"cell_type": "markdown",
"id": "9654ac08",
"metadata": {},
"source": [
"## Read CSV file into a Pandas DataFrame.\n",
"\n",
"Next we use the Pandas library to read the unzipped CSV file into a DataFrame. This step makes it easier to index the data into Elasticsearch in bulk."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "76347d10",
"metadata": {},
"outputs": [],
"source": [
"\n",
"wikipedia_dataframe = pd.read_csv(\"data/vector_database_wikipedia_articles_embedded.csv\")"
]
},
{
"cell_type": "markdown",
"id": "6af9f5ad",
"metadata": {},
"source": [
"## Create index with mapping\n",
"\n",
"Now we need to create an Elasticsearch index with the necessary mappings. This will enable us to index the data into Elasticsearch.\n",
"\n",
"We use the `dense_vector` field type for the `title_vector` and `content_vector` fields. This is a special field type that allows us to store dense vectors in Elasticsearch.\n",
"\n",
"Later, we'll need to target the `dense_vector` field for kNN search.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "681989b3",
"metadata": {},
"outputs": [],
"source": [
"index_mapping= {\n",
" \"properties\": {\n",
" \"title_vector\": {\n",
" \"type\": \"dense_vector\",\n",
" \"dims\": 1536,\n",
" \"index\": \"true\",\n",
" \"similarity\": \"cosine\"\n",
" },\n",
" \"content_vector\": {\n",
" \"type\": \"dense_vector\",\n",
" \"dims\": 1536,\n",
" \"index\": \"true\",\n",
" \"similarity\": \"cosine\"\n",
" },\n",
" \"text\": {\"type\": \"text\"},\n",
" \"title\": {\"type\": \"text\"},\n",
" \"url\": { \"type\": \"keyword\"},\n",
" \"vector_id\": {\"type\": \"long\"}\n",
" \n",
" }\n",
"}\n",
"\n",
"client.indices.create(index=\"wikipedia_vector_index\", mappings=index_mapping)"
]
},
{
"cell_type": "markdown",
"id": "c2fb582e",
"metadata": {},
"source": [
"## Index data into Elasticsearch \n",
"\n",
"The following function generates the required bulk actions that can be passed to Elasticsearch's Bulk API, so we can index multiple documents efficiently in a single request.\n",
"\n",
"For each row in the DataFrame, the function yields a dictionary representing a single document to be indexed. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "efee9b97",
"metadata": {},
"outputs": [],
"source": [
"def dataframe_to_bulk_actions(df):\n",
" for index, row in df.iterrows():\n",
" yield {\n",
" \"_index\": 'wikipedia_vector_index',\n",
" \"_id\": row['id'],\n",
" \"_source\": {\n",
" 'url' : row[\"url\"],\n",
" 'title' : row[\"title\"],\n",
" 'text' : row[\"text\"],\n",
" 'title_vector' : json.loads(row[\"title_vector\"]),\n",
" 'content_vector' : json.loads(row[\"content_vector\"]),\n",
" 'vector_id' : row[\"vector_id\"]\n",
" }\n",
" }"
]
},
{
"cell_type": "markdown",
"id": "b8164b38",
"metadata": {},
"source": [
"As the dataframe is large, we will index data in batches of `100`. We index the data into Elasticsearch using the Python client's [helpers](https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/client-helpers.html#bulk-helpers) for the bulk API."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "aacb5e9c",
"metadata": {},
"outputs": [],
"source": [
"start = 0\n",
"end = len(wikipedia_dataframe)\n",
"batch_size = 100\n",
"for batch_start in range(start, end, batch_size):\n",
" batch_end = min(batch_start + batch_size, end)\n",
" batch_dataframe = wikipedia_dataframe.iloc[batch_start:batch_end]\n",
" actions = dataframe_to_bulk_actions(batch_dataframe)\n",
" helpers.bulk(client, actions)"
]
},
{
"cell_type": "markdown",
"id": "091ffc51",
"metadata": {},
"source": [
"Let's test the index with a simple match query."
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "2ccc8955",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'took': 10, 'timed_out': False, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}, 'hits': {'total': {'value': 4, 'relation': 'eq'}, 'max_score': 14.917897, 'hits': [{'_index': 'wikipedia_vector_index', '_id': '34227', '_score': 14.917897, '_source': {'url': 'https://simple.wikipedia.org/wiki/Hummingbird', 'title': 'Hummingbird', 'text': \"Hummingbirds are small birds of the family Trochilidae.\\n\\nThey are among the smallest of birds: most species measure 7.513\\xa0cm (35\\xa0in). The smallest living bird species is the 25\\xa0cm Bee Hummingbird. They can hover in mid-air by rapidly flapping their wings 1280 times per second (depending on the species). They are also the only group of birds able to fly backwards. Their rapid wing beats do actually hum. They can fly at speeds over 15\\xa0m/s (54\\xa0km/h, 34\\xa0mi/h).\\n\\nEating habits and pollination \\nHummingbirds help flowers to pollinate, though most insects are best known for doing so. The hummingbird enjoys nectar, like the butterfly and other flower-loving insects, such as bees.\\n\\nHummingbirds do not have a good sense of smell; instead, they are attracted to color, especially the color red. Unlike the butterfly, the hummingbird hovers over the flower as it drinks nectar from it, like a moth. When it does so, it flaps its wings very quickly to stay in one place, which makes it look like a blur and also beats so fast it makes a humming sound. A hummingbird sometimes puts its whole head into the flower to drink the nectar properly. When it takes its head back out, its head is covered with yellow pollen, so that when it moves to another flower, it can pollinate. Or sometimes it may pollinate with its beak.\\n\\nLike bees, hummingbirds can assess the amount of sugar in the nectar they eat. They reject flowers whose nectar has less than 10% sugar. Nectar is a poor source of nutrients, so hummingbirds meet their needs for protein, amino acids, vitamins, minerals, etc. by preying on insects and spiders.\\n\\nFeeding apparatus \\nMost hummingbirds have bills that are long and straight or nearly so, but in some species the bill shape is adapted for specialized feeding. Thornbills have short, sharp bills adapted for feeding from flowers with short corollas and piercing the bases of longer ones. The Sicklebills' extremely decurved bills are adapted to extracting nectar from the curved corollas of flowers in the family Gesneriaceae. The bill of the Fiery-tailed Awlbill has an upturned tip, as in the Avocets. The male Tooth-billed Hummingbird has barracuda-like spikes at the tip of its long, straight bill.\\n\\nThe two halves of a hummingbird's bill have a pronounced overlap, with the lower half (mandible) fitting tightly inside the upper half (maxilla). When hummingbirds feed on nectar, the bill is usually only opened slightly, allowing the tongue to dart out into the nectar.\\n\\nLike the similar nectar-feeding sunbirds and unlike other birds, hummingbirds drink by using grooved or trough-like tongues which they can stick out a long way.\\nHummingbirds do not spend all day flying, as the energy cost would be prohibitive; the majority of their activity consists simply of sitting or perching. Hummingbirds feed in many small meals, consuming many small invertebrates and up to twelve times their own body weight in nectar each day. They spend an average of 1015% of their time feeding and 7580% sitting and digesting.\\n\\nCo-evolution with flowers\\n\\nSince hummingbirds are specialized nectar-eaters, they are tied to the bird-flowers they feed upon. Some species, especially those with unusual bill shapes such as the Sword-billed Hummingbird and the sicklebills, are co-evolved with a small number of flower species.\\n\\nMany plants pollinated by hummingbirds produce flowers in shades of red, orange, and bright pink, though the birds will take nectar from flowers of many colors. Hummingbirds can see wavelengths into the near-ultraviolet. However, their flowers do not reflect these wavelengths as many insect-pollinated flowers do. Th
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/var/folders/vz/v2f6_x6s0kg51j2vbm5rlhww0000gn/T/ipykernel_27978/2105931364.py:1: DeprecationWarning: The 'body' parameter is deprecated and will be removed in a future version. Instead use individual parameters.\n",
" print(client.search(index=\"wikipedia_vector_index\", body={\n"
]
}
],
"source": [
"print(client.search(index=\"wikipedia_vector_index\", body={\n",
" \"_source\": {\n",
" \"excludes\": [\"title_vector\", \"content_vector\"]\n",
" },\n",
" \"query\": {\n",
" \"match\": {\n",
" \"text\": {\n",
" \"query\": \"Hummingbird\"\n",
" }\n",
" }\n",
" }\n",
"}))"
]
},
{
"cell_type": "markdown",
"id": "992b6804",
"metadata": {},
"source": [
"## Encode a question with OpenAI embedding model\n",
"\n",
"To perform kNN search, we need to encode queries with the same embedding model used to encode the documents at index time.\n",
"In this example, we need to use the `text-embedding-3-small` model.\n",
"\n",
"You'll need your OpenAI [API key](https://platform.openai.com/account/api-keys) to generate the embeddings."
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "57385c69",
"metadata": {},
"outputs": [],
"source": [
"# Get OpenAI API key\n",
"OPENAI_API_KEY = getpass(\"Enter OpenAI API key\")\n",
"\n",
"# Set API key\n",
"openai.api_key = OPENAI_API_KEY\n",
"\n",
"# Define model\n",
"EMBEDDING_MODEL = \"text-embedding-3-small\"\n",
"\n",
"# Define question\n",
"question = 'Is the Atlantic the biggest ocean in the world?'\n",
"\n",
"# Create embedding\n",
"question_embedding = openai.Embedding.create(input=question, model=EMBEDDING_MODEL)\n"
]
},
{
"cell_type": "markdown",
"id": "c7e6bf5d",
"metadata": {},
"source": [
"## Run semantic search queries\n",
"\n",
"Now we're ready to run queries against our Elasticsearch index using our encoded question. We'll be doing a k-nearest neighbors search, using the Elasticsearch [kNN query](https://www.elastic.co/guide/en/elasticsearch/reference/current/knn-search.html) option.\n",
"\n",
"First, we define a small function to pretty print the results."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "b1291434",
"metadata": {},
"outputs": [],
"source": [
"# Function to pretty print Elasticsearch results\n",
"\n",
"def pretty_response(response):\n",
" for hit in response['hits']['hits']:\n",
" id = hit['_id']\n",
" score = hit['_score']\n",
" title = hit['_source']['title']\n",
" text = hit['_source']['text']\n",
" pretty_output = (f\"\\nID: {id}\\nTitle: {title}\\nSummary: {text}\\nScore: {score}\")\n",
" print(pretty_output)"
]
},
{
"cell_type": "markdown",
"id": "ed8a497f",
"metadata": {},
"source": [
"Now let's run our `kNN` query."
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "fc834fdd",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"ID: 1936\n",
"Title: Atlantic Ocean\n",
"Summary: The Atlantic Ocean is the world's second largest ocean. It covers a total area of about . It covers about 20 percent of the Earth's surface. It is named after the god Atlas from Greek mythology.\n",
"\n",
"Geologic history \n",
"The Atlantic formed when the Americas moved west from Eurasia and Africa. This began sometime in the Cretaceous period, roughly 135 million years ago. It was part of the break-up of the supercontinent Pangaea.\n",
"\n",
"The east coast of South America is shaped somewhat like the west coast of Africa, and this gave a clue that continents moved over long periods of time (continental drift). The Atlantic Ocean is still growing now, because of sea-floor spreading from the mid-Atlantic Ridge, while the Pacific Ocean is said to be shrinking because the sea floor is folding under itself or subducting into the mantle.\n",
"\n",
"Geography\n",
"The Atlantic Ocean is bounded on the west by North and South America. It connects to the Arctic Ocean through the Denmark Strait, Greenland Sea, Norwegian Sea and Barents Sea. It connects with the Mediterranean Sea through the Strait of Gibraltar.\n",
"\n",
"In the southeast, the Atlantic merges into the Indian Ocean. The 20° East meridian defines its border.\n",
"\n",
"In the southwest, the Drake Passage connects it to the Pacific Ocean. The Panama Canal links the Atlantic and Pacific.\n",
"\n",
"The Atlantic Ocean is second in size to the Pacific. It occupies an area of about . The volume of the Atlantic, along with its adjacent seas (the seas next to it), is 354,700,000 cubic kilometres.\n",
"\n",
"The average depth of the Atlantic, along with its adjacent seas, is . The greatest depth is Milwaukee Deep near Puerto Rico, where the Ocean is deep.\n",
"\n",
"Gulf Stream \n",
"The Atlantic Ocean has important ocean currents. One of these, called the Gulf Stream, flows across the North Atlantic. Water gets heated by the sun in the Caribbean Sea and then moves northwest toward the North Pole. This makes France, the British Isles, Iceland, and Norway in Europe much warmer in winter than Newfoundland and Nova Scotia in Canada. Without the Gulf Stream, the climates of northeast Canada and northwest Europe might be the same, because these places are about the same distance from the North Pole.\n",
"\n",
"There are currents in the South Atlantic too, but the shape of this sea means that it has less effect on South Africa.\n",
"\n",
"Geology \n",
"The main feature of the Atlantic Ocean's seabed is a large underwater mountain chain called the Mid-Atlantic Ridge. It runs from north to south under the Ocean. This is at the boundary of four tectonic plates: Eurasian, North American, South American and African. The ridge extends from Iceland in the north to about 58° south.\n",
"\n",
"The salinity of the surface waters of the open ocean ranges from 3337 parts per thousand and varies with latitude and season.\n",
"\n",
"References\n",
"\n",
"Other websites \n",
"LA Times special Altered Oceans \n",
"Oceanography Image of the Day, from the Woods Hole Oceanographic Institution\n",
"National Oceanic and Atmospheric Administration \n",
"NOAA In-situ Ocean Data Viewer Plot and download ocean observations\n",
"\n",
"www.cartage.org.lb \n",
"www.mnsu.edu\n",
"Score: 0.93641126\n",
"\n",
"ID: 1975\n",
"Title: Pacific Ocean\n",
"Summary: The Pacific Ocean is the body of water between Asia and Australia in the west, the Americas in the east, the Southern Ocean to the south, and the Arctic Ocean to the north. It is the largest named ocean and it covers one-third of the surface of the entire world. It joins the Atlantic Ocean at a line drawn south from Cape Horn, Chile/Argentina to Antarctica, and joins the Indian Ocean at a line drawn south from Tasmania, Australia to Antarctica.\n",
"\n",
"As the Atlantic slowly gets wider, the Pacific is slowly shrinking. It does this by folding the sea floor in towards the centre of the Earth - this is called subduction. This bumping and grinding is hard so there are many earthquakes and volcanoes when the pressure builds up and is quickly released as large explosions of hot rocks and dust. When an earthquake happens under the sea, the quick jerk causes a tsunami. This is why tsunamis are more common around the edge of the Pacific than anywhere else. Many of the Earth's volcanoes are either islands in the Pacific, or are on continents within a few hundred kilometers of the ocean's edge. Plate tectonics are another reason which makes Pacific Ocean smaller.\n",
"\n",
"Other websites \n",
"\n",
" EPIC Pacific Ocean Data Collection Viewable on-line collection of observational data\n",
" NOAA In-situ Ocean Data Viewer plot and download ocean observations\n",
" NOAA PMEL Argo profiling floats Realtime Pacific Ocean data\n",
" NOAA TAO El Niño data Realtime Pacific Ocean El Niño buoy data\n",
" NOAA Ocean Surface Current Analyses Realtime (OSCAR) Near-realtime Pacific Ocean Surface Currents derived from satellite altimeter and scatterometer data\n",
"Score: 0.9177895\n",
"\n",
"ID: 11124\n",
"Title: List of seas\n",
"Summary: The sea is the interconnected system of all the Earth's oceanic waters, including the Atlantic, Pacific, Indian, Southern and Arctic Oceans. However, the word \"sea\" can also be used for many specific, much smaller bodies of seawater, such as the North Sea or the Red Sea.There are 78 seas in the world\n",
"\n",
"List of seas, by ocean\n",
"\n",
"Pacific Ocean \n",
" Bering Sea\n",
" Gulf of Alaska\n",
" Seck Sea (Gulf of California)\n",
" Sea of Okhotsk\n",
" Sea of Japan\n",
" Seto Inland Sea\n",
" East China Sea\n",
" South China Sea\n",
" Beibu Gulf\n",
" Sulu Sea\n",
" Celebes Sea\n",
" Bohol Sea (Mindanao Sea)\n",
" Philippine Sea\n",
" Flores Sea\n",
" Banda Sea\n",
" Arafura Sea\n",
" Tasman Sea\n",
" Yellow Sea\n",
" Bohai Sea\n",
" Coral Sea\n",
" Gulf of Carpentaria\n",
"\n",
"Atlantic Ocean \n",
" Hudson Bay\n",
" James Bay\n",
" Baffin Bay init fam\n",
" Gulf of St. Lawrence\n",
" Gulf of Guinea\n",
" Caribbean Sea\n",
" Gulf of Mexico\n",
" Sargasso Sea\n",
" North Sea\n",
" Baltic Sea\n",
" Gulf of Bothnia\n",
" Irish Sea\n",
" Celtic Sea\n",
" English Channel\n",
" Mediterranean Sea\n",
" Adriatic Sea\n",
" Aegean Sea\n",
" Black Sea\n",
" Sea of Azov\n",
" Ionian Sea\n",
" Ligurian Sea\n",
" Mirtoon Sea\n",
" Tyrrhenian Sea\n",
" Gulf of Sidra\n",
" Sea of Marmara\n",
" Sea of Crete\n",
"\n",
"Indian Ocean \n",
" Red Sea\n",
" Gulf of Aden\n",
" Persian Gulf\n",
" Gulf of Oman\n",
" Arabian Sea\n",
" Bay of Bengal\n",
" Gulf of Thailand\n",
" Java Sea\n",
" Timor Sea\n",
" Gulf of Kutch\n",
" Gulf of Khambhat\n",
"\n",
"Arctic Ocean \n",
" Barents Sea\n",
" Kara Sea\n",
" Beaufort Sea\n",
" Amundsen Gulf\n",
" Greenland Sea\n",
" Chukchi Sea\n",
" Laptev Sea\n",
" East Siberian Sea\n",
"\n",
"Southern Ocean \n",
" Amundsen Sea\n",
" Weddell Sea\n",
" Ross Sea\n",
" Great Australian Bight\n",
" Gulf St. Vincent\n",
" Spencer Gulf\n",
"\n",
"Seas which have land around them (these are landlocked) \n",
" Aral Sea\n",
" Caspian Sea\n",
" Dead Sea\n",
" Sea of Galilee (we call this a sea, but it is really a small freshwater lake)\n",
" Salton Sea\n",
"\n",
"Seas which are not on Earth \n",
"Lunar maria are very big areas on the Moon. In the past, people thought they were water and called them \"seas\". \n",
"\n",
"Scientists think that there is liquid water under the ground on some moons, for example Europa.\n",
"\n",
"Scientists also think that there are liquid hydrocarbons on Titan. \n",
"\n",
"Basic English 850 words\n",
"\n",
"Geography-related lists\n",
"Score: 0.9160284\n",
"\n",
"ID: 2033\n",
"Title: Southern Ocean\n",
"Summary: The Southern Ocean is the ocean around Antarctica. It means the waters of the Atlantic, Pacific, and Indian Oceans around the continent of Antarctica. Since the 1770s geographers have discussed its limits. Nowadays, sixty degrees south latitude is often accepted. Some people call this ocean the Antarctic Ocean.\n",
"\n",
"The total area is 20,327,000 km², and the coastline length is 17,968 km.\n",
"\n",
"Other websites \n",
"\n",
" Oceanography Image of the Day, from the Woods Hole Oceanographic Institution\n",
" The CIA World Factbook's entry on the Southern Ocean\n",
" The Fifth Ocean from Geography.About.com\n",
" NOAA In-situ Ocean Data Viewer Plot and download ocean observations\n",
" NOAA FAQ about the number of oceans \n",
"\n",
" \n",
"Geography of Antarctica\n",
"Score: 0.9083342\n",
"\n",
"ID: 1978\n",
"Title: Indian Ocean\n",
"Summary: The Indian Ocean is the ocean surrounded by Asia to the north, Australia and the Pacific Ocean to the east, the Southern Ocean to the south, and Africa and the Atlantic Ocean to the west. It is named for the river Indus and Ancient India on its north shore. The Bay of Bengal, the Arabian Sea, the Persian Gulf and the Red Sea are all parts of this ocean.\n",
"\n",
"The deepest point in the Indian Ocean is in the Java Trench near the Sunda Islands in the east, 7500 m (25,344 feet) deep. The average depth is 3,890 m (12,762 ft). The Indian Ocean is the third largest ocean, 28,350,000 square miles in size. The majority is in the southern hemisphere.\n",
"\n",
"Other websites \n",
"\n",
" Maps of the indian Ocean\n",
" Océan Indien in easy French\n",
" NOAA In-situ Ocean Data Viewer Plot and download ocean observations\n",
" The Indian Ocean in World History: Educational Website Interactive resource from the Sultan Qaboos Cultural Center\n",
" The Regional Tuna Tagging Project-Indian Ocean with details of the importance of Tuna in the Indian Ocean.. \n",
" Detailed maps of the Indian Ocean\n",
" The Indian Ocean Trade: A Classroom Simulation \n",
" CIA - The World Factbook, Oceans: Indian Ocean\n",
"Score: 0.90738976\n",
"\n",
"ID: 1980\n",
"Title: Arctic Ocean\n",
"Summary: The Arctic Ocean is the ocean around the North Pole. The most northern parts of Eurasia and North America are around the Arctic Ocean. Thick pack ice and snow cover almost all of this ocean in winter, and most of it in summer. An icebreaker or a nuclear-powered submarine can use the Northwest Passage through the Arctic Ocean to go between the Pacific and Atlantic oceans.\n",
"\n",
"The ocean's area is about 14.056 million km2, which is the smallest of the world's 5 oceans, and it has of coastline. The central surface covered by ice about thick. The biology there is quite special. Endangered species there include walruses, whales and polar bears. Year by year the Arctic Ocean is becoming less icy, as a result of global warming.\n",
"\n",
"The average depth of the Arctic Ocean is . The deepest point is in the Eurasian Basin, at .\n",
"\n",
"Geography \n",
"The Arctic Ocean covers an area of about 14,056,000 km2. The coastline is 45,390 km (28,200 mi) long It is surrounded by Eurasia, North America, Greenland, and by several islands.\n",
"\n",
"It is generally taken to include Baffin Bay, Barents Sea, Beaufort Sea, Chukchi Sea, East Siberian Sea, Greenland Sea, Hudson Bay, Hudson Strait, Kara Sea, Laptev Sea, White Sea and other bodies of water. It is connected to the Pacific Ocean by the Bering Strait and to the Atlantic Ocean through the Greenland Sea and Labrador Sea.\n",
"\n",
"Countries bordering the Arctic Ocean are: Russia, Norway, Iceland, Greenland, Canada and the United States.\n",
"\n",
"Climate \n",
"The Arctic Ocean is in a polar climate. Winters are characterized by the polar night, cold and stable weather conditions, and clear skies.\n",
"\n",
"The temperature of the surface of the Arctic Ocean is fairly constant, near the freezing point of seawater. Arctic Ocean consists of saltwater but its salinity is less than other oceans. The temperature must reach 1.8 °C (28.8 °F) before freezing occurs.\n",
"\n",
"Ice covers most of the Arctic Ocean. It covers almost the whole ocean in late winter and the majority of the ocean in late summer. Much of the Arctic ice pack is covered in snow for about 10 months of the year. The maximum snow cover is in March or April — about 20 to 50 cm (7.9 to 19.7 in).\n",
"\n",
"The climate of the Arctic region has varied significantly in the past. As recently as 55 million years ago, during the eocene epoch, the region reached an average annual temperature of 1020 °C (5068 °F). The surface waters of the Arctic Ocean warmed enough to support tropical lifeforms.\n",
"\n",
"Animal and plant life \n",
"Endangered marine species in the Arctic Ocean include walruses and whales. The area has a fragile ecosystem. The Arctic Ocean has relatively little plant life except for phytoplankton. Phytoplankton are a crucial part of the ocean. They feed on nutrients from rivers and the currents of the Atlantic and Pacific oceans.\n",
"\n",
"References\n",
"\n",
"Other websites \n",
"\n",
" The Hidden Ocean Arctic 2005 Daily logs, photos and video from exploration mission.\n",
" Oceanography Image of the Day, from the Woods Hole Oceanographic Institution\n",
" Arctic Council\n",
" The Northern Forum\n",
" Arctic Environmental Atlas Interactive map\n",
" NOAA Arctic Theme Page\n",
" \n",
" Daily Arctic Ocean Rawinsonde Data from Soviet Drifting Ice Stations (19541990) at NSIDC\n",
" Arctic time series: The Unaami Data collection \n",
" NOAA North Pole Web Cam Images from Web Cams deployed in spring on an ice floe\n",
" NOAA Near-realtime North Pole Weather Data Data from instruments deployed on an ice floe\n",
" Search for Arctic Life Heats Up by Stephen Leahy\n",
" International Polar Foundation\n",
" National Snow and Ice Data Center Daily report of Arctic ice cover based on satellite data\n",
" Marine Biodiversity Wiki \n",
"\n",
"Oceans\n",
"Arctic\n",
"Score: 0.9073119\n",
"\n",
"ID: 15220\n",
"Title: Caribbean Sea\n",
"Summary: The Caribbean Sea is a tropical sea in the center of the Caribbean area. The body of water is part of the Atlantic Ocean. The sea is southeast of the Gulf of Mexico. The Caribbean Sea has many islands, which are popular among North American tourists because of their tropical climate. The Caribbean Sea is famous around the world as a tourist destination.\n",
"\n",
"History \n",
"Christopher Columbus came across a group of islands in the Caribbean region. When he did so, he thought he had reached another part of the world. Because of this, he named the islands the West Indies. However, later it was realized that he found an entire region. It still had its natural resources. The name Caribbean was later given to it by the Amerindian tribe, the Caribs. That is how it got its name: the Caribbean Sea.\n",
"\n",
"This entire region covers an area of 1,063,000 sq. miles. It covers from Mexico to the boundaries of South America.\n",
"\n",
"This sea is just as deep as it is wide. Its deepest point is believed to be even lower than 25,220 ft, 7,686 m. That makes this point one of the lowest points on the surface of the earth, and the Caribbean Sea one of the deepest seas in the world.\n",
"\n",
"Other websites \n",
"\n",
"Seas of the Atlantic Ocean\n",
"Score: 0.9067033\n",
"\n",
"ID: 21206\n",
"Title: Irish Sea\n",
"Summary: The Irish Sea (sometimes called the Manx Sea) is a body of water that separates Ireland and Great Britain. It is known to be one of the most polluted seas in the world including the North Sea and the Mediterranean Sea. The sea is important to regional trade, shipping and fishing. It is a source of power generation in the form of wind power and nuclear plants. Annual traffic between Great Britain and Ireland amounts to over 12 million passengers and of traded goods.\n",
"\n",
"Economics \n",
"It covers and at its deepest point is deep. In 2008, about of fish were caught. Shell fish made up three quarters of this amount.\n",
"\n",
"The Irish Sea has 17 active oil and gas drilling platforms. It is estimated there are about 1.6 billion barrels of oil in the Barryroe oil field alone.\n",
"\n",
"Sealife \n",
"At least thirty species of shark can be found in the Irish Sea at different times. These include the basking, thresher, blue, mako and porbeagle sharks. There are about 12 species of Dolphin, porpoise and whales in the Irish Sea. These include the common dolphin, bottlenose dolphin and the harbor porpoise.\n",
"\n",
"References \n",
"\n",
"Seas of the Atlantic Ocean\n",
"Ireland\n",
"Geography of the United Kingdom\n",
"Score: 0.90408546\n",
"\n",
"ID: 6308\n",
"Title: North Sea\n",
"Summary: The North Sea is a sea that is part of the Atlantic Ocean in northern Europe. The North Sea is between Norway and Denmark in the east, Scotland and England in the west, Germany, the Netherlands, Belgium and France in the south.\n",
"\n",
"Borders \n",
"The Skagerrak connects the North Sea to the Baltic Sea. In the south, the North Sea becomes the English Channel, a sea between England and France. This is called the Dover Straits and is very busy with ships.\n",
"\n",
"The border between the North Sea and the Skagerrak is at an imagined line between Lindesnes in Norway, and Hanstholm in Denmark. In the North, the North sea is open towards the Atlantic. The border between the two is an imagined line from Northern Scotland, to Shetland, and then to Ålesund in Norway. According to the Oslo-Paris Treaty of 1962 it is a bit more to the west and the north though. The treaty puts it at 5° East longitude, and 62° North latitude. That is at the parallel of the Geirangerfjord in Norway.\n",
"\n",
"Various statistical data \n",
"On average, the North Sea has a depth of only 94 meters. About 80 million people live near the North Sea, at most 150 km away from the coast. Together with the English Channel in the south, the southern North Sea is the busiest body of water in the world.\n",
"\n",
"Rivers that drain into it \n",
"Well-known rivers that drain into the North Sea include the Tay (at Dundee), the Forth (at Edinburgh), the Tyne (South Shields), the Wear (at Sunderland), the Tees (near Middlesbrough), the Elbe (at Cuxhaven), the Weser (at Bremerhaven), the Rhine and Meuse or Maas (at Rotterdam), the Scheldt (at Flushing or Vlissingen), the Thames, and the Humber (at Hull), and the river Nairn (at Nairn)\n",
"\n",
"The Kiel Canal, one of the world's busiest artificial waterways, connects the North Sea with the Baltic.\n",
"\n",
"Name \n",
"Its name comes from its relationship to the land of the Frisians (see Frisia). They live directly to the south of the North Sea, and to the west of the East Sea (Oostzee, the Baltic Sea), the former South Sea (Zuiderzee, today's IJsselmeer) and the today reclaimed Middle Sea (Middelzee). But the spread of the name could also be from the view of the cities of the Hanseatic League. Some of its main cities, like Lübeck, Bremen or Hamburg had the same view.\n",
"\n",
"In classical times this body of water was also called the Oceanum Germanicum or Mare Germanicum, meaning German Ocean or Sea. This name was commonly used in English and other languages along with the name North Sea, until the early eighteenth century. By the late nineteenth century, German Sea was a rare, scholarly usage even in Germany. In Danish the North Sea is also named Vesterhavet (besides Nordsøen), meaning Western Ocean because it is west of Denmark.\n",
"\n",
"Geographic divisions \n",
"\n",
"Most of the North sea is on the European Continental shelf. On average, the depth is about 93 to 94 meters only. In the south it is very shallow, only 25 to 35 meters. In the north in the bathyal zone north of Shetland, this depth increases to between 100 and 200 metres. In the south, the depth is at most 50 metres. An exception to this is the Norwegian Trench. It is deepest there, with a depth of 725 metres. The most shallow part of it is a sand bank called Dogger Bank. In the southern part, there are many sand banks.\n",
"\n",
"Looking at the satellite picture it is easy to see the geographic divisions of the North Sea:\n",
"a generally shallow southern North Sea\n",
"the central North Sea\n",
" the northern North Sea, with the Norwegian Trench, near the Skagerrak.\n",
"\n",
"The southern north sea is composed of the Southern Bight, before the coast of Belgium and the Netherlands and the German Bight before the coastline of Germany. The Dogger Bank is the limit between the southern and central parts. The Waddenzee runs all the way from Den Helder in the Netherlands to Esbjerg in Denmark.\n",
"\n",
"The Dogger Bank covers an area about half the size of the Netherlands. There, the North Sea has a depth of between 13 and 20 metres only. The area is very famous for fishing. With some storms there are even waves breaking there.\n",
"\n",
"The Norwegian Trench has an average depth of around 250 to 300 metres; at the entrance to the Skagerrak, the depth increases up to 725 meters. Along the trench is the Norwegian Current, which brings most of the waters of the North Sea into the Atlantic Ocean. Also, most of the waters of the Baltic Sea flow northwards here.\n",
"\n",
"About 200 km east of the Scottish city of Dundee there are more trenches, known collectively as the Devil's hole. Generally, the water is about 90 meters deep there. The trenches very often are only a few kilometers in length. In these trenches, the depth increases to up to 230 meters.\n",
"\n",
"In the Dover Strait the water is about 30 meters deep. At the end of the English Channel, this depth increases to about 100 meters.\n",
"\n",
"History \n",
"In the last ice age the North Sea was covered by large areas of ice called glaciers. About 20,000 years ago the ice melted and the North Sea was formed (made).\n",
"\n",
"North Sea oil \n",
"In the 1960s, geologists found large areas of oil and natural gas under the North Sea. Most of the oil fields are owned by the United Kingdom and Norway but some belong to Denmark, the Netherlands and Germany. Drilling began in the 1960s and led to a famous argument between England and Scotland about how the revenue (money) from the oil should be spent.\n",
"\n",
"Animal life \n",
"\n",
"People have been fishing in the North Sea for thousands of years. However, so many fish are now caught there that new ones may not be able to grow fast enough to keep the fishery going.\n",
"\n",
"Terns, Atlantic puffins, razorbills, kittiwakes and other seabirds live on the North Sea coast. Many coastal areas are protected nature reserves.\n",
"\n",
"Other websites\n",
"\n",
"Seas of the Atlantic Ocean\n",
"Bodies of water of Europe\n",
"Score: 0.9021919\n",
"\n",
"ID: 6278\n",
"Title: Atlantis\n",
"Summary: Atlantis is a name for a fictional large island or small continent that was (in the legend) in the Atlantic Ocean many years before it sank into the depth of the sea .\n",
"\n",
"The name Atlantis first appears in the writings of Herodotus - he describes the western ocean as \"Sea of Atlantis.\" Then, one generation later, Atlantis is described in detail in the stories Timaeus and Critias by the Greek philosopher Plato. He used this story to help explain his ideas about government and philosophy. Plato was the only ancient writer who wrote specific things about Atlantis.\n",
"\n",
"According to Plato, the Atlanteans lived 9000 years before his own time and were half human and half god. They created a very good human society. When they stopped being good people and did bad things, the gods sent earthquakes and fire to destroy Atlantis.\n",
"\n",
"Many scholars think Plato could have been thinking of a real place when he wrote about Atlantis. Many, many people have thought of many, many places where the real place that inspired Atlantis could have been. For example, there was a Minoan kingdom on the island of Santorini. The Minoan kingdom was very powerful thousands of years before Plato, and their society was damaged when a volcano erupted on their island. According to Plato, Atlantis was very large, as big as North Africa, so it should not have been hard to find.\n",
"\n",
"After the discovery of the Americas, some people in Europe thought they might be Atlantis. However, after Plato, the idea of Atlantis was mostly forgotten until 1882, when a writer named Ignatius Donnelly wrote a book saying that Atlantis was real and that the culture of Atlantis had started many other ancient cultures, such as the Egyptian and Mayan. Then other people became interested in Atlantis. \n",
"\n",
"Atlantis has appeared in many works of fiction. In Marvel Comics, Atlantis is at the bottom of the ocean and exists in modern times, with people who breathe water. Other works of fiction use Atlantis as background. For example, Robert E. Howard set his Conan the Barbarian stories in a fictional time called the Hyborian Age, which began with the destruction of Atlantis and ended when real written history started.\n",
"\n",
"References\n",
"\n",
"Greek mythology\n",
"Ancient history\n",
"Score: 0.9008117\n"
]
}
],
"source": [
"response = client.search(\n",
" index = \"wikipedia_vector_index\",\n",
" knn={\n",
" \"field\": \"content_vector\",\n",
" \"query_vector\": question_embedding[\"data\"][0][\"embedding\"],\n",
" \"k\": 10,\n",
" \"num_candidates\": 100\n",
" }\n",
")\n",
"pretty_response(response)\n",
"\n",
"top_hit_summary = response['hits']['hits'][0]['_source']['text'] # Store content of top hit for final step"
]
},
{
"cell_type": "markdown",
"id": "276c1147",
"metadata": {},
"source": [
"Success! We've used kNN to perform semantic search over our dataset and found the top results.\n",
"\n",
"Now we can use the Chat Completions API to work some generative AI magic using the top search result as additional context."
]
},
{
"cell_type": "markdown",
"id": "8abac103",
"metadata": {},
"source": [
"## Use Chat Completions API for retrieval augmented generation\n",
"\n",
"Now we can send the question and the text to OpenAI's chat completion API.\n",
"\n",
"Using a LLM model together with a retrieval model is known as retrieval augmented generation (RAG). We're using Elasticsearch to do what it does best, retrieve relevant documents. Then we use the LLM to do what it does best, tasks like generating summaries and answering questions, using the retrieved documents as context. \n",
"\n",
"The model will generate a response to the question, using the top kNN hit as context. Use the `messages` list to shape your prompt to the model. In this example, we're using the `gpt-3.5-turbo` model."
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "5cfb3153",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"------------------------------------------------------------\n",
"No, the Atlantic Ocean is not the biggest ocean in the world. It is the second largest ocean, covering about 20 percent of the Earth's surface. The Pacific Ocean is the largest ocean in the world.\n",
"------------------------------------------------------------\n"
]
}
],
"source": [
"summary = openai.ChatCompletion.create(\n",
" model=\"gpt-3.5-turbo\",\n",
" messages=[\n",
" {\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},\n",
" {\"role\": \"user\", \"content\": \"Answer the following question:\" \n",
" + question \n",
" + \"by using the following text:\" \n",
" + top_hit_summary},\n",
" ]\n",
")\n",
"\n",
"choices = summary.choices\n",
"\n",
"for choice in choices:\n",
" print(\"------------------------------------------------------------\")\n",
" print(choice.message.content)\n",
" print(\"------------------------------------------------------------\")"
]
},
{
"cell_type": "markdown",
"id": "2a3f33fa",
"metadata": {},
"source": []
},
{
"cell_type": "markdown",
"id": "868ab150",
"metadata": {},
"source": [
"### Code explanation\n",
"\n",
"Here's what that code does:\n",
"\n",
"- Uses OpenAI's model to generate a response\n",
"- Sends a conversation containing a system message and a user message to the model\n",
"- The system message sets the assistant's role as \"helpful assistant\"\n",
"- The user message contains a question as specified in the original kNN query and some input text\n",
"- The response from the model is stored in the `summary.choices` variable"
]
},
{
"cell_type": "markdown",
"id": "aa0eec27",
"metadata": {},
"source": [
"## Next steps\n",
"\n",
"That was just one example of how to combine Elasticsearch with the power of OpenAI's models, to enable retrieval augmented generation. RAG allows you to avoid the costly and complex process of training or fine-tuning models, by leveraging out-of-the-box models, enhanced with additional context.\n",
"\n",
"Use this as a blueprint for your own experiments.\n",
"\n",
"To adapt the conversation for different use cases, customize the system message to define the assistant's behavior or persona. Adjust the user message to specify the task, such as summarization or question answering, along with the desired format of the response."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.11.3 64-bit",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
},
"vscode": {
"interpreter": {
"hash": "b0fa6594d8f4cbf19f97940f81e996739fb7646882a419484c72d19e05852a7e"
}
}
},
"nbformat": 4,
"nbformat_minor": 5
}