mirror of
https://github.com/hwchase17/langchain
synced 2024-10-29 17:07:25 +00:00
387 lines
12 KiB
Plaintext
387 lines
12 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"attachments": {},
|
||
"cell_type": "markdown",
|
||
"id": "683953b3",
|
||
"metadata": {},
|
||
"source": [
|
||
"# Vectara\n",
|
||
"\n",
|
||
">[Vectara](https://vectara.com/) is a API platform for building LLM-powered applications. It provides a simple to use API for document indexing and query that is managed by Vectara and is optimized for performance and accuracy. \n",
|
||
"\n",
|
||
"\n",
|
||
"This notebook shows how to use functionality related to the `Vectara` vector database or the `Vectara` retriever. \n",
|
||
"\n",
|
||
"See the [Vectara API documentation ](https://docs.vectara.com/docs/) for more information on how to use the API."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 2,
|
||
"id": "aac9563e",
|
||
"metadata": {
|
||
"ExecuteTime": {
|
||
"end_time": "2023-04-04T10:51:22.282884Z",
|
||
"start_time": "2023-04-04T10:51:21.408077Z"
|
||
},
|
||
"tags": []
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"import os\n",
|
||
"from langchain.embeddings import FakeEmbeddings\n",
|
||
"from langchain.text_splitter import CharacterTextSplitter\n",
|
||
"from langchain.vectorstores import Vectara\n",
|
||
"from langchain.document_loaders import TextLoader"
|
||
]
|
||
},
|
||
{
|
||
"attachments": {},
|
||
"cell_type": "markdown",
|
||
"id": "eeead681",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Connecting to Vectara from LangChain\n",
|
||
"\n",
|
||
"The Vectara API provides simple API endpoints for indexing and querying, which is encapsulated in the Vectara integration.\n",
|
||
"First let's ingest the documents using the from_documents() method:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 3,
|
||
"id": "be0a4973",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"loader = TextLoader(\"../../../state_of_the_union.txt\")\n",
|
||
"documents = loader.load()\n",
|
||
"text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
|
||
"docs = text_splitter.split_documents(documents)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 4,
|
||
"id": "8429667e",
|
||
"metadata": {
|
||
"ExecuteTime": {
|
||
"end_time": "2023-04-04T10:51:22.525091Z",
|
||
"start_time": "2023-04-04T10:51:22.522015Z"
|
||
},
|
||
"tags": []
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"vectara = Vectara.from_documents(\n",
|
||
" docs,\n",
|
||
" embedding=FakeEmbeddings(size=768),\n",
|
||
" doc_metadata={\"speech\": \"state-of-the-union\"},\n",
|
||
")"
|
||
]
|
||
},
|
||
{
|
||
"attachments": {},
|
||
"cell_type": "markdown",
|
||
"id": "90dbf3e7",
|
||
"metadata": {},
|
||
"source": [
|
||
"Vectara's indexing API provides a file upload API where the file is handled directly by Vectara - pre-processed, chunked optimally and added to the Vectara vector store.\n",
|
||
"To use this, we added the add_files() method (and from_files()). \n",
|
||
"\n",
|
||
"Let's see this in action. We pick two PDF documents to upload: \n",
|
||
"1. The \"I have a dream\" speech by Dr. King\n",
|
||
"2. Churchill's \"We Shall Fight on the Beaches\" speech"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 5,
|
||
"id": "85ef3468",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"import tempfile\n",
|
||
"import urllib.request\n",
|
||
"\n",
|
||
"urls = [\n",
|
||
" [\n",
|
||
" \"https://www.gilderlehrman.org/sites/default/files/inline-pdfs/king.dreamspeech.excerpts.pdf\",\n",
|
||
" \"I-have-a-dream\",\n",
|
||
" ],\n",
|
||
" [\n",
|
||
" \"https://www.parkwayschools.net/cms/lib/MO01931486/Centricity/Domain/1578/Churchill_Beaches_Speech.pdf\",\n",
|
||
" \"we shall fight on the beaches\",\n",
|
||
" ],\n",
|
||
"]\n",
|
||
"files_list = []\n",
|
||
"for url, _ in urls:\n",
|
||
" name = tempfile.NamedTemporaryFile().name\n",
|
||
" urllib.request.urlretrieve(url, name)\n",
|
||
" files_list.append(name)\n",
|
||
"\n",
|
||
"docsearch: Vectara = Vectara.from_files(\n",
|
||
" files=files_list,\n",
|
||
" embedding=FakeEmbeddings(size=768),\n",
|
||
" metadatas=[{\"url\": url, \"speech\": title} for url, title in urls],\n",
|
||
")"
|
||
]
|
||
},
|
||
{
|
||
"attachments": {},
|
||
"cell_type": "markdown",
|
||
"id": "1f9215c8",
|
||
"metadata": {
|
||
"ExecuteTime": {
|
||
"end_time": "2023-04-04T09:27:29.920258Z",
|
||
"start_time": "2023-04-04T09:27:29.913714Z"
|
||
}
|
||
},
|
||
"source": [
|
||
"## Similarity search\n",
|
||
"\n",
|
||
"The simplest scenario for using Vectara is to perform a similarity search. "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 6,
|
||
"id": "a8c513ab",
|
||
"metadata": {
|
||
"ExecuteTime": {
|
||
"end_time": "2023-04-04T10:51:25.204469Z",
|
||
"start_time": "2023-04-04T10:51:24.855618Z"
|
||
},
|
||
"tags": []
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
|
||
"found_docs = vectara.similarity_search(\n",
|
||
" query, n_sentence_context=0, filter=\"doc.speech = 'state-of-the-union'\"\n",
|
||
")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 7,
|
||
"id": "fc516993",
|
||
"metadata": {
|
||
"ExecuteTime": {
|
||
"end_time": "2023-04-04T10:51:25.220984Z",
|
||
"start_time": "2023-04-04T10:51:25.213943Z"
|
||
},
|
||
"tags": []
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n",
|
||
"\n",
|
||
"Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n",
|
||
"\n",
|
||
"One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n",
|
||
"\n",
|
||
"And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"print(found_docs[0].page_content)"
|
||
]
|
||
},
|
||
{
|
||
"attachments": {},
|
||
"cell_type": "markdown",
|
||
"id": "1bda9bf5",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Similarity search with score\n",
|
||
"\n",
|
||
"Sometimes we might want to perform the search, but also obtain a relevancy score to know how good is a particular result."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 8,
|
||
"id": "8804a21d",
|
||
"metadata": {
|
||
"ExecuteTime": {
|
||
"end_time": "2023-04-04T10:51:25.631585Z",
|
||
"start_time": "2023-04-04T10:51:25.227384Z"
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
|
||
"found_docs = vectara.similarity_search_with_score(\n",
|
||
" query, filter=\"doc.speech = 'state-of-the-union'\"\n",
|
||
")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 9,
|
||
"id": "756a6887",
|
||
"metadata": {
|
||
"ExecuteTime": {
|
||
"end_time": "2023-04-04T10:51:25.642282Z",
|
||
"start_time": "2023-04-04T10:51:25.635947Z"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n",
|
||
"\n",
|
||
"Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n",
|
||
"\n",
|
||
"One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n",
|
||
"\n",
|
||
"And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.\n",
|
||
"\n",
|
||
"Score: 0.4917977\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"document, score = found_docs[0]\n",
|
||
"print(document.page_content)\n",
|
||
"print(f\"\\nScore: {score}\")"
|
||
]
|
||
},
|
||
{
|
||
"attachments": {},
|
||
"cell_type": "markdown",
|
||
"id": "1f9876a8",
|
||
"metadata": {},
|
||
"source": [
|
||
"Now let's do similar search for content in the files we uploaded"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 10,
|
||
"id": "47784de5",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"(Document(page_content='We must forever conduct our struggle on the high plane of dignity and discipline.', metadata={'section': '1'}), 0.7962591)\n",
|
||
"(Document(page_content='We must not allow our\\ncreative protests to degenerate into physical violence. . . .', metadata={'section': '1'}), 0.25983918)\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"query = \"We must forever conduct our struggle\"\n",
|
||
"found_docs = vectara.similarity_search_with_score(\n",
|
||
" query, filter=\"doc.speech = 'I-have-a-dream'\"\n",
|
||
")\n",
|
||
"print(found_docs[0])\n",
|
||
"print(found_docs[1])"
|
||
]
|
||
},
|
||
{
|
||
"attachments": {},
|
||
"cell_type": "markdown",
|
||
"id": "691a82d6",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Vectara as a Retriever\n",
|
||
"\n",
|
||
"Vectara, as all the other vector stores, can be used also as a LangChain Retriever:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 11,
|
||
"id": "9427195f",
|
||
"metadata": {
|
||
"ExecuteTime": {
|
||
"end_time": "2023-04-04T10:51:26.031451Z",
|
||
"start_time": "2023-04-04T10:51:26.018763Z"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"VectaraRetriever(vectorstore=<langchain.vectorstores.vectara.Vectara object at 0x12772caf0>, search_type='similarity', search_kwargs={'lambda_val': 0.025, 'k': 5, 'filter': '', 'n_sentence_context': '0'})"
|
||
]
|
||
},
|
||
"execution_count": 11,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"retriever = vectara.as_retriever()\n",
|
||
"retriever"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 12,
|
||
"id": "f3c70c31",
|
||
"metadata": {
|
||
"ExecuteTime": {
|
||
"end_time": "2023-04-04T10:51:26.495652Z",
|
||
"start_time": "2023-04-04T10:51:26.046407Z"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"Document(page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \\n\\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \\n\\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \\n\\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.', metadata={'source': '../../../state_of_the_union.txt'})"
|
||
]
|
||
},
|
||
"execution_count": 12,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
|
||
"retriever.get_relevant_documents(query)[0]"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "2300e785",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
}
|
||
],
|
||
"metadata": {
|
||
"kernelspec": {
|
||
"display_name": "Python 3 (ipykernel)",
|
||
"language": "python",
|
||
"name": "python3"
|
||
},
|
||
"language_info": {
|
||
"codemirror_mode": {
|
||
"name": "ipython",
|
||
"version": 3
|
||
},
|
||
"file_extension": ".py",
|
||
"mimetype": "text/x-python",
|
||
"name": "python",
|
||
"nbconvert_exporter": "python",
|
||
"pygments_lexer": "ipython3",
|
||
"version": "3.10.9"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 5
|
||
}
|