Various miscellaneous fixes to most pages in the 'Retrievers' section of
the documentation:
- "VectorStore" and "vectorstore" changed to "vector store" for
consistency
- Various spelling, grammar, and formatting improvements for readability
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
@ -59,8 +59,8 @@ LangChain provides several objects to easily distinguish between different roles
If none of those roles sound right, there is also a `ChatMessage` class where you can specify the role manually.
For more information on how to use these different messages most effectively, see our prompting guide.
LangChain exposes a standard interface for both, but it's useful to understand this difference in order to construct prompts for a given language model.
The standard interface that LangChain exposes has two methods:
LangChain provides a standard interface for both, but it's useful to understand this difference in order to construct prompts for a given language model.
The standard interface that LangChain provides has two methods:
- `predict`: Takes in a string, returns a string
- `predict_messages`: Takes in a list of messages, returns a message.
This text splitter is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is `["\n\n", "\n", " ", ""]`. This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.
1. How the text is split: by list of characters
2. How the chunk size is measured: by number of characters
1. How the text is split: by list of characters.
2. How the chunk size is measured: by number of characters.
import Example from "@snippets/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter.mdx"
@ -55,7 +55,7 @@ However, we have also added a collection of algorithms on top of this to increas
These include:
- [Parent Document Retriever](/docs/modules/data_connection/retrievers/parent_document_retriever): This allows you to create multiple embeddings per parent document, allowing you to look up smaller chunks but return larger context.
- [Self Query Retriever](/docs/modules/data_connection/retrievers/self_query): User questions often contain a reference to something that isn't just semantic but rather expresses some logic that can best be represented as a metadata filter. Self-query allows you to parse out the *semantic* part of a query from other *metadata filters* present in the query
- [Self Query Retriever](/docs/modules/data_connection/retrievers/self_query): User questions often contain a reference to something that isn't just semantic but rather expresses some logic that can best be represented as a metadata filter. Self-query allows you to parse out the *semantic* part of a query from other *metadata filters* present in the query.
- [Ensemble Retriever](/docs/modules/data_connection/retrievers/ensemble): Sometimes you may want to retrieve documents from multiple different sources, or using multiple different algorithms. The ensemble retriever allows you to easily do this.
@ -5,10 +5,10 @@ One challenge with retrieval is that usually you don't know the specific queries
Contextual compression is meant to fix this. The idea is simple: instead of immediately returning retrieved documents as-is, you can compress them using the context of the given query, so that only the relevant information is returned. “Compressing” here refers to both compressing the contents of an individual document and filtering out documents wholesale.
To use the Contextual Compression Retriever, you'll need:
- a base Retriever
- a base retriever
- a Document Compressor
The Contextual Compression Retriever passes queries to the base Retriever, takes the initial documents and passes them through the Document Compressor. The Document Compressor takes a list of Documents and shortens it by reducing the contents of Documents or dropping Documents altogether.
The Contextual Compression Retriever passes queries to the base retriever, takes the initial documents and passes them through the Document Compressor. The Document Compressor takes a list of documents and shortens it by reducing the contents of documents or dropping documents altogether.
A vector store retriever is a retriever that uses a vector store to retrieve documents. It is a lightweight wrapper around the Vector Store class to make it conform to the Retriever interface.
A vector store retriever is a retriever that uses a vector store to retrieve documents. It is a lightweight wrapper around the vector store class to make it conform to the retriever interface.
It uses the search methods implemented by a vector store, like similarity search and MMR, to query the texts in the vector store.
Once you construct a Vector store, it's very easy to construct a retriever. Let's walk through an example.
Once you construct a vector store, it's very easy to construct a retriever. Let's walk through an example.
import Example from "@snippets/modules/data_connection/retrievers/how_to/vectorstore.mdx"
@ -11,7 +11,7 @@ The Embeddings class is a class designed for interfacing with text embedding mod
Embeddings create a vector representation of a piece of text. This is useful because it means we can think about text in the vector space, and do things like semantic search where we look for pieces of text that are most similar in the vector space.
The base Embeddings class in LangChain exposes two methods: one for embedding documents and one for embedding a query. The former takes as input multiple texts, while the latter takes a single text. The reason for having these as two separate methods is that some embedding providers have different embedding methods for documents (to be searched over) vs queries (the search query itself).
The base Embeddings class in LangChain provides two methods: one for embedding documents and one for embedding a query. The former takes as input multiple texts, while the latter takes a single text. The reason for having these as two separate methods is that some embedding providers have different embedding methods for documents (to be searched over) vs queries (the search query itself).
This walkthrough showcases basic functionality related to VectorStores. A key part of working with vector stores is creating the vector to put in them, which is usually created via embeddings. Therefore, it is recommended that you familiarize yourself with the [text embedding model](/docs/modules/data_connection/text_embedding/) interfaces before diving into this.
This walkthrough showcases basic functionality related to vector stores. A key part of working with vector stores is creating the vector to put in them, which is usually created via embeddings. Therefore, it is recommended that you familiarize yourself with the [text embedding model](/docs/modules/data_connection/text_embedding/) interfaces before diving into this.
import GetStarted from "@snippets/modules/data_connection/vectorstores/get_started.mdx"
"When a full paragraph or document is embedded, the embedding process considers both the overall context and the relationships between the sentences and phrases within the text. This can result in a more comprehensive vector representation that captures the broader meaning and themes of the text.\n",
"```\n",
" \n",
"As mentioned, chunking often aims to keep text with common context together.\n",
"\n",
"With this in mind, we might want to specifically honor the structure of the document itself.\n",
"\n",
"For example, a markdown file is organized by headers.\n",
"\n",
"Creating chunks within specific header groups is an intuitive idea.\n",
"\n",
"To address this challenge, we can use `MarkdownHeaderTextSplitter`.\n",
"\n",
"This will split a markdown file by a specified set of headers. \n",
"As mentioned, chunking often aims to keep text with common context together. With this in mind, we might want to specifically honor the structure of the document itself. For example, a markdown file is organized by headers. Creating chunks within specific header groups is an intuitive idea. To address this challenge, we can use `MarkdownHeaderTextSplitter`. This will split a markdown file by a specified set of headers. \n",
"\n",
"For example, if we want to split this markdown:\n",
"`incremental` and `full` offer the following automated clean up:\n",
"\n",
"* If the content of source document or derived documents has **changed**, both `incremental` or `full` modes will clean up (delete) previous versions of the content.\n",
"* If the content of the source document or derived documents has **changed**, both `incremental` or `full` modes will clean up (delete) previous versions of the content.\n",
"* If the source document has been **deleted** (meaning it is not included in the documents currently being indexed), the `full` cleanup mode will delete it from the vector store correctly, but the `incremental` mode will not.\n",
"\n",
"When content is mutated (e.g., the source PDF file was revised) there will be a period of time during indexing when both the new and old versions may be returned to the user. This happens after the new content was written, but before the old version was deleted.\n",
@ -56,7 +56,7 @@
"## Requirements\n",
"\n",
"1. Do not use with a store that has been pre-populated with content independently of the indexing API, as the record manager will not know that records have been inserted previously.\n",
"2. Only works with LangChain ``VectorStore``'s that support:\n",
"2. Only works with LangChain `vectorstore`'s that support:\n",
" * document addition by id (`add_documents` method with `ids` argument)\n",
" * delete by id (`delete` method with)\n",
" \n",
@ -64,11 +64,11 @@
"\n",
"The record manager relies on a time-based mechanism to determine what content can be cleaned up (when using `full` or `incremental` cleanup modes).\n",
"\n",
"If two tasks run back to back, and the first task finishes before the the clock time changes, then the second task may not be able to clean up content.\n",
"If two tasks run back-to-back, and the first task finishes before the the clock time changes, then the second task may not be able to clean up content.\n",
"\n",
"This is unlikely to be an issue in actual settings for the following reasons:\n",
"\n",
"1. The RecordManager uses higher resolutino timestamps.\n",
"1. The RecordManager uses higher resolution timestamps.\n",
"2. The data would need to change between the first and the second tasks runs, which becomes unlikely if the time interval between the tasks is small.\n",
"3. Indexing tasks typically take more than a few ms."
]
@ -99,7 +99,7 @@
"id": "f81201ab-d997-433c-9f18-ceea70e61cbd",
"metadata": {},
"source": [
"Initialize a vector store and set up the embeddings"
"Initialize a vector store and set up the embeddings:"
]
},
{
@ -125,7 +125,7 @@
"source": [
"Initialize a record manager with an appropriate namespace.\n",
"\n",
"**Suggestion** Use a namespace that takes into account both the vectostore and the collection name in the vectorstore; e.g., 'redis/my_docs', 'chromadb/my_docs' or 'postgres/my_docs'"
"**Suggestion:** Use a namespace that takes into account both the vector store and the collection name in the vectorstore; e.g., 'redis/my_docs', 'chromadb/my_docs' or 'postgres/my_docs'."
]
},
{
@ -148,7 +148,7 @@
"id": "835c2c19-68ec-4086-9066-f7ba40877fd5",
"metadata": {},
"source": [
"Create a schema before using the record manager"
"Create a schema before using the record manager."
]
},
{
@ -166,7 +166,7 @@
"id": "7f07c6bd-6ada-4b17-a8c5-fe5e4a5278fd",
"metadata": {},
"source": [
"Let's index some test documents"
"Let's index some test documents:"
]
},
{
@ -185,7 +185,7 @@
"id": "c7d572be-a913-4511-ab64-2864a252458a",
"metadata": {},
"source": [
"Indexing into an empty vectorstore"
"Indexing into an empty vectorstore:"
]
},
{
@ -285,7 +285,7 @@
"id": "7be3e55a-5fe9-4f40-beff-577c2aa5e76a",
"metadata": {},
"source": [
"Second time around all content will be skipped"
"Second time around all content will be skipped:"
]
},
{
@ -396,7 +396,7 @@
"id": "b205c1ba-f069-4a4e-af93-dc98afd5c9e6",
"metadata": {},
"source": [
"If we provide no documents with incremental indexing mode, nothing will change"
"If we provide no documents with incremental indexing mode, nothing will change."
]
},
{
@ -476,7 +476,7 @@
"\n",
"In `full` mode the user should pass the `full` universe of content that should be indexed into the indexing function.\n",
"\n",
"Any documents that are not passed into the indexing functino and are present in the vectorstore will be deleted!\n",
"Any documents that are not passed into the indexing function and are present in the vectorstore will be deleted!\n",
"\n",
"This behavior is useful to handle deletions of source documents."
]
@ -527,7 +527,7 @@
"id": "887c45c6-4363-4389-ac56-9cdad682b4c8",
"metadata": {},
"source": [
"Say someone deleted the first doc"
"Say someone deleted the first doc:"
]
},
{
@ -566,7 +566,7 @@
"id": "d940bcb4-cf6d-4c21-a565-e7f53f6dacf1",
"metadata": {},
"source": [
"Using full mode will clean up the deleted content as well"
"Using full mode will clean up the deleted content as well."
]
},
{
@ -716,7 +716,7 @@
"id": "ab1c0915-3f9e-42ac-bdb5-3017935c6e7f",
"metadata": {},
"source": [
"This should delete the old versions of documents associated with `doggy.txt` source and replace them with the new versions"
"This should delete the old versions of documents associated with `doggy.txt` source and replace them with the new versions."
]
},
{
@ -774,11 +774,11 @@
"id": "c0af4d24-d735-4e5d-ad9b-a2e8b281f9f1",
"metadata": {},
"source": [
"## Using with Loaders\n",
"## Using with loaders\n",
"\n",
"Indexing can accept either an iterable of documents or else any loader.\n",
"\n",
"**Attention** The loader **MUST** set source keys correctly."
"**Attention:** The loader **must** set source keys correctly."
"Distance-based vector database retrieval embeds (represents) queries in high-dimensional space and finds similar embedded documents based on \"distance\". But, retrieval may produce difference results with subtle changes in query wording or if the embeddings do not capture the semantics of the data well. Prompt engineering / tuning is sometimes done to manually address these problems, but can be tedious.\n",
"Distance-based vector database retrieval embeds (represents) queries in high-dimensional space and finds similar embedded documents based on \"distance\". But, retrieval may produce different results with subtle changes in query wording or if the embeddings do not capture the semantics of the data well. Prompt engineering / tuning is sometimes done to manually address these problems, but can be tedious.\n",
"\n",
"The `MultiQueryRetriever` automates the process of prompt tuning by using an LLM to generate multiple queries from different perspectives for a given user input query. For each query, it retrieves a set of relevant documents and takes the unique union across all queries to get a larger set of potentially relevant documents. By generating multiple perspectives on the same question, the `MultiQueryRetriever` might be able to overcome some of the limitations of the distance-based retrieval and get a richer set of results."
]
@ -43,7 +43,7 @@
"id": "cca8f56c",
"metadata": {},
"source": [
"`Simple usage`\n",
"#### Simple usage\n",
"\n",
"Specify the LLM to use for query generation, and the retriver will do the rest."
]
@ -113,7 +113,7 @@
"id": "c54a282f",
"metadata": {},
"source": [
"`Supplying your own prompt`\n",
"#### Supplying your own prompt\n",
"\n",
"You can also supply a prompt along with an output parser to split the results into a list of queries."
"The `EnsembleRetriever` takes a list of retrievers as input and ensemble the results of their get_relevant_documents() methods and rerank the results based on the [Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) algorithm.\n",
"The `EnsembleRetriever` takes a list of retrievers as input and ensemble the results of their `get_relevant_documents()` methods and rerank the results based on the [Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) algorithm.\n",
"\n",
"By leveraging the strengths of different algorithms, the `EnsembleRetriever` can achieve better performance than any single algorithm. \n",
"\n",
"The most common pattern is to combine a sparse retriever(like BM25) with a dense retriever(like Embedding similarity), because their strengths are complementary. It is also known as \"hybrid search\".The sparse retriever is good at finding relevant documents based on keywords, while the dense retriever is good at finding relevant documents based on semantic similarity."
"The most common pattern is to combine a sparse retriever (like BM25) with a dense retriever (like embedding similarity), because their strengths are complementary. It is also known as \"hybrid search\".The sparse retriever is good at finding relevant documents based on keywords, while the dense retriever is good at finding relevant documents based on semantic similarity."
"The methods to create multiple vectors per document include:\n",
"\n",
"- smaller chunks: split a document into smaller chunks, and embed those (this is ParentDocumentRetriever)\n",
"- summary: create a summary for each document, embed that along with (or instead of) the document\n",
"- hypothetical questions: create hypothetical questions that each document would be appropriate to answer, embed those along with (or instead of) the document\n",
"- Smaller chunks: split a document into smaller chunks, and embed those (this is ParentDocumentRetriever).\n",
"- Summary: create a summary for each document, embed that along with (or instead of) the document.\n",
"- Hypothetical questions: create hypothetical questions that each document would be appropriate to answer, embed those along with (or instead of) the document.\n",
"\n",
"\n",
"Note that this also enables another method of adding embeddings - manually. This is great because you can explicitly add questions or queries that should lead to a document being recovered, giving you more control"
"Note that this also enables another method of adding embeddings - manually. This is great because you can explicitly add questions or queries that should lead to a document being recovered, giving you more control."
]
},
{
@ -68,7 +68,7 @@
"source": [
"## Smaller chunks\n",
"\n",
"Often times it can be useful to retrieve larger chunks of information, but embed smaller chunks. This allows for embeddings to capture the semantic meaning as closely as possible, but for as much context as possible to be passed downstream. NOTE: this is what the ParentDocumentRetriever does. Here we show what is going on under the hood."
"Often times it can be useful to retrieve larger chunks of information, but embed smaller chunks. This allows for embeddings to capture the semantic meaning as closely as possible, but for as much context as possible to be passed downstream. Note that this is what the `ParentDocumentRetriever` does. Here we show what is going on under the hood."
"2. You want to have long enough documents that the context of each chunk is\n",
" retained.\n",
"\n",
"The ParentDocumentRetriever strikes that balance by splitting and storing\n",
"The `ParentDocumentRetriever` strikes that balance by splitting and storing\n",
"small chunks of data. During retrieval, it first fetches the small chunks\n",
"but then looks up the parent ids for those chunks and returns those larger\n",
"documents.\n",
@ -70,7 +70,7 @@
"id": "d3943f72",
"metadata": {},
"source": [
"## Retrieving Full Documents\n",
"## Retrieving full documents\n",
"\n",
"In this mode, we want to retrieve the full documents. Therefore, we only specify a child splitter."
]
@ -220,7 +220,7 @@
"id": "14f813a5",
"metadata": {},
"source": [
"## Retrieving Larger Chunks\n",
"## Retrieving larger chunks\n",
"\n",
"Sometimes, the full documents can be too big to want to retrieve them as is. In that case, what we really want to do is to first split the raw documents into larger chunks, and then split it into smaller chunks. We then index the smaller chunks, but on retrieval we retrieve the larger chunks (but still not the full documents)."
]
@ -273,7 +273,7 @@
"id": "64ad3c8c",
"metadata": {},
"source": [
"We can see that there are much more than two documents now - these are the larger chunks"
"We can see that there are much more than two documents now - these are the larger chunks."
"First we'll want to create a DeepLake VectorStore and seed it with some data. We've created a small demo set of documents that contain summaries of movies.\n",
"First we'll want to create a Deep Lake vector store and seed it with some data. We've created a small demo set of documents that contain summaries of movies.\n",
"\n",
"NOTE: The self-query retriever requires you to have `lark` installed (`pip install lark`). We also need the `deeplake` package."
"**Note:** The self-query retriever requires you to have `lark` installed (`pip install lark`). We also need the `deeplake` package."
"First we'll want to create a Chroma VectorStore and seed it with some data. We've created a small demo set of documents that contain summaries of movies.\n",
"First we'll want to create a Chroma vector store and seed it with some data. We've created a small demo set of documents that contain summaries of movies.\n",
"\n",
"NOTE: The self-query retriever requires you to have `lark` installed (`pip install lark`). We also need the `chromadb` package."
"**Note:** The self-query retriever requires you to have `lark` installed (`pip install lark`). We also need the `chromadb` package."
"First we'll want to create a Elasticsearch VectorStore and seed it with some data. We've created a small demo set of documents that contain summaries of movies.\n",
"First we'll want to create a Elasticsearch vector store and seed it with some data. We've created a small demo set of documents that contain summaries of movies.\n",
"\n",
"NOTE: The self-query retriever requires you to have `lark` installed (`pip install lark`). We also need the `elasticsearch` package."
"**Note:** The self-query retriever requires you to have `lark` installed (`pip install lark`). We also need the `elasticsearch` package."
">[MyScale](https://docs.myscale.com/en/) is an integrated vector database. You can access your database in SQL and also from here, LangChain. MyScale can make a use of [various data types and functions for filters](https://blog.myscale.com/2023/06/06/why-integrated-database-solution-can-boost-your-llm-apps/#filter-on-anything-without-constraints). It will boost up your LLM app no matter if you are scaling up your data or expand your system to broader application.\n",
"\n",
"In the notebook we'll demo the `SelfQueryRetriever` wrapped around a MyScale vector store with some extra piece we contributed to LangChain. In short, it can be concluded into 4 points:\n",
"In the notebook we'll demo the `SelfQueryRetriever` wrapped around a MyScale vector store with some extra pieces we contributed to LangChain. In short, it can be condensed into 4 points:\n",
"1. Add `contain` comparator to match list of any if there is more than one element matched\n",
"2. Add `timestamp` data type for datetime match (ISO-format, or YYYY-MM-DD)\n",
"3. Add `like` comparator for string pattern search\n",
@ -22,9 +22,9 @@
"metadata": {},
"source": [
"## Creating a MyScale vector store\n",
"MyScale has already been integrated to LangChain for a while. So you can follow [this notebook](/docs/integrations/vectorstores/myscale.ipynb) to create your own vectorstore for a self-query retriever.\n",
"MyScale has already been integrated to LangChain for a while. So you can follow [this notebook](/docs/integrations/vectorstores/myscale) to create your own vectorstore for a self-query retriever.\n",
"\n",
"NOTE: All self-query retrievers requires you to have `lark` installed (`pip install lark`). We use `lark` for grammar definition. Before you proceed to the next step, we also want to remind you that `clickhouse-connect` is also needed to interact with your MyScale backend."
"**Note:** All self-query retrievers requires you to have `lark` installed (`pip install lark`). We use `lark` for grammar definition. Before you proceed to the next step, we also want to remind you that `clickhouse-connect` is also needed to interact with your MyScale backend."
]
},
{
@ -44,7 +44,7 @@
"id": "83811610-7df3-4ede-b268-68a6a83ba9e2",
"metadata": {},
"source": [
"In this tutorial we follow other example's setting and use `OpenAIEmbeddings`. Remember to get a OpenAI API Key for valid accesss to LLMs."
"In this tutorial we follow other example's setting and use `OpenAIEmbeddings`. Remember to get an OpenAI API Key for valid accesss to LLMs."
]
},
{
@ -88,7 +88,7 @@
"metadata": {},
"source": [
"## Create some sample data\n",
"As you can see, the data we created has some difference to other self-query retrievers. We replaced keyword `year` to `date` which gives you a finer control on timestamps. We also altered the type of keyword `gerne` to list of strings, where LLM can use a new `contain` comparator to construct filters. We also provides comparator `like` and arbitrary function support to filters, which will be introduced in next few cells.\n",
"As you can see, the data we created has some differences compared to other self-query retrievers. We replaced the keyword `year` with `date` which gives you finer control on timestamps. We also changed the type of the keyword `gerne` to a list of strings, where an LLM can use a new `contain` comparator to construct filters. We also provide the `like` comparator and arbitrary function support to filters, which will be introduced in next few cells.\n",
"\n",
"Now let's look at the data first."
]
@ -146,7 +146,7 @@
"metadata": {},
"source": [
"## Creating our self-querying retriever\n",
"Just like other retrievers... Simple and nice."
"Just like other retrievers... simple and nice."
]
},
{
@ -272,7 +272,7 @@
"id": "86371ac8",
"metadata": {},
"source": [
"# Wait a second... What else?\n",
"# Wait a second... what else?\n",
"\n",
"Self-query retriever with MyScale can do more! Let's find out."
"First we'll want to create a `Pinecone` VectorStore and seed it with some data. We've created a small demo set of documents that contain summaries of movies.\n",
"First we'll want to create a `Pinecone` vector store and seed it with some data. We've created a small demo set of documents that contain summaries of movies.\n",
"\n",
"To use Pinecone, you have to have `pinecone` package installed and you must have an API key and an Environment. Here are the [installation instructions](https://docs.pinecone.io/docs/quickstart).\n",
"To use Pinecone, you have to have `pinecone` package installed and you must have an API key and an environment. Here are the [installation instructions](https://docs.pinecone.io/docs/quickstart).\n",
"\n",
"NOTE: The self-query retriever requires you to have `lark` package installed."
"**Note:** The self-query retriever requires you to have `lark` package installed."
">[Qdrant](https://qdrant.tech/documentation/) (read: quadrant) is a vector similarity search engine. It provides a production-ready service with a convenient API to store, search, and manage points - vectors with an additional payload. `Qdrant` is tailored to extended filtering support. It makes it useful \n",
">[Qdrant](https://qdrant.tech/documentation/) (read: quadrant) is a vector similarity search engine. It provides a production-ready service with a convenient API to store, search, and manage points - vectors with an additional payload. `Qdrant` is tailored to extended filtering support.\n",
"\n",
"In the notebook we'll demo the `SelfQueryRetriever` wrapped around a Qdrant vector store. "
]
@ -20,9 +20,9 @@
"metadata": {},
"source": [
"## Creating a Qdrant vector store\n",
"First we'll want to create a Qdrant VectorStore and seed it with some data. We've created a small demo set of documents that contain summaries of movies.\n",
"First we'll want to create a Qdrant vector store and seed it with some data. We've created a small demo set of documents that contain summaries of movies.\n",
"\n",
"NOTE: The self-query retriever requires you to have `lark` installed (`pip install lark`). We also need the `qdrant-client` package."
"**Note:** The self-query retriever requires you to have `lark` installed (`pip install lark`). We also need the `qdrant-client` package."
"First we'll want to create a Weaviate VectorStore and seed it with some data. We've created a small demo set of documents that contain summaries of movies.\n",
"First we'll want to create a Weaviate vector store and seed it with some data. We've created a small demo set of documents that contain summaries of movies.\n",
"\n",
"NOTE: The self-query retriever requires you to have `lark` installed (`pip install lark`). We also need the `weaviate-client` package."
"**Note:** The self-query retriever requires you to have `lark` installed (`pip install lark`). We also need the `weaviate-client` package."
Under the hood, by default this uses the [UnstructuredLoader](/docs/integrations/document_loaders/unstructured_file.html)
Under the hood, by default this uses the [UnstructuredLoader](/docs/integrations/document_loaders/unstructured_file.html).
```python
from langchain.document_loaders import DirectoryLoader
@ -121,7 +121,7 @@ len(docs)
</CodeOutputBlock>
## Autodetect file encodings with TextLoader
## Auto-detect file encodings with TextLoader
In this example we will see some strategies that can be useful when loading a big list of arbitrary files from a directory using the `TextLoader` class.
@ -212,7 +212,7 @@ loader.load()
</HTMLOutputBlock>
The file `example-non-utf8.txt` uses a different encoding the `load()` function fails with a helpful message indicating which file failed decoding.
The file `example-non-utf8.txt` uses a different encoding, so the `load()` function fails with a helpful message indicating which file failed decoding.
With the default behavior of `TextLoader` any failure to load any of the documents will fail the whole loading process and no documents are loaded.
This covers how to load online pdfs into a document format that we can use downstream. This can be used for various online pdf sites such as https://open.umn.edu/opentextbooks/textbooks/ and https://arxiv.org/archive/
This covers how to load online PDFs into a document format that we can use downstream. This can be used for various online PDF sites such as https://open.umn.edu/opentextbooks/textbooks/ and https://arxiv.org/archive/
Note: all other pdf loaders can also be used to fetch remote PDFs, but `OnlinePDFLoader` is a legacy function, and works specifically with `UnstructuredPDFLoader`.
Note: all other PDF loaders can also be used to fetch remote PDFs, but `OnlinePDFLoader` is a legacy function, and works specifically with `UnstructuredPDFLoader`.
@ -208,7 +208,7 @@ data = loader.load()
### Using PDFMiner to generate HTML text
This can be helpful for chunking texts semantically into sections as the output html content can be parsed via `BeautifulSoup` to get more structured and rich information about font size, page numbers, pdf headers/footers, etc.
This can be helpful for chunking texts semantically into sections as the output html content can be parsed via `BeautifulSoup` to get more structured and rich information about font size, page numbers, PDF headers/footers, etc.
It's that simple! You can call `get_relevant_documents` or the async `aget_relevant_documents` methods to retrieve documents relevant to a query, where "relevance" is defined by
the specific retriever object you are calling.
Of course, we also help construct what we think useful Retrievers are. The main type of Retriever that we focus on is a Vectorstore retriever. We will focus on that for the rest of this guide.
Of course, we also help construct what we think useful retrievers are. The main type of retriever that we focus on is a vector store retriever. We will focus on that for the rest of this guide.
In order to understand what a vectorstore retriever is, it's important to understand what a Vectorstore is. So let's look at that.
In order to understand what a vectorstore retriever is, it's important to understand what a vector store is. So let's look at that.
By default, LangChain uses [Chroma](/docs/ecosystem/integrations/chroma.html) as the vector store to index and search embeddings. To walk through this tutorial, we'll first need to install `chromadb`.
@ -52,7 +52,7 @@ We have chosen this as the example for getting started because it nicely combine
Question answering over documents consists of four steps:
1. Create an index
2. Create a Retriever from that index
2. Create a retriever from that index
3. Create a question answering chain
4. Ask questions!
@ -66,7 +66,7 @@ from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
```
Next in the generic setup, let's specify the document loader we want to use. You can download the `state_of_the_union.txt` file [here](https://github.com/hwchase17/langchain/blob/master/docs/extras/modules/state_of_the_union.txt)
Next in the generic setup, let's specify the document loader we want to use. You can download the `state_of_the_union.txt` file [here](https://github.com/hwchase17/langchain/blob/master/docs/extras/modules/state_of_the_union.txt).
What is returned from the `VectorstoreIndexCreator` is `VectorStoreIndexWrapper`, which provides these nice `query` and `query_with_sources` functionality. If we just wanted to access the vectorstore directly, we can also do that.
What is returned from the `VectorstoreIndexCreator` is a `VectorStoreIndexWrapper`, which provides these nice `query` and `query_with_sources` functionalities. If we just want to access the vector store directly, we can also do that.
```python
@ -144,7 +144,7 @@ index.vectorstore
</CodeOutputBlock>
If we then want to access the VectorstoreRetriever, we can do that with:
If we then want to access the `VectorStoreRetriever`, we can do that with:
@ -9,9 +9,9 @@ from langchain.schema import Document
from langchain.vectorstores import FAISS
```
## Low Decay Rate
## Low decay rate
A low `decay rate` (in this, to be extreme, we will set close to 0) means memories will be "remembered" for longer. A `decay rate` of 0 means memories never be forgotten, making this retriever equivalent to the vector lookup.
A low `decay rate` (in this, to be extreme, we will set it close to 0) means memories will be "remembered" for longer. A `decay rate` of 0 means memories never be forgotten, making this retriever equivalent to the vector lookup.
With a high `decay rate` (e.g., several 9's), the `recency score` quickly goes to 0! If you set this all the way to 1, `recency` is 0 for all objects, once again making this equivalent to a vector lookup.
docs = retriever.get_relevant_documents("what did he say about ketanji brown jackson")
```
## Maximum Marginal Relevance Retrieval
By default, the vectorstore retriever uses similarity search. If the underlying vectorstore support maximum marginal relevance search, you can specify that as the search type.
## Maximum marginal relevance retrieval
By default, the vectorstore retriever uses similarity search. If the underlying vectorstore supports maximum marginal relevance search, you can specify that as the search type.
We'll use a Pinecone vector store in this example.
First we'll want to create a `Pinecone` VectorStore and seed it with some data. We've created a small demo set of documents that contain summaries of movies.
First we'll want to create a `Pinecone` vector store and seed it with some data. We've created a small demo set of documents that contain summaries of movies.
To use Pinecone, you need to have `pinecone` package installed and you must have an API key and an Environment. Here are the [installation instructions](https://docs.pinecone.io/docs/quickstart).
To use Pinecone, you need to have `pinecone` package installed and you must have an API key and an environment. Here are the [installation instructions](https://docs.pinecone.io/docs/quickstart).
NOTE: The self-query retriever requires you to have `lark` package installed.
**Note:** The self-query retriever requires you to have `lark` package installed.
Langchain supports async operation on vector stores. All the methods might be called using their async counterparts, with the prefix `a`, meaning `async`.
LangChain supports async operation on vector stores. All the methods might be called using their async counterparts, with the prefix `a`, meaning `async`.
`Qdrant` is a vector store, which supports all the async operations, thus it will be used in this walkthrough.