docs: misc retrievers fixes (#9791)

Various miscellaneous fixes to most pages in the 'Retrievers' section of
the documentation:
- "VectorStore" and "vectorstore" changed to "vector store" for
consistency
- Various spelling, grammar, and formatting improvements for readability

Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
pull/10160/head
seamusp 1 year ago committed by GitHub
parent 8bc452a466
commit 16945c9922
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

@ -59,8 +59,8 @@ LangChain provides several objects to easily distinguish between different roles
If none of those roles sound right, there is also a `ChatMessage` class where you can specify the role manually.
For more information on how to use these different messages most effectively, see our prompting guide.
LangChain exposes a standard interface for both, but it's useful to understand this difference in order to construct prompts for a given language model.
The standard interface that LangChain exposes has two methods:
LangChain provides a standard interface for both, but it's useful to understand this difference in order to construct prompts for a given language model.
The standard interface that LangChain provides has two methods:
- `predict`: Takes in a string, returns a string
- `predict_messages`: Takes in a list of messages, returns a message.

@ -11,7 +11,7 @@ Use document loaders to load data from a source as `Document`'s. A `Document` is
and associated metadata. For example, there are document loaders for loading a simple `.txt` file, for loading the text
contents of any web page, or even for loading a transcript of a YouTube video.
Document loaders expose a "load" method for loading data as documents from a configured source. They optionally
Document loaders provide a "load" method for loading data as documents from a configured source. They optionally
implement a "lazy load" as well for lazily loading data into memory.
## Get started

@ -2,8 +2,8 @@
This is the simplest method. This splits based on characters (by default "\n\n") and measure chunk length by number of characters.
1. How the text is split: by single character
2. How the chunk size is measured: by number of characters
1. How the text is split: by single character.
2. How the chunk size is measured: by number of characters.
import Example from "@snippets/modules/data_connection/document_transformers/text_splitters/character_text_splitter.mdx"

@ -1,6 +1,6 @@
# Split code
CodeTextSplitter allows you to split your code with multiple language support. Import enum `Language` and specify the language.
CodeTextSplitter allows you to split your code with multiple languages supported. Import enum `Language` and specify the language.
import Example from "@snippets/modules/data_connection/document_transformers/text_splitters/code_splitter.mdx"

@ -2,8 +2,8 @@
This text splitter is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is `["\n\n", "\n", " ", ""]`. This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.
1. How the text is split: by list of characters
2. How the chunk size is measured: by number of characters
1. How the text is split: by list of characters.
2. How the chunk size is measured: by number of characters.
import Example from "@snippets/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter.mdx"

@ -37,7 +37,7 @@ efficiently find other pieces of text that are similar.
LangChain provides integrations with over 25 different embedding providers and methods,
from open-source to proprietary API,
allowing you to choose the one best suited for your needs.
LangChain exposes a standard interface, allowing you to easily swap between models.
LangChain provides a standard interface, allowing you to easily swap between models.
**[Vector stores](/docs/modules/data_connection/vectorstores/)**
@ -55,7 +55,7 @@ However, we have also added a collection of algorithms on top of this to increas
These include:
- [Parent Document Retriever](/docs/modules/data_connection/retrievers/parent_document_retriever): This allows you to create multiple embeddings per parent document, allowing you to look up smaller chunks but return larger context.
- [Self Query Retriever](/docs/modules/data_connection/retrievers/self_query): User questions often contain a reference to something that isn't just semantic but rather expresses some logic that can best be represented as a metadata filter. Self-query allows you to parse out the *semantic* part of a query from other *metadata filters* present in the query
- [Self Query Retriever](/docs/modules/data_connection/retrievers/self_query): User questions often contain a reference to something that isn't just semantic but rather expresses some logic that can best be represented as a metadata filter. Self-query allows you to parse out the *semantic* part of a query from other *metadata filters* present in the query.
- [Ensemble Retriever](/docs/modules/data_connection/retrievers/ensemble): Sometimes you may want to retrieve documents from multiple different sources, or using multiple different algorithms. The ensemble retriever allows you to easily do this.
- And more!

@ -5,10 +5,10 @@ One challenge with retrieval is that usually you don't know the specific queries
Contextual compression is meant to fix this. The idea is simple: instead of immediately returning retrieved documents as-is, you can compress them using the context of the given query, so that only the relevant information is returned. “Compressing” here refers to both compressing the contents of an individual document and filtering out documents wholesale.
To use the Contextual Compression Retriever, you'll need:
- a base Retriever
- a base retriever
- a Document Compressor
The Contextual Compression Retriever passes queries to the base Retriever, takes the initial documents and passes them through the Document Compressor. The Document Compressor takes a list of Documents and shortens it by reducing the contents of Documents or dropping Documents altogether.
The Contextual Compression Retriever passes queries to the base retriever, takes the initial documents and passes them through the Document Compressor. The Document Compressor takes a list of documents and shortens it by reducing the contents of documents or dropping documents altogether.
![](https://drive.google.com/uc?id=1CtNgWODXZudxAWSRiWgSGEoTNrUFT98v)

@ -8,7 +8,7 @@ Head to [Integrations](/docs/integrations/retrievers/) for documentation on buil
:::
A retriever is an interface that returns documents given an unstructured query. It is more general than a vector store.
A retriever does not need to be able to store documents, only to return (or retrieve) it. Vector stores can be used
A retriever does not need to be able to store documents, only to return (or retrieve) them. Vector stores can be used
as the backbone of a retriever, but there are other types of retrievers as well.
## Get started

@ -1,9 +1,9 @@
# Vector store-backed retriever
A vector store retriever is a retriever that uses a vector store to retrieve documents. It is a lightweight wrapper around the Vector Store class to make it conform to the Retriever interface.
A vector store retriever is a retriever that uses a vector store to retrieve documents. It is a lightweight wrapper around the vector store class to make it conform to the retriever interface.
It uses the search methods implemented by a vector store, like similarity search and MMR, to query the texts in the vector store.
Once you construct a Vector store, it's very easy to construct a retriever. Let's walk through an example.
Once you construct a vector store, it's very easy to construct a retriever. Let's walk through an example.
import Example from "@snippets/modules/data_connection/retrievers/how_to/vectorstore.mdx"

@ -11,7 +11,7 @@ The Embeddings class is a class designed for interfacing with text embedding mod
Embeddings create a vector representation of a piece of text. This is useful because it means we can think about text in the vector space, and do things like semantic search where we look for pieces of text that are most similar in the vector space.
The base Embeddings class in LangChain exposes two methods: one for embedding documents and one for embedding a query. The former takes as input multiple texts, while the latter takes a single text. The reason for having these as two separate methods is that some embedding providers have different embedding methods for documents (to be searched over) vs queries (the search query itself).
The base Embeddings class in LangChain provides two methods: one for embedding documents and one for embedding a query. The former takes as input multiple texts, while the latter takes a single text. The reason for having these as two separate methods is that some embedding providers have different embedding methods for documents (to be searched over) vs queries (the search query itself).
## Get started

@ -16,7 +16,7 @@ for you.
## Get started
This walkthrough showcases basic functionality related to VectorStores. A key part of working with vector stores is creating the vector to put in them, which is usually created via embeddings. Therefore, it is recommended that you familiarize yourself with the [text embedding model](/docs/modules/data_connection/text_embedding/) interfaces before diving into this.
This walkthrough showcases basic functionality related to vector stores. A key part of working with vector stores is creating the vector to put in them, which is usually created via embeddings. Therefore, it is recommended that you familiarize yourself with the [text embedding model](/docs/modules/data_connection/text_embedding/) interfaces before diving into this.
import GetStarted from "@snippets/modules/data_connection/vectorstores/get_started.mdx"

@ -9,7 +9,7 @@
"# Lost in the middle: The problem with long contexts\n",
"\n",
"No matter the architecture of your model, there is a substantial performance degradation when you include 10+ retrieved documents.\n",
"In brief: When models must access relevant information in the middle of long contexts, then tend to ignore the provided documents.\n",
"In brief: When models must access relevant information in the middle of long contexts, they tend to ignore the provided documents.\n",
"See: https://arxiv.org/abs/2307.03172\n",
"\n",
"To avoid this issue you can re-order documents after retrieval to avoid performance degradation."

@ -17,17 +17,7 @@
"When a full paragraph or document is embedded, the embedding process considers both the overall context and the relationships between the sentences and phrases within the text. This can result in a more comprehensive vector representation that captures the broader meaning and themes of the text.\n",
"```\n",
" \n",
"As mentioned, chunking often aims to keep text with common context together.\n",
"\n",
"With this in mind, we might want to specifically honor the structure of the document itself.\n",
"\n",
"For example, a markdown file is organized by headers.\n",
"\n",
"Creating chunks within specific header groups is an intuitive idea.\n",
"\n",
"To address this challenge, we can use `MarkdownHeaderTextSplitter`.\n",
"\n",
"This will split a markdown file by a specified set of headers. \n",
"As mentioned, chunking often aims to keep text with common context together. With this in mind, we might want to specifically honor the structure of the document itself. For example, a markdown file is organized by headers. Creating chunks within specific header groups is an intuitive idea. To address this challenge, we can use `MarkdownHeaderTextSplitter`. This will split a markdown file by a specified set of headers. \n",
"\n",
"For example, if we want to split this markdown:\n",
"```\n",

@ -22,8 +22,8 @@
"\n",
"We can use it to estimate tokens used. It will probably be more accurate for the OpenAI models.\n",
"\n",
"1. How the text is split: by character passed in\n",
"2. How the chunk size is measured: by `tiktoken` tokenizer"
"1. How the text is split: by character passed in.\n",
"2. How the chunk size is measured: by `tiktoken` tokenizer."
]
},
{
@ -122,8 +122,8 @@
"\n",
"Another alternative to `NLTK` is to use [spaCy tokenizer](https://spacy.io/api/tokenizer).\n",
"\n",
"1. How the text is split: by `spaCy` tokenizer\n",
"2. How the chunk size is measured: by number of characters"
"1. How the text is split: by `spaCy` tokenizer.\n",
"2. How the chunk size is measured: by number of characters."
]
},
{
@ -331,7 +331,7 @@
"Rather than just splitting on \"\\n\\n\", we can use `NLTK` to split based on [NLTK tokenizers](https://www.nltk.org/api/nltk.tokenize.html).\n",
"\n",
"1. How the text is split: by `NLTK` tokenizer.\n",
"2. How the chunk size is measured:by number of characters"
"2. How the chunk size is measured: by number of characters."
]
},
{
@ -430,8 +430,8 @@
"\n",
"We use Hugging Face tokenizer, the [GPT2TokenizerFast](https://huggingface.co/Ransaka/gpt2-tokenizer-fast) to count the text length in tokens.\n",
"\n",
"1. How the text is split: by character passed in\n",
"2. How the chunk size is measured: by number of tokens calculated by the `Hugging Face` tokenizer\n"
"1. How the text is split: by character passed in.\n",
"2. How the chunk size is measured: by number of tokens calculated by the `Hugging Face` tokenizer.\n"
]
},
{

@ -45,7 +45,7 @@
"\n",
"`incremental` and `full` offer the following automated clean up:\n",
"\n",
"* If the content of source document or derived documents has **changed**, both `incremental` or `full` modes will clean up (delete) previous versions of the content.\n",
"* If the content of the source document or derived documents has **changed**, both `incremental` or `full` modes will clean up (delete) previous versions of the content.\n",
"* If the source document has been **deleted** (meaning it is not included in the documents currently being indexed), the `full` cleanup mode will delete it from the vector store correctly, but the `incremental` mode will not.\n",
"\n",
"When content is mutated (e.g., the source PDF file was revised) there will be a period of time during indexing when both the new and old versions may be returned to the user. This happens after the new content was written, but before the old version was deleted.\n",
@ -56,7 +56,7 @@
"## Requirements\n",
"\n",
"1. Do not use with a store that has been pre-populated with content independently of the indexing API, as the record manager will not know that records have been inserted previously.\n",
"2. Only works with LangChain ``VectorStore``'s that support:\n",
"2. Only works with LangChain `vectorstore`'s that support:\n",
" * document addition by id (`add_documents` method with `ids` argument)\n",
" * delete by id (`delete` method with)\n",
" \n",
@ -64,11 +64,11 @@
"\n",
"The record manager relies on a time-based mechanism to determine what content can be cleaned up (when using `full` or `incremental` cleanup modes).\n",
"\n",
"If two tasks run back to back, and the first task finishes before the the clock time changes, then the second task may not be able to clean up content.\n",
"If two tasks run back-to-back, and the first task finishes before the the clock time changes, then the second task may not be able to clean up content.\n",
"\n",
"This is unlikely to be an issue in actual settings for the following reasons:\n",
"\n",
"1. The RecordManager uses higher resolutino timestamps.\n",
"1. The RecordManager uses higher resolution timestamps.\n",
"2. The data would need to change between the first and the second tasks runs, which becomes unlikely if the time interval between the tasks is small.\n",
"3. Indexing tasks typically take more than a few ms."
]
@ -99,7 +99,7 @@
"id": "f81201ab-d997-433c-9f18-ceea70e61cbd",
"metadata": {},
"source": [
"Initialize a vector store and set up the embeddings"
"Initialize a vector store and set up the embeddings:"
]
},
{
@ -125,7 +125,7 @@
"source": [
"Initialize a record manager with an appropriate namespace.\n",
"\n",
"**Suggestion** Use a namespace that takes into account both the vectostore and the collection name in the vectorstore; e.g., 'redis/my_docs', 'chromadb/my_docs' or 'postgres/my_docs'"
"**Suggestion:** Use a namespace that takes into account both the vector store and the collection name in the vector store; e.g., 'redis/my_docs', 'chromadb/my_docs' or 'postgres/my_docs'."
]
},
{
@ -148,7 +148,7 @@
"id": "835c2c19-68ec-4086-9066-f7ba40877fd5",
"metadata": {},
"source": [
"Create a schema before using the record manager"
"Create a schema before using the record manager."
]
},
{
@ -166,7 +166,7 @@
"id": "7f07c6bd-6ada-4b17-a8c5-fe5e4a5278fd",
"metadata": {},
"source": [
"Let's index some test documents"
"Let's index some test documents:"
]
},
{
@ -185,7 +185,7 @@
"id": "c7d572be-a913-4511-ab64-2864a252458a",
"metadata": {},
"source": [
"Indexing into an empty vectorstore"
"Indexing into an empty vector store:"
]
},
{
@ -285,7 +285,7 @@
"id": "7be3e55a-5fe9-4f40-beff-577c2aa5e76a",
"metadata": {},
"source": [
"Second time around all content will be skipped"
"Second time around all content will be skipped:"
]
},
{
@ -396,7 +396,7 @@
"id": "b205c1ba-f069-4a4e-af93-dc98afd5c9e6",
"metadata": {},
"source": [
"If we provide no documents with incremental indexing mode, nothing will change"
"If we provide no documents with incremental indexing mode, nothing will change."
]
},
{
@ -476,7 +476,7 @@
"\n",
"In `full` mode the user should pass the `full` universe of content that should be indexed into the indexing function.\n",
"\n",
"Any documents that are not passed into the indexing functino and are present in the vectorstore will be deleted!\n",
"Any documents that are not passed into the indexing function and are present in the vectorstore will be deleted!\n",
"\n",
"This behavior is useful to handle deletions of source documents."
]
@ -527,7 +527,7 @@
"id": "887c45c6-4363-4389-ac56-9cdad682b4c8",
"metadata": {},
"source": [
"Say someone deleted the first doc"
"Say someone deleted the first doc:"
]
},
{
@ -566,7 +566,7 @@
"id": "d940bcb4-cf6d-4c21-a565-e7f53f6dacf1",
"metadata": {},
"source": [
"Using full mode will clean up the deleted content as well"
"Using full mode will clean up the deleted content as well."
]
},
{
@ -716,7 +716,7 @@
"id": "ab1c0915-3f9e-42ac-bdb5-3017935c6e7f",
"metadata": {},
"source": [
"This should delete the old versions of documents associated with `doggy.txt` source and replace them with the new versions"
"This should delete the old versions of documents associated with `doggy.txt` source and replace them with the new versions."
]
},
{
@ -774,11 +774,11 @@
"id": "c0af4d24-d735-4e5d-ad9b-a2e8b281f9f1",
"metadata": {},
"source": [
"## Using with Loaders\n",
"## Using with loaders\n",
"\n",
"Indexing can accept either an iterable of documents or else any loader.\n",
"\n",
"**Attention** The loader **MUST** set source keys correctly."
"**Attention:** The loader **must** set source keys correctly."
]
},
{

@ -7,7 +7,7 @@
"source": [
"# MultiQueryRetriever\n",
"\n",
"Distance-based vector database retrieval embeds (represents) queries in high-dimensional space and finds similar embedded documents based on \"distance\". But, retrieval may produce difference results with subtle changes in query wording or if the embeddings do not capture the semantics of the data well. Prompt engineering / tuning is sometimes done to manually address these problems, but can be tedious.\n",
"Distance-based vector database retrieval embeds (represents) queries in high-dimensional space and finds similar embedded documents based on \"distance\". But, retrieval may produce different results with subtle changes in query wording or if the embeddings do not capture the semantics of the data well. Prompt engineering / tuning is sometimes done to manually address these problems, but can be tedious.\n",
"\n",
"The `MultiQueryRetriever` automates the process of prompt tuning by using an LLM to generate multiple queries from different perspectives for a given user input query. For each query, it retrieves a set of relevant documents and takes the unique union across all queries to get a larger set of potentially relevant documents. By generating multiple perspectives on the same question, the `MultiQueryRetriever` might be able to overcome some of the limitations of the distance-based retrieval and get a richer set of results."
]
@ -43,7 +43,7 @@
"id": "cca8f56c",
"metadata": {},
"source": [
"`Simple usage`\n",
"#### Simple usage\n",
"\n",
"Specify the LLM to use for query generation, and the retriver will do the rest."
]
@ -113,7 +113,7 @@
"id": "c54a282f",
"metadata": {},
"source": [
"`Supplying your own prompt`\n",
"#### Supplying your own prompt\n",
"\n",
"You can also supply a prompt along with an output parser to split the results into a list of queries."
]

@ -6,11 +6,11 @@
"source": [
"# Ensemble Retriever\n",
"\n",
"The `EnsembleRetriever` takes a list of retrievers as input and ensemble the results of their get_relevant_documents() methods and rerank the results based on the [Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) algorithm.\n",
"The `EnsembleRetriever` takes a list of retrievers as input and ensemble the results of their `get_relevant_documents()` methods and rerank the results based on the [Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) algorithm.\n",
"\n",
"By leveraging the strengths of different algorithms, the `EnsembleRetriever` can achieve better performance than any single algorithm. \n",
"\n",
"The most common pattern is to combine a sparse retriever(like BM25) with a dense retriever(like Embedding similarity), because their strengths are complementary. It is also known as \"hybrid search\".The sparse retriever is good at finding relevant documents based on keywords, while the dense retriever is good at finding relevant documents based on semantic similarity."
"The most common pattern is to combine a sparse retriever (like BM25) with a dense retriever (like embedding similarity), because their strengths are complementary. It is also known as \"hybrid search\". The sparse retriever is good at finding relevant documents based on keywords, while the dense retriever is good at finding relevant documents based on semantic similarity."
]
},
{

@ -11,12 +11,12 @@
"\n",
"The methods to create multiple vectors per document include:\n",
"\n",
"- smaller chunks: split a document into smaller chunks, and embed those (this is ParentDocumentRetriever)\n",
"- summary: create a summary for each document, embed that along with (or instead of) the document\n",
"- hypothetical questions: create hypothetical questions that each document would be appropriate to answer, embed those along with (or instead of) the document\n",
"- Smaller chunks: split a document into smaller chunks, and embed those (this is ParentDocumentRetriever).\n",
"- Summary: create a summary for each document, embed that along with (or instead of) the document.\n",
"- Hypothetical questions: create hypothetical questions that each document would be appropriate to answer, embed those along with (or instead of) the document.\n",
"\n",
"\n",
"Note that this also enables another method of adding embeddings - manually. This is great because you can explicitly add questions or queries that should lead to a document being recovered, giving you more control"
"Note that this also enables another method of adding embeddings - manually. This is great because you can explicitly add questions or queries that should lead to a document being recovered, giving you more control."
]
},
{
@ -68,7 +68,7 @@
"source": [
"## Smaller chunks\n",
"\n",
"Often times it can be useful to retrieve larger chunks of information, but embed smaller chunks. This allows for embeddings to capture the semantic meaning as closely as possible, but for as much context as possible to be passed downstream. NOTE: this is what the ParentDocumentRetriever does. Here we show what is going on under the hood."
"Often times it can be useful to retrieve larger chunks of information, but embed smaller chunks. This allows for embeddings to capture the semantic meaning as closely as possible, but for as much context as possible to be passed downstream. Note that this is what the `ParentDocumentRetriever` does. Here we show what is going on under the hood."
]
},
{

@ -15,7 +15,7 @@
"2. You want to have long enough documents that the context of each chunk is\n",
" retained.\n",
"\n",
"The ParentDocumentRetriever strikes that balance by splitting and storing\n",
"The `ParentDocumentRetriever` strikes that balance by splitting and storing\n",
"small chunks of data. During retrieval, it first fetches the small chunks\n",
"but then looks up the parent ids for those chunks and returns those larger\n",
"documents.\n",
@ -70,7 +70,7 @@
"id": "d3943f72",
"metadata": {},
"source": [
"## Retrieving Full Documents\n",
"## Retrieving full documents\n",
"\n",
"In this mode, we want to retrieve the full documents. Therefore, we only specify a child splitter."
]
@ -143,7 +143,7 @@
"id": "f895d62b",
"metadata": {},
"source": [
"Let's now call the vectorstore search functionality - we should see that it returns small chunks (since we're storing the small chunks)."
"Let's now call the vector store search functionality - we should see that it returns small chunks (since we're storing the small chunks)."
]
},
{
@ -220,7 +220,7 @@
"id": "14f813a5",
"metadata": {},
"source": [
"## Retrieving Larger Chunks\n",
"## Retrieving larger chunks\n",
"\n",
"Sometimes, the full documents can be too big to want to retrieve them as is. In that case, what we really want to do is to first split the raw documents into larger chunks, and then split it into smaller chunks. We then index the smaller chunks, but on retrieval we retrieve the larger chunks (but still not the full documents)."
]
@ -273,7 +273,7 @@
"id": "64ad3c8c",
"metadata": {},
"source": [
"We can see that there are much more than two documents now - these are the larger chunks"
"We can see that there are much more than two documents now - these are the larger chunks."
]
},
{
@ -302,7 +302,7 @@
"id": "baaef673",
"metadata": {},
"source": [
"Let's make sure the underlying vectorstore still retrieves the small chunks."
"Let's make sure the underlying vector store still retrieves the small chunks."
]
},
{

@ -8,9 +8,9 @@
"source": [
"# Deep Lake self-querying \n",
"\n",
">[DeepLake](https://www.activeloop.ai) is a multimodal database for building AI applications.\n",
">[Deep Lake](https://www.activeloop.ai) is a multimodal database for building AI applications.\n",
"\n",
"In the notebook we'll demo the `SelfQueryRetriever` wrapped around a DeepLake vector store. "
"In the notebook we'll demo the `SelfQueryRetriever` wrapped around a Deep Lake vector store. "
]
},
{
@ -19,10 +19,10 @@
"id": "68e75fb9",
"metadata": {},
"source": [
"## Creating a Deep Lake vectorstore\n",
"First we'll want to create a DeepLake VectorStore and seed it with some data. We've created a small demo set of documents that contain summaries of movies.\n",
"## Creating a Deep Lake vector store\n",
"First we'll want to create a Deep Lake vector store and seed it with some data. We've created a small demo set of documents that contain summaries of movies.\n",
"\n",
"NOTE: The self-query retriever requires you to have `lark` installed (`pip install lark`). We also need the `deeplake` package."
"**Note:** The self-query retriever requires you to have `lark` installed (`pip install lark`). We also need the `deeplake` package."
]
},
{

@ -17,10 +17,10 @@
"id": "68e75fb9",
"metadata": {},
"source": [
"## Creating a Chroma vectorstore\n",
"First we'll want to create a Chroma VectorStore and seed it with some data. We've created a small demo set of documents that contain summaries of movies.\n",
"## Creating a Chroma vector store\n",
"First we'll want to create a Chroma vector store and seed it with some data. We've created a small demo set of documents that contain summaries of movies.\n",
"\n",
"NOTE: The self-query retriever requires you to have `lark` installed (`pip install lark`). We also need the `chromadb` package."
"**Note:** The self-query retriever requires you to have `lark` installed (`pip install lark`). We also need the `chromadb` package."
]
},
{
@ -64,7 +64,7 @@
},
"outputs": [
{
"name": "stdin",
"name": "stdout",
"output_type": "stream",
"text": [
"OpenAI API Key: ········\n"

@ -13,10 +13,10 @@
"id": "68e75fb9",
"metadata": {},
"source": [
"## Creating a Elasticsearch vectorstore\n",
"First we'll want to create a Elasticsearch VectorStore and seed it with some data. We've created a small demo set of documents that contain summaries of movies.\n",
"## Creating a Elasticsearch vector store\n",
"First we'll want to create a Elasticsearch vector store and seed it with some data. We've created a small demo set of documents that contain summaries of movies.\n",
"\n",
"NOTE: The self-query retriever requires you to have `lark` installed (`pip install lark`). We also need the `elasticsearch` package."
"**Note:** The self-query retriever requires you to have `lark` installed (`pip install lark`). We also need the `elasticsearch` package."
]
},
{

@ -9,7 +9,7 @@
"\n",
">[MyScale](https://docs.myscale.com/en/) is an integrated vector database. You can access your database in SQL and also from here, LangChain. MyScale can make a use of [various data types and functions for filters](https://blog.myscale.com/2023/06/06/why-integrated-database-solution-can-boost-your-llm-apps/#filter-on-anything-without-constraints). It will boost up your LLM app no matter if you are scaling up your data or expand your system to broader application.\n",
"\n",
"In the notebook we'll demo the `SelfQueryRetriever` wrapped around a MyScale vector store with some extra piece we contributed to LangChain. In short, it can be concluded into 4 points:\n",
"In the notebook we'll demo the `SelfQueryRetriever` wrapped around a MyScale vector store with some extra pieces we contributed to LangChain. In short, it can be condensed into 4 points:\n",
"1. Add `contain` comparator to match list of any if there is more than one element matched\n",
"2. Add `timestamp` data type for datetime match (ISO-format, or YYYY-MM-DD)\n",
"3. Add `like` comparator for string pattern search\n",
@ -21,10 +21,10 @@
"id": "68e75fb9",
"metadata": {},
"source": [
"## Creating a MyScale vectorstore\n",
"MyScale has already been integrated to LangChain for a while. So you can follow [this notebook](/docs/integrations/vectorstores/myscale.ipynb) to create your own vectorstore for a self-query retriever.\n",
"## Creating a MyScale vector store\n",
"MyScale has already been integrated to LangChain for a while. So you can follow [this notebook](/docs/integrations/vectorstores/myscale) to create your own vectorstore for a self-query retriever.\n",
"\n",
"NOTE: All self-query retrievers requires you to have `lark` installed (`pip install lark`). We use `lark` for grammar definition. Before you proceed to the next step, we also want to remind you that `clickhouse-connect` is also needed to interact with your MyScale backend."
"**Note:** All self-query retrievers requires you to have `lark` installed (`pip install lark`). We use `lark` for grammar definition. Before you proceed to the next step, we also want to remind you that `clickhouse-connect` is also needed to interact with your MyScale backend."
]
},
{
@ -44,7 +44,7 @@
"id": "83811610-7df3-4ede-b268-68a6a83ba9e2",
"metadata": {},
"source": [
"In this tutorial we follow other example's setting and use `OpenAIEmbeddings`. Remember to get a OpenAI API Key for valid accesss to LLMs."
"In this tutorial we follow other example's setting and use `OpenAIEmbeddings`. Remember to get an OpenAI API Key for valid accesss to LLMs."
]
},
{
@ -88,7 +88,7 @@
"metadata": {},
"source": [
"## Create some sample data\n",
"As you can see, the data we created has some difference to other self-query retrievers. We replaced keyword `year` to `date` which gives you a finer control on timestamps. We also altered the type of keyword `gerne` to list of strings, where LLM can use a new `contain` comparator to construct filters. We also provides comparator `like` and arbitrary function support to filters, which will be introduced in next few cells.\n",
"As you can see, the data we created has some differences compared to other self-query retrievers. We replaced the keyword `year` with `date` which gives you finer control on timestamps. We also changed the type of the keyword `gerne` to a list of strings, where an LLM can use a new `contain` comparator to construct filters. We also provide the `like` comparator and arbitrary function support to filters, which will be introduced in next few cells.\n",
"\n",
"Now let's look at the data first."
]
@ -146,7 +146,7 @@
"metadata": {},
"source": [
"## Creating our self-querying retriever\n",
"Just like other retrievers... Simple and nice."
"Just like other retrievers... simple and nice."
]
},
{
@ -272,7 +272,7 @@
"id": "86371ac8",
"metadata": {},
"source": [
"# Wait a second... What else?\n",
"# Wait a second... what else?\n",
"\n",
"Self-query retriever with MyScale can do more! Let's find out."
]

@ -16,11 +16,11 @@
"metadata": {},
"source": [
"## Creating a Pinecone index\n",
"First we'll want to create a `Pinecone` VectorStore and seed it with some data. We've created a small demo set of documents that contain summaries of movies.\n",
"First we'll want to create a `Pinecone` vector store and seed it with some data. We've created a small demo set of documents that contain summaries of movies.\n",
"\n",
"To use Pinecone, you have to have `pinecone` package installed and you must have an API key and an Environment. Here are the [installation instructions](https://docs.pinecone.io/docs/quickstart).\n",
"To use Pinecone, you have to have `pinecone` package installed and you must have an API key and an environment. Here are the [installation instructions](https://docs.pinecone.io/docs/quickstart).\n",
"\n",
"NOTE: The self-query retriever requires you to have `lark` package installed."
"**Note:** The self-query retriever requires you to have `lark` package installed."
]
},
{

@ -8,7 +8,7 @@
"source": [
"# Qdrant self-querying \n",
"\n",
">[Qdrant](https://qdrant.tech/documentation/) (read: quadrant ) is a vector similarity search engine. It provides a production-ready service with a convenient API to store, search, and manage points - vectors with an additional payload. `Qdrant` is tailored to extended filtering support. It makes it useful \n",
">[Qdrant](https://qdrant.tech/documentation/) (read: quadrant) is a vector similarity search engine. It provides a production-ready service with a convenient API to store, search, and manage points - vectors with an additional payload. `Qdrant` is tailored to extended filtering support.\n",
"\n",
"In the notebook we'll demo the `SelfQueryRetriever` wrapped around a Qdrant vector store. "
]
@ -19,10 +19,10 @@
"id": "68e75fb9",
"metadata": {},
"source": [
"## Creating a Qdrant vectorstore\n",
"First we'll want to create a Qdrant VectorStore and seed it with some data. We've created a small demo set of documents that contain summaries of movies.\n",
"## Creating a Qdrant vector store\n",
"First we'll want to create a Qdrant vector store and seed it with some data. We've created a small demo set of documents that contain summaries of movies.\n",
"\n",
"NOTE: The self-query retriever requires you to have `lark` installed (`pip install lark`). We also need the `qdrant-client` package."
"**Note:** The self-query retriever requires you to have `lark` installed (`pip install lark`). We also need the `qdrant-client` package."
]
},
{

@ -13,10 +13,10 @@
"id": "68e75fb9",
"metadata": {},
"source": [
"## Creating a Weaviate vectorstore\n",
"First we'll want to create a Weaviate VectorStore and seed it with some data. We've created a small demo set of documents that contain summaries of movies.\n",
"## Creating a Weaviate vector store\n",
"First we'll want to create a Weaviate vector store and seed it with some data. We've created a small demo set of documents that contain summaries of movies.\n",
"\n",
"NOTE: The self-query retriever requires you to have `lark` installed (`pip install lark`). We also need the `weaviate-client` package."
"**Note:** The self-query retriever requires you to have `lark` installed (`pip install lark`). We also need the `weaviate-client` package."
]
},
{

@ -80,9 +80,9 @@
"id": "39114da4",
"metadata": {},
"source": [
"`Run with citations`\n",
"#### Run with citations\n",
"\n",
"We can use `RetrievalQAWithSourcesChain` to retrieve docs and provide citations"
"We can use `RetrievalQAWithSourcesChain` to retrieve docs and provide citations."
]
},
{
@ -126,7 +126,7 @@
"id": "357559fd",
"metadata": {},
"source": [
"`Run with logging`\n",
"#### Run with logging\n",
"\n",
"Here, we use `get_relevant_documents` method to return docs."
]
@ -171,9 +171,9 @@
"id": "b681a846",
"metadata": {},
"source": [
"`Generate answer using retrieved docs`\n",
"#### Generate answer using retrieved docs\n",
"\n",
"We can use `load_qa_chain` for QA using the retrieved docs"
"We can use `load_qa_chain` for QA using the retrieved docs."
]
},
{
@ -207,7 +207,7 @@
"source": [
"### More flexibility\n",
"\n",
"Pass an LLM chain with custom prompt and output parsing"
"Pass an LLM chain with custom prompt and output parsing."
]
},
{
@ -326,7 +326,7 @@
"source": [
"### Run locally\n",
"\n",
"Specify LLM and embeddings that will run locally (e.g., on your laptop)"
"Specify LLM and embeddings that will run locally (e.g., on your laptop)."
]
},
{

@ -9,13 +9,8 @@
"\n",
"Embeddings can be stored or temporarily cached to avoid needing to recompute them.\n",
"\n",
"Caching embeddings can be done using a `CacheBackedEmbeddings`.\n",
"\n",
"The cache backed embedder is a wrapper around an embedder that caches\n",
"embeddings in a key-value store. \n",
"\n",
"The text is hashed and the hash is used as the key in the cache.\n",
"\n",
"Caching embeddings can be done using a `CacheBackedEmbeddings`. The cache backed embedder is a wrapper around an embedder that caches\n",
"embeddings in a key-value store. The text is hashed and the hash is used as the key in the cache.\n",
"\n",
"The main supported way to initialized a `CacheBackedEmbeddings` is `from_bytes_store`. This takes in the following parameters:\n",
"\n",
@ -44,9 +39,9 @@
"id": "9ddf07dd-3e72-41de-99d4-78e9521e272f",
"metadata": {},
"source": [
"## Using with a vectorstore\n",
"## Using with a vector store\n",
"\n",
"First, let's see an example that uses the local file system for storing embeddings and uses FAISS vectorstore for retrieval."
"First, let's see an example that uses the local file system for storing embeddings and uses FAISS vector store for retrieval."
]
},
{
@ -91,7 +86,7 @@
"id": "f8cdf33c-321d-4d2c-b76b-d6f5f8b42a92",
"metadata": {},
"source": [
"The cache is empty prior to embedding"
"The cache is empty prior to embedding:"
]
},
{
@ -140,7 +135,7 @@
"id": "f526444b-93f8-423f-b6d1-dab539450921",
"metadata": {},
"source": [
"create the vectorstore"
"Create the vector store:"
]
},
{
@ -168,7 +163,7 @@
"id": "64fc53f5-d559-467f-bf62-5daef32ffbc0",
"metadata": {},
"source": [
"If we try to create the vectostore again, it'll be much faster since it does not need to re-compute any embeddings."
"If we try to create the vector store again, it'll be much faster since it does not need to re-compute any embeddings."
]
},
{

@ -1,4 +1,4 @@
The simplest loader reads in a file as text and places it all into one Document.
The simplest loader reads in a file as text and places it all into one document.
```python
from langchain.document_loaders import TextLoader

@ -19,7 +19,7 @@ print(data)
</CodeOutputBlock>
## Customizing the csv parsing and loading
## Customizing the CSV parsing and loading
See the [csv module](https://docs.python.org/3/library/csv.html) documentation for more information of what csv args are supported.

@ -1,4 +1,4 @@
Under the hood, by default this uses the [UnstructuredLoader](/docs/integrations/document_loaders/unstructured_file.html)
Under the hood, by default this uses the [UnstructuredLoader](/docs/integrations/document_loaders/unstructured_file.html).
```python
from langchain.document_loaders import DirectoryLoader
@ -121,7 +121,7 @@ len(docs)
</CodeOutputBlock>
## Auto detect file encodings with TextLoader
## Auto-detect file encodings with TextLoader
In this example we will see some strategies that can be useful when loading a big list of arbitrary files from a directory using the `TextLoader` class.
@ -212,7 +212,7 @@ loader.load()
</HTMLOutputBlock>
The file `example-non-utf8.txt` uses a different encoding the `load()` function fails with a helpful message indicating which file failed decoding.
The file `example-non-utf8.txt` uses a different encoding, so the `load()` function fails with a helpful message indicating which file failed decoding.
With the default behavior of `TextLoader` any failure to load any of the documents will fail the whole loading process and no documents are loaded.

@ -139,9 +139,9 @@ data[0]
### Fetching remote PDFs using Unstructured
This covers how to load online pdfs into a document format that we can use downstream. This can be used for various online pdf sites such as https://open.umn.edu/opentextbooks/textbooks/ and https://arxiv.org/archive/
This covers how to load online PDFs into a document format that we can use downstream. This can be used for various online PDF sites such as https://open.umn.edu/opentextbooks/textbooks/ and https://arxiv.org/archive/
Note: all other pdf loaders can also be used to fetch remote PDFs, but `OnlinePDFLoader` is a legacy function, and works specifically with `UnstructuredPDFLoader`.
Note: all other PDF loaders can also be used to fetch remote PDFs, but `OnlinePDFLoader` is a legacy function, and works specifically with `UnstructuredPDFLoader`.
@ -208,7 +208,7 @@ data = loader.load()
### Using PDFMiner to generate HTML text
This can be helpful for chunking texts semantically into sections as the output html content can be parsed via `BeautifulSoup` to get more structured and rich information about font size, page numbers, pdf headers/footers, etc.
This can be helpful for chunking texts semantically into sections as the output html content can be parsed via `BeautifulSoup` to get more structured and rich information about font size, page numbers, PDF headers/footers, etc.
```python
@ -222,7 +222,7 @@ loader = PDFMinerPDFasHTMLLoader("example_data/layout-parser-paper.pdf")
```python
data = loader.load()[0] # entire pdf is loaded as a single Document
data = loader.load()[0] # entire PDF is loaded as a single Document
```
@ -259,7 +259,7 @@ for c in content:
cur_text = c.text
snippets.append((cur_text,cur_fs))
# Note: The above logic is very straightforward. One can also add more strategies such as removing duplicate snippets (as
# headers/footers in a PDF appear on multiple pages so if we find duplicatess safe to assume that it is redundant info)
# headers/footers in a PDF appear on multiple pages so if we find duplicates it's safe to assume that it is redundant info)
```
@ -285,7 +285,7 @@ for s in snippets:
continue
# if current snippet's font size > previous section's content but less than previous section's heading than also make a new
# section (e.g. title of a pdf will have the highest font size but we don't want it to subsume all sections)
# section (e.g. title of a PDF will have the highest font size but we don't want it to subsume all sections)
metadata={'heading':s[0], 'content_font': 0, 'heading_font': s[1]}
metadata.update(data.metadata)
semantic_snippets.append(Document(page_content='',metadata=metadata))
@ -358,7 +358,7 @@ loader = PyPDFDirectoryLoader("example_data/")
docs = loader.load()
```
## Using pdfplumber
## Using PDFPlumber
Like PyMuPDF, the output Documents contain detailed metadata about the PDF and its pages, and returns one document per page.

@ -50,7 +50,7 @@ RecursiveCharacterTextSplitter.get_separators_for_language(Language.PYTHON)
## Python
Here's an example using the PythonTextSplitter
Here's an example using the PythonTextSplitter:
```python
@ -78,7 +78,7 @@ python_docs
</CodeOutputBlock>
## JS
Here's an example using the JS text splitter
Here's an example using the JS text splitter:
```python
@ -109,7 +109,7 @@ js_docs
## Markdown
Here's an example using the Markdown text splitter.
Here's an example using the Markdown text splitter:
````python
@ -155,7 +155,7 @@ md_docs
## Latex
Here's an example on Latex text
Here's an example on Latex text:
```python
@ -219,7 +219,7 @@ latex_docs
## HTML
Here's an example using an HTML text splitter
Here's an example using an HTML text splitter:
```python
@ -281,7 +281,7 @@ html_docs
## Solidity
Here's an example using the Solidity text splitter
Here's an example using the Solidity text splitter:
```python
SOL_CODE = """

@ -36,27 +36,27 @@ class BaseRetriever(ABC):
It's that simple! You can call `get_relevant_documents` or the async `aget_relevant_documents` methods to retrieve documents relevant to a query, where "relevance" is defined by
the specific retriever object you are calling.
Of course, we also help construct what we think useful Retrievers are. The main type of Retriever that we focus on is a Vectorstore retriever. We will focus on that for the rest of this guide.
Of course, we also help construct what we think useful retrievers are. The main type of retriever that we focus on is a vector store retriever. We will focus on that for the rest of this guide.
In order to understand what a vectorstore retriever is, it's important to understand what a Vectorstore is. So let's look at that.
In order to understand what a vector store retriever is, it's important to understand what a vector store is. So let's look at that.
By default, LangChain uses [Chroma](/docs/ecosystem/integrations/chroma.html) as the vectorstore to index and search embeddings. To walk through this tutorial, we'll first need to install `chromadb`.
By default, LangChain uses [Chroma](/docs/ecosystem/integrations/chroma.html) as the vector store to index and search embeddings. To walk through this tutorial, we'll first need to install `chromadb`.
```
pip install chromadb
```
This example showcases question answering over documents.
We have chosen this as the example for getting started because it nicely combines a lot of different elements (Text splitters, embeddings, vectorstores) and then also shows how to use them in a chain.
We have chosen this as the example for getting started because it nicely combines a lot of different elements (Text splitters, embeddings, vector stores) and then also shows how to use them in a chain.
Question answering over documents consists of four steps:
1. Create an index
2. Create a Retriever from that index
2. Create a retriever from that index
3. Create a question answering chain
4. Ask questions!
Each of the steps has multiple sub steps and potential configurations. In this notebook we will primarily focus on (1). We will start by showing the one-liner for doing so, but then break down what is actually going on.
Each of the steps has multiple substeps and potential configurations. In this notebook we will primarily focus on (1). We will start by showing the one-liner for doing so, but then break down what is actually going on.
First, let's import some common classes we'll use no matter what.
@ -66,7 +66,7 @@ from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
```
Next in the generic setup, let's specify the document loader we want to use. You can download the `state_of_the_union.txt` file [here](https://github.com/hwchase17/langchain/blob/master/docs/extras/modules/state_of_the_union.txt)
Next in the generic setup, let's specify the document loader we want to use. You can download the `state_of_the_union.txt` file [here](https://github.com/hwchase17/langchain/blob/master/docs/extras/modules/state_of_the_union.txt).
```python
@ -129,7 +129,7 @@ index.query_with_sources(query)
</CodeOutputBlock>
What is returned from the `VectorstoreIndexCreator` is `VectorStoreIndexWrapper`, which provides these nice `query` and `query_with_sources` functionality. If we just wanted to access the vectorstore directly, we can also do that.
What is returned from the `VectorstoreIndexCreator` is a `VectorStoreIndexWrapper`, which provides these nice `query` and `query_with_sources` functionalities. If we just want to access the vector store directly, we can also do that.
```python
@ -144,7 +144,7 @@ index.vectorstore
</CodeOutputBlock>
If we then want to access the VectorstoreRetriever, we can do that with:
If we then want to access the `VectorStoreRetriever`, we can do that with:
```python
@ -159,7 +159,7 @@ index.vectorstore.as_retriever()
</CodeOutputBlock>
It can also be convenient to filter the vectorstore by the metadata associated with documents, particularly when your vectorstore has multiple sources. This can be done using the `query` method like so:
It can also be convenient to filter the vector store by the metadata associated with documents, particularly when your vector store has multiple sources. This can be done using the `query` method like so:
```python
@ -185,7 +185,7 @@ There are three main steps going on after the documents are loaded:
1. Splitting documents into chunks
2. Creating embeddings for each document
3. Storing documents and embeddings in a vectorstore
3. Storing documents and embeddings in a vector store
Let's walk through this in code
@ -211,7 +211,7 @@ from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()
```
We now create the vectorstore to use as the index.
We now create the vector store to use as the index.
```python

@ -9,9 +9,9 @@ from langchain.schema import Document
from langchain.vectorstores import FAISS
```
## Low Decay Rate
## Low decay rate
A low `decay rate` (in this, to be extreme, we will set close to 0) means memories will be "remembered" for longer. A `decay rate` of 0 means memories never be forgotten, making this retriever equivalent to the vector lookup.
A low `decay rate` (in this, to be extreme, we will set it close to 0) means memories will be "remembered" for longer. A `decay rate` of 0 means memories never be forgotten, making this retriever equivalent to the vector lookup.
```python
@ -53,7 +53,7 @@ retriever.get_relevant_documents("hello world")
</CodeOutputBlock>
## High Decay Rate
## High decay rate
With a high `decay rate` (e.g., several 9's), the `recency score` quickly goes to 0! If you set this all the way to 1, `recency` is 0 for all objects, once again making this equivalent to a vector lookup.
@ -98,9 +98,9 @@ retriever.get_relevant_documents("hello world")
</CodeOutputBlock>
## Virtual Time
## Virtual time
Using some utils in LangChain, you can mock out the time component
Using some utils in LangChain, you can mock out the time component.
```python

@ -34,8 +34,8 @@ retriever = db.as_retriever()
docs = retriever.get_relevant_documents("what did he say about ketanji brown jackson")
```
## Maximum Marginal Relevance Retrieval
By default, the vectorstore retriever uses similarity search. If the underlying vectorstore support maximum marginal relevance search, you can specify that as the search type.
## Maximum marginal relevance retrieval
By default, the vector store retriever uses similarity search. If the underlying vector store supports maximum marginal relevance search, you can specify that as the search type.
```python
@ -47,9 +47,9 @@ retriever = db.as_retriever(search_type="mmr")
docs = retriever.get_relevant_documents("what did he say about ketanji brown jackson")
```
## Similarity Score Threshold Retrieval
## Similarity score threshold retrieval
You can also a retrieval method that sets a similarity score threshold and only returns documents with a score above that threshold
You can also a retrieval method that sets a similarity score threshold and only returns documents with a score above that threshold.
```python

@ -1,11 +1,11 @@
## Get started
We'll use a Pinecone vector store in this example.
First we'll want to create a `Pinecone` VectorStore and seed it with some data. We've created a small demo set of documents that contain summaries of movies.
First we'll want to create a `Pinecone` vector store and seed it with some data. We've created a small demo set of documents that contain summaries of movies.
To use Pinecone, you need to have `pinecone` package installed and you must have an API key and an Environment. Here are the [installation instructions](https://docs.pinecone.io/docs/quickstart).
To use Pinecone, you need to have `pinecone` package installed and you must have an API key and an environment. Here are the [installation instructions](https://docs.pinecone.io/docs/quickstart).
NOTE: The self-query retriever requires you to have `lark` package installed.
**Note:** The self-query retriever requires you to have `lark` package installed.
```python

@ -20,7 +20,7 @@ from langchain.embeddings import OpenAIEmbeddings
embeddings_model = OpenAIEmbeddings(openai_api_key="...")
```
otherwise you can initialize without any params:
Otherwise you can initialize without any params:
```python
from langchain.embeddings import OpenAIEmbeddings

@ -1,4 +1,4 @@
Langchain supports async operation on vector stores. All the methods might be called using their async counterparts, with the prefix `a`, meaning `async`.
LangChain supports async operation on vector stores. All the methods might be called using their async counterparts, with the prefix `a`, meaning `async`.
`Qdrant` is a vector store, which supports all the async operations, thus it will be used in this walkthrough.
@ -47,7 +47,7 @@ docs = await db.asimilarity_search_by_vector(embedding_vector)
## Maximum marginal relevance search (MMR)
Maximal marginal relevance optimizes for similarity to query AND diversity among selected documents. It is also supported in async API.
Maximal marginal relevance optimizes for similarity to query **and** diversity among selected documents. It is also supported in async API.
```python
query = "What did the president say about Ketanji Brown Jackson"

Loading…
Cancel
Save