Improve vector store onboarding exp (#6698)

This PR
- fixes the `similarity_search_by_vector` example, makes the code run
and adds the example to mirror `similarity_search`
- reverts back to chroma from faiss to remove sharp edges / create a
happy path for new developers. (1) real metadata filtering, (2) expected
functionality like `update`, `delete`, etc to serve beyond the most
trivial use cases

@hwchase17

---------

Co-authored-by: Bagatur <baskaryan@gmail.com>
This commit is contained in:
Jeff Huber 2023-07-18 13:48:42 -07:00 committed by GitHub
parent 25a2bdfb70
commit dc8b790214
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
3 changed files with 114 additions and 6 deletions

View File

@ -8,6 +8,8 @@ vectors, and then at query time to embed the unstructured query and retrieve the
'most similar' to the embedded query. A vector store takes care of storing embedded data and performing vector search
for you.
![vector store diagram](/img/vector_stores.jpg)
## Get started
This walkthrough showcases basic functionality related to VectorStores. A key part of working with vector stores is creating the vector to put in them, which is usually created via embeddings. Therefore, it is recommended that you familiarize yourself with the [text embedding model](/docs/modules/data_connection/text_embedding/) interfaces before diving into this.

Binary file not shown.

After

Width:  |  Height:  |  Size: 858 KiB

View File

@ -1,3 +1,43 @@
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
There are many great vector store options, here are a few that are free, open-source, and run entirely on your local machine. Review all integrations for many great hosted offerings.
<Tabs>
<TabItem value="chroma" label="Chroma" default>
This walkthrough uses the `chroma` vector database, which runs on your local machine as a library.
```bash
pip install chromadb
```
We want to use OpenAIEmbeddings so we have to get the OpenAI API Key.
```python
import os
import getpass
os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key:')
```
```python
from langchain.document_loaders import TextLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
# Load the document, split it into chunks, embed each chunk and load it into the vector store.
raw_documents = TextLoader('../../../state_of_the_union.txt').load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
documents = text_splitter.split_documents(raw_documents)
db = Chroma.from_documents(documents, OpenAIEmbeddings())
```
</TabItem>
<TabItem value="faiss" label="FAISS">
This walkthrough uses the `FAISS` vector database, which makes use of the Facebook AI Similarity Search (FAISS) library.
```bash
@ -14,22 +54,71 @@ import getpass
os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key:')
```
```python
from langchain.document_loaders import TextLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import FAISS
# Load the document, split it into chunks, embed each chunk and load it into the vector store.
raw_documents = TextLoader('../../../state_of_the_union.txt').load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
documents = text_splitter.split_documents(raw_documents)
embeddings = OpenAIEmbeddings()
db = FAISS.from_documents(documents, embeddings)
db = FAISS.from_documents(documents, OpenAIEmbeddings())
```
</TabItem>
<TabItem value="lance" label="Lance">
This notebook shows how to use functionality related to the LanceDB vector database based on the Lance data format.
```bash
pip install lancedb
```
We want to use OpenAIEmbeddings so we have to get the OpenAI API Key.
```python
import os
import getpass
os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key:')
```
```python
from langchain.document_loaders import TextLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import LanceDB
import lancedb
db = lancedb.connect("/tmp/lancedb")
table = db.create_table(
"my_table",
data=[
{
"vector": embeddings.embed_query("Hello World"),
"text": "Hello World",
"id": "1",
}
],
mode="overwrite",
)
# Load the document, split it into chunks, embed each chunk and load it into the vector store.
raw_documents = TextLoader('../../../state_of_the_union.txt').load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
documents = text_splitter.split_documents(raw_documents)
db = LanceDB.from_documents(documents, OpenAIEmbeddings(), connection=table)
```
</TabItem>
</Tabs>
### Similarity search
```python
@ -57,6 +146,23 @@ print(docs[0].page_content)
It is also possible to do a search for documents similar to a given embedding vector using `similarity_search_by_vector` which accepts an embedding vector as a parameter instead of a string.
```python
embedding_vector = embeddings.embed_query(query)
embedding_vector = OpenAIEmbeddings().embed_query(query)
docs = db.similarity_search_by_vector(embedding_vector)
print(docs[0].page_content)
```
The query is the same, and so the result is also the same.
<CodeOutputBlock lang="python">
```
Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while youre at it, pass the Disclose Act so Americans can know who is funding our elections.
Tonight, Id like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service.
One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.
And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nations top legal minds, who will continue Justice Breyers legacy of excellence.
```
</CodeOutputBlock>