mirror of
https://github.com/hwchase17/langchain
synced 2024-11-06 03:20:49 +00:00
Improve vector store onboarding exp (#6698)
This PR - fixes the `similarity_search_by_vector` example, makes the code run and adds the example to mirror `similarity_search` - reverts back to chroma from faiss to remove sharp edges / create a happy path for new developers. (1) real metadata filtering, (2) expected functionality like `update`, `delete`, etc to serve beyond the most trivial use cases @hwchase17 --------- Co-authored-by: Bagatur <baskaryan@gmail.com>
This commit is contained in:
parent
25a2bdfb70
commit
dc8b790214
@ -8,6 +8,8 @@ vectors, and then at query time to embed the unstructured query and retrieve the
|
||||
'most similar' to the embedded query. A vector store takes care of storing embedded data and performing vector search
|
||||
for you.
|
||||
|
||||
![vector store diagram](/img/vector_stores.jpg)
|
||||
|
||||
## Get started
|
||||
|
||||
This walkthrough showcases basic functionality related to VectorStores. A key part of working with vector stores is creating the vector to put in them, which is usually created via embeddings. Therefore, it is recommended that you familiarize yourself with the [text embedding model](/docs/modules/data_connection/text_embedding/) interfaces before diving into this.
|
||||
|
BIN
docs/docs_skeleton/static/img/vector_stores.jpg
Normal file
BIN
docs/docs_skeleton/static/img/vector_stores.jpg
Normal file
Binary file not shown.
After Width: | Height: | Size: 858 KiB |
@ -1,3 +1,43 @@
|
||||
import Tabs from '@theme/Tabs';
|
||||
import TabItem from '@theme/TabItem';
|
||||
|
||||
There are many great vector store options, here are a few that are free, open-source, and run entirely on your local machine. Review all integrations for many great hosted offerings.
|
||||
|
||||
<Tabs>
|
||||
<TabItem value="chroma" label="Chroma" default>
|
||||
|
||||
This walkthrough uses the `chroma` vector database, which runs on your local machine as a library.
|
||||
|
||||
```bash
|
||||
pip install chromadb
|
||||
```
|
||||
|
||||
We want to use OpenAIEmbeddings so we have to get the OpenAI API Key.
|
||||
|
||||
|
||||
```python
|
||||
import os
|
||||
import getpass
|
||||
|
||||
os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key:')
|
||||
```
|
||||
|
||||
```python
|
||||
from langchain.document_loaders import TextLoader
|
||||
from langchain.embeddings.openai import OpenAIEmbeddings
|
||||
from langchain.text_splitter import CharacterTextSplitter
|
||||
from langchain.vectorstores import Chroma
|
||||
|
||||
# Load the document, split it into chunks, embed each chunk and load it into the vector store.
|
||||
raw_documents = TextLoader('../../../state_of_the_union.txt').load()
|
||||
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
|
||||
documents = text_splitter.split_documents(raw_documents)
|
||||
db = Chroma.from_documents(documents, OpenAIEmbeddings())
|
||||
```
|
||||
|
||||
</TabItem>
|
||||
<TabItem value="faiss" label="FAISS">
|
||||
|
||||
This walkthrough uses the `FAISS` vector database, which makes use of the Facebook AI Similarity Search (FAISS) library.
|
||||
|
||||
```bash
|
||||
@ -14,22 +54,71 @@ import getpass
|
||||
os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key:')
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
from langchain.document_loaders import TextLoader
|
||||
from langchain.embeddings.openai import OpenAIEmbeddings
|
||||
from langchain.text_splitter import CharacterTextSplitter
|
||||
from langchain.vectorstores import FAISS
|
||||
|
||||
|
||||
# Load the document, split it into chunks, embed each chunk and load it into the vector store.
|
||||
raw_documents = TextLoader('../../../state_of_the_union.txt').load()
|
||||
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
|
||||
documents = text_splitter.split_documents(raw_documents)
|
||||
|
||||
embeddings = OpenAIEmbeddings()
|
||||
db = FAISS.from_documents(documents, embeddings)
|
||||
db = FAISS.from_documents(documents, OpenAIEmbeddings())
|
||||
```
|
||||
|
||||
</TabItem>
|
||||
<TabItem value="lance" label="Lance">
|
||||
|
||||
This notebook shows how to use functionality related to the LanceDB vector database based on the Lance data format.
|
||||
|
||||
```bash
|
||||
pip install lancedb
|
||||
```
|
||||
|
||||
We want to use OpenAIEmbeddings so we have to get the OpenAI API Key.
|
||||
|
||||
|
||||
```python
|
||||
import os
|
||||
import getpass
|
||||
|
||||
os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key:')
|
||||
```
|
||||
|
||||
```python
|
||||
from langchain.document_loaders import TextLoader
|
||||
from langchain.embeddings.openai import OpenAIEmbeddings
|
||||
from langchain.text_splitter import CharacterTextSplitter
|
||||
from langchain.vectorstores import LanceDB
|
||||
|
||||
import lancedb
|
||||
|
||||
db = lancedb.connect("/tmp/lancedb")
|
||||
table = db.create_table(
|
||||
"my_table",
|
||||
data=[
|
||||
{
|
||||
"vector": embeddings.embed_query("Hello World"),
|
||||
"text": "Hello World",
|
||||
"id": "1",
|
||||
}
|
||||
],
|
||||
mode="overwrite",
|
||||
)
|
||||
|
||||
# Load the document, split it into chunks, embed each chunk and load it into the vector store.
|
||||
raw_documents = TextLoader('../../../state_of_the_union.txt').load()
|
||||
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
|
||||
documents = text_splitter.split_documents(raw_documents)
|
||||
db = LanceDB.from_documents(documents, OpenAIEmbeddings(), connection=table)
|
||||
```
|
||||
|
||||
</TabItem>
|
||||
</Tabs>
|
||||
|
||||
|
||||
|
||||
### Similarity search
|
||||
|
||||
```python
|
||||
@ -57,6 +146,23 @@ print(docs[0].page_content)
|
||||
It is also possible to do a search for documents similar to a given embedding vector using `similarity_search_by_vector` which accepts an embedding vector as a parameter instead of a string.
|
||||
|
||||
```python
|
||||
embedding_vector = embeddings.embed_query(query)
|
||||
embedding_vector = OpenAIEmbeddings().embed_query(query)
|
||||
docs = db.similarity_search_by_vector(embedding_vector)
|
||||
print(docs[0].page_content)
|
||||
```
|
||||
|
||||
The query is the same, and so the result is also the same.
|
||||
|
||||
<CodeOutputBlock lang="python">
|
||||
|
||||
```
|
||||
Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections.
|
||||
|
||||
Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service.
|
||||
|
||||
One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.
|
||||
|
||||
And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.
|
||||
```
|
||||
|
||||
</CodeOutputBlock>
|
Loading…
Reference in New Issue
Block a user