Harrison/lancedb (#3634)

Co-authored-by: Minh Le <minhle@canva.com>
fix_agent_callbacks
Harrison Chase 1 year ago committed by GitHub
parent 52b5290810
commit a35bbbfa9e
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

@ -0,0 +1,23 @@
# LanceDB
This page covers how to use [LanceDB](https://github.com/lancedb/lancedb) within LangChain.
It is broken into two parts: installation and setup, and then references to specific LanceDB wrappers.
## Installation and Setup
- Install the Python SDK with `pip install lancedb`
## Wrappers
### VectorStore
There exists a wrapper around LanceDB databases, allowing you to use it as a vectorstore,
whether for semantic search or example selection.
To import this vectorstore:
```python
from langchain.vectorstores import LanceDB
```
For a more detailed walkthrough of the LanceDB wrapper, see [this notebook](../modules/indexes/vectorstores/examples/lancedb.ipynb)

@ -0,0 +1,179 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "683953b3",
"metadata": {},
"source": [
"# LanceDB\n",
"\n",
"This notebook shows how to use functionality related to the LanceDB vector database based on the Lance data format."
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "bfcf346a",
"metadata": {},
"outputs": [],
"source": [
"#!pip install lancedb"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "aac9563e",
"metadata": {},
"outputs": [],
"source": [
"from langchain.embeddings import OpenAIEmbeddings\n",
"from langchain.vectorstores import LanceDB"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "a3c3999a",
"metadata": {},
"outputs": [],
"source": [
"from langchain.document_loaders import TextLoader\n",
"from langchain.text_splitter import CharacterTextSplitter\n",
"loader = TextLoader('../../../state_of_the_union.txt')\n",
"documents = loader.load()\n",
"\n",
"documents = CharacterTextSplitter().split_documents(documents)\n",
"\n",
"embeddings = OpenAIEmbeddings()"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "6e104aee",
"metadata": {},
"outputs": [],
"source": [
"import lancedb\n",
"\n",
"db = lancedb.connect('/tmp/lancedb')\n",
"table = db.create_table(\"my_table\", data=[\n",
" {\"vector\": embeddings.embed_query(\"Hello World\"), \"text\": \"Hello World\", \"id\": \"1\"}\n",
"], mode=\"overwrite\")\n",
"\n",
"docsearch = LanceDB.from_documents(documents, embeddings, connection=table)\n",
"\n",
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
"docs = docsearch.similarity_search(query)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "9c608226",
"metadata": {
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"They were responding to a 9-1-1 call when a man shot and killed them with a stolen gun. \n",
"\n",
"Officer Mora was 27 years old. \n",
"\n",
"Officer Rivera was 22. \n",
"\n",
"Both Dominican Americans whod grown up on the same streets they later chose to patrol as police officers. \n",
"\n",
"I spoke with their families and told them that we are forever in debt for their sacrifice, and we will carry on their mission to restore the trust and safety every community deserves. \n",
"\n",
"Ive worked on these issues a long time. \n",
"\n",
"I know what works: Investing in crime preventionand community police officers wholl walk the beat, wholl know the neighborhood, and who can restore trust and safety. \n",
"\n",
"So lets not abandon our streets. Or choose between safety and equal justice. \n",
"\n",
"Lets come together to protect our communities, restore trust, and hold law enforcement accountable. \n",
"\n",
"Thats why the Justice Department required body cameras, banned chokeholds, and restricted no-knock warrants for its officers. \n",
"\n",
"Thats why the American Rescue Plan provided $350 Billion that cities, states, and counties can use to hire more police and invest in proven strategies like community violence interruption—trusted messengers breaking the cycle of violence and trauma and giving young people hope. \n",
"\n",
"We should all agree: The answer is not to Defund the police. The answer is to FUND the police with the resources and training they need to protect our communities. \n",
"\n",
"I ask Democrats and Republicans alike: Pass my budget and keep our neighborhoods safe. \n",
"\n",
"And I will keep doing everything in my power to crack down on gun trafficking and ghost guns you can buy online and make at home—they have no serial numbers and cant be traced. \n",
"\n",
"And I ask Congress to pass proven measures to reduce gun violence. Pass universal background checks. Why should anyone on a terrorist list be able to purchase a weapon? \n",
"\n",
"Ban assault weapons and high-capacity magazines. \n",
"\n",
"Repeal the liability shield that makes gun manufacturers the only industry in America that cant be sued. \n",
"\n",
"These laws dont infringe on the Second Amendment. They save lives. \n",
"\n",
"The most fundamental right in America is the right to vote and to have it counted. And its under assault. \n",
"\n",
"In state after state, new laws have been passed, not only to suppress the vote, but to subvert entire elections. \n",
"\n",
"We cannot let this happen. \n",
"\n",
"Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while youre at it, pass the Disclose Act so Americans can know who is funding our elections. \n",
"\n",
"Tonight, Id like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n",
"\n",
"One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n",
"\n",
"And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nations top legal minds, who will continue Justice Breyers legacy of excellence. \n",
"\n",
"A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since shes been nominated, shes received a broad range of support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans. \n",
"\n",
"And if we are to advance liberty and justice, we need to secure the Border and fix the immigration system. \n",
"\n",
"We can do both. At our border, weve installed new technology like cutting-edge scanners to better detect drug smuggling. \n",
"\n",
"Weve set up joint patrols with Mexico and Guatemala to catch more human traffickers. \n",
"\n",
"Were putting in place dedicated immigration judges so families fleeing persecution and violence can have their cases heard faster.\n"
]
}
],
"source": [
"print(docs[0].page_content)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a359ed74",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.1"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

@ -7,6 +7,7 @@ from langchain.vectorstores.chroma import Chroma
from langchain.vectorstores.deeplake import DeepLake
from langchain.vectorstores.elastic_vector_search import ElasticVectorSearch
from langchain.vectorstores.faiss import FAISS
from langchain.vectorstores.lancedb import LanceDB
from langchain.vectorstores.milvus import Milvus
from langchain.vectorstores.myscale import MyScale, MyScaleSettings
from langchain.vectorstores.opensearch_vector_search import OpenSearchVectorSearch
@ -34,4 +35,5 @@ __all__ = [
"MyScaleSettings",
"SupabaseVectorStore",
"AnalyticDB",
"LanceDB",
]

@ -0,0 +1,133 @@
"""Wrapper around LanceDB vector database"""
from __future__ import annotations
import uuid
from typing import Any, Iterable, List, Optional
from langchain.docstore.document import Document
from langchain.embeddings.base import Embeddings
from langchain.vectorstores.base import VectorStore
class LanceDB(VectorStore):
"""Wrapper around LanceDB vector database.
To use, you should have ``lancedb`` python package installed.
Example:
.. code-block:: python
db = lancedb.connect('./lancedb')
table = db.open_table('my_table')
vectorstore = LanceDB(table, embedding_function)
vectorstore.add_texts(['text1', 'text2'])
result = vectorstore.similarity_search('text1')
"""
def __init__(
self,
connection: Any,
embedding: Embeddings,
vector_key: Optional[str] = "vector",
id_key: Optional[str] = "id",
text_key: Optional[str] = "text",
):
"""Initialize with Lance DB connection"""
try:
import lancedb
except ImportError:
raise ValueError(
"Could not import lancedb python package. "
"Please install it with `pip install lancedb`."
)
if not isinstance(connection, lancedb.db.LanceTable):
raise ValueError(
"connection should be an instance of lancedb.db.LanceTable, ",
f"got {type(connection)}",
)
self._connection = connection
self._embedding = embedding
self._vector_key = vector_key
self._id_key = id_key
self._text_key = text_key
def add_texts(
self,
texts: Iterable[str],
metadatas: Optional[List[dict]] = None,
ids: Optional[List[str]] = None,
**kwargs: Any,
) -> List[str]:
"""Turn texts into embedding and add it to the database
Args:
texts: Iterable of strings to add to the vectorstore.
metadatas: Optional list of metadatas associated with the texts.
ids: Optional list of ids to associate with the texts.
Returns:
List of ids of the added texts.
"""
# Embed texts and create documents
docs = []
ids = ids or [str(uuid.uuid4()) for _ in texts]
embeddings = self._embedding.embed_documents(list(texts))
for idx, text in enumerate(texts):
embedding = embeddings[idx]
metadata = metadatas[idx] if metadatas else {}
docs.append(
{
self._vector_key: embedding,
self._id_key: ids[idx],
self._text_key: text,
**metadata,
}
)
self._connection.add(docs)
return ids
def similarity_search(
self, query: str, k: int = 4, **kwargs: Any
) -> List[Document]:
"""Return documents most similar to the query
Args:
query: String to query the vectorstore with.
k: Number of documents to return.
Returns:
List of documents most similar to the query.
"""
embedding = self._embedding.embed_query(query)
docs = self._connection.search(embedding).limit(k).to_df()
return [
Document(
page_content=row[self._text_key],
metadata=row[docs.columns != self._text_key],
)
for _, row in docs.iterrows()
]
@classmethod
def from_texts(
cls,
texts: List[str],
embedding: Embeddings,
metadatas: Optional[List[dict]] = None,
connection: Any = None,
vector_key: Optional[str] = "vector",
id_key: Optional[str] = "id",
text_key: Optional[str] = "text",
**kwargs: Any,
) -> LanceDB:
instance = LanceDB(
connection,
embedding,
vector_key,
id_key,
text_key,
)
instance.add_texts(texts, metadatas=metadatas, **kwargs)
return instance

99
poetry.lock generated

@ -1,4 +1,4 @@
# This file is automatically @generated by Poetry and should not be changed by hand.
# This file is automatically @generated by Poetry 1.4.2 and should not be changed by hand.
[[package]]
name = "absl-py"
@ -1595,7 +1595,7 @@ files = [
name = "duckdb"
version = "0.7.1"
description = "DuckDB embedded database"
category = "dev"
category = "main"
optional = false
python-versions = "*"
files = [
@ -3445,6 +3445,29 @@ files = [
{file = "keras-2.11.0-py2.py3-none-any.whl", hash = "sha256:38c6fff0ea9a8b06a2717736565c92a73c8cd9b1c239e7125ccb188b7848f65e"},
]
[[package]]
name = "lancedb"
version = "0.1"
description = "lancedb"
category = "main"
optional = true
python-versions = ">=3.8"
files = [
{file = "lancedb-0.1-py3-none-any.whl", hash = "sha256:b4180c08298324f36df1128aaae6b7f1fbef46c959a1b2d430f559ffe3454bf3"},
{file = "lancedb-0.1.tar.gz", hash = "sha256:443c26c392de409243a3758a1bb9535d9d057d8c6aebf4d3290915de1a7aea99"},
]
[package.dependencies]
pylance = ">=0.4.3"
ratelimiter = "*"
retry = "*"
tqdm = "*"
[package.extras]
dev = ["black", "pre-commit", "ruff"]
docs = ["mkdocs", "mkdocs-jupyter", "mkdocs-material", "mkdocstrings[python]"]
tests = ["pytest"]
[[package]]
name = "langcodes"
version = "3.3.0"
@ -5024,7 +5047,7 @@ files = [
name = "pandas"
version = "2.0.0"
description = "Powerful data structures for data analysis, time series, and statistics"
category = "dev"
category = "main"
optional = false
python-versions = ">=3.8"
files = [
@ -5722,6 +5745,18 @@ files = [
[package.extras]
tests = ["pytest"]
[[package]]
name = "py"
version = "1.11.0"
description = "library with cross-python path, ini-parsing, io, code, log facilities"
category = "main"
optional = true
python-versions = ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*, !=3.4.*"
files = [
{file = "py-1.11.0-py2.py3-none-any.whl", hash = "sha256:607c53218732647dff4acdfcd50cb62615cedf612e72d1724fb1a0cc6405b378"},
{file = "py-1.11.0.tar.gz", hash = "sha256:51c75c4126074b472f746a24399ad32f6053d1b34b68d2fa41e558e6f4a98719"},
]
[[package]]
name = "pyarrow"
version = "11.0.0"
@ -5995,6 +6030,29 @@ dev = ["coverage[toml] (==5.0.4)", "cryptography (>=3.4.0)", "pre-commit", "pyte
docs = ["sphinx (>=4.5.0,<5.0.0)", "sphinx-rtd-theme", "zope.interface"]
tests = ["coverage[toml] (==5.0.4)", "pytest (>=6.0.0,<7.0.0)"]
[[package]]
name = "pylance"
version = "0.4.3"
description = "python wrapper for lance-rs"
category = "main"
optional = true
python-versions = ">=3.8"
files = [
{file = "pylance-0.4.3-cp38-abi3-macosx_10_15_x86_64.whl", hash = "sha256:94ea9489daa802cea726f56dbdb29f402e38e99e7472e48a56cba1d2cf53a83f"},
{file = "pylance-0.4.3-cp38-abi3-macosx_11_0_arm64.whl", hash = "sha256:567654e35aae255f4dd48fb9c05e6b25299239bb277921195f7b242525363605"},
{file = "pylance-0.4.3-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:3c63a9067adb18c96459dee289b3d3b8508e131d1c9ed737656ad7d222efda25"},
{file = "pylance-0.4.3-cp38-abi3-win_amd64.whl", hash = "sha256:7c7f6dd032b13d3532d200cb92a74b596cdcd59c7f21add1d99607a7f427e624"},
]
[package.dependencies]
duckdb = ">=0.7"
numpy = "*"
pandas = ">=1.5"
pyarrow = ">=10"
[package.extras]
tests = ["duckdb", "polars[pandas,pyarrow]", "pytest"]
[[package]]
name = "pyowm"
version = "3.3.0"
@ -6563,6 +6621,21 @@ packaging = "*"
[package.extras]
test = ["pytest (>=6,!=7.0.0,!=7.0.1)", "pytest-cov (>=3.0.0)", "pytest-qt"]
[[package]]
name = "ratelimiter"
version = "1.2.0.post0"
description = "Simple python rate limiting object"
category = "main"
optional = true
python-versions = "*"
files = [
{file = "ratelimiter-1.2.0.post0-py3-none-any.whl", hash = "sha256:a52be07bc0bb0b3674b4b304550f10c769bbb00fead3072e035904474259809f"},
{file = "ratelimiter-1.2.0.post0.tar.gz", hash = "sha256:5c395dcabdbbde2e5178ef3f89b568a3066454a6ddc223b76473dac22f89b4f7"},
]
[package.extras]
test = ["pytest (>=3.0)", "pytest-asyncio"]
[[package]]
name = "redis"
version = "4.5.4"
@ -6715,6 +6788,22 @@ urllib3 = ">=1.25.10"
[package.extras]
tests = ["coverage (>=6.0.0)", "flake8", "mypy", "pytest (>=7.0.0)", "pytest-asyncio", "pytest-cov", "pytest-httpserver", "types-requests"]
[[package]]
name = "retry"
version = "0.9.2"
description = "Easy to use retry decorator."
category = "main"
optional = true
python-versions = "*"
files = [
{file = "retry-0.9.2-py2.py3-none-any.whl", hash = "sha256:ccddf89761fa2c726ab29391837d4327f819ea14d244c232a1d24c67a2f98606"},
{file = "retry-0.9.2.tar.gz", hash = "sha256:f8bfa8b99b69c4506d6f5bd3b0aabf77f98cdb17f3c9fc3f5ca820033336fba4"},
]
[package.dependencies]
decorator = ">=3.4.2"
py = ">=1.4.26,<2.0.0"
[[package]]
name = "rfc3339-validator"
version = "0.1.4"
@ -7524,7 +7613,7 @@ files = [
]
[package.dependencies]
greenlet = {version = "!=0.4.17", markers = "python_version >= \"3\" and (platform_machine == \"aarch64\" or platform_machine == \"ppc64le\" or platform_machine == \"x86_64\" or platform_machine == \"amd64\" or platform_machine == \"AMD64\" or platform_machine == \"win32\" or platform_machine == \"WIN32\")"}
greenlet = {version = "!=0.4.17", markers = "python_version >= \"3\" and platform_machine == \"aarch64\" or python_version >= \"3\" and platform_machine == \"ppc64le\" or python_version >= \"3\" and platform_machine == \"x86_64\" or python_version >= \"3\" and platform_machine == \"amd64\" or python_version >= \"3\" and platform_machine == \"AMD64\" or python_version >= \"3\" and platform_machine == \"win32\" or python_version >= \"3\" and platform_machine == \"WIN32\""}
[package.extras]
aiomysql = ["aiomysql", "greenlet (!=0.4.17)"]
@ -8498,7 +8587,7 @@ typing-extensions = ">=3.7.4"
name = "tzdata"
version = "2023.3"
description = "Provider of IANA time zone data"
category = "dev"
category = "main"
optional = false
python-versions = ">=2"
files = [

@ -71,6 +71,7 @@ html2text = {version="^2020.1.16", optional=true}
numexpr = "^2.8.4"
duckduckgo-search = {version="^2.8.6", optional=true}
azure-cosmos = {version="^4.4.0b1", optional=true}
lancedb = {version = "^0.1", optional = true}
[tool.poetry.group.docs.dependencies]
autodoc_pydantic = "^1.8.0"

@ -0,0 +1,43 @@
import lancedb
from langchain.vectorstores import LanceDB
from tests.integration_tests.vectorstores.fake_embeddings import FakeEmbeddings
def test_lancedb() -> None:
embeddings = FakeEmbeddings()
db = lancedb.connect("/tmp/lancedb")
texts = ["text 1", "text 2", "item 3"]
vectors = embeddings.embed_documents(texts)
table = db.create_table(
"my_table",
data=[
{"vector": vectors[idx], "id": text, "text": text}
for idx, text in enumerate(texts)
],
mode="overwrite",
)
store = LanceDB(table, embeddings)
result = store.similarity_search("text 1")
result_texts = [doc.page_content for doc in result]
assert "text 1" in result_texts
def test_lancedb_add_texts() -> None:
embeddings = FakeEmbeddings()
db = lancedb.connect("/tmp/lancedb")
texts = ["text 1"]
vectors = embeddings.embed_documents(texts)
table = db.create_table(
"my_table",
data=[
{"vector": vectors[idx], "id": text, "text": text}
for idx, text in enumerate(texts)
],
mode="overwrite",
)
store = LanceDB(table, embeddings)
store.add_texts(["text 2"])
result = store.similarity_search("text 2")
result_texts = [doc.page_content for doc in result]
assert "text 2" in result_texts
Loading…
Cancel
Save