diff --git a/docs/ecosystem/lancedb.md b/docs/ecosystem/lancedb.md new file mode 100644 index 00000000..22ea15fd --- /dev/null +++ b/docs/ecosystem/lancedb.md @@ -0,0 +1,23 @@ +# LanceDB + +This page covers how to use [LanceDB](https://github.com/lancedb/lancedb) within LangChain. +It is broken into two parts: installation and setup, and then references to specific LanceDB wrappers. + +## Installation and Setup + +- Install the Python SDK with `pip install lancedb` + +## Wrappers + +### VectorStore + +There exists a wrapper around LanceDB databases, allowing you to use it as a vectorstore, +whether for semantic search or example selection. + +To import this vectorstore: + +```python +from langchain.vectorstores import LanceDB +``` + +For a more detailed walkthrough of the LanceDB wrapper, see [this notebook](../modules/indexes/vectorstores/examples/lancedb.ipynb) diff --git a/docs/modules/indexes/vectorstores/examples/lanecdb.ipynb b/docs/modules/indexes/vectorstores/examples/lanecdb.ipynb new file mode 100644 index 00000000..794bbaa0 --- /dev/null +++ b/docs/modules/indexes/vectorstores/examples/lanecdb.ipynb @@ -0,0 +1,179 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "683953b3", + "metadata": {}, + "source": [ + "# LanceDB\n", + "\n", + "This notebook shows how to use functionality related to the LanceDB vector database based on the Lance data format." + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "bfcf346a", + "metadata": {}, + "outputs": [], + "source": [ + "#!pip install lancedb" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "aac9563e", + "metadata": {}, + "outputs": [], + "source": [ + "from langchain.embeddings import OpenAIEmbeddings\n", + "from langchain.vectorstores import LanceDB" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "a3c3999a", + "metadata": {}, + "outputs": [], + "source": [ + "from langchain.document_loaders import TextLoader\n", + "from langchain.text_splitter import CharacterTextSplitter\n", + "loader = TextLoader('../../../state_of_the_union.txt')\n", + "documents = loader.load()\n", + "\n", + "documents = CharacterTextSplitter().split_documents(documents)\n", + "\n", + "embeddings = OpenAIEmbeddings()" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "6e104aee", + "metadata": {}, + "outputs": [], + "source": [ + "import lancedb\n", + "\n", + "db = lancedb.connect('/tmp/lancedb')\n", + "table = db.create_table(\"my_table\", data=[\n", + " {\"vector\": embeddings.embed_query(\"Hello World\"), \"text\": \"Hello World\", \"id\": \"1\"}\n", + "], mode=\"overwrite\")\n", + "\n", + "docsearch = LanceDB.from_documents(documents, embeddings, connection=table)\n", + "\n", + "query = \"What did the president say about Ketanji Brown Jackson\"\n", + "docs = docsearch.similarity_search(query)" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "9c608226", + "metadata": { + "scrolled": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "They were responding to a 9-1-1 call when a man shot and killed them with a stolen gun. \n", + "\n", + "Officer Mora was 27 years old. \n", + "\n", + "Officer Rivera was 22. \n", + "\n", + "Both Dominican Americans who’d grown up on the same streets they later chose to patrol as police officers. \n", + "\n", + "I spoke with their families and told them that we are forever in debt for their sacrifice, and we will carry on their mission to restore the trust and safety every community deserves. \n", + "\n", + "I’ve worked on these issues a long time. \n", + "\n", + "I know what works: Investing in crime preventionand community police officers who’ll walk the beat, who’ll know the neighborhood, and who can restore trust and safety. \n", + "\n", + "So let’s not abandon our streets. Or choose between safety and equal justice. \n", + "\n", + "Let’s come together to protect our communities, restore trust, and hold law enforcement accountable. \n", + "\n", + "That’s why the Justice Department required body cameras, banned chokeholds, and restricted no-knock warrants for its officers. \n", + "\n", + "That’s why the American Rescue Plan provided $350 Billion that cities, states, and counties can use to hire more police and invest in proven strategies like community violence interruption—trusted messengers breaking the cycle of violence and trauma and giving young people hope. \n", + "\n", + "We should all agree: The answer is not to Defund the police. The answer is to FUND the police with the resources and training they need to protect our communities. \n", + "\n", + "I ask Democrats and Republicans alike: Pass my budget and keep our neighborhoods safe. \n", + "\n", + "And I will keep doing everything in my power to crack down on gun trafficking and ghost guns you can buy online and make at home—they have no serial numbers and can’t be traced. \n", + "\n", + "And I ask Congress to pass proven measures to reduce gun violence. Pass universal background checks. Why should anyone on a terrorist list be able to purchase a weapon? \n", + "\n", + "Ban assault weapons and high-capacity magazines. \n", + "\n", + "Repeal the liability shield that makes gun manufacturers the only industry in America that can’t be sued. \n", + "\n", + "These laws don’t infringe on the Second Amendment. They save lives. \n", + "\n", + "The most fundamental right in America is the right to vote – and to have it counted. And it’s under assault. \n", + "\n", + "In state after state, new laws have been passed, not only to suppress the vote, but to subvert entire elections. \n", + "\n", + "We cannot let this happen. \n", + "\n", + "Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n", + "\n", + "Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n", + "\n", + "One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n", + "\n", + "And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence. \n", + "\n", + "A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since she’s been nominated, she’s received a broad range of support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans. \n", + "\n", + "And if we are to advance liberty and justice, we need to secure the Border and fix the immigration system. \n", + "\n", + "We can do both. At our border, we’ve installed new technology like cutting-edge scanners to better detect drug smuggling. \n", + "\n", + "We’ve set up joint patrols with Mexico and Guatemala to catch more human traffickers. \n", + "\n", + "We’re putting in place dedicated immigration judges so families fleeing persecution and violence can have their cases heard faster.\n" + ] + } + ], + "source": [ + "print(docs[0].page_content)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a359ed74", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.1" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/langchain/vectorstores/__init__.py b/langchain/vectorstores/__init__.py index 30d1ca7e..51ac88b5 100644 --- a/langchain/vectorstores/__init__.py +++ b/langchain/vectorstores/__init__.py @@ -7,6 +7,7 @@ from langchain.vectorstores.chroma import Chroma from langchain.vectorstores.deeplake import DeepLake from langchain.vectorstores.elastic_vector_search import ElasticVectorSearch from langchain.vectorstores.faiss import FAISS +from langchain.vectorstores.lancedb import LanceDB from langchain.vectorstores.milvus import Milvus from langchain.vectorstores.myscale import MyScale, MyScaleSettings from langchain.vectorstores.opensearch_vector_search import OpenSearchVectorSearch @@ -34,4 +35,5 @@ __all__ = [ "MyScaleSettings", "SupabaseVectorStore", "AnalyticDB", + "LanceDB", ] diff --git a/langchain/vectorstores/lancedb.py b/langchain/vectorstores/lancedb.py new file mode 100644 index 00000000..eec6d4e0 --- /dev/null +++ b/langchain/vectorstores/lancedb.py @@ -0,0 +1,133 @@ +"""Wrapper around LanceDB vector database""" +from __future__ import annotations + +import uuid +from typing import Any, Iterable, List, Optional + +from langchain.docstore.document import Document +from langchain.embeddings.base import Embeddings +from langchain.vectorstores.base import VectorStore + + +class LanceDB(VectorStore): + """Wrapper around LanceDB vector database. + + To use, you should have ``lancedb`` python package installed. + + Example: + .. code-block:: python + + db = lancedb.connect('./lancedb') + table = db.open_table('my_table') + vectorstore = LanceDB(table, embedding_function) + vectorstore.add_texts(['text1', 'text2']) + result = vectorstore.similarity_search('text1') + """ + + def __init__( + self, + connection: Any, + embedding: Embeddings, + vector_key: Optional[str] = "vector", + id_key: Optional[str] = "id", + text_key: Optional[str] = "text", + ): + """Initialize with Lance DB connection""" + try: + import lancedb + except ImportError: + raise ValueError( + "Could not import lancedb python package. " + "Please install it with `pip install lancedb`." + ) + if not isinstance(connection, lancedb.db.LanceTable): + raise ValueError( + "connection should be an instance of lancedb.db.LanceTable, ", + f"got {type(connection)}", + ) + self._connection = connection + self._embedding = embedding + self._vector_key = vector_key + self._id_key = id_key + self._text_key = text_key + + def add_texts( + self, + texts: Iterable[str], + metadatas: Optional[List[dict]] = None, + ids: Optional[List[str]] = None, + **kwargs: Any, + ) -> List[str]: + """Turn texts into embedding and add it to the database + + Args: + texts: Iterable of strings to add to the vectorstore. + metadatas: Optional list of metadatas associated with the texts. + ids: Optional list of ids to associate with the texts. + + Returns: + List of ids of the added texts. + """ + # Embed texts and create documents + docs = [] + ids = ids or [str(uuid.uuid4()) for _ in texts] + embeddings = self._embedding.embed_documents(list(texts)) + for idx, text in enumerate(texts): + embedding = embeddings[idx] + metadata = metadatas[idx] if metadatas else {} + docs.append( + { + self._vector_key: embedding, + self._id_key: ids[idx], + self._text_key: text, + **metadata, + } + ) + + self._connection.add(docs) + return ids + + def similarity_search( + self, query: str, k: int = 4, **kwargs: Any + ) -> List[Document]: + """Return documents most similar to the query + + Args: + query: String to query the vectorstore with. + k: Number of documents to return. + + Returns: + List of documents most similar to the query. + """ + embedding = self._embedding.embed_query(query) + docs = self._connection.search(embedding).limit(k).to_df() + return [ + Document( + page_content=row[self._text_key], + metadata=row[docs.columns != self._text_key], + ) + for _, row in docs.iterrows() + ] + + @classmethod + def from_texts( + cls, + texts: List[str], + embedding: Embeddings, + metadatas: Optional[List[dict]] = None, + connection: Any = None, + vector_key: Optional[str] = "vector", + id_key: Optional[str] = "id", + text_key: Optional[str] = "text", + **kwargs: Any, + ) -> LanceDB: + instance = LanceDB( + connection, + embedding, + vector_key, + id_key, + text_key, + ) + instance.add_texts(texts, metadatas=metadatas, **kwargs) + + return instance diff --git a/poetry.lock b/poetry.lock index 0e54edc0..5e814181 100644 --- a/poetry.lock +++ b/poetry.lock @@ -1,4 +1,4 @@ -# This file is automatically @generated by Poetry and should not be changed by hand. +# This file is automatically @generated by Poetry 1.4.2 and should not be changed by hand. [[package]] name = "absl-py" @@ -1595,7 +1595,7 @@ files = [ name = "duckdb" version = "0.7.1" description = "DuckDB embedded database" -category = "dev" +category = "main" optional = false python-versions = "*" files = [ @@ -3445,6 +3445,29 @@ files = [ {file = "keras-2.11.0-py2.py3-none-any.whl", hash = "sha256:38c6fff0ea9a8b06a2717736565c92a73c8cd9b1c239e7125ccb188b7848f65e"}, ] +[[package]] +name = "lancedb" +version = "0.1" +description = "lancedb" +category = "main" +optional = true +python-versions = ">=3.8" +files = [ + {file = "lancedb-0.1-py3-none-any.whl", hash = "sha256:b4180c08298324f36df1128aaae6b7f1fbef46c959a1b2d430f559ffe3454bf3"}, + {file = "lancedb-0.1.tar.gz", hash = "sha256:443c26c392de409243a3758a1bb9535d9d057d8c6aebf4d3290915de1a7aea99"}, +] + +[package.dependencies] +pylance = ">=0.4.3" +ratelimiter = "*" +retry = "*" +tqdm = "*" + +[package.extras] +dev = ["black", "pre-commit", "ruff"] +docs = ["mkdocs", "mkdocs-jupyter", "mkdocs-material", "mkdocstrings[python]"] +tests = ["pytest"] + [[package]] name = "langcodes" version = "3.3.0" @@ -5024,7 +5047,7 @@ files = [ name = "pandas" version = "2.0.0" description = "Powerful data structures for data analysis, time series, and statistics" -category = "dev" +category = "main" optional = false python-versions = ">=3.8" files = [ @@ -5722,6 +5745,18 @@ files = [ [package.extras] tests = ["pytest"] +[[package]] +name = "py" +version = "1.11.0" +description = "library with cross-python path, ini-parsing, io, code, log facilities" +category = "main" +optional = true +python-versions = ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*, !=3.4.*" +files = [ + {file = "py-1.11.0-py2.py3-none-any.whl", hash = "sha256:607c53218732647dff4acdfcd50cb62615cedf612e72d1724fb1a0cc6405b378"}, + {file = "py-1.11.0.tar.gz", hash = "sha256:51c75c4126074b472f746a24399ad32f6053d1b34b68d2fa41e558e6f4a98719"}, +] + [[package]] name = "pyarrow" version = "11.0.0" @@ -5995,6 +6030,29 @@ dev = ["coverage[toml] (==5.0.4)", "cryptography (>=3.4.0)", "pre-commit", "pyte docs = ["sphinx (>=4.5.0,<5.0.0)", "sphinx-rtd-theme", "zope.interface"] tests = ["coverage[toml] (==5.0.4)", "pytest (>=6.0.0,<7.0.0)"] +[[package]] +name = "pylance" +version = "0.4.3" +description = "python wrapper for lance-rs" +category = "main" +optional = true +python-versions = ">=3.8" +files = [ + {file = "pylance-0.4.3-cp38-abi3-macosx_10_15_x86_64.whl", hash = "sha256:94ea9489daa802cea726f56dbdb29f402e38e99e7472e48a56cba1d2cf53a83f"}, + {file = "pylance-0.4.3-cp38-abi3-macosx_11_0_arm64.whl", hash = "sha256:567654e35aae255f4dd48fb9c05e6b25299239bb277921195f7b242525363605"}, + {file = "pylance-0.4.3-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:3c63a9067adb18c96459dee289b3d3b8508e131d1c9ed737656ad7d222efda25"}, + {file = "pylance-0.4.3-cp38-abi3-win_amd64.whl", hash = "sha256:7c7f6dd032b13d3532d200cb92a74b596cdcd59c7f21add1d99607a7f427e624"}, +] + +[package.dependencies] +duckdb = ">=0.7" +numpy = "*" +pandas = ">=1.5" +pyarrow = ">=10" + +[package.extras] +tests = ["duckdb", "polars[pandas,pyarrow]", "pytest"] + [[package]] name = "pyowm" version = "3.3.0" @@ -6563,6 +6621,21 @@ packaging = "*" [package.extras] test = ["pytest (>=6,!=7.0.0,!=7.0.1)", "pytest-cov (>=3.0.0)", "pytest-qt"] +[[package]] +name = "ratelimiter" +version = "1.2.0.post0" +description = "Simple python rate limiting object" +category = "main" +optional = true +python-versions = "*" +files = [ + {file = "ratelimiter-1.2.0.post0-py3-none-any.whl", hash = "sha256:a52be07bc0bb0b3674b4b304550f10c769bbb00fead3072e035904474259809f"}, + {file = "ratelimiter-1.2.0.post0.tar.gz", hash = "sha256:5c395dcabdbbde2e5178ef3f89b568a3066454a6ddc223b76473dac22f89b4f7"}, +] + +[package.extras] +test = ["pytest (>=3.0)", "pytest-asyncio"] + [[package]] name = "redis" version = "4.5.4" @@ -6715,6 +6788,22 @@ urllib3 = ">=1.25.10" [package.extras] tests = ["coverage (>=6.0.0)", "flake8", "mypy", "pytest (>=7.0.0)", "pytest-asyncio", "pytest-cov", "pytest-httpserver", "types-requests"] +[[package]] +name = "retry" +version = "0.9.2" +description = "Easy to use retry decorator." +category = "main" +optional = true +python-versions = "*" +files = [ + {file = "retry-0.9.2-py2.py3-none-any.whl", hash = "sha256:ccddf89761fa2c726ab29391837d4327f819ea14d244c232a1d24c67a2f98606"}, + {file = "retry-0.9.2.tar.gz", hash = "sha256:f8bfa8b99b69c4506d6f5bd3b0aabf77f98cdb17f3c9fc3f5ca820033336fba4"}, +] + +[package.dependencies] +decorator = ">=3.4.2" +py = ">=1.4.26,<2.0.0" + [[package]] name = "rfc3339-validator" version = "0.1.4" @@ -7524,7 +7613,7 @@ files = [ ] [package.dependencies] -greenlet = {version = "!=0.4.17", markers = "python_version >= \"3\" and (platform_machine == \"aarch64\" or platform_machine == \"ppc64le\" or platform_machine == \"x86_64\" or platform_machine == \"amd64\" or platform_machine == \"AMD64\" or platform_machine == \"win32\" or platform_machine == \"WIN32\")"} +greenlet = {version = "!=0.4.17", markers = "python_version >= \"3\" and platform_machine == \"aarch64\" or python_version >= \"3\" and platform_machine == \"ppc64le\" or python_version >= \"3\" and platform_machine == \"x86_64\" or python_version >= \"3\" and platform_machine == \"amd64\" or python_version >= \"3\" and platform_machine == \"AMD64\" or python_version >= \"3\" and platform_machine == \"win32\" or python_version >= \"3\" and platform_machine == \"WIN32\""} [package.extras] aiomysql = ["aiomysql", "greenlet (!=0.4.17)"] @@ -8498,7 +8587,7 @@ typing-extensions = ">=3.7.4" name = "tzdata" version = "2023.3" description = "Provider of IANA time zone data" -category = "dev" +category = "main" optional = false python-versions = ">=2" files = [ diff --git a/pyproject.toml b/pyproject.toml index 3d554d26..a1635d3d 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -71,6 +71,7 @@ html2text = {version="^2020.1.16", optional=true} numexpr = "^2.8.4" duckduckgo-search = {version="^2.8.6", optional=true} azure-cosmos = {version="^4.4.0b1", optional=true} +lancedb = {version = "^0.1", optional = true} [tool.poetry.group.docs.dependencies] autodoc_pydantic = "^1.8.0" diff --git a/tests/integration_tests/vectorstores/test_lancedb.py b/tests/integration_tests/vectorstores/test_lancedb.py new file mode 100644 index 00000000..b2f7e4cc --- /dev/null +++ b/tests/integration_tests/vectorstores/test_lancedb.py @@ -0,0 +1,43 @@ +import lancedb + +from langchain.vectorstores import LanceDB +from tests.integration_tests.vectorstores.fake_embeddings import FakeEmbeddings + + +def test_lancedb() -> None: + embeddings = FakeEmbeddings() + db = lancedb.connect("/tmp/lancedb") + texts = ["text 1", "text 2", "item 3"] + vectors = embeddings.embed_documents(texts) + table = db.create_table( + "my_table", + data=[ + {"vector": vectors[idx], "id": text, "text": text} + for idx, text in enumerate(texts) + ], + mode="overwrite", + ) + store = LanceDB(table, embeddings) + result = store.similarity_search("text 1") + result_texts = [doc.page_content for doc in result] + assert "text 1" in result_texts + + +def test_lancedb_add_texts() -> None: + embeddings = FakeEmbeddings() + db = lancedb.connect("/tmp/lancedb") + texts = ["text 1"] + vectors = embeddings.embed_documents(texts) + table = db.create_table( + "my_table", + data=[ + {"vector": vectors[idx], "id": text, "text": text} + for idx, text in enumerate(texts) + ], + mode="overwrite", + ) + store = LanceDB(table, embeddings) + store.add_texts(["text 2"]) + result = store.similarity_search("text 2") + result_texts = [doc.page_content for doc in result] + assert "text 2" in result_texts