# Annoy

> [Annoy](https://github.com/spotify/annoy) (`Approximate Nearest Neighbors Oh Yeah`) is a C++ library with Python bindings to search for points in space that are close to a given query point. It also creates large read-only file-based data structures that are mmapped into memory so that many processes may share the same data.

This notebook shows how to use functionality related to the `Annoy` vector database.

```{note}
NOTE: Annoy is read-only - once the index is built you cannot add any more emebddings!
If you want to progressively add new entries to your VectorStore then better choose an alternative!
```

In [None]:
#!pip install annoy

## Create VectorStore from texts

In [None]:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Annoy

embeddings_func = HuggingFaceEmbeddings()

In [4]:
texts = ["pizza is great", "I love salad", "my car", "a dog"]

# default metric is angular
vector_store = Annoy.from_texts(texts, embeddings_func)

In [4]:
# allows for custom annoy parameters, defaults are n_trees=100, n_jobs=-1, metric="angular"
vector_store_v2 = Annoy.from_texts(
    texts, embeddings_func, metric="dot", n_trees=100, n_jobs=1
)

In [5]:
vector_store.similarity_search("food", k=3)

[Document(page_content='pizza is great', metadata={}),
 Document(page_content='I love salad', metadata={}),
 Document(page_content='my car', metadata={})]

In [6]:
# the score is a distance metric, so lower is better
vector_store.similarity_search_with_score("food", k=3)

[(Document(page_content='pizza is great', metadata={}), 1.0944390296936035),
 (Document(page_content='I love salad', metadata={}), 1.1273186206817627),
 (Document(page_content='my car', metadata={}), 1.1580758094787598)]

## Create VectorStore from docs

In [7]:
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter

loader = TextLoader("../../../state_of_the_union.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

In [8]:
docs[:5]

[Document(page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  \n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\nWith a duty to one another to the American people to the Constitution. \n\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \n\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \n\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \n\nHe met the Ukrainian people. \n\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.', metadata={'source'

In [9]:
vector_store_from_docs = Annoy.from_documents(docs, embeddings_func)

In [10]:
query = "What did the president say about Ketanji Brown Jackson"
docs = vector_store_from_docs.similarity_search(query)

In [11]:
print(docs[0].page_content[:100])

Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Ac


## Create VectorStore via existing embeddings

In [12]:
embs = embeddings_func.embed_documents(texts)

In [13]:
data = list(zip(texts, embs))

vector_store_from_embeddings = Annoy.from_embeddings(data, embeddings_func)

In [14]:
vector_store_from_embeddings.similarity_search_with_score("food", k=3)

[(Document(page_content='pizza is great', metadata={}), 1.0944390296936035),
 (Document(page_content='I love salad', metadata={}), 1.1273186206817627),
 (Document(page_content='my car', metadata={}), 1.1580758094787598)]

## Search via embeddings

In [15]:
motorbike_emb = embeddings_func.embed_query("motorbike")

In [16]:
vector_store.similarity_search_by_vector(motorbike_emb, k=3)

[Document(page_content='my car', metadata={}),
 Document(page_content='a dog', metadata={}),
 Document(page_content='pizza is great', metadata={})]

In [17]:
vector_store.similarity_search_with_score_by_vector(motorbike_emb, k=3)

[(Document(page_content='my car', metadata={}), 1.0870471000671387),
 (Document(page_content='a dog', metadata={}), 1.2095637321472168),
 (Document(page_content='pizza is great', metadata={}), 1.3254905939102173)]

## Search via docstore id

In [18]:
vector_store.index_to_docstore_id

{0: '2d1498a8-a37c-4798-acb9-0016504ed798',
 1: '2d30aecc-88e0-4469-9d51-0ef7e9858e6d',
 2: '927f1120-985b-4691-b577-ad5cb42e011c',
 3: '3056ddcf-a62f-48c8-bd98-b9e57a3dfcae'}

In [19]:
some_docstore_id = 0  # texts[0]

vector_store.docstore._dict[vector_store.index_to_docstore_id[some_docstore_id]]

Document(page_content='pizza is great', metadata={})

In [20]:
# same document has distance 0
vector_store.similarity_search_with_score_by_index(some_docstore_id, k=3)

[(Document(page_content='pizza is great', metadata={}), 0.0),
 (Document(page_content='I love salad', metadata={}), 1.0734446048736572),
 (Document(page_content='my car', metadata={}), 1.2895267009735107)]

## Save and load

In [21]:
vector_store.save_local("my_annoy_index_and_docstore")

saving config


In [22]:
loaded_vector_store = Annoy.load_local(
    "my_annoy_index_and_docstore", embeddings=embeddings_func
)

In [23]:
# same document has distance 0
loaded_vector_store.similarity_search_with_score_by_index(some_docstore_id, k=3)

[(Document(page_content='pizza is great', metadata={}), 0.0),
 (Document(page_content='I love salad', metadata={}), 1.0734446048736572),
 (Document(page_content='my car', metadata={}), 1.2895267009735107)]

## Construct from scratch

In [25]:
import uuid
from annoy import AnnoyIndex
from langchain.docstore.document import Document
from langchain.docstore.in_memory import InMemoryDocstore

metadatas = [{"x": "food"}, {"x": "food"}, {"x": "stuff"}, {"x": "animal"}]

# embeddings
embeddings = embeddings_func.embed_documents(texts)

# embedding dim
f = len(embeddings[0])

# index
metric = "angular"
index = AnnoyIndex(f, metric=metric)
for i, emb in enumerate(embeddings):
    index.add_item(i, emb)
index.build(10)

# docstore
documents = []
for i, text in enumerate(texts):
    metadata = metadatas[i] if metadatas else {}
    documents.append(Document(page_content=text, metadata=metadata))
index_to_docstore_id = {i: str(uuid.uuid4()) for i in range(len(documents))}
docstore = InMemoryDocstore(
    {index_to_docstore_id[i]: doc for i, doc in enumerate(documents)}
)

db_manually = Annoy(
    embeddings_func.embed_query, index, metric, docstore, index_to_docstore_id
)

In [26]:
db_manually.similarity_search_with_score("eating!", k=3)

[(Document(page_content='pizza is great', metadata={'x': 'food'}),
  1.1314140558242798),
 (Document(page_content='I love salad', metadata={'x': 'food'}),
  1.1668788194656372),
 (Document(page_content='my car', metadata={'x': 'stuff'}), 1.226445198059082)]