experimental[minor]: Add semantic chunker (#15799)

2024-10-31 15:20:26 +00:00 · 2024-01-10 08:18:30 -08:00 · 2024-01-10 08:18:30 -08:00 · 20abe24819
commit 20abe24819
parent a1d7f2b3e1
3 changed files with 305 additions and 0 deletions
--- a/docs/docs/modules/data_connection/document_transformers/index.mdx
+++ b/docs/docs/modules/data_connection/document_transformers/index.mdx
@ -44,6 +44,7 @@ LangChain offers many different types of text splitters. Below is a table listin
 | Code      | Code (Python, JS) specific characters |               | Splits text based on characters specific to coding languages. 15 different languages are available to choose from.                                                                      |
 | Token     | Tokens                                |               | Splits text on tokens. There exist a few different ways to measure tokens.                                                                                                              |
 | Character | A user defined character              |               | Splits text based on a user defined character. One of the simpler methods.                                                                                                              |
+| [Experimental] Semantic Chunker | Sentences             |               | First splits on sentences. Then combines ones next to each other if they are semantically similar enough. Taken from [Greg Kamradt](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/5_Levels_Of_Text_Splitting.ipynb)                                                                                                              |


 ## Evaluate text splitters
--- a/docs/docs/modules/data_connection/document_transformers/semantic-chunker.ipynb
+++ b/docs/docs/modules/data_connection/document_transformers/semantic-chunker.ipynb
@ -0,0 +1,145 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "c3ee8d00",
+   "metadata": {},
+   "source": [
+    "# Semantic Chunking\n",
+    "\n",
+    "Splits the text based on semantic similarity.\n",
+    "\n",
+    "Taken from Greg Kamradt's wonderful notebook:\n",
+    "https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/5_Levels_Of_Text_Splitting.ipynb\n",
+    "\n",
+    "All credit to him.\n",
+    "\n",
+    "At a high level, this splits into sentences, then groups into groups of 3\n",
+    "sentences, and then merges one that are similar in the embedding space."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "542f4427",
+   "metadata": {},
+   "source": [
+    "## Install Dependencies"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d8c58769",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install --quiet langchain_experimental langchain_openai"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c20cdf54",
+   "metadata": {},
+   "source": [
+    "## Load Example Data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "313fb032",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# This is a long document we can split up.\n",
+    "with open(\"../../state_of_the_union.txt\") as f:\n",
+    "    state_of_the_union = f.read()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f7436e15",
+   "metadata": {},
+   "source": [
+    "## Create Text Splitter"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "a88ff70c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain_experimental.text_splitter import SemanticChunker\n",
+    "from langchain_openai.embeddings import OpenAIEmbeddings"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "613d4a3b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "text_splitter = SemanticChunker(OpenAIEmbeddings())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "91b14834",
+   "metadata": {},
+   "source": [
+    "## Split Text"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "295ec095",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. Last year COVID-19 kept us apart. This year we are finally together again. Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. With a duty to one another to the American people to the Constitution. And with an unwavering resolve that freedom will always triumph over tyranny. Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. He thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. He met the Ukrainian people. From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world. Groups of citizens blocking tanks with their bodies. Everyone from students to retirees teachers turned soldiers defending their homeland. In this struggle as President Zelenskyy said in his speech to the European Parliament “Light will win over darkness.” The Ukrainian Ambassador to the United States is here tonight. Let each of us here tonight in this Chamber send an unmistakable signal to Ukraine and to the world. Please rise if you are able and show that, Yes, we the United States of America stand with the Ukrainian people. Throughout our history we’ve learned this lesson when dictators do not pay a price for their aggression they cause more chaos. They keep moving.\n"
+     ]
+    }
+   ],
+   "source": [
+    "docs = text_splitter.create_documents([state_of_the_union])\n",
+    "print(docs[0].page_content)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a9a3b9cd",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.1"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/libs/experimental/langchain_experimental/text_splitter.py
+++ b/libs/experimental/langchain_experimental/text_splitter.py
@ -0,0 +1,159 @@
+import copy
+import re
+from typing import Any, Iterable, List, Optional, Sequence, Tuple
+
+import numpy as np
+from langchain_community.utils.math import (
+    cosine_similarity,
+)
+from langchain_core.documents import BaseDocumentTransformer, Document
+from langchain_core.embeddings import Embeddings
+
+
+def combine_sentences(sentences: List[dict], buffer_size: int = 1) -> List[dict]:
+    # Go through each sentence dict
+    for i in range(len(sentences)):
+        # Create a string that will hold the sentences which are joined
+        combined_sentence = ""
+
+        # Add sentences before the current one, based on the buffer size.
+        for j in range(i - buffer_size, i):
+            # Check if the index j is not negative
+            # (to avoid index out of range like on the first one)
+            if j >= 0:
+                # Add the sentence at index j to the combined_sentence string
+                combined_sentence += sentences[j]["sentence"] + " "
+
+        # Add the current sentence
+        combined_sentence += sentences[i]["sentence"]
+
+        # Add sentences after the current one, based on the buffer size
+        for j in range(i + 1, i + 1 + buffer_size):
+            # Check if the index j is within the range of the sentences list
+            if j < len(sentences):
+                # Add the sentence at index j to the combined_sentence string
+                combined_sentence += " " + sentences[j]["sentence"]
+
+        # Then add the whole thing to your dict
+        # Store the combined sentence in the current sentence dict
+        sentences[i]["combined_sentence"] = combined_sentence
+
+    return sentences
+
+
+def calculate_cosine_distances(sentences: List[dict]) -> Tuple[List[float], List[dict]]:
+    distances = []
+    for i in range(len(sentences) - 1):
+        embedding_current = sentences[i]["combined_sentence_embedding"]
+        embedding_next = sentences[i + 1]["combined_sentence_embedding"]
+
+        # Calculate cosine similarity
+        similarity = cosine_similarity([embedding_current], [embedding_next])[0][0]
+
+        # Convert to cosine distance
+        distance = 1 - similarity
+
+        # Append cosine distance to the list
+        distances.append(distance)
+
+        # Store distance in the dictionary
+        sentences[i]["distance_to_next"] = distance
+
+    # Optionally handle the last sentence
+    # sentences[-1]['distance_to_next'] = None  # or a default value
+
+    return distances, sentences
+
+
+class SemanticChunker(BaseDocumentTransformer):
+    """Splits the text based on semantic similarity.
+
+    Taken from Greg Kamradt's wonderful notebook:
+    https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/5_Levels_Of_Text_Splitting.ipynb
+
+    All credit to him.
+
+    At a high level, this splits into sentences, then groups into groups of 3
+    sentences, and then merges one that are similar in the embedding space.
+    """
+
+    def __init__(self, embeddings: Embeddings, add_start_index: bool = False):
+        self._add_start_index = add_start_index
+        self.embeddings = embeddings
+
+    def split_text(self, text: str) -> List[str]:
+        """Split text into multiple components."""
+        # Splitting the essay on '.', '?', and '!'
+        single_sentences_list = re.split(r"(?<=[.?!])\s+", text)
+        sentences = [
+            {"sentence": x, "index": i} for i, x in enumerate(single_sentences_list)
+        ]
+        sentences = combine_sentences(sentences)
+        embeddings = self.embeddings.embed_documents(
+            [x["combined_sentence"] for x in sentences]
+        )
+        for i, sentence in enumerate(sentences):
+            sentence["combined_sentence_embedding"] = embeddings[i]
+        distances, sentences = calculate_cosine_distances(sentences)
+        start_index = 0
+
+        # Create a list to hold the grouped sentences
+        chunks = []
+        breakpoint_percentile_threshold = 95
+        breakpoint_distance_threshold = np.percentile(
+            distances, breakpoint_percentile_threshold
+        )  # If you want more chunks, lower the percentile cutoff
+
+        indices_above_thresh = [
+            i for i, x in enumerate(distances) if x > breakpoint_distance_threshold
+        ]  # The indices of those breakpoints on your list
+
+        # Iterate through the breakpoints to slice the sentences
+        for index in indices_above_thresh:
+            # The end index is the current breakpoint
+            end_index = index
+
+            # Slice the sentence_dicts from the current start index to the end index
+            group = sentences[start_index : end_index + 1]
+            combined_text = " ".join([d["sentence"] for d in group])
+            chunks.append(combined_text)
+
+            # Update the start index for the next group
+            start_index = index + 1
+
+        # The last group, if any sentences remain
+        if start_index < len(sentences):
+            combined_text = " ".join([d["sentence"] for d in sentences[start_index:]])
+            chunks.append(combined_text)
+        return chunks
+
+    def create_documents(
+        self, texts: List[str], metadatas: Optional[List[dict]] = None
+    ) -> List[Document]:
+        """Create documents from a list of texts."""
+        _metadatas = metadatas or [{}] * len(texts)
+        documents = []
+        for i, text in enumerate(texts):
+            index = -1
+            for chunk in self.split_text(text):
+                metadata = copy.deepcopy(_metadatas[i])
+                if self._add_start_index:
+                    index = text.find(chunk, index + 1)
+                    metadata["start_index"] = index
+                new_doc = Document(page_content=chunk, metadata=metadata)
+                documents.append(new_doc)
+        return documents
+
+    def split_documents(self, documents: Iterable[Document]) -> List[Document]:
+        """Split documents."""
+        texts, metadatas = [], []
+        for doc in documents:
+            texts.append(doc.page_content)
+            metadatas.append(doc.metadata)
+        return self.create_documents(texts, metadatas=metadatas)
+
+    def transform_documents(
+        self, documents: Sequence[Document], **kwargs: Any
+    ) -> Sequence[Document]:
+        """Transform sequence of documents by splitting them."""
+        return self.split_documents(list(documents))