langchain[patch]: Add konlpy based text splitting for Korean (#16003)

- **Description:** Adds a text splitter based on [Konlpy](https://konlpy.org/en/latest/#start) which is a Python package for natural language processing (NLP) of the Korean language. (It is like Spacy or NLTK for Korean) - **Dependencies:** Konlpy would have to be installed before this splitter is used, - **Twitter handle:** @untilhamza
5 months ago · 39b3c6d94c
parent 9b0a531aa2
commit 39b3c6d94c
2 changed files with 131 additions and 1 deletions
--- a/docs/docs/modules/data_connection/document_transformers/split_by_token.ipynb
+++ b/docs/docs/modules/data_connection/document_transformers/split_by_token.ipynb
@ -419,6 +419,105 @@
    "print(texts[0])"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "id": "98a3f975",
+   "metadata": {},
+   "source": [
+    "## KoNLPY\n",
+    "> [KoNLPy: Korean NLP in Python](https://konlpy.org/en/latest/) is is a Python package for natural language processing (NLP) of the Korean language.\n",
+    "\n",
+    "Token splitting involves the segmentation of text into smaller, more manageable units called tokens. These tokens are often words, phrases, symbols, or other meaningful elements crucial for further processing and analysis. In languages like English, token splitting typically involves separating words by spaces and punctuation marks. The effectiveness of token splitting largely depends on the tokenizer's understanding of the language structure, ensuring the generation of meaningful tokens. Since tokenizers designed for the English language are not equipped to understand the unique semantic structures of other languages, such as Korean, they cannot be effectively used for Korean language processing.\n",
+    "\n",
+    "### Token splitting for Korean with KoNLPy's Kkma Analyzer\n",
+    "In case of Korean text, KoNLPY includes at morphological analyzer called `Kkma` (Korean Knowledge Morpheme Analyzer). `Kkma` provides detailed morphological analysis of Korean text. It breaks down sentences into words and words into their respective morphemes, identifying parts of speech for each token. It can segment a block of text into individual sentences, which is particularly useful for processing long texts.\n",
+    "\n",
+    "### Usage Considerations\n",
+    "While `Kkma` is renowned for its detailed analysis, it is important to note that this precision may impact processing speed. Thus, `Kkma` is best suited for applications where analytical depth is prioritized over rapid text processing."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 28,
+   "id": "88ec8f2f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# pip install konlpy"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 23,
+   "id": "ddfba6cf",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# This is a long Korean document that we want to split up into its component sentences.\n",
+    "with open(\"./your_korean_doc.txt\") as f:\n",
+    "    korean_document = f.read()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 24,
+   "id": "225dfc5c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain.text_splitter import KonlpyTextSplitter\n",
+    "\n",
+    "text_splitter = KonlpyTextSplitter()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 37,
+   "id": "cf156711",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "춘향전 옛날에 남원에 이 도령이라는 벼슬아치 아들이 있었다.\n",
+      "\n",
+      "그의 외모는 빛나는 달처럼 잘생겼고, 그의 학식과 기예는 남보다 뛰어났다.\n",
+      "\n",
+      "한편, 이 마을에는 춘향이라는 절세 가인이 살고 있었다.\n",
+      "\n",
+      "춘 향의 아름다움은 꽃과 같아 마을 사람들 로부터 많은 사랑을 받았다.\n",
+      "\n",
+      "어느 봄날, 도령은 친구들과 놀러 나갔다가 춘 향을 만 나 첫 눈에 반하고 말았다.\n",
+      "\n",
+      "두 사람은 서로 사랑하게 되었고, 이내 비밀스러운 사랑의 맹세를 나누었다.\n",
+      "\n",
+      "하지만 좋은 날들은 오래가지 않았다.\n",
+      "\n",
+      "도령의 아버지가 다른 곳으로 전근을 가게 되어 도령도 떠나 야만 했다.\n",
+      "\n",
+      "이별의 아픔 속에서도, 두 사람은 재회를 기약하며 서로를 믿고 기다리기로 했다.\n",
+      "\n",
+      "그러나 새로 부임한 관아의 사또가 춘 향의 아름다움에 욕심을 내 어 그녀에게 강요를 시작했다.\n",
+      "\n",
+      "춘 향 은 도령에 대한 자신의 사랑을 지키기 위해, 사또의 요구를 단호히 거절했다.\n",
+      "\n",
+      "이에 분노한 사또는 춘 향을 감옥에 가두고 혹독한 형벌을 내렸다.\n",
+      "\n",
+      "이야기는 이 도령이 고위 관직에 오른 후, 춘 향을 구해 내는 것으로 끝난다.\n",
+      "\n",
+      "두 사람은 오랜 시련 끝에 다시 만나게 되고, 그들의 사랑은 온 세상에 전해 지며 후세에까지 이어진다.\n",
+      "\n",
+      "- 춘향전 (The Tale of Chunhyang)\n"
+     ]
+    }
+   ],
+   "source": [
+    "texts = text_splitter.split_text(korean_document)\n",
+    "# The sentences are split with \"\\n\\n\" characters.\n",
+    "print(texts[0])"
+   ]
+  },
  {
   "cell_type": "markdown",
   "id": "13dc0983",
@ -521,7 +620,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.10.1"
+   "version": "3.10.12"
  },
  "vscode": {
   "interpreter": {
--- a/libs/langchain/langchain/text_splitter.py
+++ b/libs/langchain/langchain/text_splitter.py
@ -1427,6 +1427,37 @@ class SpacyTextSplitter(TextSplitter):
        return self._merge_splits(splits, self._separator)


+class KonlpyTextSplitter(TextSplitter):
+    """Splitting text using Konlpy package.
+
+    It is good for splitting Korean text.
+    """
+
+    def __init__(
+        self,
+        separator: str = "\n\n",
+        **kwargs: Any,
+    ) -> None:
+        """Initialize the Konlpy text splitter."""
+        super().__init__(**kwargs)
+        self._separator = separator
+        try:
+            from konlpy.tag import Kkma
+        except ImportError:
+            raise ImportError(
+                """
+                Konlpy is not installed, please install it with 
+                `pip install konlpy`
+                """
+            )
+        self.kkma = Kkma()
+
+    def split_text(self, text: str) -> List[str]:
+        """Split incoming text and return chunks."""
+        splits = self.kkma.sentences(text)
+        return self._merge_splits(splits, self._separator)
+
+
 # For backwards compatibility
 class PythonCodeTextSplitter(RecursiveCharacterTextSplitter):
    """Attempts to split the text along Python syntax."""