langchain/tests/integration_tests/test_nlp_text_splitters.py

"""Test text splitting functionality using NLTK and Spacy based sentence splitters."""
import pytest

from langchain.text_splitter import NLTKTextSplitter, SpacyTextSplitter


def test_nltk_text_splitting_args() -> None:
    """Test invalid arguments."""
    with pytest.raises(ValueError):
        NLTKTextSplitter(chunk_size=2, chunk_overlap=4)


def test_spacy_text_splitting_args() -> None:
    """Test invalid arguments."""
    with pytest.raises(ValueError):
        SpacyTextSplitter(chunk_size=2, chunk_overlap=4)


def test_nltk_text_splitter() -> None:
    """Test splitting by sentence using NLTK."""
    text = "This is sentence one. And this is sentence two."
    separator = "|||"
    splitter = NLTKTextSplitter(separator=separator)
    output = splitter.split_text(text)
    expected_output = [f"This is sentence one.{separator}And this is sentence two."]
    assert output == expected_output


@pytest.mark.parametrize("pipeline", ["sentencizer", "en_core_web_sm"])
def test_spacy_text_splitter(pipeline: str) -> None:
    """Test splitting by sentence using Spacy."""
    text = "This is sentence one. And this is sentence two."
    separator = "|||"
    splitter = SpacyTextSplitter(separator=separator, pipeline=pipeline)
    output = splitter.split_text(text)
    expected_output = [f"This is sentence one.{separator}And this is sentence two."]
    assert output == expected_output
OptimizedPrompt -- k-shot example choice backed by semantic search (#91) 2022-11-10 05:15:42 +00:00			`"""Test text splitting functionality using NLTK and Spacy based sentence splitters."""`
Implements NLTK and Spacy-based TextSplitters (#103) This PR is for Issue #88 - [x] `make format` - [x] `make lint` - [x] `make tests` 2022-11-10 04:45:30 +00:00			`import pytest`

			`from langchain.text_splitter import NLTKTextSplitter, SpacyTextSplitter`


			`def test_nltk_text_splitting_args() -> None:`
			`"""Test invalid arguments."""`
			`with pytest.raises(ValueError):`
			`NLTKTextSplitter(chunk_size=2, chunk_overlap=4)`


			`def test_spacy_text_splitting_args() -> None:`
			`"""Test invalid arguments."""`
			`with pytest.raises(ValueError):`
			`SpacyTextSplitter(chunk_size=2, chunk_overlap=4)`


			`def test_nltk_text_splitter() -> None:`
			`"""Test splitting by sentence using NLTK."""`
			`text = "This is sentence one. And this is sentence two."`
			`separator = "\|\|\|"`
			`splitter = NLTKTextSplitter(separator=separator)`
			`output = splitter.split_text(text)`
			`expected_output = [f"This is sentence one.{separator}And this is sentence two."]`
			`assert output == expected_output`


Add spacy sentencizer (#7442) `SpacyTextSplitter` currently uses spacy's statistics-based `en_core_web_sm` model for sentence splitting. This is a good splitter, but it's also pretty slow, and in this case it's doing a lot of work that's not needed given that the spacy parse is then just thrown away. However, there is also a simple rules-based spacy sentencizer. Using this is at least an order of magnitude faster than using `en_core_web_sm` according to my local tests. Also, spacy sentence tokenization based on `en_core_web_sm` can be sped up in this case by not doing the NER stage. This shaves some cycles too, both when loading the model and when parsing the text. Consequently, this PR adds the option to use the basic spacy sentencizer, and it disables the NER stage for the current approach, which is kept as the default. Lastly, when extracting the tokenized sentences, the `text` attribute is called directly instead of doing the string conversion, which is IMO a bit more idiomatic. 2023-07-10 06:52:05 +00:00			`@pytest.mark.parametrize("pipeline", ["sentencizer", "en_core_web_sm"])`
			`def test_spacy_text_splitter(pipeline: str) -> None:`
Implements NLTK and Spacy-based TextSplitters (#103) This PR is for Issue #88 - [x] `make format` - [x] `make lint` - [x] `make tests` 2022-11-10 04:45:30 +00:00			`"""Test splitting by sentence using Spacy."""`
			`text = "This is sentence one. And this is sentence two."`
			`separator = "\|\|\|"`
Add spacy sentencizer (#7442) `SpacyTextSplitter` currently uses spacy's statistics-based `en_core_web_sm` model for sentence splitting. This is a good splitter, but it's also pretty slow, and in this case it's doing a lot of work that's not needed given that the spacy parse is then just thrown away. However, there is also a simple rules-based spacy sentencizer. Using this is at least an order of magnitude faster than using `en_core_web_sm` according to my local tests. Also, spacy sentence tokenization based on `en_core_web_sm` can be sped up in this case by not doing the NER stage. This shaves some cycles too, both when loading the model and when parsing the text. Consequently, this PR adds the option to use the basic spacy sentencizer, and it disables the NER stage for the current approach, which is kept as the default. Lastly, when extracting the tokenized sentences, the `text` attribute is called directly instead of doing the string conversion, which is IMO a bit more idiomatic. 2023-07-10 06:52:05 +00:00			`splitter = SpacyTextSplitter(separator=separator, pipeline=pipeline)`
Implements NLTK and Spacy-based TextSplitters (#103) This PR is for Issue #88 - [x] `make format` - [x] `make lint` - [x] `make tests` 2022-11-10 04:45:30 +00:00			`output = splitter.split_text(text)`
			`expected_output = [f"This is sentence one.{separator}And this is sentence two."]`
			`assert output == expected_output`