2022-11-01 03:17:22 +00:00
|
|
|
"""Test text splitting functionality."""
|
2024-03-19 19:51:16 +00:00
|
|
|
|
langchain: adds recursive json splitter (#17144)
- **Description:** This adds a recursive json splitter class to the
existing text_splitters as well as unit tests
- **Issue:** splitting text from structured data can cause issues if you
have a large nested json object and you split it as regular text you may
end up losing the structure of the json. To mitigate against this you
can split the nested json into large chunks and overlap them, but this
causes unnecessary text processing and there will still be times where
the nested json is so big that the chunks get separated from the parent
keys.
As an example you wouldn't want the following to be split in half:
```shell
{'val0': 'DFWeNdWhapbR',
'val1': {'val10': 'QdJo',
'val11': 'FWSDVFHClW',
'val12': 'bkVnXMMlTiQh',
'val13': 'tdDMKRrOY',
'val14': 'zybPALvL',
'val15': 'JMzGMNH',
'val16': {'val160': 'qLuLKusFw',
'val161': 'DGuotLh',
'val162': 'KztlcSBropT',
-----------------------------------------------------------------------split-----
'val163': 'YlHHDrN',
'val164': 'CtzsxlGBZKf',
'val165': 'bXzhcrWLmBFp',
'val166': 'zZAqC',
'val167': 'ZtyWno',
'val168': 'nQQZRsLnaBhb',
'val169': 'gSpMbJwA'},
'val17': 'JhgiyF',
'val18': 'aJaqjUSFFrI',
'val19': 'glqNSvoyxdg'}}
```
Any llm processing the second chunk of text may not have the context of
val1, and val16 reducing accuracy. Embeddings will also lack this
context and this makes retrieval less accurate.
Instead you want it to be split into chunks that retain the json
structure.
```shell
{'val0': 'DFWeNdWhapbR',
'val1': {'val10': 'QdJo',
'val11': 'FWSDVFHClW',
'val12': 'bkVnXMMlTiQh',
'val13': 'tdDMKRrOY',
'val14': 'zybPALvL',
'val15': 'JMzGMNH',
'val16': {'val160': 'qLuLKusFw',
'val161': 'DGuotLh',
'val162': 'KztlcSBropT',
'val163': 'YlHHDrN',
'val164': 'CtzsxlGBZKf'}}}
```
and
```shell
{'val1':{'val16':{
'val165': 'bXzhcrWLmBFp',
'val166': 'zZAqC',
'val167': 'ZtyWno',
'val168': 'nQQZRsLnaBhb',
'val169': 'gSpMbJwA'},
'val17': 'JhgiyF',
'val18': 'aJaqjUSFFrI',
'val19': 'glqNSvoyxdg'}}
```
This recursive json text splitter does this. Values that contain a list
can be converted to dict first by using split(... convert_lists=True)
otherwise long lists will not be split and you may end up with chunks
larger than the max chunk.
In my testing large json objects could be split into small chunks with
✅ Increased question answering accuracy
✅ The ability to split into smaller chunks meant retrieval queries can
use fewer tokens
- **Dependencies:** json import added to text_splitter.py, and random
added to the unit test
- **Twitter handle:** @joelsprunger
---------
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
2024-02-08 21:45:34 +00:00
|
|
|
import random
|
Add regex control over separators in character text splitter (#7933)
<!-- Thank you for contributing to LangChain!
Replace this comment with:
- Description: a description of the change,
- Issue: the issue # it fixes (if applicable),
- Dependencies: any dependencies required for this change,
- Tag maintainer: for a quicker response, tag the relevant maintainer
(see below),
- Twitter handle: we announce bigger features on Twitter. If your PR
gets announced and you'd like a mention, we'll gladly shout you out!
Please make sure you're PR is passing linting and testing before
submitting. Run `make format`, `make lint` and `make test` to check this
locally.
If you're adding a new integration, please include:
1. a test for the integration, preferably unit tests that do not rely on
network access,
2. an example notebook showing its use.
Maintainer responsibilities:
- General / Misc / if you don't know who to tag: @baskaryan
- DataLoaders / VectorStores / Retrievers: @rlancemartin, @eyurtsev
- Models / Prompts: @hwchase17, @baskaryan
- Memory: @hwchase17
- Agents / Tools / Toolkits: @hinthornw
- Tracing / Callbacks: @agola11
- Async: @agola11
If no one reviews your PR within a few days, feel free to @-mention the
same people again.
See contribution guidelines for more information on how to write/run
tests, lint, etc:
https://github.com/hwchase17/langchain/blob/master/.github/CONTRIBUTING.md
-->
#7854
Added the ability to use the `separator` ase a regex or a simple
character.
Fixed a bug where `start_index` was incorrectly counting from -1.
Who can review?
@eyurtsev
@hwchase17
@mmz-001
2023-08-04 03:25:23 +00:00
|
|
|
import re
|
langchain: adds recursive json splitter (#17144)
- **Description:** This adds a recursive json splitter class to the
existing text_splitters as well as unit tests
- **Issue:** splitting text from structured data can cause issues if you
have a large nested json object and you split it as regular text you may
end up losing the structure of the json. To mitigate against this you
can split the nested json into large chunks and overlap them, but this
causes unnecessary text processing and there will still be times where
the nested json is so big that the chunks get separated from the parent
keys.
As an example you wouldn't want the following to be split in half:
```shell
{'val0': 'DFWeNdWhapbR',
'val1': {'val10': 'QdJo',
'val11': 'FWSDVFHClW',
'val12': 'bkVnXMMlTiQh',
'val13': 'tdDMKRrOY',
'val14': 'zybPALvL',
'val15': 'JMzGMNH',
'val16': {'val160': 'qLuLKusFw',
'val161': 'DGuotLh',
'val162': 'KztlcSBropT',
-----------------------------------------------------------------------split-----
'val163': 'YlHHDrN',
'val164': 'CtzsxlGBZKf',
'val165': 'bXzhcrWLmBFp',
'val166': 'zZAqC',
'val167': 'ZtyWno',
'val168': 'nQQZRsLnaBhb',
'val169': 'gSpMbJwA'},
'val17': 'JhgiyF',
'val18': 'aJaqjUSFFrI',
'val19': 'glqNSvoyxdg'}}
```
Any llm processing the second chunk of text may not have the context of
val1, and val16 reducing accuracy. Embeddings will also lack this
context and this makes retrieval less accurate.
Instead you want it to be split into chunks that retain the json
structure.
```shell
{'val0': 'DFWeNdWhapbR',
'val1': {'val10': 'QdJo',
'val11': 'FWSDVFHClW',
'val12': 'bkVnXMMlTiQh',
'val13': 'tdDMKRrOY',
'val14': 'zybPALvL',
'val15': 'JMzGMNH',
'val16': {'val160': 'qLuLKusFw',
'val161': 'DGuotLh',
'val162': 'KztlcSBropT',
'val163': 'YlHHDrN',
'val164': 'CtzsxlGBZKf'}}}
```
and
```shell
{'val1':{'val16':{
'val165': 'bXzhcrWLmBFp',
'val166': 'zZAqC',
'val167': 'ZtyWno',
'val168': 'nQQZRsLnaBhb',
'val169': 'gSpMbJwA'},
'val17': 'JhgiyF',
'val18': 'aJaqjUSFFrI',
'val19': 'glqNSvoyxdg'}}
```
This recursive json text splitter does this. Values that contain a list
can be converted to dict first by using split(... convert_lists=True)
otherwise long lists will not be split and you may end up with chunks
larger than the max chunk.
In my testing large json objects could be split into small chunks with
✅ Increased question answering accuracy
✅ The ability to split into smaller chunks meant retrieval queries can
use fewer tokens
- **Dependencies:** json import added to text_splitter.py, and random
added to the unit test
- **Twitter handle:** @joelsprunger
---------
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
2024-02-08 21:45:34 +00:00
|
|
|
import string
|
2023-12-01 19:57:50 +00:00
|
|
|
from pathlib import Path
|
langchain: adds recursive json splitter (#17144)
- **Description:** This adds a recursive json splitter class to the
existing text_splitters as well as unit tests
- **Issue:** splitting text from structured data can cause issues if you
have a large nested json object and you split it as regular text you may
end up losing the structure of the json. To mitigate against this you
can split the nested json into large chunks and overlap them, but this
causes unnecessary text processing and there will still be times where
the nested json is so big that the chunks get separated from the parent
keys.
As an example you wouldn't want the following to be split in half:
```shell
{'val0': 'DFWeNdWhapbR',
'val1': {'val10': 'QdJo',
'val11': 'FWSDVFHClW',
'val12': 'bkVnXMMlTiQh',
'val13': 'tdDMKRrOY',
'val14': 'zybPALvL',
'val15': 'JMzGMNH',
'val16': {'val160': 'qLuLKusFw',
'val161': 'DGuotLh',
'val162': 'KztlcSBropT',
-----------------------------------------------------------------------split-----
'val163': 'YlHHDrN',
'val164': 'CtzsxlGBZKf',
'val165': 'bXzhcrWLmBFp',
'val166': 'zZAqC',
'val167': 'ZtyWno',
'val168': 'nQQZRsLnaBhb',
'val169': 'gSpMbJwA'},
'val17': 'JhgiyF',
'val18': 'aJaqjUSFFrI',
'val19': 'glqNSvoyxdg'}}
```
Any llm processing the second chunk of text may not have the context of
val1, and val16 reducing accuracy. Embeddings will also lack this
context and this makes retrieval less accurate.
Instead you want it to be split into chunks that retain the json
structure.
```shell
{'val0': 'DFWeNdWhapbR',
'val1': {'val10': 'QdJo',
'val11': 'FWSDVFHClW',
'val12': 'bkVnXMMlTiQh',
'val13': 'tdDMKRrOY',
'val14': 'zybPALvL',
'val15': 'JMzGMNH',
'val16': {'val160': 'qLuLKusFw',
'val161': 'DGuotLh',
'val162': 'KztlcSBropT',
'val163': 'YlHHDrN',
'val164': 'CtzsxlGBZKf'}}}
```
and
```shell
{'val1':{'val16':{
'val165': 'bXzhcrWLmBFp',
'val166': 'zZAqC',
'val167': 'ZtyWno',
'val168': 'nQQZRsLnaBhb',
'val169': 'gSpMbJwA'},
'val17': 'JhgiyF',
'val18': 'aJaqjUSFFrI',
'val19': 'glqNSvoyxdg'}}
```
This recursive json text splitter does this. Values that contain a list
can be converted to dict first by using split(... convert_lists=True)
otherwise long lists will not be split and you may end up with chunks
larger than the max chunk.
In my testing large json objects could be split into small chunks with
✅ Increased question answering accuracy
✅ The ability to split into smaller chunks meant retrieval queries can
use fewer tokens
- **Dependencies:** json import added to text_splitter.py, and random
added to the unit test
- **Twitter handle:** @joelsprunger
---------
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
2024-02-08 21:45:34 +00:00
|
|
|
from typing import Any, List
|
2023-06-10 23:48:53 +00:00
|
|
|
|
2022-11-01 03:17:22 +00:00
|
|
|
import pytest
|
2023-11-27 20:48:43 +00:00
|
|
|
from langchain_core.documents import Document
|
2022-11-01 03:17:22 +00:00
|
|
|
|
2024-03-01 02:33:21 +00:00
|
|
|
from langchain_text_splitters import (
|
Add more code splitters (go, rst, js, java, cpp, scala, ruby, php, swift, rust) (#5171)
As the title says, I added more code splitters.
The implementation is trivial, so i don't add separate tests for each
splitter.
Let me know if any concerns.
Fixes # (issue)
https://github.com/hwchase17/langchain/issues/5170
## Who can review?
Community members can review the PR once tests pass. Tag
maintainers/contributors who might be interested:
@eyurtsev @hwchase17
---------
Signed-off-by: byhsu <byhsu@linkedin.com>
Co-authored-by: byhsu <byhsu@linkedin.com>
2023-05-30 15:04:05 +00:00
|
|
|
Language,
|
2023-01-08 23:11:10 +00:00
|
|
|
RecursiveCharacterTextSplitter,
|
2024-01-25 23:50:06 +00:00
|
|
|
TextSplitter,
|
2023-12-19 01:15:57 +00:00
|
|
|
Tokenizer,
|
2023-01-08 23:11:10 +00:00
|
|
|
)
|
2024-03-01 02:33:21 +00:00
|
|
|
from langchain_text_splitters.base import split_text_on_tokens
|
|
|
|
from langchain_text_splitters.character import CharacterTextSplitter
|
2024-04-01 20:32:26 +00:00
|
|
|
from langchain_text_splitters.html import HTMLHeaderTextSplitter, HTMLSectionSplitter
|
2024-03-01 02:33:21 +00:00
|
|
|
from langchain_text_splitters.json import RecursiveJsonSplitter
|
|
|
|
from langchain_text_splitters.markdown import MarkdownHeaderTextSplitter
|
|
|
|
from langchain_text_splitters.python import PythonCodeTextSplitter
|
2022-11-01 03:17:22 +00:00
|
|
|
|
2023-05-29 23:56:31 +00:00
|
|
|
FAKE_PYTHON_TEXT = """
|
|
|
|
class Foo:
|
|
|
|
|
|
|
|
def bar():
|
|
|
|
|
|
|
|
|
|
|
|
def foo():
|
|
|
|
|
|
|
|
def testing_func():
|
|
|
|
|
|
|
|
def bar():
|
|
|
|
"""
|
|
|
|
|
2022-11-01 03:17:22 +00:00
|
|
|
|
|
|
|
def test_character_text_splitter() -> None:
|
|
|
|
"""Test splitting by character count."""
|
|
|
|
text = "foo bar baz 123"
|
2022-12-19 01:21:43 +00:00
|
|
|
splitter = CharacterTextSplitter(separator=" ", chunk_size=7, chunk_overlap=3)
|
2022-11-01 03:17:22 +00:00
|
|
|
output = splitter.split_text(text)
|
|
|
|
expected_output = ["foo bar", "bar baz", "baz 123"]
|
|
|
|
assert output == expected_output
|
|
|
|
|
|
|
|
|
2023-01-09 03:19:32 +00:00
|
|
|
def test_character_text_splitter_empty_doc() -> None:
|
|
|
|
"""Test splitting by character count doesn't create empty documents."""
|
|
|
|
text = "foo bar"
|
|
|
|
splitter = CharacterTextSplitter(separator=" ", chunk_size=2, chunk_overlap=0)
|
|
|
|
output = splitter.split_text(text)
|
|
|
|
expected_output = ["foo", "bar"]
|
|
|
|
assert output == expected_output
|
|
|
|
|
|
|
|
|
2023-03-07 23:42:28 +00:00
|
|
|
def test_character_text_splitter_separtor_empty_doc() -> None:
|
|
|
|
"""Test edge cases are separators."""
|
|
|
|
text = "f b"
|
|
|
|
splitter = CharacterTextSplitter(separator=" ", chunk_size=2, chunk_overlap=0)
|
|
|
|
output = splitter.split_text(text)
|
|
|
|
expected_output = ["f", "b"]
|
|
|
|
assert output == expected_output
|
|
|
|
|
|
|
|
|
2022-12-19 01:21:43 +00:00
|
|
|
def test_character_text_splitter_long() -> None:
|
|
|
|
"""Test splitting by character count on long words."""
|
|
|
|
text = "foo bar baz a a"
|
|
|
|
splitter = CharacterTextSplitter(separator=" ", chunk_size=3, chunk_overlap=1)
|
|
|
|
output = splitter.split_text(text)
|
|
|
|
expected_output = ["foo", "bar", "baz", "a a"]
|
|
|
|
assert output == expected_output
|
|
|
|
|
|
|
|
|
2023-01-08 23:11:10 +00:00
|
|
|
def test_character_text_splitter_short_words_first() -> None:
|
|
|
|
"""Test splitting by character count when shorter words are first."""
|
|
|
|
text = "a a foo bar baz"
|
|
|
|
splitter = CharacterTextSplitter(separator=" ", chunk_size=3, chunk_overlap=1)
|
|
|
|
output = splitter.split_text(text)
|
|
|
|
expected_output = ["a a", "foo", "bar", "baz"]
|
|
|
|
assert output == expected_output
|
|
|
|
|
|
|
|
|
2022-11-01 03:17:22 +00:00
|
|
|
def test_character_text_splitter_longer_words() -> None:
|
|
|
|
"""Test splitting by characters when splits not found easily."""
|
|
|
|
text = "foo bar baz 123"
|
|
|
|
splitter = CharacterTextSplitter(separator=" ", chunk_size=1, chunk_overlap=1)
|
|
|
|
output = splitter.split_text(text)
|
|
|
|
expected_output = ["foo", "bar", "baz", "123"]
|
|
|
|
assert output == expected_output
|
|
|
|
|
|
|
|
|
Add regex control over separators in character text splitter (#7933)
<!-- Thank you for contributing to LangChain!
Replace this comment with:
- Description: a description of the change,
- Issue: the issue # it fixes (if applicable),
- Dependencies: any dependencies required for this change,
- Tag maintainer: for a quicker response, tag the relevant maintainer
(see below),
- Twitter handle: we announce bigger features on Twitter. If your PR
gets announced and you'd like a mention, we'll gladly shout you out!
Please make sure you're PR is passing linting and testing before
submitting. Run `make format`, `make lint` and `make test` to check this
locally.
If you're adding a new integration, please include:
1. a test for the integration, preferably unit tests that do not rely on
network access,
2. an example notebook showing its use.
Maintainer responsibilities:
- General / Misc / if you don't know who to tag: @baskaryan
- DataLoaders / VectorStores / Retrievers: @rlancemartin, @eyurtsev
- Models / Prompts: @hwchase17, @baskaryan
- Memory: @hwchase17
- Agents / Tools / Toolkits: @hinthornw
- Tracing / Callbacks: @agola11
- Async: @agola11
If no one reviews your PR within a few days, feel free to @-mention the
same people again.
See contribution guidelines for more information on how to write/run
tests, lint, etc:
https://github.com/hwchase17/langchain/blob/master/.github/CONTRIBUTING.md
-->
#7854
Added the ability to use the `separator` ase a regex or a simple
character.
Fixed a bug where `start_index` was incorrectly counting from -1.
Who can review?
@eyurtsev
@hwchase17
@mmz-001
2023-08-04 03:25:23 +00:00
|
|
|
@pytest.mark.parametrize(
|
|
|
|
"separator, is_separator_regex", [(re.escape("."), True), (".", False)]
|
|
|
|
)
|
|
|
|
def test_character_text_splitter_keep_separator_regex(
|
|
|
|
separator: str, is_separator_regex: bool
|
|
|
|
) -> None:
|
2023-07-06 13:30:03 +00:00
|
|
|
"""Test splitting by characters while keeping the separator
|
|
|
|
that is a regex special character.
|
|
|
|
"""
|
|
|
|
text = "foo.bar.baz.123"
|
|
|
|
splitter = CharacterTextSplitter(
|
Add regex control over separators in character text splitter (#7933)
<!-- Thank you for contributing to LangChain!
Replace this comment with:
- Description: a description of the change,
- Issue: the issue # it fixes (if applicable),
- Dependencies: any dependencies required for this change,
- Tag maintainer: for a quicker response, tag the relevant maintainer
(see below),
- Twitter handle: we announce bigger features on Twitter. If your PR
gets announced and you'd like a mention, we'll gladly shout you out!
Please make sure you're PR is passing linting and testing before
submitting. Run `make format`, `make lint` and `make test` to check this
locally.
If you're adding a new integration, please include:
1. a test for the integration, preferably unit tests that do not rely on
network access,
2. an example notebook showing its use.
Maintainer responsibilities:
- General / Misc / if you don't know who to tag: @baskaryan
- DataLoaders / VectorStores / Retrievers: @rlancemartin, @eyurtsev
- Models / Prompts: @hwchase17, @baskaryan
- Memory: @hwchase17
- Agents / Tools / Toolkits: @hinthornw
- Tracing / Callbacks: @agola11
- Async: @agola11
If no one reviews your PR within a few days, feel free to @-mention the
same people again.
See contribution guidelines for more information on how to write/run
tests, lint, etc:
https://github.com/hwchase17/langchain/blob/master/.github/CONTRIBUTING.md
-->
#7854
Added the ability to use the `separator` ase a regex or a simple
character.
Fixed a bug where `start_index` was incorrectly counting from -1.
Who can review?
@eyurtsev
@hwchase17
@mmz-001
2023-08-04 03:25:23 +00:00
|
|
|
separator=separator,
|
|
|
|
chunk_size=1,
|
|
|
|
chunk_overlap=0,
|
|
|
|
keep_separator=True,
|
|
|
|
is_separator_regex=is_separator_regex,
|
2023-07-06 13:30:03 +00:00
|
|
|
)
|
|
|
|
output = splitter.split_text(text)
|
|
|
|
expected_output = ["foo", ".bar", ".baz", ".123"]
|
|
|
|
assert output == expected_output
|
|
|
|
|
|
|
|
|
Add regex control over separators in character text splitter (#7933)
<!-- Thank you for contributing to LangChain!
Replace this comment with:
- Description: a description of the change,
- Issue: the issue # it fixes (if applicable),
- Dependencies: any dependencies required for this change,
- Tag maintainer: for a quicker response, tag the relevant maintainer
(see below),
- Twitter handle: we announce bigger features on Twitter. If your PR
gets announced and you'd like a mention, we'll gladly shout you out!
Please make sure you're PR is passing linting and testing before
submitting. Run `make format`, `make lint` and `make test` to check this
locally.
If you're adding a new integration, please include:
1. a test for the integration, preferably unit tests that do not rely on
network access,
2. an example notebook showing its use.
Maintainer responsibilities:
- General / Misc / if you don't know who to tag: @baskaryan
- DataLoaders / VectorStores / Retrievers: @rlancemartin, @eyurtsev
- Models / Prompts: @hwchase17, @baskaryan
- Memory: @hwchase17
- Agents / Tools / Toolkits: @hinthornw
- Tracing / Callbacks: @agola11
- Async: @agola11
If no one reviews your PR within a few days, feel free to @-mention the
same people again.
See contribution guidelines for more information on how to write/run
tests, lint, etc:
https://github.com/hwchase17/langchain/blob/master/.github/CONTRIBUTING.md
-->
#7854
Added the ability to use the `separator` ase a regex or a simple
character.
Fixed a bug where `start_index` was incorrectly counting from -1.
Who can review?
@eyurtsev
@hwchase17
@mmz-001
2023-08-04 03:25:23 +00:00
|
|
|
@pytest.mark.parametrize(
|
|
|
|
"separator, is_separator_regex", [(re.escape("."), True), (".", False)]
|
|
|
|
)
|
|
|
|
def test_character_text_splitter_discard_separator_regex(
|
|
|
|
separator: str, is_separator_regex: bool
|
|
|
|
) -> None:
|
2023-07-06 13:30:03 +00:00
|
|
|
"""Test splitting by characters discarding the separator
|
|
|
|
that is a regex special character."""
|
|
|
|
text = "foo.bar.baz.123"
|
|
|
|
splitter = CharacterTextSplitter(
|
Add regex control over separators in character text splitter (#7933)
<!-- Thank you for contributing to LangChain!
Replace this comment with:
- Description: a description of the change,
- Issue: the issue # it fixes (if applicable),
- Dependencies: any dependencies required for this change,
- Tag maintainer: for a quicker response, tag the relevant maintainer
(see below),
- Twitter handle: we announce bigger features on Twitter. If your PR
gets announced and you'd like a mention, we'll gladly shout you out!
Please make sure you're PR is passing linting and testing before
submitting. Run `make format`, `make lint` and `make test` to check this
locally.
If you're adding a new integration, please include:
1. a test for the integration, preferably unit tests that do not rely on
network access,
2. an example notebook showing its use.
Maintainer responsibilities:
- General / Misc / if you don't know who to tag: @baskaryan
- DataLoaders / VectorStores / Retrievers: @rlancemartin, @eyurtsev
- Models / Prompts: @hwchase17, @baskaryan
- Memory: @hwchase17
- Agents / Tools / Toolkits: @hinthornw
- Tracing / Callbacks: @agola11
- Async: @agola11
If no one reviews your PR within a few days, feel free to @-mention the
same people again.
See contribution guidelines for more information on how to write/run
tests, lint, etc:
https://github.com/hwchase17/langchain/blob/master/.github/CONTRIBUTING.md
-->
#7854
Added the ability to use the `separator` ase a regex or a simple
character.
Fixed a bug where `start_index` was incorrectly counting from -1.
Who can review?
@eyurtsev
@hwchase17
@mmz-001
2023-08-04 03:25:23 +00:00
|
|
|
separator=separator,
|
|
|
|
chunk_size=1,
|
|
|
|
chunk_overlap=0,
|
|
|
|
keep_separator=False,
|
|
|
|
is_separator_regex=is_separator_regex,
|
2023-07-06 13:30:03 +00:00
|
|
|
)
|
|
|
|
output = splitter.split_text(text)
|
|
|
|
expected_output = ["foo", "bar", "baz", "123"]
|
|
|
|
assert output == expected_output
|
|
|
|
|
|
|
|
|
2022-11-01 03:17:22 +00:00
|
|
|
def test_character_text_splitting_args() -> None:
|
|
|
|
"""Test invalid arguments."""
|
|
|
|
with pytest.raises(ValueError):
|
|
|
|
CharacterTextSplitter(chunk_size=2, chunk_overlap=4)
|
2022-12-21 03:24:08 +00:00
|
|
|
|
|
|
|
|
2023-04-25 17:02:59 +00:00
|
|
|
def test_merge_splits() -> None:
|
|
|
|
"""Test merging splits with a given separator."""
|
|
|
|
splitter = CharacterTextSplitter(separator=" ", chunk_size=9, chunk_overlap=2)
|
|
|
|
splits = ["foo", "bar", "baz"]
|
|
|
|
expected_output = ["foo bar", "baz"]
|
|
|
|
output = splitter._merge_splits(splits, separator=" ")
|
|
|
|
assert output == expected_output
|
|
|
|
|
|
|
|
|
2022-12-21 03:24:08 +00:00
|
|
|
def test_create_documents() -> None:
|
|
|
|
"""Test create documents method."""
|
|
|
|
texts = ["foo bar", "baz"]
|
|
|
|
splitter = CharacterTextSplitter(separator=" ", chunk_size=3, chunk_overlap=0)
|
|
|
|
docs = splitter.create_documents(texts)
|
|
|
|
expected_docs = [
|
|
|
|
Document(page_content="foo"),
|
|
|
|
Document(page_content="bar"),
|
|
|
|
Document(page_content="baz"),
|
|
|
|
]
|
|
|
|
assert docs == expected_docs
|
|
|
|
|
|
|
|
|
|
|
|
def test_create_documents_with_metadata() -> None:
|
|
|
|
"""Test create documents with metadata method."""
|
|
|
|
texts = ["foo bar", "baz"]
|
|
|
|
splitter = CharacterTextSplitter(separator=" ", chunk_size=3, chunk_overlap=0)
|
|
|
|
docs = splitter.create_documents(texts, [{"source": "1"}, {"source": "2"}])
|
|
|
|
expected_docs = [
|
|
|
|
Document(page_content="foo", metadata={"source": "1"}),
|
|
|
|
Document(page_content="bar", metadata={"source": "1"}),
|
|
|
|
Document(page_content="baz", metadata={"source": "2"}),
|
|
|
|
]
|
|
|
|
assert docs == expected_docs
|
2023-01-08 23:11:10 +00:00
|
|
|
|
|
|
|
|
2024-01-25 23:50:06 +00:00
|
|
|
@pytest.mark.parametrize(
|
|
|
|
"splitter, text, expected_docs",
|
|
|
|
[
|
|
|
|
(
|
|
|
|
CharacterTextSplitter(
|
|
|
|
separator=" ", chunk_size=7, chunk_overlap=3, add_start_index=True
|
|
|
|
),
|
|
|
|
"foo bar baz 123",
|
|
|
|
[
|
|
|
|
Document(page_content="foo bar", metadata={"start_index": 0}),
|
|
|
|
Document(page_content="bar baz", metadata={"start_index": 4}),
|
|
|
|
Document(page_content="baz 123", metadata={"start_index": 8}),
|
|
|
|
],
|
|
|
|
),
|
|
|
|
(
|
|
|
|
RecursiveCharacterTextSplitter(
|
|
|
|
chunk_size=6,
|
|
|
|
chunk_overlap=0,
|
|
|
|
separators=["\n\n", "\n", " ", ""],
|
|
|
|
add_start_index=True,
|
|
|
|
),
|
|
|
|
"w1 w1 w1 w1 w1 w1 w1 w1 w1",
|
|
|
|
[
|
|
|
|
Document(page_content="w1 w1", metadata={"start_index": 0}),
|
|
|
|
Document(page_content="w1 w1", metadata={"start_index": 6}),
|
|
|
|
Document(page_content="w1 w1", metadata={"start_index": 12}),
|
|
|
|
Document(page_content="w1 w1", metadata={"start_index": 18}),
|
|
|
|
Document(page_content="w1", metadata={"start_index": 24}),
|
|
|
|
],
|
|
|
|
),
|
|
|
|
],
|
|
|
|
)
|
|
|
|
def test_create_documents_with_start_index(
|
|
|
|
splitter: TextSplitter, text: str, expected_docs: List[Document]
|
|
|
|
) -> None:
|
Add start index to metadata in TextSplitter (#5912)
<!--
Thank you for contributing to LangChain! Your PR will appear in our
release under the title you set. Please make sure it highlights your
valuable contribution.
Replace this with a description of the change, the issue it fixes (if
applicable), and relevant context. List any dependencies required for
this change.
After you're done, someone will review your PR. They may suggest
improvements. If no one reviews your PR within a few days, feel free to
@-mention the same people again, as notifications can get lost.
Finally, we'd love to show appreciation for your contribution - if you'd
like us to shout you out on Twitter, please also include your handle!
-->
#### Add start index to metadata in TextSplitter
- Modified method `create_documents` to track start position of each
chunk
- The `start_index` is included in the metadata if the `add_start_index`
parameter in the class constructor is set to `True`
This enables referencing back to the original document, particularly
useful when a specific chunk is retrieved.
<!-- If you're adding a new integration, please include:
1. a test for the integration - favor unit tests that does not rely on
network access.
2. an example notebook showing its use
See contribution guidelines for more information on how to write tests,
lint
etc:
https://github.com/hwchase17/langchain/blob/master/.github/CONTRIBUTING.md
-->
#### Who can review?
Tag maintainers/contributors who might be interested:
@eyurtsev @agola11
<!-- For a quicker response, figure out the right person to tag with @
@hwchase17 - project lead
Tracing / Callbacks
- @agola11
Async
- @agola11
DataLoaders
- @eyurtsev
Models
- @hwchase17
- @agola11
Agents / Tools / Toolkits
- @vowelparrot
VectorStores / Retrievers / Memory
- @dev2049
-->
2023-06-09 06:09:32 +00:00
|
|
|
"""Test create documents method."""
|
2024-01-25 23:50:06 +00:00
|
|
|
docs = splitter.create_documents([text])
|
Add start index to metadata in TextSplitter (#5912)
<!--
Thank you for contributing to LangChain! Your PR will appear in our
release under the title you set. Please make sure it highlights your
valuable contribution.
Replace this with a description of the change, the issue it fixes (if
applicable), and relevant context. List any dependencies required for
this change.
After you're done, someone will review your PR. They may suggest
improvements. If no one reviews your PR within a few days, feel free to
@-mention the same people again, as notifications can get lost.
Finally, we'd love to show appreciation for your contribution - if you'd
like us to shout you out on Twitter, please also include your handle!
-->
#### Add start index to metadata in TextSplitter
- Modified method `create_documents` to track start position of each
chunk
- The `start_index` is included in the metadata if the `add_start_index`
parameter in the class constructor is set to `True`
This enables referencing back to the original document, particularly
useful when a specific chunk is retrieved.
<!-- If you're adding a new integration, please include:
1. a test for the integration - favor unit tests that does not rely on
network access.
2. an example notebook showing its use
See contribution guidelines for more information on how to write tests,
lint
etc:
https://github.com/hwchase17/langchain/blob/master/.github/CONTRIBUTING.md
-->
#### Who can review?
Tag maintainers/contributors who might be interested:
@eyurtsev @agola11
<!-- For a quicker response, figure out the right person to tag with @
@hwchase17 - project lead
Tracing / Callbacks
- @agola11
Async
- @agola11
DataLoaders
- @eyurtsev
Models
- @hwchase17
- @agola11
Agents / Tools / Toolkits
- @vowelparrot
VectorStores / Retrievers / Memory
- @dev2049
-->
2023-06-09 06:09:32 +00:00
|
|
|
assert docs == expected_docs
|
2024-01-25 23:50:06 +00:00
|
|
|
for doc in docs:
|
|
|
|
s_i = doc.metadata["start_index"]
|
|
|
|
assert text[s_i : s_i + len(doc.page_content)] == doc.page_content
|
Add start index to metadata in TextSplitter (#5912)
<!--
Thank you for contributing to LangChain! Your PR will appear in our
release under the title you set. Please make sure it highlights your
valuable contribution.
Replace this with a description of the change, the issue it fixes (if
applicable), and relevant context. List any dependencies required for
this change.
After you're done, someone will review your PR. They may suggest
improvements. If no one reviews your PR within a few days, feel free to
@-mention the same people again, as notifications can get lost.
Finally, we'd love to show appreciation for your contribution - if you'd
like us to shout you out on Twitter, please also include your handle!
-->
#### Add start index to metadata in TextSplitter
- Modified method `create_documents` to track start position of each
chunk
- The `start_index` is included in the metadata if the `add_start_index`
parameter in the class constructor is set to `True`
This enables referencing back to the original document, particularly
useful when a specific chunk is retrieved.
<!-- If you're adding a new integration, please include:
1. a test for the integration - favor unit tests that does not rely on
network access.
2. an example notebook showing its use
See contribution guidelines for more information on how to write tests,
lint
etc:
https://github.com/hwchase17/langchain/blob/master/.github/CONTRIBUTING.md
-->
#### Who can review?
Tag maintainers/contributors who might be interested:
@eyurtsev @agola11
<!-- For a quicker response, figure out the right person to tag with @
@hwchase17 - project lead
Tracing / Callbacks
- @agola11
Async
- @agola11
DataLoaders
- @eyurtsev
Models
- @hwchase17
- @agola11
Agents / Tools / Toolkits
- @vowelparrot
VectorStores / Retrievers / Memory
- @dev2049
-->
2023-06-09 06:09:32 +00:00
|
|
|
|
|
|
|
|
2023-03-11 17:18:25 +00:00
|
|
|
def test_metadata_not_shallow() -> None:
|
|
|
|
"""Test that metadatas are not shallow."""
|
|
|
|
texts = ["foo bar"]
|
|
|
|
splitter = CharacterTextSplitter(separator=" ", chunk_size=3, chunk_overlap=0)
|
|
|
|
docs = splitter.create_documents(texts, [{"source": "1"}])
|
|
|
|
expected_docs = [
|
|
|
|
Document(page_content="foo", metadata={"source": "1"}),
|
|
|
|
Document(page_content="bar", metadata={"source": "1"}),
|
|
|
|
]
|
|
|
|
assert docs == expected_docs
|
|
|
|
docs[0].metadata["foo"] = 1
|
|
|
|
assert docs[0].metadata == {"source": "1", "foo": 1}
|
|
|
|
assert docs[1].metadata == {"source": "1"}
|
|
|
|
|
|
|
|
|
2023-06-10 23:48:53 +00:00
|
|
|
def test_iterative_text_splitter_keep_separator() -> None:
|
|
|
|
chunk_size = 5
|
|
|
|
output = __test_iterative_text_splitter(chunk_size=chunk_size, keep_separator=True)
|
|
|
|
|
|
|
|
assert output == [
|
|
|
|
"....5",
|
|
|
|
"X..3",
|
|
|
|
"Y...4",
|
|
|
|
"X....5",
|
|
|
|
"Y...",
|
|
|
|
]
|
|
|
|
|
|
|
|
|
|
|
|
def test_iterative_text_splitter_discard_separator() -> None:
|
|
|
|
chunk_size = 5
|
|
|
|
output = __test_iterative_text_splitter(chunk_size=chunk_size, keep_separator=False)
|
|
|
|
|
|
|
|
assert output == [
|
|
|
|
"....5",
|
|
|
|
"..3",
|
|
|
|
"...4",
|
|
|
|
"....5",
|
|
|
|
"...",
|
|
|
|
]
|
|
|
|
|
|
|
|
|
|
|
|
def __test_iterative_text_splitter(chunk_size: int, keep_separator: bool) -> List[str]:
|
|
|
|
chunk_size += 1 if keep_separator else 0
|
|
|
|
|
|
|
|
splitter = RecursiveCharacterTextSplitter(
|
|
|
|
chunk_size=chunk_size,
|
|
|
|
chunk_overlap=0,
|
|
|
|
separators=["X", "Y"],
|
|
|
|
keep_separator=keep_separator,
|
|
|
|
)
|
|
|
|
text = "....5X..3Y...4X....5Y..."
|
|
|
|
output = splitter.split_text(text)
|
|
|
|
for chunk in output:
|
|
|
|
assert len(chunk) <= chunk_size, f"Chunk is larger than {chunk_size}"
|
|
|
|
return output
|
|
|
|
|
|
|
|
|
2023-01-08 23:11:10 +00:00
|
|
|
def test_iterative_text_splitter() -> None:
|
|
|
|
"""Test iterative text splitter."""
|
|
|
|
text = """Hi.\n\nI'm Harrison.\n\nHow? Are? You?\nOkay then f f f f.
|
|
|
|
This is a weird text to write, but gotta test the splittingggg some how.
|
|
|
|
|
|
|
|
Bye!\n\n-H."""
|
|
|
|
splitter = RecursiveCharacterTextSplitter(chunk_size=10, chunk_overlap=1)
|
|
|
|
output = splitter.split_text(text)
|
|
|
|
expected_output = [
|
|
|
|
"Hi.",
|
|
|
|
"I'm",
|
|
|
|
"Harrison.",
|
|
|
|
"How? Are?",
|
|
|
|
"You?",
|
2023-03-07 23:42:28 +00:00
|
|
|
"Okay then",
|
2023-01-08 23:11:10 +00:00
|
|
|
"f f f f.",
|
|
|
|
"This is a",
|
2023-05-29 23:56:31 +00:00
|
|
|
"weird",
|
2023-01-08 23:11:10 +00:00
|
|
|
"text to",
|
2023-05-29 23:56:31 +00:00
|
|
|
"write,",
|
|
|
|
"but gotta",
|
|
|
|
"test the",
|
|
|
|
"splitting",
|
|
|
|
"gggg",
|
2023-01-08 23:11:10 +00:00
|
|
|
"some how.",
|
2023-05-29 23:56:31 +00:00
|
|
|
"Bye!",
|
|
|
|
"-H.",
|
2023-01-08 23:11:10 +00:00
|
|
|
]
|
|
|
|
assert output == expected_output
|
2023-05-23 03:00:24 +00:00
|
|
|
|
|
|
|
|
|
|
|
def test_split_documents() -> None:
|
|
|
|
"""Test split_documents."""
|
|
|
|
splitter = CharacterTextSplitter(separator="", chunk_size=1, chunk_overlap=0)
|
|
|
|
docs = [
|
|
|
|
Document(page_content="foo", metadata={"source": "1"}),
|
|
|
|
Document(page_content="bar", metadata={"source": "2"}),
|
|
|
|
Document(page_content="baz", metadata={"source": "1"}),
|
|
|
|
]
|
|
|
|
expected_output = [
|
|
|
|
Document(page_content="f", metadata={"source": "1"}),
|
|
|
|
Document(page_content="o", metadata={"source": "1"}),
|
|
|
|
Document(page_content="o", metadata={"source": "1"}),
|
|
|
|
Document(page_content="b", metadata={"source": "2"}),
|
|
|
|
Document(page_content="a", metadata={"source": "2"}),
|
|
|
|
Document(page_content="r", metadata={"source": "2"}),
|
|
|
|
Document(page_content="b", metadata={"source": "1"}),
|
|
|
|
Document(page_content="a", metadata={"source": "1"}),
|
|
|
|
Document(page_content="z", metadata={"source": "1"}),
|
|
|
|
]
|
|
|
|
assert splitter.split_documents(docs) == expected_output
|
2023-05-29 23:56:31 +00:00
|
|
|
|
|
|
|
|
|
|
|
def test_python_text_splitter() -> None:
|
|
|
|
splitter = PythonCodeTextSplitter(chunk_size=30, chunk_overlap=0)
|
|
|
|
splits = splitter.split_text(FAKE_PYTHON_TEXT)
|
|
|
|
split_0 = """class Foo:\n\n def bar():"""
|
|
|
|
split_1 = """def foo():"""
|
|
|
|
split_2 = """def testing_func():"""
|
|
|
|
split_3 = """def bar():"""
|
|
|
|
expected_splits = [split_0, split_1, split_2, split_3]
|
|
|
|
assert splits == expected_splits
|
Add more code splitters (go, rst, js, java, cpp, scala, ruby, php, swift, rust) (#5171)
As the title says, I added more code splitters.
The implementation is trivial, so i don't add separate tests for each
splitter.
Let me know if any concerns.
Fixes # (issue)
https://github.com/hwchase17/langchain/issues/5170
## Who can review?
Community members can review the PR once tests pass. Tag
maintainers/contributors who might be interested:
@eyurtsev @hwchase17
---------
Signed-off-by: byhsu <byhsu@linkedin.com>
Co-authored-by: byhsu <byhsu@linkedin.com>
2023-05-30 15:04:05 +00:00
|
|
|
|
|
|
|
|
|
|
|
CHUNK_SIZE = 16
|
|
|
|
|
|
|
|
|
|
|
|
def test_python_code_splitter() -> None:
|
2023-05-31 14:11:53 +00:00
|
|
|
splitter = RecursiveCharacterTextSplitter.from_language(
|
|
|
|
Language.PYTHON, chunk_size=CHUNK_SIZE, chunk_overlap=0
|
Add more code splitters (go, rst, js, java, cpp, scala, ruby, php, swift, rust) (#5171)
As the title says, I added more code splitters.
The implementation is trivial, so i don't add separate tests for each
splitter.
Let me know if any concerns.
Fixes # (issue)
https://github.com/hwchase17/langchain/issues/5170
## Who can review?
Community members can review the PR once tests pass. Tag
maintainers/contributors who might be interested:
@eyurtsev @hwchase17
---------
Signed-off-by: byhsu <byhsu@linkedin.com>
Co-authored-by: byhsu <byhsu@linkedin.com>
2023-05-30 15:04:05 +00:00
|
|
|
)
|
|
|
|
code = """
|
|
|
|
def hello_world():
|
|
|
|
print("Hello, World!")
|
|
|
|
|
|
|
|
# Call the function
|
|
|
|
hello_world()
|
|
|
|
"""
|
|
|
|
chunks = splitter.split_text(code)
|
|
|
|
assert chunks == [
|
|
|
|
"def",
|
|
|
|
"hello_world():",
|
|
|
|
'print("Hello,',
|
|
|
|
'World!")',
|
|
|
|
"# Call the",
|
|
|
|
"function",
|
|
|
|
"hello_world()",
|
|
|
|
]
|
|
|
|
|
|
|
|
|
|
|
|
def test_golang_code_splitter() -> None:
|
2023-05-31 14:11:53 +00:00
|
|
|
splitter = RecursiveCharacterTextSplitter.from_language(
|
|
|
|
Language.GO, chunk_size=CHUNK_SIZE, chunk_overlap=0
|
Add more code splitters (go, rst, js, java, cpp, scala, ruby, php, swift, rust) (#5171)
As the title says, I added more code splitters.
The implementation is trivial, so i don't add separate tests for each
splitter.
Let me know if any concerns.
Fixes # (issue)
https://github.com/hwchase17/langchain/issues/5170
## Who can review?
Community members can review the PR once tests pass. Tag
maintainers/contributors who might be interested:
@eyurtsev @hwchase17
---------
Signed-off-by: byhsu <byhsu@linkedin.com>
Co-authored-by: byhsu <byhsu@linkedin.com>
2023-05-30 15:04:05 +00:00
|
|
|
)
|
|
|
|
code = """
|
|
|
|
package main
|
|
|
|
|
|
|
|
import "fmt"
|
|
|
|
|
|
|
|
func helloWorld() {
|
|
|
|
fmt.Println("Hello, World!")
|
|
|
|
}
|
|
|
|
|
|
|
|
func main() {
|
|
|
|
helloWorld()
|
|
|
|
}
|
|
|
|
"""
|
|
|
|
chunks = splitter.split_text(code)
|
|
|
|
assert chunks == [
|
|
|
|
"package main",
|
|
|
|
'import "fmt"',
|
|
|
|
"func",
|
|
|
|
"helloWorld() {",
|
|
|
|
'fmt.Println("He',
|
|
|
|
"llo,",
|
|
|
|
'World!")',
|
|
|
|
"}",
|
|
|
|
"func main() {",
|
|
|
|
"helloWorld()",
|
|
|
|
"}",
|
|
|
|
]
|
|
|
|
|
|
|
|
|
|
|
|
def test_rst_code_splitter() -> None:
|
2023-05-31 14:11:53 +00:00
|
|
|
splitter = RecursiveCharacterTextSplitter.from_language(
|
|
|
|
Language.RST, chunk_size=CHUNK_SIZE, chunk_overlap=0
|
Add more code splitters (go, rst, js, java, cpp, scala, ruby, php, swift, rust) (#5171)
As the title says, I added more code splitters.
The implementation is trivial, so i don't add separate tests for each
splitter.
Let me know if any concerns.
Fixes # (issue)
https://github.com/hwchase17/langchain/issues/5170
## Who can review?
Community members can review the PR once tests pass. Tag
maintainers/contributors who might be interested:
@eyurtsev @hwchase17
---------
Signed-off-by: byhsu <byhsu@linkedin.com>
Co-authored-by: byhsu <byhsu@linkedin.com>
2023-05-30 15:04:05 +00:00
|
|
|
)
|
|
|
|
code = """
|
|
|
|
Sample Document
|
|
|
|
===============
|
|
|
|
|
|
|
|
Section
|
|
|
|
-------
|
|
|
|
|
|
|
|
This is the content of the section.
|
|
|
|
|
|
|
|
Lists
|
|
|
|
-----
|
|
|
|
|
|
|
|
- Item 1
|
|
|
|
- Item 2
|
|
|
|
- Item 3
|
2023-06-05 23:40:26 +00:00
|
|
|
|
|
|
|
Comment
|
|
|
|
*******
|
|
|
|
Not a comment
|
|
|
|
|
|
|
|
.. This is a comment
|
Add more code splitters (go, rst, js, java, cpp, scala, ruby, php, swift, rust) (#5171)
As the title says, I added more code splitters.
The implementation is trivial, so i don't add separate tests for each
splitter.
Let me know if any concerns.
Fixes # (issue)
https://github.com/hwchase17/langchain/issues/5170
## Who can review?
Community members can review the PR once tests pass. Tag
maintainers/contributors who might be interested:
@eyurtsev @hwchase17
---------
Signed-off-by: byhsu <byhsu@linkedin.com>
Co-authored-by: byhsu <byhsu@linkedin.com>
2023-05-30 15:04:05 +00:00
|
|
|
"""
|
|
|
|
chunks = splitter.split_text(code)
|
|
|
|
assert chunks == [
|
|
|
|
"Sample Document",
|
|
|
|
"===============",
|
|
|
|
"Section",
|
|
|
|
"-------",
|
|
|
|
"This is the",
|
|
|
|
"content of the",
|
|
|
|
"section.",
|
2023-06-05 23:40:26 +00:00
|
|
|
"Lists",
|
|
|
|
"-----",
|
Add more code splitters (go, rst, js, java, cpp, scala, ruby, php, swift, rust) (#5171)
As the title says, I added more code splitters.
The implementation is trivial, so i don't add separate tests for each
splitter.
Let me know if any concerns.
Fixes # (issue)
https://github.com/hwchase17/langchain/issues/5170
## Who can review?
Community members can review the PR once tests pass. Tag
maintainers/contributors who might be interested:
@eyurtsev @hwchase17
---------
Signed-off-by: byhsu <byhsu@linkedin.com>
Co-authored-by: byhsu <byhsu@linkedin.com>
2023-05-30 15:04:05 +00:00
|
|
|
"- Item 1",
|
|
|
|
"- Item 2",
|
|
|
|
"- Item 3",
|
2023-06-05 23:40:26 +00:00
|
|
|
"Comment",
|
|
|
|
"*******",
|
|
|
|
"Not a comment",
|
|
|
|
".. This is a",
|
|
|
|
"comment",
|
Add more code splitters (go, rst, js, java, cpp, scala, ruby, php, swift, rust) (#5171)
As the title says, I added more code splitters.
The implementation is trivial, so i don't add separate tests for each
splitter.
Let me know if any concerns.
Fixes # (issue)
https://github.com/hwchase17/langchain/issues/5170
## Who can review?
Community members can review the PR once tests pass. Tag
maintainers/contributors who might be interested:
@eyurtsev @hwchase17
---------
Signed-off-by: byhsu <byhsu@linkedin.com>
Co-authored-by: byhsu <byhsu@linkedin.com>
2023-05-30 15:04:05 +00:00
|
|
|
]
|
Fix invalid escape sequence warnings (#8771)
<!-- Thank you for contributing to LangChain!
Replace this comment with:
- Description: a description of the change,
- Issue: the issue # it fixes (if applicable),
- Dependencies: any dependencies required for this change,
- Tag maintainer: for a quicker response, tag the relevant maintainer
(see below),
- Twitter handle: we announce bigger features on Twitter. If your PR
gets announced and you'd like a mention, we'll gladly shout you out!
Please make sure you're PR is passing linting and testing before
submitting. Run `make format`, `make lint` and `make test` to check this
locally.
If you're adding a new integration, please include:
1. a test for the integration, preferably unit tests that do not rely on
network access,
2. an example notebook showing its use.
Maintainer responsibilities:
- General / Misc / if you don't know who to tag: @baskaryan
- DataLoaders / VectorStores / Retrievers: @rlancemartin, @eyurtsev
- Models / Prompts: @hwchase17, @baskaryan
- Memory: @hwchase17
- Agents / Tools / Toolkits: @hinthornw
- Tracing / Callbacks: @agola11
- Async: @agola11
If no one reviews your PR within a few days, feel free to @-mention the
same people again.
See contribution guidelines for more information on how to write/run
tests, lint, etc:
https://github.com/hwchase17/langchain/blob/master/.github/CONTRIBUTING.md
-->
Description: The lines I have changed looks like incorrectly escaped for
regex. In python 3.11, I receive DeprecationWarning for these lines.
You don't see any warnings unless you explicitly run python with `-W
always::DeprecationWarning` flag. So, this is my attempt to fix it.
Here are the warnings from log files:
```
/usr/local/lib/python3.11/site-packages/langchain/text_splitter.py:919: DeprecationWarning: invalid escape sequence '\s'
/usr/local/lib/python3.11/site-packages/langchain/text_splitter.py:918: DeprecationWarning: invalid escape sequence '\s'
/usr/local/lib/python3.11/site-packages/langchain/text_splitter.py:917: DeprecationWarning: invalid escape sequence '\s'
/usr/local/lib/python3.11/site-packages/langchain/text_splitter.py:916: DeprecationWarning: invalid escape sequence '\c'
/usr/local/lib/python3.11/site-packages/langchain/text_splitter.py:903: DeprecationWarning: invalid escape sequence '\*'
/usr/local/lib/python3.11/site-packages/langchain/text_splitter.py:804: DeprecationWarning: invalid escape sequence '\*'
/usr/local/lib/python3.11/site-packages/langchain/text_splitter.py:804: DeprecationWarning: invalid escape sequence '\*'
```
cc @baskaryan
---------
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
2023-08-07 00:01:18 +00:00
|
|
|
# Special test for special characters
|
|
|
|
code = "harry\n***\nbabylon is"
|
|
|
|
chunks = splitter.split_text(code)
|
|
|
|
assert chunks == ["harry", "***\nbabylon is"]
|
Add more code splitters (go, rst, js, java, cpp, scala, ruby, php, swift, rust) (#5171)
As the title says, I added more code splitters.
The implementation is trivial, so i don't add separate tests for each
splitter.
Let me know if any concerns.
Fixes # (issue)
https://github.com/hwchase17/langchain/issues/5170
## Who can review?
Community members can review the PR once tests pass. Tag
maintainers/contributors who might be interested:
@eyurtsev @hwchase17
---------
Signed-off-by: byhsu <byhsu@linkedin.com>
Co-authored-by: byhsu <byhsu@linkedin.com>
2023-05-30 15:04:05 +00:00
|
|
|
|
|
|
|
|
|
|
|
def test_proto_file_splitter() -> None:
|
2023-05-31 14:11:53 +00:00
|
|
|
splitter = RecursiveCharacterTextSplitter.from_language(
|
|
|
|
Language.PROTO, chunk_size=CHUNK_SIZE, chunk_overlap=0
|
Add more code splitters (go, rst, js, java, cpp, scala, ruby, php, swift, rust) (#5171)
As the title says, I added more code splitters.
The implementation is trivial, so i don't add separate tests for each
splitter.
Let me know if any concerns.
Fixes # (issue)
https://github.com/hwchase17/langchain/issues/5170
## Who can review?
Community members can review the PR once tests pass. Tag
maintainers/contributors who might be interested:
@eyurtsev @hwchase17
---------
Signed-off-by: byhsu <byhsu@linkedin.com>
Co-authored-by: byhsu <byhsu@linkedin.com>
2023-05-30 15:04:05 +00:00
|
|
|
)
|
|
|
|
code = """
|
|
|
|
syntax = "proto3";
|
|
|
|
|
|
|
|
package example;
|
|
|
|
|
|
|
|
message Person {
|
|
|
|
string name = 1;
|
|
|
|
int32 age = 2;
|
|
|
|
repeated string hobbies = 3;
|
|
|
|
}
|
|
|
|
"""
|
|
|
|
chunks = splitter.split_text(code)
|
|
|
|
assert chunks == [
|
|
|
|
"syntax =",
|
|
|
|
'"proto3";',
|
|
|
|
"package",
|
|
|
|
"example;",
|
|
|
|
"message Person",
|
|
|
|
"{",
|
|
|
|
"string name",
|
|
|
|
"= 1;",
|
|
|
|
"int32 age =",
|
|
|
|
"2;",
|
|
|
|
"repeated",
|
|
|
|
"string hobbies",
|
|
|
|
"= 3;",
|
|
|
|
"}",
|
|
|
|
]
|
|
|
|
|
|
|
|
|
|
|
|
def test_javascript_code_splitter() -> None:
|
2023-05-31 14:11:53 +00:00
|
|
|
splitter = RecursiveCharacterTextSplitter.from_language(
|
|
|
|
Language.JS, chunk_size=CHUNK_SIZE, chunk_overlap=0
|
Add more code splitters (go, rst, js, java, cpp, scala, ruby, php, swift, rust) (#5171)
As the title says, I added more code splitters.
The implementation is trivial, so i don't add separate tests for each
splitter.
Let me know if any concerns.
Fixes # (issue)
https://github.com/hwchase17/langchain/issues/5170
## Who can review?
Community members can review the PR once tests pass. Tag
maintainers/contributors who might be interested:
@eyurtsev @hwchase17
---------
Signed-off-by: byhsu <byhsu@linkedin.com>
Co-authored-by: byhsu <byhsu@linkedin.com>
2023-05-30 15:04:05 +00:00
|
|
|
)
|
|
|
|
code = """
|
|
|
|
function helloWorld() {
|
|
|
|
console.log("Hello, World!");
|
|
|
|
}
|
|
|
|
|
|
|
|
// Call the function
|
|
|
|
helloWorld();
|
|
|
|
"""
|
|
|
|
chunks = splitter.split_text(code)
|
|
|
|
assert chunks == [
|
|
|
|
"function",
|
|
|
|
"helloWorld() {",
|
|
|
|
'console.log("He',
|
|
|
|
"llo,",
|
|
|
|
'World!");',
|
|
|
|
"}",
|
|
|
|
"// Call the",
|
|
|
|
"function",
|
|
|
|
"helloWorld();",
|
|
|
|
]
|
|
|
|
|
|
|
|
|
2023-10-23 19:44:31 +00:00
|
|
|
def test_cobol_code_splitter() -> None:
|
|
|
|
splitter = RecursiveCharacterTextSplitter.from_language(
|
|
|
|
Language.COBOL, chunk_size=CHUNK_SIZE, chunk_overlap=0
|
|
|
|
)
|
|
|
|
code = """
|
|
|
|
IDENTIFICATION DIVISION.
|
|
|
|
PROGRAM-ID. HelloWorld.
|
|
|
|
DATA DIVISION.
|
|
|
|
WORKING-STORAGE SECTION.
|
|
|
|
01 GREETING PIC X(12) VALUE 'Hello, World!'.
|
|
|
|
PROCEDURE DIVISION.
|
|
|
|
DISPLAY GREETING.
|
|
|
|
STOP RUN.
|
|
|
|
"""
|
|
|
|
chunks = splitter.split_text(code)
|
|
|
|
assert chunks == [
|
|
|
|
"IDENTIFICATION",
|
|
|
|
"DIVISION.",
|
|
|
|
"PROGRAM-ID.",
|
|
|
|
"HelloWorld.",
|
|
|
|
"DATA DIVISION.",
|
|
|
|
"WORKING-STORAGE",
|
|
|
|
"SECTION.",
|
|
|
|
"01 GREETING",
|
|
|
|
"PIC X(12)",
|
|
|
|
"VALUE 'Hello,",
|
|
|
|
"World!'.",
|
|
|
|
"PROCEDURE",
|
|
|
|
"DIVISION.",
|
|
|
|
"DISPLAY",
|
|
|
|
"GREETING.",
|
|
|
|
"STOP RUN.",
|
|
|
|
]
|
|
|
|
|
|
|
|
|
2023-09-28 23:41:51 +00:00
|
|
|
def test_typescript_code_splitter() -> None:
|
|
|
|
splitter = RecursiveCharacterTextSplitter.from_language(
|
|
|
|
Language.TS, chunk_size=CHUNK_SIZE, chunk_overlap=0
|
|
|
|
)
|
|
|
|
code = """
|
|
|
|
function helloWorld(): void {
|
|
|
|
console.log("Hello, World!");
|
|
|
|
}
|
|
|
|
|
|
|
|
// Call the function
|
|
|
|
helloWorld();
|
|
|
|
"""
|
|
|
|
chunks = splitter.split_text(code)
|
|
|
|
assert chunks == [
|
|
|
|
"function",
|
|
|
|
"helloWorld():",
|
|
|
|
"void {",
|
|
|
|
'console.log("He',
|
|
|
|
"llo,",
|
|
|
|
'World!");',
|
|
|
|
"}",
|
|
|
|
"// Call the",
|
|
|
|
"function",
|
|
|
|
"helloWorld();",
|
|
|
|
]
|
|
|
|
|
|
|
|
|
Add more code splitters (go, rst, js, java, cpp, scala, ruby, php, swift, rust) (#5171)
As the title says, I added more code splitters.
The implementation is trivial, so i don't add separate tests for each
splitter.
Let me know if any concerns.
Fixes # (issue)
https://github.com/hwchase17/langchain/issues/5170
## Who can review?
Community members can review the PR once tests pass. Tag
maintainers/contributors who might be interested:
@eyurtsev @hwchase17
---------
Signed-off-by: byhsu <byhsu@linkedin.com>
Co-authored-by: byhsu <byhsu@linkedin.com>
2023-05-30 15:04:05 +00:00
|
|
|
def test_java_code_splitter() -> None:
|
2023-05-31 14:11:53 +00:00
|
|
|
splitter = RecursiveCharacterTextSplitter.from_language(
|
|
|
|
Language.JAVA, chunk_size=CHUNK_SIZE, chunk_overlap=0
|
Add more code splitters (go, rst, js, java, cpp, scala, ruby, php, swift, rust) (#5171)
As the title says, I added more code splitters.
The implementation is trivial, so i don't add separate tests for each
splitter.
Let me know if any concerns.
Fixes # (issue)
https://github.com/hwchase17/langchain/issues/5170
## Who can review?
Community members can review the PR once tests pass. Tag
maintainers/contributors who might be interested:
@eyurtsev @hwchase17
---------
Signed-off-by: byhsu <byhsu@linkedin.com>
Co-authored-by: byhsu <byhsu@linkedin.com>
2023-05-30 15:04:05 +00:00
|
|
|
)
|
|
|
|
code = """
|
|
|
|
public class HelloWorld {
|
|
|
|
public static void main(String[] args) {
|
|
|
|
System.out.println("Hello, World!");
|
|
|
|
}
|
|
|
|
}
|
|
|
|
"""
|
|
|
|
chunks = splitter.split_text(code)
|
|
|
|
assert chunks == [
|
|
|
|
"public class",
|
|
|
|
"HelloWorld {",
|
|
|
|
"public",
|
|
|
|
"static void",
|
|
|
|
"main(String[]",
|
|
|
|
"args) {",
|
|
|
|
"System.out.prin",
|
|
|
|
'tln("Hello,',
|
|
|
|
'World!");',
|
|
|
|
"}\n}",
|
|
|
|
]
|
|
|
|
|
|
|
|
|
chore: add kotlin code splitter (#11364)
<!-- Thank you for contributing to LangChain!
Replace this entire comment with:
- **Description:** a description of the change,
- **Issue:** the issue # it fixes (if applicable),
- **Dependencies:** any dependencies required for this change,
- **Tag maintainer:** for a quicker response, tag the relevant
maintainer (see below),
- **Twitter handle:** we announce bigger features on Twitter. If your PR
gets announced, and you'd like a mention, we'll gladly shout you out!
Please make sure your PR is passing linting and testing before
submitting. Run `make format`, `make lint` and `make test` to check this
locally.
See contribution guidelines for more information on how to write/run
tests, lint, etc:
https://github.com/langchain-ai/langchain/blob/master/.github/CONTRIBUTING.md
If you're adding a new integration, please include:
1. a test for the integration, preferably unit tests that do not rely on
network access,
2. an example notebook showing its use. It lives in `docs/extras`
directory.
If no one reviews your PR within a few days, please @-mention one of
@baskaryan, @eyurtsev, @hwchase17.
-->
- **Description:** Adds Kotlin language to `TextSplitter`
---------
Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
2023-10-03 22:35:36 +00:00
|
|
|
def test_kotlin_code_splitter() -> None:
|
|
|
|
splitter = RecursiveCharacterTextSplitter.from_language(
|
|
|
|
Language.KOTLIN, chunk_size=CHUNK_SIZE, chunk_overlap=0
|
|
|
|
)
|
|
|
|
code = """
|
|
|
|
class HelloWorld {
|
|
|
|
companion object {
|
|
|
|
@JvmStatic
|
|
|
|
fun main(args: Array<String>) {
|
|
|
|
println("Hello, World!")
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
"""
|
|
|
|
chunks = splitter.split_text(code)
|
|
|
|
assert chunks == [
|
|
|
|
"class",
|
|
|
|
"HelloWorld {",
|
|
|
|
"companion",
|
|
|
|
"object {",
|
|
|
|
"@JvmStatic",
|
|
|
|
"fun",
|
|
|
|
"main(args:",
|
|
|
|
"Array<String>)",
|
|
|
|
"{",
|
|
|
|
'println("Hello,',
|
|
|
|
'World!")',
|
|
|
|
"}\n }",
|
|
|
|
"}",
|
|
|
|
]
|
|
|
|
|
|
|
|
|
2023-09-08 23:01:06 +00:00
|
|
|
def test_csharp_code_splitter() -> None:
|
|
|
|
splitter = RecursiveCharacterTextSplitter.from_language(
|
|
|
|
Language.CSHARP, chunk_size=CHUNK_SIZE, chunk_overlap=0
|
|
|
|
)
|
|
|
|
code = """
|
|
|
|
using System;
|
|
|
|
class Program
|
|
|
|
{
|
|
|
|
static void Main()
|
|
|
|
{
|
|
|
|
int age = 30; // Change the age value as needed
|
|
|
|
|
|
|
|
// Categorize the age without any console output
|
|
|
|
if (age < 18)
|
|
|
|
{
|
|
|
|
// Age is under 18
|
|
|
|
}
|
|
|
|
else if (age >= 18 && age < 65)
|
|
|
|
{
|
|
|
|
// Age is an adult
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
// Age is a senior citizen
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
"""
|
|
|
|
|
|
|
|
chunks = splitter.split_text(code)
|
|
|
|
assert chunks == [
|
|
|
|
"using System;",
|
|
|
|
"class Program\n{",
|
|
|
|
"static void",
|
|
|
|
"Main()",
|
|
|
|
"{",
|
|
|
|
"int age",
|
|
|
|
"= 30; // Change",
|
|
|
|
"the age value",
|
|
|
|
"as needed",
|
|
|
|
"//",
|
|
|
|
"Categorize the",
|
|
|
|
"age without any",
|
|
|
|
"console output",
|
|
|
|
"if (age",
|
|
|
|
"< 18)",
|
|
|
|
"{",
|
|
|
|
"//",
|
|
|
|
"Age is under 18",
|
|
|
|
"}",
|
|
|
|
"else if",
|
|
|
|
"(age >= 18 &&",
|
|
|
|
"age < 65)",
|
|
|
|
"{",
|
|
|
|
"//",
|
|
|
|
"Age is an adult",
|
|
|
|
"}",
|
|
|
|
"else",
|
|
|
|
"{",
|
|
|
|
"//",
|
|
|
|
"Age is a senior",
|
|
|
|
"citizen",
|
|
|
|
"}\n }",
|
|
|
|
"}",
|
|
|
|
]
|
|
|
|
|
|
|
|
|
Add more code splitters (go, rst, js, java, cpp, scala, ruby, php, swift, rust) (#5171)
As the title says, I added more code splitters.
The implementation is trivial, so i don't add separate tests for each
splitter.
Let me know if any concerns.
Fixes # (issue)
https://github.com/hwchase17/langchain/issues/5170
## Who can review?
Community members can review the PR once tests pass. Tag
maintainers/contributors who might be interested:
@eyurtsev @hwchase17
---------
Signed-off-by: byhsu <byhsu@linkedin.com>
Co-authored-by: byhsu <byhsu@linkedin.com>
2023-05-30 15:04:05 +00:00
|
|
|
def test_cpp_code_splitter() -> None:
|
2023-05-31 14:11:53 +00:00
|
|
|
splitter = RecursiveCharacterTextSplitter.from_language(
|
|
|
|
Language.CPP, chunk_size=CHUNK_SIZE, chunk_overlap=0
|
Add more code splitters (go, rst, js, java, cpp, scala, ruby, php, swift, rust) (#5171)
As the title says, I added more code splitters.
The implementation is trivial, so i don't add separate tests for each
splitter.
Let me know if any concerns.
Fixes # (issue)
https://github.com/hwchase17/langchain/issues/5170
## Who can review?
Community members can review the PR once tests pass. Tag
maintainers/contributors who might be interested:
@eyurtsev @hwchase17
---------
Signed-off-by: byhsu <byhsu@linkedin.com>
Co-authored-by: byhsu <byhsu@linkedin.com>
2023-05-30 15:04:05 +00:00
|
|
|
)
|
|
|
|
code = """
|
|
|
|
#include <iostream>
|
|
|
|
|
|
|
|
int main() {
|
|
|
|
std::cout << "Hello, World!" << std::endl;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
"""
|
|
|
|
chunks = splitter.split_text(code)
|
|
|
|
assert chunks == [
|
|
|
|
"#include",
|
|
|
|
"<iostream>",
|
|
|
|
"int main() {",
|
|
|
|
"std::cout",
|
|
|
|
'<< "Hello,',
|
|
|
|
'World!" <<',
|
|
|
|
"std::endl;",
|
|
|
|
"return 0;\n}",
|
|
|
|
]
|
|
|
|
|
|
|
|
|
|
|
|
def test_scala_code_splitter() -> None:
|
2023-05-31 14:11:53 +00:00
|
|
|
splitter = RecursiveCharacterTextSplitter.from_language(
|
|
|
|
Language.SCALA, chunk_size=CHUNK_SIZE, chunk_overlap=0
|
Add more code splitters (go, rst, js, java, cpp, scala, ruby, php, swift, rust) (#5171)
As the title says, I added more code splitters.
The implementation is trivial, so i don't add separate tests for each
splitter.
Let me know if any concerns.
Fixes # (issue)
https://github.com/hwchase17/langchain/issues/5170
## Who can review?
Community members can review the PR once tests pass. Tag
maintainers/contributors who might be interested:
@eyurtsev @hwchase17
---------
Signed-off-by: byhsu <byhsu@linkedin.com>
Co-authored-by: byhsu <byhsu@linkedin.com>
2023-05-30 15:04:05 +00:00
|
|
|
)
|
|
|
|
code = """
|
|
|
|
object HelloWorld {
|
|
|
|
def main(args: Array[String]): Unit = {
|
|
|
|
println("Hello, World!")
|
|
|
|
}
|
|
|
|
}
|
|
|
|
"""
|
|
|
|
chunks = splitter.split_text(code)
|
|
|
|
assert chunks == [
|
|
|
|
"object",
|
|
|
|
"HelloWorld {",
|
|
|
|
"def",
|
|
|
|
"main(args:",
|
|
|
|
"Array[String]):",
|
|
|
|
"Unit = {",
|
|
|
|
'println("Hello,',
|
|
|
|
'World!")',
|
|
|
|
"}\n}",
|
|
|
|
]
|
|
|
|
|
|
|
|
|
|
|
|
def test_ruby_code_splitter() -> None:
|
2023-05-31 14:11:53 +00:00
|
|
|
splitter = RecursiveCharacterTextSplitter.from_language(
|
|
|
|
Language.RUBY, chunk_size=CHUNK_SIZE, chunk_overlap=0
|
Add more code splitters (go, rst, js, java, cpp, scala, ruby, php, swift, rust) (#5171)
As the title says, I added more code splitters.
The implementation is trivial, so i don't add separate tests for each
splitter.
Let me know if any concerns.
Fixes # (issue)
https://github.com/hwchase17/langchain/issues/5170
## Who can review?
Community members can review the PR once tests pass. Tag
maintainers/contributors who might be interested:
@eyurtsev @hwchase17
---------
Signed-off-by: byhsu <byhsu@linkedin.com>
Co-authored-by: byhsu <byhsu@linkedin.com>
2023-05-30 15:04:05 +00:00
|
|
|
)
|
|
|
|
code = """
|
|
|
|
def hello_world
|
|
|
|
puts "Hello, World!"
|
|
|
|
end
|
|
|
|
|
|
|
|
hello_world
|
|
|
|
"""
|
|
|
|
chunks = splitter.split_text(code)
|
|
|
|
assert chunks == [
|
|
|
|
"def hello_world",
|
|
|
|
'puts "Hello,',
|
|
|
|
'World!"',
|
|
|
|
"end",
|
|
|
|
"hello_world",
|
|
|
|
]
|
|
|
|
|
|
|
|
|
|
|
|
def test_php_code_splitter() -> None:
|
2023-05-31 14:11:53 +00:00
|
|
|
splitter = RecursiveCharacterTextSplitter.from_language(
|
|
|
|
Language.PHP, chunk_size=CHUNK_SIZE, chunk_overlap=0
|
Add more code splitters (go, rst, js, java, cpp, scala, ruby, php, swift, rust) (#5171)
As the title says, I added more code splitters.
The implementation is trivial, so i don't add separate tests for each
splitter.
Let me know if any concerns.
Fixes # (issue)
https://github.com/hwchase17/langchain/issues/5170
## Who can review?
Community members can review the PR once tests pass. Tag
maintainers/contributors who might be interested:
@eyurtsev @hwchase17
---------
Signed-off-by: byhsu <byhsu@linkedin.com>
Co-authored-by: byhsu <byhsu@linkedin.com>
2023-05-30 15:04:05 +00:00
|
|
|
)
|
|
|
|
code = """
|
|
|
|
<?php
|
|
|
|
function hello_world() {
|
|
|
|
echo "Hello, World!";
|
|
|
|
}
|
|
|
|
|
|
|
|
hello_world();
|
|
|
|
?>
|
|
|
|
"""
|
|
|
|
chunks = splitter.split_text(code)
|
|
|
|
assert chunks == [
|
|
|
|
"<?php",
|
|
|
|
"function",
|
|
|
|
"hello_world() {",
|
|
|
|
"echo",
|
|
|
|
'"Hello,',
|
|
|
|
'World!";',
|
|
|
|
"}",
|
|
|
|
"hello_world();",
|
|
|
|
"?>",
|
|
|
|
]
|
|
|
|
|
|
|
|
|
|
|
|
def test_swift_code_splitter() -> None:
|
2023-05-31 14:11:53 +00:00
|
|
|
splitter = RecursiveCharacterTextSplitter.from_language(
|
|
|
|
Language.SWIFT, chunk_size=CHUNK_SIZE, chunk_overlap=0
|
Add more code splitters (go, rst, js, java, cpp, scala, ruby, php, swift, rust) (#5171)
As the title says, I added more code splitters.
The implementation is trivial, so i don't add separate tests for each
splitter.
Let me know if any concerns.
Fixes # (issue)
https://github.com/hwchase17/langchain/issues/5170
## Who can review?
Community members can review the PR once tests pass. Tag
maintainers/contributors who might be interested:
@eyurtsev @hwchase17
---------
Signed-off-by: byhsu <byhsu@linkedin.com>
Co-authored-by: byhsu <byhsu@linkedin.com>
2023-05-30 15:04:05 +00:00
|
|
|
)
|
|
|
|
code = """
|
|
|
|
func helloWorld() {
|
|
|
|
print("Hello, World!")
|
|
|
|
}
|
|
|
|
|
|
|
|
helloWorld()
|
|
|
|
"""
|
|
|
|
chunks = splitter.split_text(code)
|
|
|
|
assert chunks == [
|
|
|
|
"func",
|
|
|
|
"helloWorld() {",
|
|
|
|
'print("Hello,',
|
|
|
|
'World!")',
|
|
|
|
"}",
|
|
|
|
"helloWorld()",
|
|
|
|
]
|
|
|
|
|
|
|
|
|
|
|
|
def test_rust_code_splitter() -> None:
|
2023-05-31 14:11:53 +00:00
|
|
|
splitter = RecursiveCharacterTextSplitter.from_language(
|
|
|
|
Language.RUST, chunk_size=CHUNK_SIZE, chunk_overlap=0
|
Add more code splitters (go, rst, js, java, cpp, scala, ruby, php, swift, rust) (#5171)
As the title says, I added more code splitters.
The implementation is trivial, so i don't add separate tests for each
splitter.
Let me know if any concerns.
Fixes # (issue)
https://github.com/hwchase17/langchain/issues/5170
## Who can review?
Community members can review the PR once tests pass. Tag
maintainers/contributors who might be interested:
@eyurtsev @hwchase17
---------
Signed-off-by: byhsu <byhsu@linkedin.com>
Co-authored-by: byhsu <byhsu@linkedin.com>
2023-05-30 15:04:05 +00:00
|
|
|
)
|
|
|
|
code = """
|
|
|
|
fn main() {
|
|
|
|
println!("Hello, World!");
|
|
|
|
}
|
|
|
|
"""
|
|
|
|
chunks = splitter.split_text(code)
|
|
|
|
assert chunks == ["fn main() {", 'println!("Hello', ",", 'World!");', "}"]
|
2023-06-05 23:40:26 +00:00
|
|
|
|
|
|
|
|
|
|
|
def test_markdown_code_splitter() -> None:
|
|
|
|
splitter = RecursiveCharacterTextSplitter.from_language(
|
|
|
|
Language.MARKDOWN, chunk_size=CHUNK_SIZE, chunk_overlap=0
|
|
|
|
)
|
|
|
|
code = """
|
|
|
|
# Sample Document
|
|
|
|
|
|
|
|
## Section
|
|
|
|
|
|
|
|
This is the content of the section.
|
|
|
|
|
|
|
|
## Lists
|
|
|
|
|
|
|
|
- Item 1
|
|
|
|
- Item 2
|
|
|
|
- Item 3
|
|
|
|
|
|
|
|
### Horizontal lines
|
|
|
|
|
|
|
|
***********
|
|
|
|
____________
|
|
|
|
-------------------
|
|
|
|
|
|
|
|
#### Code blocks
|
|
|
|
```
|
|
|
|
This is a code block
|
2023-10-06 01:34:42 +00:00
|
|
|
|
|
|
|
# sample code
|
|
|
|
a = 1
|
|
|
|
b = 2
|
2023-06-05 23:40:26 +00:00
|
|
|
```
|
|
|
|
"""
|
|
|
|
chunks = splitter.split_text(code)
|
|
|
|
assert chunks == [
|
|
|
|
"# Sample",
|
|
|
|
"Document",
|
|
|
|
"## Section",
|
|
|
|
"This is the",
|
|
|
|
"content of the",
|
|
|
|
"section.",
|
|
|
|
"## Lists",
|
|
|
|
"- Item 1",
|
|
|
|
"- Item 2",
|
|
|
|
"- Item 3",
|
|
|
|
"### Horizontal",
|
|
|
|
"lines",
|
|
|
|
"***********",
|
|
|
|
"____________",
|
|
|
|
"---------------",
|
|
|
|
"----",
|
|
|
|
"#### Code",
|
|
|
|
"blocks",
|
|
|
|
"```",
|
|
|
|
"This is a code",
|
|
|
|
"block",
|
2023-10-06 01:34:42 +00:00
|
|
|
"# sample code",
|
|
|
|
"a = 1\nb = 2",
|
2023-06-05 23:40:26 +00:00
|
|
|
"```",
|
|
|
|
]
|
Fix invalid escape sequence warnings (#8771)
<!-- Thank you for contributing to LangChain!
Replace this comment with:
- Description: a description of the change,
- Issue: the issue # it fixes (if applicable),
- Dependencies: any dependencies required for this change,
- Tag maintainer: for a quicker response, tag the relevant maintainer
(see below),
- Twitter handle: we announce bigger features on Twitter. If your PR
gets announced and you'd like a mention, we'll gladly shout you out!
Please make sure you're PR is passing linting and testing before
submitting. Run `make format`, `make lint` and `make test` to check this
locally.
If you're adding a new integration, please include:
1. a test for the integration, preferably unit tests that do not rely on
network access,
2. an example notebook showing its use.
Maintainer responsibilities:
- General / Misc / if you don't know who to tag: @baskaryan
- DataLoaders / VectorStores / Retrievers: @rlancemartin, @eyurtsev
- Models / Prompts: @hwchase17, @baskaryan
- Memory: @hwchase17
- Agents / Tools / Toolkits: @hinthornw
- Tracing / Callbacks: @agola11
- Async: @agola11
If no one reviews your PR within a few days, feel free to @-mention the
same people again.
See contribution guidelines for more information on how to write/run
tests, lint, etc:
https://github.com/hwchase17/langchain/blob/master/.github/CONTRIBUTING.md
-->
Description: The lines I have changed looks like incorrectly escaped for
regex. In python 3.11, I receive DeprecationWarning for these lines.
You don't see any warnings unless you explicitly run python with `-W
always::DeprecationWarning` flag. So, this is my attempt to fix it.
Here are the warnings from log files:
```
/usr/local/lib/python3.11/site-packages/langchain/text_splitter.py:919: DeprecationWarning: invalid escape sequence '\s'
/usr/local/lib/python3.11/site-packages/langchain/text_splitter.py:918: DeprecationWarning: invalid escape sequence '\s'
/usr/local/lib/python3.11/site-packages/langchain/text_splitter.py:917: DeprecationWarning: invalid escape sequence '\s'
/usr/local/lib/python3.11/site-packages/langchain/text_splitter.py:916: DeprecationWarning: invalid escape sequence '\c'
/usr/local/lib/python3.11/site-packages/langchain/text_splitter.py:903: DeprecationWarning: invalid escape sequence '\*'
/usr/local/lib/python3.11/site-packages/langchain/text_splitter.py:804: DeprecationWarning: invalid escape sequence '\*'
/usr/local/lib/python3.11/site-packages/langchain/text_splitter.py:804: DeprecationWarning: invalid escape sequence '\*'
```
cc @baskaryan
---------
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
2023-08-07 00:01:18 +00:00
|
|
|
# Special test for special characters
|
|
|
|
code = "harry\n***\nbabylon is"
|
|
|
|
chunks = splitter.split_text(code)
|
|
|
|
assert chunks == ["harry", "***\nbabylon is"]
|
|
|
|
|
|
|
|
|
|
|
|
def test_latex_code_splitter() -> None:
|
|
|
|
splitter = RecursiveCharacterTextSplitter.from_language(
|
|
|
|
Language.LATEX, chunk_size=CHUNK_SIZE, chunk_overlap=0
|
|
|
|
)
|
|
|
|
code = """
|
|
|
|
Hi Harrison!
|
|
|
|
\\chapter{1}
|
|
|
|
"""
|
|
|
|
chunks = splitter.split_text(code)
|
|
|
|
assert chunks == ["Hi Harrison!", "\\chapter{1}"]
|
2023-06-06 16:27:37 +00:00
|
|
|
|
|
|
|
|
|
|
|
def test_html_code_splitter() -> None:
|
|
|
|
splitter = RecursiveCharacterTextSplitter.from_language(
|
|
|
|
Language.HTML, chunk_size=60, chunk_overlap=0
|
|
|
|
)
|
|
|
|
code = """
|
|
|
|
<h1>Sample Document</h1>
|
|
|
|
<h2>Section</h2>
|
|
|
|
<p id="1234">Reference content.</p>
|
|
|
|
|
|
|
|
<h2>Lists</h2>
|
|
|
|
<ul>
|
|
|
|
<li>Item 1</li>
|
|
|
|
<li>Item 2</li>
|
|
|
|
<li>Item 3</li>
|
|
|
|
</ul>
|
|
|
|
|
|
|
|
<h3>A block</h3>
|
|
|
|
<div class="amazing">
|
|
|
|
<p>Some text</p>
|
|
|
|
<p>Some more text</p>
|
|
|
|
</div>
|
|
|
|
"""
|
|
|
|
chunks = splitter.split_text(code)
|
|
|
|
assert chunks == [
|
|
|
|
"<h1>Sample Document</h1>\n <h2>Section</h2>",
|
|
|
|
'<p id="1234">Reference content.</p>',
|
|
|
|
"<h2>Lists</h2>\n <ul>",
|
|
|
|
"<li>Item 1</li>\n <li>Item 2</li>",
|
|
|
|
"<li>Item 3</li>\n </ul>",
|
|
|
|
"<h3>A block</h3>",
|
|
|
|
'<div class="amazing">',
|
|
|
|
"<p>Some text</p>",
|
|
|
|
"<p>Some more text</p>\n </div>",
|
|
|
|
]
|
2023-06-13 16:07:52 +00:00
|
|
|
|
|
|
|
|
|
|
|
def test_md_header_text_splitter_1() -> None:
|
|
|
|
"""Test markdown splitter by header: Case 1."""
|
|
|
|
|
|
|
|
markdown_document = (
|
|
|
|
"# Foo\n\n"
|
|
|
|
" ## Bar\n\n"
|
|
|
|
"Hi this is Jim\n\n"
|
|
|
|
"Hi this is Joe\n\n"
|
|
|
|
" ## Baz\n\n"
|
|
|
|
" Hi this is Molly"
|
|
|
|
)
|
|
|
|
headers_to_split_on = [
|
|
|
|
("#", "Header 1"),
|
|
|
|
("##", "Header 2"),
|
|
|
|
]
|
|
|
|
markdown_splitter = MarkdownHeaderTextSplitter(
|
|
|
|
headers_to_split_on=headers_to_split_on,
|
|
|
|
)
|
|
|
|
output = markdown_splitter.split_text(markdown_document)
|
|
|
|
expected_output = [
|
2023-06-22 16:25:38 +00:00
|
|
|
Document(
|
|
|
|
page_content="Hi this is Jim \nHi this is Joe",
|
|
|
|
metadata={"Header 1": "Foo", "Header 2": "Bar"},
|
|
|
|
),
|
|
|
|
Document(
|
|
|
|
page_content="Hi this is Molly",
|
|
|
|
metadata={"Header 1": "Foo", "Header 2": "Baz"},
|
|
|
|
),
|
2023-06-13 16:07:52 +00:00
|
|
|
]
|
|
|
|
assert output == expected_output
|
|
|
|
|
|
|
|
|
|
|
|
def test_md_header_text_splitter_2() -> None:
|
|
|
|
"""Test markdown splitter by header: Case 2."""
|
|
|
|
markdown_document = (
|
|
|
|
"# Foo\n\n"
|
|
|
|
" ## Bar\n\n"
|
|
|
|
"Hi this is Jim\n\n"
|
|
|
|
"Hi this is Joe\n\n"
|
|
|
|
" ### Boo \n\n"
|
|
|
|
" Hi this is Lance \n\n"
|
|
|
|
" ## Baz\n\n"
|
|
|
|
" Hi this is Molly"
|
|
|
|
)
|
|
|
|
|
|
|
|
headers_to_split_on = [
|
|
|
|
("#", "Header 1"),
|
|
|
|
("##", "Header 2"),
|
|
|
|
("###", "Header 3"),
|
|
|
|
]
|
|
|
|
markdown_splitter = MarkdownHeaderTextSplitter(
|
|
|
|
headers_to_split_on=headers_to_split_on,
|
|
|
|
)
|
|
|
|
output = markdown_splitter.split_text(markdown_document)
|
|
|
|
expected_output = [
|
2023-06-22 16:25:38 +00:00
|
|
|
Document(
|
|
|
|
page_content="Hi this is Jim \nHi this is Joe",
|
|
|
|
metadata={"Header 1": "Foo", "Header 2": "Bar"},
|
|
|
|
),
|
|
|
|
Document(
|
|
|
|
page_content="Hi this is Lance",
|
|
|
|
metadata={"Header 1": "Foo", "Header 2": "Bar", "Header 3": "Boo"},
|
|
|
|
),
|
|
|
|
Document(
|
|
|
|
page_content="Hi this is Molly",
|
|
|
|
metadata={"Header 1": "Foo", "Header 2": "Baz"},
|
|
|
|
),
|
2023-06-13 16:07:52 +00:00
|
|
|
]
|
|
|
|
assert output == expected_output
|
|
|
|
|
|
|
|
|
|
|
|
def test_md_header_text_splitter_3() -> None:
|
|
|
|
"""Test markdown splitter by header: Case 3."""
|
|
|
|
|
|
|
|
markdown_document = (
|
|
|
|
"# Foo\n\n"
|
|
|
|
" ## Bar\n\n"
|
|
|
|
"Hi this is Jim\n\n"
|
|
|
|
"Hi this is Joe\n\n"
|
|
|
|
" ### Boo \n\n"
|
|
|
|
" Hi this is Lance \n\n"
|
|
|
|
" #### Bim \n\n"
|
|
|
|
" Hi this is John \n\n"
|
|
|
|
" ## Baz\n\n"
|
|
|
|
" Hi this is Molly"
|
|
|
|
)
|
|
|
|
|
|
|
|
headers_to_split_on = [
|
|
|
|
("#", "Header 1"),
|
|
|
|
("##", "Header 2"),
|
|
|
|
("###", "Header 3"),
|
|
|
|
("####", "Header 4"),
|
|
|
|
]
|
|
|
|
|
|
|
|
markdown_splitter = MarkdownHeaderTextSplitter(
|
|
|
|
headers_to_split_on=headers_to_split_on,
|
|
|
|
)
|
|
|
|
output = markdown_splitter.split_text(markdown_document)
|
|
|
|
|
|
|
|
expected_output = [
|
2023-06-22 16:25:38 +00:00
|
|
|
Document(
|
|
|
|
page_content="Hi this is Jim \nHi this is Joe",
|
|
|
|
metadata={"Header 1": "Foo", "Header 2": "Bar"},
|
|
|
|
),
|
|
|
|
Document(
|
|
|
|
page_content="Hi this is Lance",
|
|
|
|
metadata={"Header 1": "Foo", "Header 2": "Bar", "Header 3": "Boo"},
|
|
|
|
),
|
|
|
|
Document(
|
|
|
|
page_content="Hi this is John",
|
|
|
|
metadata={
|
2023-06-13 16:07:52 +00:00
|
|
|
"Header 1": "Foo",
|
|
|
|
"Header 2": "Bar",
|
|
|
|
"Header 3": "Boo",
|
|
|
|
"Header 4": "Bim",
|
|
|
|
},
|
2023-06-22 16:25:38 +00:00
|
|
|
),
|
|
|
|
Document(
|
|
|
|
page_content="Hi this is Molly",
|
|
|
|
metadata={"Header 1": "Foo", "Header 2": "Baz"},
|
|
|
|
),
|
2023-06-13 16:07:52 +00:00
|
|
|
]
|
|
|
|
|
|
|
|
assert output == expected_output
|
feat: Add support for the Solidity language (#6054)
<!--
Thank you for contributing to LangChain! Your PR will appear in our
release under the title you set. Please make sure it highlights your
valuable contribution.
Replace this with a description of the change, the issue it fixes (if
applicable), and relevant context. List any dependencies required for
this change.
After you're done, someone will review your PR. They may suggest
improvements. If no one reviews your PR within a few days, feel free to
@-mention the same people again, as notifications can get lost.
Finally, we'd love to show appreciation for your contribution - if you'd
like us to shout you out on Twitter, please also include your handle!
-->
## Add Solidity programming language support for code splitter.
Twitter: @0xjord4n_
<!-- If you're adding a new integration, please include:
1. a test for the integration - favor unit tests that does not rely on
network access.
2. an example notebook showing its use
See contribution guidelines for more information on how to write tests,
lint
etc:
https://github.com/hwchase17/langchain/blob/master/.github/CONTRIBUTING.md
-->
#### Who can review?
Tag maintainers/contributors who might be interested:
@hwchase17
<!-- For a quicker response, figure out the right person to tag with @
@hwchase17 - project lead
Tracing / Callbacks
- @agola11
Async
- @agola11
DataLoaders
- @eyurtsev
Models
- @hwchase17
- @agola11
Agents / Tools / Toolkits
- @hwchase17
VectorStores / Retrievers / Memory
- @dev2049
-->
2023-06-14 21:25:02 +00:00
|
|
|
|
|
|
|
|
2024-01-03 06:34:52 +00:00
|
|
|
def test_md_header_text_splitter_preserve_headers_1() -> None:
|
|
|
|
"""Test markdown splitter by header: Preserve Headers."""
|
|
|
|
|
|
|
|
markdown_document = (
|
|
|
|
"# Foo\n\n"
|
|
|
|
" ## Bat\n\n"
|
|
|
|
"Hi this is Jim\n\n"
|
|
|
|
"Hi Joe\n\n"
|
|
|
|
"## Baz\n\n"
|
|
|
|
"# Bar\n\n"
|
|
|
|
"This is Alice\n\n"
|
|
|
|
"This is Bob"
|
|
|
|
)
|
|
|
|
headers_to_split_on = [
|
|
|
|
("#", "Header 1"),
|
|
|
|
]
|
|
|
|
markdown_splitter = MarkdownHeaderTextSplitter(
|
|
|
|
headers_to_split_on=headers_to_split_on,
|
|
|
|
strip_headers=False,
|
|
|
|
)
|
|
|
|
output = markdown_splitter.split_text(markdown_document)
|
|
|
|
expected_output = [
|
|
|
|
Document(
|
|
|
|
page_content="# Foo \n## Bat \nHi this is Jim \nHi Joe \n## Baz",
|
|
|
|
metadata={"Header 1": "Foo"},
|
|
|
|
),
|
|
|
|
Document(
|
|
|
|
page_content="# Bar \nThis is Alice \nThis is Bob",
|
|
|
|
metadata={"Header 1": "Bar"},
|
|
|
|
),
|
|
|
|
]
|
|
|
|
assert output == expected_output
|
|
|
|
|
|
|
|
|
|
|
|
def test_md_header_text_splitter_preserve_headers_2() -> None:
|
|
|
|
"""Test markdown splitter by header: Preserve Headers."""
|
|
|
|
|
|
|
|
markdown_document = (
|
|
|
|
"# Foo\n\n"
|
|
|
|
" ## Bar\n\n"
|
|
|
|
"Hi this is Jim\n\n"
|
|
|
|
"Hi this is Joe\n\n"
|
|
|
|
"### Boo \n\n"
|
|
|
|
"Hi this is Lance\n\n"
|
|
|
|
"## Baz\n\n"
|
|
|
|
"Hi this is Molly\n"
|
|
|
|
" ## Buz\n"
|
|
|
|
"# Bop"
|
|
|
|
)
|
|
|
|
headers_to_split_on = [
|
|
|
|
("#", "Header 1"),
|
|
|
|
("##", "Header 2"),
|
|
|
|
("###", "Header 3"),
|
|
|
|
]
|
|
|
|
markdown_splitter = MarkdownHeaderTextSplitter(
|
|
|
|
headers_to_split_on=headers_to_split_on,
|
|
|
|
strip_headers=False,
|
|
|
|
)
|
|
|
|
output = markdown_splitter.split_text(markdown_document)
|
|
|
|
expected_output = [
|
|
|
|
Document(
|
|
|
|
page_content="# Foo \n## Bar \nHi this is Jim \nHi this is Joe",
|
|
|
|
metadata={"Header 1": "Foo", "Header 2": "Bar"},
|
|
|
|
),
|
|
|
|
Document(
|
|
|
|
page_content="### Boo \nHi this is Lance",
|
|
|
|
metadata={"Header 1": "Foo", "Header 2": "Bar", "Header 3": "Boo"},
|
|
|
|
),
|
|
|
|
Document(
|
|
|
|
page_content="## Baz \nHi this is Molly",
|
|
|
|
metadata={"Header 1": "Foo", "Header 2": "Baz"},
|
|
|
|
),
|
|
|
|
Document(
|
|
|
|
page_content="## Buz",
|
|
|
|
metadata={"Header 1": "Foo", "Header 2": "Buz"},
|
|
|
|
),
|
|
|
|
Document(page_content="# Bop", metadata={"Header 1": "Bop"}),
|
|
|
|
]
|
|
|
|
assert output == expected_output
|
|
|
|
|
|
|
|
|
2023-11-28 16:52:38 +00:00
|
|
|
@pytest.mark.parametrize("fence", [("```"), ("~~~")])
|
|
|
|
def test_md_header_text_splitter_fenced_code_block(fence: str) -> None:
|
|
|
|
"""Test markdown splitter by header: Fenced code block."""
|
|
|
|
|
|
|
|
markdown_document = (
|
|
|
|
"# This is a Header\n\n"
|
|
|
|
f"{fence}\n"
|
|
|
|
"foo()\n"
|
|
|
|
"# Not a header\n"
|
|
|
|
"bar()\n"
|
|
|
|
f"{fence}"
|
|
|
|
)
|
|
|
|
|
|
|
|
headers_to_split_on = [
|
|
|
|
("#", "Header 1"),
|
|
|
|
("##", "Header 2"),
|
|
|
|
]
|
|
|
|
|
|
|
|
markdown_splitter = MarkdownHeaderTextSplitter(
|
|
|
|
headers_to_split_on=headers_to_split_on,
|
|
|
|
)
|
|
|
|
output = markdown_splitter.split_text(markdown_document)
|
|
|
|
|
|
|
|
expected_output = [
|
|
|
|
Document(
|
|
|
|
page_content=f"{fence}\nfoo()\n# Not a header\nbar()\n{fence}",
|
|
|
|
metadata={"Header 1": "This is a Header"},
|
|
|
|
),
|
|
|
|
]
|
|
|
|
|
|
|
|
assert output == expected_output
|
|
|
|
|
|
|
|
|
|
|
|
@pytest.mark.parametrize(["fence", "other_fence"], [("```", "~~~"), ("~~~", "```")])
|
|
|
|
def test_md_header_text_splitter_fenced_code_block_interleaved(
|
|
|
|
fence: str, other_fence: str
|
|
|
|
) -> None:
|
|
|
|
"""Test markdown splitter by header: Interleaved fenced code block."""
|
|
|
|
|
|
|
|
markdown_document = (
|
|
|
|
"# This is a Header\n\n"
|
|
|
|
f"{fence}\n"
|
|
|
|
"foo\n"
|
|
|
|
"# Not a header\n"
|
|
|
|
f"{other_fence}\n"
|
|
|
|
"# Not a header\n"
|
|
|
|
f"{fence}"
|
|
|
|
)
|
|
|
|
|
|
|
|
headers_to_split_on = [
|
|
|
|
("#", "Header 1"),
|
|
|
|
("##", "Header 2"),
|
|
|
|
]
|
|
|
|
|
|
|
|
markdown_splitter = MarkdownHeaderTextSplitter(
|
|
|
|
headers_to_split_on=headers_to_split_on,
|
|
|
|
)
|
|
|
|
output = markdown_splitter.split_text(markdown_document)
|
|
|
|
|
|
|
|
expected_output = [
|
|
|
|
Document(
|
|
|
|
page_content=(
|
|
|
|
f"{fence}\nfoo\n# Not a header\n{other_fence}\n# Not a header\n{fence}"
|
|
|
|
),
|
|
|
|
metadata={"Header 1": "This is a Header"},
|
|
|
|
),
|
|
|
|
]
|
|
|
|
|
|
|
|
assert output == expected_output
|
|
|
|
|
|
|
|
|
feat: Add support for the Solidity language (#6054)
<!--
Thank you for contributing to LangChain! Your PR will appear in our
release under the title you set. Please make sure it highlights your
valuable contribution.
Replace this with a description of the change, the issue it fixes (if
applicable), and relevant context. List any dependencies required for
this change.
After you're done, someone will review your PR. They may suggest
improvements. If no one reviews your PR within a few days, feel free to
@-mention the same people again, as notifications can get lost.
Finally, we'd love to show appreciation for your contribution - if you'd
like us to shout you out on Twitter, please also include your handle!
-->
## Add Solidity programming language support for code splitter.
Twitter: @0xjord4n_
<!-- If you're adding a new integration, please include:
1. a test for the integration - favor unit tests that does not rely on
network access.
2. an example notebook showing its use
See contribution guidelines for more information on how to write tests,
lint
etc:
https://github.com/hwchase17/langchain/blob/master/.github/CONTRIBUTING.md
-->
#### Who can review?
Tag maintainers/contributors who might be interested:
@hwchase17
<!-- For a quicker response, figure out the right person to tag with @
@hwchase17 - project lead
Tracing / Callbacks
- @agola11
Async
- @agola11
DataLoaders
- @eyurtsev
Models
- @hwchase17
- @agola11
Agents / Tools / Toolkits
- @hwchase17
VectorStores / Retrievers / Memory
- @dev2049
-->
2023-06-14 21:25:02 +00:00
|
|
|
def test_solidity_code_splitter() -> None:
|
|
|
|
splitter = RecursiveCharacterTextSplitter.from_language(
|
|
|
|
Language.SOL, chunk_size=CHUNK_SIZE, chunk_overlap=0
|
|
|
|
)
|
|
|
|
code = """pragma solidity ^0.8.20;
|
|
|
|
contract HelloWorld {
|
|
|
|
function add(uint a, uint b) pure public returns(uint) {
|
|
|
|
return a + b;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
"""
|
|
|
|
chunks = splitter.split_text(code)
|
|
|
|
assert chunks == [
|
|
|
|
"pragma solidity",
|
|
|
|
"^0.8.20;",
|
|
|
|
"contract",
|
|
|
|
"HelloWorld {",
|
|
|
|
"function",
|
|
|
|
"add(uint a,",
|
|
|
|
"uint b) pure",
|
|
|
|
"public",
|
|
|
|
"returns(uint) {",
|
|
|
|
"return a",
|
|
|
|
"+ b;",
|
|
|
|
"}\n }",
|
|
|
|
]
|
2023-12-01 19:57:50 +00:00
|
|
|
|
|
|
|
|
2024-04-13 22:42:51 +00:00
|
|
|
def test_lua_code_splitter() -> None:
|
|
|
|
splitter = RecursiveCharacterTextSplitter.from_language(
|
|
|
|
Language.LUA, chunk_size=CHUNK_SIZE, chunk_overlap=0
|
|
|
|
)
|
|
|
|
code = """
|
|
|
|
local variable = 10
|
|
|
|
|
|
|
|
function add(a, b)
|
|
|
|
return a + b
|
|
|
|
end
|
|
|
|
|
|
|
|
if variable > 5 then
|
|
|
|
for i=1, variable do
|
|
|
|
while i < variable do
|
|
|
|
repeat
|
|
|
|
print(i)
|
|
|
|
i = i + 1
|
|
|
|
until i >= variable
|
|
|
|
end
|
|
|
|
end
|
|
|
|
end
|
|
|
|
"""
|
|
|
|
chunks = splitter.split_text(code)
|
|
|
|
assert chunks == [
|
|
|
|
"local variable",
|
|
|
|
"= 10",
|
|
|
|
"function add(a,",
|
|
|
|
"b)",
|
|
|
|
"return a +",
|
|
|
|
"b",
|
|
|
|
"end",
|
|
|
|
"if variable > 5",
|
|
|
|
"then",
|
|
|
|
"for i=1,",
|
|
|
|
"variable do",
|
|
|
|
"while i",
|
|
|
|
"< variable do",
|
|
|
|
"repeat",
|
|
|
|
"print(i)",
|
|
|
|
"i = i + 1",
|
|
|
|
"until i >=",
|
|
|
|
"variable",
|
|
|
|
"end",
|
|
|
|
"end\nend",
|
|
|
|
]
|
|
|
|
|
|
|
|
|
2024-03-29 20:17:50 +00:00
|
|
|
def test_haskell_code_splitter() -> None:
|
|
|
|
splitter = RecursiveCharacterTextSplitter.from_language(
|
|
|
|
Language.HASKELL, chunk_size=CHUNK_SIZE, chunk_overlap=0
|
|
|
|
)
|
|
|
|
code = """
|
|
|
|
main :: IO ()
|
|
|
|
main = do
|
|
|
|
putStrLn "Hello, World!"
|
|
|
|
|
|
|
|
-- Some sample functions
|
|
|
|
add :: Int -> Int -> Int
|
|
|
|
add x y = x + y
|
|
|
|
"""
|
|
|
|
# Adjusted expected chunks to account for indentation and newlines
|
|
|
|
expected_chunks = [
|
|
|
|
"main ::",
|
|
|
|
"IO ()",
|
|
|
|
"main = do",
|
|
|
|
"putStrLn",
|
|
|
|
'"Hello, World!"',
|
|
|
|
"--",
|
|
|
|
"Some sample",
|
|
|
|
"functions",
|
|
|
|
"add :: Int ->",
|
|
|
|
"Int -> Int",
|
|
|
|
"add x y = x",
|
|
|
|
"+ y",
|
|
|
|
]
|
|
|
|
chunks = splitter.split_text(code)
|
|
|
|
assert chunks == expected_chunks
|
|
|
|
|
|
|
|
|
2023-12-01 19:57:50 +00:00
|
|
|
@pytest.mark.requires("lxml")
|
|
|
|
def test_html_header_text_splitter(tmp_path: Path) -> None:
|
|
|
|
splitter = HTMLHeaderTextSplitter(
|
|
|
|
headers_to_split_on=[("h1", "Header 1"), ("h2", "Header 2")]
|
|
|
|
)
|
|
|
|
|
|
|
|
content = """
|
|
|
|
<h1>Sample Document</h1>
|
|
|
|
<h2>Section</h2>
|
|
|
|
<p id="1234">Reference content.</p>
|
|
|
|
|
|
|
|
<h2>Lists</h2>
|
|
|
|
<ul>
|
|
|
|
<li>Item 1</li>
|
|
|
|
<li>Item 2</li>
|
|
|
|
<li>Item 3</li>
|
|
|
|
</ul>
|
|
|
|
|
|
|
|
<h3>A block</h3>
|
|
|
|
<div class="amazing">
|
|
|
|
<p>Some text</p>
|
|
|
|
<p>Some more text</p>
|
|
|
|
</div>
|
|
|
|
"""
|
|
|
|
|
|
|
|
docs = splitter.split_text(content)
|
|
|
|
expected = [
|
|
|
|
Document(
|
|
|
|
page_content="Reference content.",
|
|
|
|
metadata={"Header 1": "Sample Document", "Header 2": "Section"},
|
|
|
|
),
|
|
|
|
Document(
|
|
|
|
page_content="Item 1 Item 2 Item 3 \nSome text \nSome more text",
|
|
|
|
metadata={"Header 1": "Sample Document", "Header 2": "Lists"},
|
|
|
|
),
|
|
|
|
]
|
|
|
|
assert docs == expected
|
|
|
|
|
|
|
|
with open(tmp_path / "doc.html", "w") as tmp:
|
|
|
|
tmp.write(content)
|
|
|
|
docs_from_file = splitter.split_text_from_file(tmp_path / "doc.html")
|
|
|
|
|
|
|
|
assert docs_from_file == expected
|
2023-12-19 01:15:57 +00:00
|
|
|
|
|
|
|
|
|
|
|
def test_split_text_on_tokens() -> None:
|
|
|
|
"""Test splitting by tokens per chunk."""
|
|
|
|
text = "foo bar baz 123"
|
|
|
|
|
|
|
|
tokenizer = Tokenizer(
|
|
|
|
chunk_overlap=3,
|
|
|
|
tokens_per_chunk=7,
|
|
|
|
decode=(lambda it: "".join(chr(i) for i in it)),
|
|
|
|
encode=(lambda it: [ord(c) for c in it]),
|
|
|
|
)
|
|
|
|
output = split_text_on_tokens(text=text, tokenizer=tokenizer)
|
|
|
|
expected_output = ["foo bar", "bar baz", "baz 123"]
|
|
|
|
assert output == expected_output
|
langchain: adds recursive json splitter (#17144)
- **Description:** This adds a recursive json splitter class to the
existing text_splitters as well as unit tests
- **Issue:** splitting text from structured data can cause issues if you
have a large nested json object and you split it as regular text you may
end up losing the structure of the json. To mitigate against this you
can split the nested json into large chunks and overlap them, but this
causes unnecessary text processing and there will still be times where
the nested json is so big that the chunks get separated from the parent
keys.
As an example you wouldn't want the following to be split in half:
```shell
{'val0': 'DFWeNdWhapbR',
'val1': {'val10': 'QdJo',
'val11': 'FWSDVFHClW',
'val12': 'bkVnXMMlTiQh',
'val13': 'tdDMKRrOY',
'val14': 'zybPALvL',
'val15': 'JMzGMNH',
'val16': {'val160': 'qLuLKusFw',
'val161': 'DGuotLh',
'val162': 'KztlcSBropT',
-----------------------------------------------------------------------split-----
'val163': 'YlHHDrN',
'val164': 'CtzsxlGBZKf',
'val165': 'bXzhcrWLmBFp',
'val166': 'zZAqC',
'val167': 'ZtyWno',
'val168': 'nQQZRsLnaBhb',
'val169': 'gSpMbJwA'},
'val17': 'JhgiyF',
'val18': 'aJaqjUSFFrI',
'val19': 'glqNSvoyxdg'}}
```
Any llm processing the second chunk of text may not have the context of
val1, and val16 reducing accuracy. Embeddings will also lack this
context and this makes retrieval less accurate.
Instead you want it to be split into chunks that retain the json
structure.
```shell
{'val0': 'DFWeNdWhapbR',
'val1': {'val10': 'QdJo',
'val11': 'FWSDVFHClW',
'val12': 'bkVnXMMlTiQh',
'val13': 'tdDMKRrOY',
'val14': 'zybPALvL',
'val15': 'JMzGMNH',
'val16': {'val160': 'qLuLKusFw',
'val161': 'DGuotLh',
'val162': 'KztlcSBropT',
'val163': 'YlHHDrN',
'val164': 'CtzsxlGBZKf'}}}
```
and
```shell
{'val1':{'val16':{
'val165': 'bXzhcrWLmBFp',
'val166': 'zZAqC',
'val167': 'ZtyWno',
'val168': 'nQQZRsLnaBhb',
'val169': 'gSpMbJwA'},
'val17': 'JhgiyF',
'val18': 'aJaqjUSFFrI',
'val19': 'glqNSvoyxdg'}}
```
This recursive json text splitter does this. Values that contain a list
can be converted to dict first by using split(... convert_lists=True)
otherwise long lists will not be split and you may end up with chunks
larger than the max chunk.
In my testing large json objects could be split into small chunks with
✅ Increased question answering accuracy
✅ The ability to split into smaller chunks meant retrieval queries can
use fewer tokens
- **Dependencies:** json import added to text_splitter.py, and random
added to the unit test
- **Twitter handle:** @joelsprunger
---------
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
2024-02-08 21:45:34 +00:00
|
|
|
|
|
|
|
|
2024-04-01 20:32:26 +00:00
|
|
|
@pytest.mark.requires("lxml")
|
|
|
|
@pytest.mark.requires("bs4")
|
|
|
|
def test_section_aware_happy_path_splitting_based_on_header_1_2() -> None:
|
|
|
|
# arrange
|
|
|
|
html_string = """<!DOCTYPE html>
|
|
|
|
<html>
|
|
|
|
<body>
|
|
|
|
<div>
|
|
|
|
<h1>Foo</h1>
|
|
|
|
<p>Some intro text about Foo.</p>
|
|
|
|
<div>
|
|
|
|
<h2>Bar main section</h2>
|
|
|
|
<p>Some intro text about Bar.</p>
|
|
|
|
<h3>Bar subsection 1</h3>
|
|
|
|
<p>Some text about the first subtopic of Bar.</p>
|
|
|
|
<h3>Bar subsection 2</h3>
|
|
|
|
<p>Some text about the second subtopic of Bar.</p>
|
|
|
|
</div>
|
|
|
|
<div>
|
|
|
|
<h2>Baz</h2>
|
|
|
|
<p>Some text about Baz</p>
|
|
|
|
</div>
|
|
|
|
<br>
|
|
|
|
<p>Some concluding text about Foo</p>
|
|
|
|
</div>
|
|
|
|
</body>
|
|
|
|
</html>"""
|
|
|
|
|
|
|
|
sec_splitter = HTMLSectionSplitter(
|
|
|
|
headers_to_split_on=[("h1", "Header 1"), ("h2", "Header 2")]
|
|
|
|
)
|
|
|
|
|
|
|
|
docs = sec_splitter.split_text(html_string)
|
|
|
|
|
|
|
|
assert len(docs) == 3
|
|
|
|
assert docs[0].metadata["Header 1"] == "Foo"
|
|
|
|
assert docs[0].page_content == "Foo \n Some intro text about Foo."
|
|
|
|
|
|
|
|
assert docs[1].page_content == (
|
|
|
|
"Bar main section \n Some intro text about Bar. \n "
|
|
|
|
"Bar subsection 1 \n Some text about the first subtopic of Bar. \n "
|
|
|
|
"Bar subsection 2 \n Some text about the second subtopic of Bar."
|
|
|
|
)
|
|
|
|
assert docs[1].metadata["Header 2"] == "Bar main section"
|
|
|
|
|
|
|
|
assert (
|
|
|
|
docs[2].page_content
|
|
|
|
== "Baz \n Some text about Baz \n \n \n Some concluding text about Foo"
|
|
|
|
)
|
|
|
|
# Baz \n Some text about Baz \n \n \n Some concluding text about Foo
|
|
|
|
# Baz \n Some text about Baz \n \n Some concluding text about Foo
|
|
|
|
assert docs[2].metadata["Header 2"] == "Baz"
|
|
|
|
|
|
|
|
|
|
|
|
@pytest.mark.requires("lxml")
|
|
|
|
@pytest.mark.requires("bs4")
|
|
|
|
def test_happy_path_splitting_based_on_header_with_font_size() -> None:
|
|
|
|
# arrange
|
|
|
|
html_string = """<!DOCTYPE html>
|
|
|
|
<html>
|
|
|
|
<body>
|
|
|
|
<div>
|
|
|
|
<span style="font-size: 22px">Foo</span>
|
|
|
|
<p>Some intro text about Foo.</p>
|
|
|
|
<div>
|
|
|
|
<h2>Bar main section</h2>
|
|
|
|
<p>Some intro text about Bar.</p>
|
|
|
|
<h3>Bar subsection 1</h3>
|
|
|
|
<p>Some text about the first subtopic of Bar.</p>
|
|
|
|
<h3>Bar subsection 2</h3>
|
|
|
|
<p>Some text about the second subtopic of Bar.</p>
|
|
|
|
</div>
|
|
|
|
<div>
|
|
|
|
<h2>Baz</h2>
|
|
|
|
<p>Some text about Baz</p>
|
|
|
|
</div>
|
|
|
|
<br>
|
|
|
|
<p>Some concluding text about Foo</p>
|
|
|
|
</div>
|
|
|
|
</body>
|
|
|
|
</html>"""
|
|
|
|
|
|
|
|
sec_splitter = HTMLSectionSplitter(
|
|
|
|
headers_to_split_on=[("h1", "Header 1"), ("h2", "Header 2")]
|
|
|
|
)
|
|
|
|
|
|
|
|
docs = sec_splitter.split_text(html_string)
|
|
|
|
|
|
|
|
assert len(docs) == 3
|
|
|
|
assert docs[0].page_content == "Foo \n Some intro text about Foo."
|
|
|
|
assert docs[0].metadata["Header 1"] == "Foo"
|
|
|
|
|
|
|
|
assert docs[1].page_content == (
|
|
|
|
"Bar main section \n Some intro text about Bar. \n "
|
|
|
|
"Bar subsection 1 \n Some text about the first subtopic of Bar. \n "
|
|
|
|
"Bar subsection 2 \n Some text about the second subtopic of Bar."
|
|
|
|
)
|
|
|
|
assert docs[1].metadata["Header 2"] == "Bar main section"
|
|
|
|
|
|
|
|
assert docs[2].page_content == (
|
|
|
|
"Baz \n Some text about Baz \n \n \n Some concluding text about Foo"
|
|
|
|
)
|
|
|
|
assert docs[2].metadata["Header 2"] == "Baz"
|
|
|
|
|
|
|
|
|
|
|
|
@pytest.mark.requires("lxml")
|
|
|
|
@pytest.mark.requires("bs4")
|
|
|
|
def test_happy_path_splitting_based_on_header_with_whitespace_chars() -> None:
|
|
|
|
# arrange
|
|
|
|
html_string = """<!DOCTYPE html>
|
|
|
|
<html>
|
|
|
|
<body>
|
|
|
|
<div>
|
|
|
|
<span style="font-size: 22px">\nFoo </span>
|
|
|
|
<p>Some intro text about Foo.</p>
|
|
|
|
<div>
|
|
|
|
<h2>Bar main section</h2>
|
|
|
|
<p>Some intro text about Bar.</p>
|
|
|
|
<h3>Bar subsection 1</h3>
|
|
|
|
<p>Some text about the first subtopic of Bar.</p>
|
|
|
|
<h3>Bar subsection 2</h3>
|
|
|
|
<p>Some text about the second subtopic of Bar.</p>
|
|
|
|
</div>
|
|
|
|
<div>
|
|
|
|
<h2>Baz</h2>
|
|
|
|
<p>Some text about Baz</p>
|
|
|
|
</div>
|
|
|
|
<br>
|
|
|
|
<p>Some concluding text about Foo</p>
|
|
|
|
</div>
|
|
|
|
</body>
|
|
|
|
</html>"""
|
|
|
|
|
|
|
|
sec_splitter = HTMLSectionSplitter(
|
|
|
|
headers_to_split_on=[("h1", "Header 1"), ("h2", "Header 2")]
|
|
|
|
)
|
|
|
|
|
|
|
|
docs = sec_splitter.split_text(html_string)
|
|
|
|
|
|
|
|
assert len(docs) == 3
|
|
|
|
assert docs[0].page_content == "Foo \n Some intro text about Foo."
|
|
|
|
assert docs[0].metadata["Header 1"] == "Foo"
|
|
|
|
|
|
|
|
assert docs[1].page_content == (
|
|
|
|
"Bar main section \n Some intro text about Bar. \n "
|
|
|
|
"Bar subsection 1 \n Some text about the first subtopic of Bar. \n "
|
|
|
|
"Bar subsection 2 \n Some text about the second subtopic of Bar."
|
|
|
|
)
|
|
|
|
assert docs[1].metadata["Header 2"] == "Bar main section"
|
|
|
|
|
|
|
|
assert docs[2].page_content == (
|
|
|
|
"Baz \n Some text about Baz \n \n \n Some concluding text about Foo"
|
|
|
|
)
|
|
|
|
assert docs[2].metadata["Header 2"] == "Baz"
|
|
|
|
|
|
|
|
|
langchain: adds recursive json splitter (#17144)
- **Description:** This adds a recursive json splitter class to the
existing text_splitters as well as unit tests
- **Issue:** splitting text from structured data can cause issues if you
have a large nested json object and you split it as regular text you may
end up losing the structure of the json. To mitigate against this you
can split the nested json into large chunks and overlap them, but this
causes unnecessary text processing and there will still be times where
the nested json is so big that the chunks get separated from the parent
keys.
As an example you wouldn't want the following to be split in half:
```shell
{'val0': 'DFWeNdWhapbR',
'val1': {'val10': 'QdJo',
'val11': 'FWSDVFHClW',
'val12': 'bkVnXMMlTiQh',
'val13': 'tdDMKRrOY',
'val14': 'zybPALvL',
'val15': 'JMzGMNH',
'val16': {'val160': 'qLuLKusFw',
'val161': 'DGuotLh',
'val162': 'KztlcSBropT',
-----------------------------------------------------------------------split-----
'val163': 'YlHHDrN',
'val164': 'CtzsxlGBZKf',
'val165': 'bXzhcrWLmBFp',
'val166': 'zZAqC',
'val167': 'ZtyWno',
'val168': 'nQQZRsLnaBhb',
'val169': 'gSpMbJwA'},
'val17': 'JhgiyF',
'val18': 'aJaqjUSFFrI',
'val19': 'glqNSvoyxdg'}}
```
Any llm processing the second chunk of text may not have the context of
val1, and val16 reducing accuracy. Embeddings will also lack this
context and this makes retrieval less accurate.
Instead you want it to be split into chunks that retain the json
structure.
```shell
{'val0': 'DFWeNdWhapbR',
'val1': {'val10': 'QdJo',
'val11': 'FWSDVFHClW',
'val12': 'bkVnXMMlTiQh',
'val13': 'tdDMKRrOY',
'val14': 'zybPALvL',
'val15': 'JMzGMNH',
'val16': {'val160': 'qLuLKusFw',
'val161': 'DGuotLh',
'val162': 'KztlcSBropT',
'val163': 'YlHHDrN',
'val164': 'CtzsxlGBZKf'}}}
```
and
```shell
{'val1':{'val16':{
'val165': 'bXzhcrWLmBFp',
'val166': 'zZAqC',
'val167': 'ZtyWno',
'val168': 'nQQZRsLnaBhb',
'val169': 'gSpMbJwA'},
'val17': 'JhgiyF',
'val18': 'aJaqjUSFFrI',
'val19': 'glqNSvoyxdg'}}
```
This recursive json text splitter does this. Values that contain a list
can be converted to dict first by using split(... convert_lists=True)
otherwise long lists will not be split and you may end up with chunks
larger than the max chunk.
In my testing large json objects could be split into small chunks with
✅ Increased question answering accuracy
✅ The ability to split into smaller chunks meant retrieval queries can
use fewer tokens
- **Dependencies:** json import added to text_splitter.py, and random
added to the unit test
- **Twitter handle:** @joelsprunger
---------
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
2024-02-08 21:45:34 +00:00
|
|
|
def test_split_json() -> None:
|
|
|
|
"""Test json text splitter"""
|
|
|
|
max_chunk = 800
|
|
|
|
splitter = RecursiveJsonSplitter(max_chunk_size=max_chunk)
|
|
|
|
|
|
|
|
def random_val() -> str:
|
|
|
|
return "".join(random.choices(string.ascii_letters, k=random.randint(4, 12)))
|
|
|
|
|
|
|
|
test_data: Any = {
|
|
|
|
"val0": random_val(),
|
|
|
|
"val1": {f"val1{i}": random_val() for i in range(100)},
|
|
|
|
}
|
|
|
|
test_data["val1"]["val16"] = {f"val16{i}": random_val() for i in range(100)}
|
|
|
|
|
|
|
|
# uses create_docs and split_text
|
|
|
|
docs = splitter.create_documents(texts=[test_data])
|
|
|
|
|
|
|
|
output = [len(doc.page_content) < max_chunk * 1.05 for doc in docs]
|
|
|
|
expected_output = [True for doc in docs]
|
|
|
|
assert output == expected_output
|
|
|
|
|
|
|
|
|
|
|
|
def test_split_json_with_lists() -> None:
|
|
|
|
"""Test json text splitter with list conversion"""
|
|
|
|
max_chunk = 800
|
|
|
|
splitter = RecursiveJsonSplitter(max_chunk_size=max_chunk)
|
|
|
|
|
|
|
|
def random_val() -> str:
|
|
|
|
return "".join(random.choices(string.ascii_letters, k=random.randint(4, 12)))
|
|
|
|
|
|
|
|
test_data: Any = {
|
|
|
|
"val0": random_val(),
|
|
|
|
"val1": {f"val1{i}": random_val() for i in range(100)},
|
|
|
|
}
|
|
|
|
test_data["val1"]["val16"] = {f"val16{i}": random_val() for i in range(100)}
|
|
|
|
|
|
|
|
test_data_list: Any = {"testPreprocessing": [test_data]}
|
|
|
|
|
|
|
|
# test text splitter
|
|
|
|
texts = splitter.split_text(json_data=test_data)
|
|
|
|
texts_list = splitter.split_text(json_data=test_data_list, convert_lists=True)
|
|
|
|
|
|
|
|
assert len(texts_list) >= len(texts)
|