langchain/docs
joelsprunger 3984f6604f
langchain: adds recursive json splitter (#17144)
- **Description:** This adds a recursive json splitter class to the
existing text_splitters as well as unit tests
- **Issue:** splitting text from structured data can cause issues if you
have a large nested json object and you split it as regular text you may
end up losing the structure of the json. To mitigate against this you
can split the nested json into large chunks and overlap them, but this
causes unnecessary text processing and there will still be times where
the nested json is so big that the chunks get separated from the parent
keys.

As an example you wouldn't want the following to be split in half:
```shell
{'val0': 'DFWeNdWhapbR',
 'val1': {'val10': 'QdJo',
          'val11': 'FWSDVFHClW',
          'val12': 'bkVnXMMlTiQh',
          'val13': 'tdDMKRrOY',
          'val14': 'zybPALvL',
          'val15': 'JMzGMNH',
          'val16': {'val160': 'qLuLKusFw',
                    'val161': 'DGuotLh',
                    'val162': 'KztlcSBropT',
-----------------------------------------------------------------------split-----
                    'val163': 'YlHHDrN',
                    'val164': 'CtzsxlGBZKf',
                    'val165': 'bXzhcrWLmBFp',
                    'val166': 'zZAqC',
                    'val167': 'ZtyWno',
                    'val168': 'nQQZRsLnaBhb',
                    'val169': 'gSpMbJwA'},
          'val17': 'JhgiyF',
          'val18': 'aJaqjUSFFrI',
          'val19': 'glqNSvoyxdg'}}
```
Any llm processing the second chunk of text may not have the context of
val1, and val16 reducing accuracy. Embeddings will also lack this
context and this makes retrieval less accurate.

Instead you want it to be split into chunks that retain the json
structure.
```shell
{'val0': 'DFWeNdWhapbR',
 'val1': {'val10': 'QdJo',
          'val11': 'FWSDVFHClW',
          'val12': 'bkVnXMMlTiQh',
          'val13': 'tdDMKRrOY',
          'val14': 'zybPALvL',
          'val15': 'JMzGMNH',
          'val16': {'val160': 'qLuLKusFw',
                    'val161': 'DGuotLh',
                    'val162': 'KztlcSBropT',
                    'val163': 'YlHHDrN',
                    'val164': 'CtzsxlGBZKf'}}}
```
and
```shell
{'val1':{'val16':{
                    'val165': 'bXzhcrWLmBFp',
                    'val166': 'zZAqC',
                    'val167': 'ZtyWno',
                    'val168': 'nQQZRsLnaBhb',
                    'val169': 'gSpMbJwA'},
          'val17': 'JhgiyF',
          'val18': 'aJaqjUSFFrI',
          'val19': 'glqNSvoyxdg'}}
```
This recursive json text splitter does this. Values that contain a list
can be converted to dict first by using split(... convert_lists=True)
otherwise long lists will not be split and you may end up with chunks
larger than the max chunk.

In my testing large json objects could be split into small chunks with 
   Increased question answering accuracy
 The ability to split into smaller chunks meant retrieval queries can
use fewer tokens


- **Dependencies:** json import added to text_splitter.py, and random
added to the unit test
  - **Twitter handle:** @joelsprunger

---------

Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
2024-02-08 13:45:34 -08:00
..
api_reference API References sorted Partner libs menu (#17130) 2024-02-06 16:49:23 -05:00
docs langchain: adds recursive json splitter (#17144) 2024-02-08 13:45:34 -08:00
scripts docs: fix llm/chat_model tables (#15716) 2024-01-08 11:40:35 -08:00
src docs[patch]: fix zoom (#14786) 2023-12-15 17:46:12 -08:00
static docs: Update with LCEL examples to Ollama & ChatOllama Integration notebook (#16194) 2024-01-22 22:05:59 -08:00
.local_build.sh docs: partner packages (#16960) 2024-02-02 15:12:21 -08:00
babel.config.js Restructure docs (#11620) 2023-10-10 12:55:19 -07:00
code-block-loader.js Restructure docs (#11620) 2023-10-10 12:55:19 -07:00
docusaurus.config.js docs: add youtube link (#17065) 2024-02-05 16:12:56 -08:00
package-lock.json docs[patch]: search experiment (#14254) 2023-12-04 16:58:26 -08:00
package.json docs[patch]: search experiment (#14254) 2023-12-04 16:58:26 -08:00
README.md docs: developer docs (#14776) 2023-12-17 12:55:49 -08:00
settings.ini Restructure docs (#11620) 2023-10-10 12:55:19 -07:00
sidebars.js docs Integraions/Components menu reordered (#17151) 2024-02-06 20:33:41 -08:00
vercel_build.sh docs: add LangGraph (#15682) 2024-01-08 08:38:14 -08:00
vercel_requirements.txt infra: docs build install community editable (#14739) 2023-12-14 16:13:09 -08:00
vercel.json docs: titles fix (#17206) 2024-02-07 22:09:34 -05:00

LangChain Documentation

For more information on contributing to our documentation, see the Documentation Contributing Guide