Archives/langchain

mirror of https://github.com/hwchase17/langchain synced 2024-10-31 15:20:26 +00:00

History

joelsprunger 3984f6604f langchain: adds recursive json splitter (#17144 ) - Description: This adds a recursive json splitter class to the existing text_splitters as well as unit tests - Issue: splitting text from structured data can cause issues if you have a large nested json object and you split it as regular text you may end up losing the structure of the json. To mitigate against this you can split the nested json into large chunks and overlap them, but this causes unnecessary text processing and there will still be times where the nested json is so big that the chunks get separated from the parent keys. As an example you wouldn't want the following to be split in half: ```shell {'val0': 'DFWeNdWhapbR', 'val1': {'val10': 'QdJo', 'val11': 'FWSDVFHClW', 'val12': 'bkVnXMMlTiQh', 'val13': 'tdDMKRrOY', 'val14': 'zybPALvL', 'val15': 'JMzGMNH', 'val16': {'val160': 'qLuLKusFw', 'val161': 'DGuotLh', 'val162': 'KztlcSBropT', -----------------------------------------------------------------------split----- 'val163': 'YlHHDrN', 'val164': 'CtzsxlGBZKf', 'val165': 'bXzhcrWLmBFp', 'val166': 'zZAqC', 'val167': 'ZtyWno', 'val168': 'nQQZRsLnaBhb', 'val169': 'gSpMbJwA'}, 'val17': 'JhgiyF', 'val18': 'aJaqjUSFFrI', 'val19': 'glqNSvoyxdg'}} ``` Any llm processing the second chunk of text may not have the context of val1, and val16 reducing accuracy. Embeddings will also lack this context and this makes retrieval less accurate. Instead you want it to be split into chunks that retain the json structure. ```shell {'val0': 'DFWeNdWhapbR', 'val1': {'val10': 'QdJo', 'val11': 'FWSDVFHClW', 'val12': 'bkVnXMMlTiQh', 'val13': 'tdDMKRrOY', 'val14': 'zybPALvL', 'val15': 'JMzGMNH', 'val16': {'val160': 'qLuLKusFw', 'val161': 'DGuotLh', 'val162': 'KztlcSBropT', 'val163': 'YlHHDrN', 'val164': 'CtzsxlGBZKf'}}} ``` and ```shell {'val1':{'val16':{ 'val165': 'bXzhcrWLmBFp', 'val166': 'zZAqC', 'val167': 'ZtyWno', 'val168': 'nQQZRsLnaBhb', 'val169': 'gSpMbJwA'}, 'val17': 'JhgiyF', 'val18': 'aJaqjUSFFrI', 'val19': 'glqNSvoyxdg'}} ``` This recursive json text splitter does this. Values that contain a list can be converted to dict first by using split(... convert_lists=True) otherwise long lists will not be split and you may end up with chunks larger than the max chunk. In my testing large json objects could be split into small chunks with ✅ Increased question answering accuracy ✅ The ability to split into smaller chunks meant retrieval queries can use fewer tokens - Dependencies: json import added to text_splitter.py, and random added to the unit test - Twitter handle: @joelsprunger --------- Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>		2024-02-08 13:45:34 -08:00
..
api_reference	API References sorted `Partner libs` menu (#17130 )	2024-02-06 16:49:23 -05:00
docs	langchain: adds recursive json splitter (#17144 )	2024-02-08 13:45:34 -08:00
scripts	docs: fix llm/chat_model tables (#15716 )	2024-01-08 11:40:35 -08:00
src	docs[patch]: fix zoom (#14786 )	2023-12-15 17:46:12 -08:00
static	docs: Update with LCEL examples to Ollama & ChatOllama Integration notebook (#16194 )	2024-01-22 22:05:59 -08:00
.local_build.sh	docs: partner packages (#16960 )	2024-02-02 15:12:21 -08:00
babel.config.js	Restructure docs (#11620 )	2023-10-10 12:55:19 -07:00
code-block-loader.js	Restructure docs (#11620 )	2023-10-10 12:55:19 -07:00
docusaurus.config.js	docs: add youtube link (#17065 )	2024-02-05 16:12:56 -08:00
package-lock.json	docs[patch]: search experiment (#14254 )	2023-12-04 16:58:26 -08:00
package.json	docs[patch]: search experiment (#14254 )	2023-12-04 16:58:26 -08:00
README.md	docs: developer docs (#14776 )	2023-12-17 12:55:49 -08:00
settings.ini	Restructure docs (#11620 )	2023-10-10 12:55:19 -07:00
sidebars.js	docs `Integraions/Components` menu reordered (#17151 )	2024-02-06 20:33:41 -08:00
vercel_build.sh	docs: add LangGraph (#15682 )	2024-01-08 08:38:14 -08:00
vercel_requirements.txt	infra: docs build install community editable (#14739 )	2023-12-14 16:13:09 -08:00
vercel.json	docs: titles fix (#17206 )	2024-02-07 22:09:34 -05:00

README.md

LangChain Documentation

For more information on contributing to our documentation, see the Documentation Contributing Guide