mirror of https://github.com/hwchase17/langchain
Bug: incorrect start_index if the chunk is substring of another chunk
Sample code: from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain.docstore.document import Document splitter = RecursiveCharacterTextSplitter(chunk_size=5, chunk_overlap=5, separators=[" ", ""], add_start_index=True) splitter.split_documents([Document(page_content="chunk chunk")]) Before this commit: [Document(page_content='chunk', metadata={'start_index': 0}), Document(page_content='chun', metadata={'start_index': 0}), Document(page_content='chunk', metadata={'start_index': 0})] After this commit: [Document(page_content='chunk', metadata={'start_index': 0}), Document(page_content='chun', metadata={'start_index': 6}), Document(page_content='chunk', metadata={'start_index': 6})] This resolves https://github.com/langchain-ai/langchain/issues/21475pull/21477/head
parent
f178c67ad0
commit
41c034a96f
Loading…
Reference in New Issue