mirror of https://github.com/hwchase17/langchain
allow tokentextsplitters to use model name to select encoder (#2963)
Fixes a bug I was seeing when the `TokenTextSplitter` was correctly splitting text under the gpt3.5-turbo token limit, but when firing the prompt off too openai, it'd come back with an error that we were over the context limit. gpt3.5-turbo and gpt-4 use `cl100k_base` tokenizer, and so the counts are just always off with the default `gpt-2` encoder. It's possible to pass along the encoding to the `TokenTextSplitter`, but it's much simpler to pass the model name of the LLM. No more concern about keeping the tokenizer and llm model in sync :)harrison/typeo
parent
706ebd8f9c
commit
51894ddd98
Loading…
Reference in New Issue