This does not involve a separator, and will naively chunk input text at
the appropriate boundaries in token space.
This is helpful if we have strict token length limits that we need to
strictly follow the specified chunk size, and we can't use aggressive
separators like spaces to guarantee the absence of long strings.
CharacterTextSplitter will let these strings through without splitting
them, which could cause overflow errors downstream.
Splitting at arbitrary token boundaries is not ideal but is hopefully
mitigated by having a decent overlap quantity. Also this results in
chunks which has exact number of tokens desired, instead of sometimes
overcounting if we concatenate shorter strings.
Potentially also helps with #528.
- This uses the faiss built-in `write_index` and `load_index` to save
and load faiss indexes locally
- Also fixes#674
- The save/load functions also use the faiss library, so I refactored
the dependency into a function
https://github.com/hwchase17/langchain/issues/354
Add support for running your own HF pipeline locally. This would allow
you to get a lot more dynamic with what HF features and models you
support since you wouldn't be beholden to what is hosted in HF hub. You
could also do stuff with HF Optimum to quantize your models and stuff to
get pretty fast inference even running on a laptop.
Add support for calling HuggingFace embedding models
using the HuggingFaceHub Inference API. New class mirrors
the existing HuggingFaceHub LLM implementation. Currently
only supports 'sentence-transformers' models.
Closes#86
this will break atm but wanted to get thoughts on implementation.
1. should add() be on docstore interface?
2. should InMemoryDocstore change to take a list of documents as init?
(makes this slightly easier to implement in FAISS -- if we think it is
less clean then could expose a method to get the number of documents
currently in the dict, and perform the logic of creating the necessary
dictionary in the FAISS.add_texts method.
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
This fixes Issue #104
The tests for HF Embeddings is skipped because of the segfault issue
mentioned there. Perhaps, a new issue should be created for that?
lots of kwargs! generation docs here:
https://docs.nlpcloud.com/#generation
This somewhat breaks the paradigm introduced in LLM base class as the
stop sequence isn't a list, and should rightfully be introduced at the
time of initialization of the class, along with the other kwargs that
depend on its presence (e.g. remove_end_sequence, etc.) curious if you'd
want to refactor LLM base class to take out stop as a specific named
kwarg?
Add support for huggingface hub
I could not find a good way to enforce stop tokens over the huggingface
hub api - that needs to hopefully be cleaned up in the future