https://github.com/hwchase17/langchain/issues/1100
When faiss data and doc.index are created in past versions, error occurs
that say there was no attribute. So I put hasattr in the check as a
simple solution.
However, increasing the number of such checks is not good for
conservatism, so I think there is a better solution.
Also, the code for the batch process was left out, so I put it back in.
Currently, the 'truncate' parameter of the cohere API is not supported.
This means that by default, if trying to generate and embedding that is
too big, the call will just fail with an error (which is frustrating if
using this embedding source e.g. with GPT-Index, because it's hard to
handle it properly when generating a lot of embeddings).
With the parameter, one can decide to either truncate the START or END
of the text to fit the max token length and still generate an embedding
without throwing the error.
In this PR, I added this parameter to the class.
_Arguably, there should be a better way to handle this error, e.g. by
optionally calling a function or so that gets triggered when the token
limit is reached and can split the document or some such. Especially in
the use case with GPT-Index, its often hard to estimate the token counts
for each document and I'd rather sort out the troublemakers or simply
split them than interrupting the whole execution.
Thoughts?_
---------
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
Currently, the class parameter 'model_name' of the CohereEmbeddings
class is not supported, but 'model' is. The class documentation is
inconsistent with this, though, so I propose to either fix the
documentation (this PR right now) or fix the parameter.
It will create the following error:
```
ValidationError: 1 validation error for CohereEmbeddings
model_name
extra fields not permitted (type=value_error.extra)
```
`SentenceTransformer` returns a NumPy array, not a `List[List[float]]`
or `List[float]` as specified in the interface of `Embeddings`. That PR
makes it consistent with the interface.
Now that OpenAI has deprecated all embeddings models except
text-embedding-ada-002, we should stop specifying a legacy embedding
model in the example. This will also avoid confusion from people (like
me) trying to specify model="text-embedding-ada-002" and having that
erroneously expanded to text-search-text-embedding-ada-002-query-001
Running the Cohere embeddings example from the docs:
```python
from langchain.embeddings import CohereEmbeddings
embeddings = CohereEmbeddings(cohere_api_key= cohere_api_key)
text = "This is a test document."
query_result = embeddings.embed_query(text)
doc_result = embeddings.embed_documents([text])
```
I get the error:
```bash
CohereError(message=res['message'], http_status=response.status_code, headers=response.headers)
cohere.error.CohereError: embed is not an available endpoint on this model
```
This is because the `model` string is set to `medium` which is not
currently available.
From the Cohere docs:
> Currently available models are small and large (default)
Add support for calling HuggingFace embedding models
using the HuggingFaceHub Inference API. New class mirrors
the existing HuggingFaceHub LLM implementation. Currently
only supports 'sentence-transformers' models.
Closes#86
Addresses the issue in #76 by either using the relevant environment
variable if set or using a string passed in the constructor.
Prefers the constructor string over the environment variable, which
seemed like the natural choice to me.