Commit Graph

252 Commits (5cfa72a130f675c8da5963a11d416f553f692e72)

Author SHA1 Message Date
Luke Harris b4de839ed8
Several confluence loader improvements (#3300)
This PR addresses several improvements:

- Previously it was not possible to load spaces of more than 100 pages.
The `limit` was being used both as an overall page limit *and* as a per
request pagination limit. This, in combination with the fact that
atlassian seem to use a server-side hard limit of 100 when page content
is expanded, meant it wasn't possible to download >100 pages. Now
`limit` is used *only* as a per-request pagination limit and `max_pages`
is introduced as the way to limit the total number of pages returned by
the paginator.
- Document metadata now includes `source` (the source url), making it
compatible with `RetrievalQAWithSourcesChain`.
 - It is now possible to include inline and footer comments.
- It is now possible to pass `verify_ssl=False` and other parameters to
the confluence object for use cases that require it.
1 year ago
Harrison Chase a6664be79c
Harrison/myscale (#3352)
Co-authored-by: Fangrui Liu <fangruil@moqi.ai>
Co-authored-by: 刘 方瑞 <fangrui.liu@outlook.com>
Co-authored-by: Fangrui.Liu <fangrui.liu@ubc.ca>
1 year ago
Filip Haltmayer 215dcc2d26
Refactor Milvus/Zilliz (#3047)
Refactoring milvus/zilliz to clean up and have a more consistent
experience.

Signed-off-by: Filip Haltmayer <filip.haltmayer@zilliz.com>
1 year ago
Richy Wang 88a8f59aa7
Add a full PostgresSQL syntax database 'AnalyticDB' as vector store. (#3135)
Hi there!
I'm excited to open this PR to add support for using a fully Postgres
syntax compatible database 'AnalyticDB' as a vector.
As AnalyticDB has been proved can be used with AutoGPT,
ChatGPT-Retrieve-Plugin, and LLama-Index, I think it is also good for
you.
AnalyticDB is a distributed Alibaba Cloud-Native vector database. It
works better when data comes to large scale. The PR includes:

- [x]  A new memory: AnalyticDBVector
- [x]  A suite of integration tests verifies the AnalyticDB integration

I have read your [contributing
guidelines](72b7d76d79/.github/CONTRIBUTING.md).
And I have passed the tests below
- [x]  make format
- [x]  make lint
- [x]  make coverage
- [x]  make test
1 year ago
Zander Chase 05a8aa5447
Fix linting on master (#3327) 1 year ago
Varun Srinivas d2f922f525
Change in method name for creating an issue on JIRA (#3307)
The awesome JIRA tool created by @zywilliamli calls the `create_issue()`
method to create issues, however, the actual method is `issue_create()`.

Details in the Documentation here:
https://atlassian-python-api.readthedocs.io/jira.html#manage-issues
1 year ago
Paul Garner aa9d5707e0
Add PythonLoader which auto-detects encoding of Python files (#3311)
This PR contributes a `PythonLoader`, which inherits from
`TextLoader` but detects and sets the encoding automatically.
1 year ago
Davis Chase 2fd24d31a4
Cleanup integration test dir (#3308) 1 year ago
Naveen Tatikonda bb6c459f7a
OpenSearch: Add Support for Lucene Filter (#3201)
### Description
Add Support for Lucene Filter. When you specify a Lucene filter for a
k-NN search, the Lucene algorithm decides whether to perform an exact
k-NN search with pre-filtering or an approximate search with modified
post-filtering. This filter is supported only for approximate search
with the indexes that are created using `lucene` engine.

OpenSearch Documentation -
https://opensearch.org/docs/latest/search-plugins/knn/filter-search-knn/#lucene-k-nn-filter-implementation

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>
1 year ago
Davis Chase 46542dc774
Contextual compression retriever (#2915)
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
1 year ago
Gabriel Altay 34fb56b633
fix copy/pasta typos wikipedia->arxiv (#3222)
just updates a few module level docstrings from Wikipedia -> Arxiv
1 year ago
Harrison Chase d2520a5f1e
Harrison/ddg (#3206)
Co-authored-by: itai <itai.marks@gmail.com>
Co-authored-by: Itai Marks <itaim@users.noreply.github.com>
Co-authored-by: Tianyi Pan <60060750+tipani86@users.noreply.github.com>
Co-authored-by: Tianyi Pan <tianyi.pan@clobotics.com>
Co-authored-by: Adilzhan Ismailov <13088690+aismlv@users.noreply.github.com>
Co-authored-by: Justin Flick <Justinjayflick@gmail.com>
Co-authored-by: Justin Flick <jflick@homesite.com>
1 year ago
Harrison Chase 9181cd9b22
Harrison/playwright selector (#3185)
Co-authored-by: zhyuri <4649294+zhyuri@users.noreply.github.com>
1 year ago
Harrison Chase 68cd37175e
Harrison/arxiv tool (#3186)
Co-authored-by: leo-gan <leo.gan.57@gmail.com>
1 year ago
Naveen Tatikonda 3453b7457c
OpenSearch: Add Support for Boolean Filter with ANN search (#3038)
### Description
Add Support for Boolean Filter with ANN search
Documentation -
https://opensearch.org/docs/latest/search-plugins/knn/filter-search-knn/#boolean-filter-with-ann-search

### Issues Resolved
https://github.com/hwchase17/langchain/issues/2924

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>
1 year ago
Harrison Chase afd3e70ae5
Harrison/confluent loader (#2994)
Co-authored-by: Justin Flick <Justinjayflick@gmail.com>
1 year ago
Jan Backes a9310a3e8b
Add Annoy as VectorStore (#2939)
Adds Annoy (https://github.com/spotify/annoy) as vector Store. 

RESOLVES hwchase17/langchain#2842

discord ref:
https://discord.com/channels/1038097195422978059/1051632794427723827/1096089994168377354

---------

Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
Co-authored-by: vowelparrot <130414180+vowelparrot@users.noreply.github.com>
1 year ago
cs0lar 8b9e02da9d
Fix/issue 1213 (#2932)
### Background

Continuing to implement all the interface methods defined by the
`VectorStore` class. This PR pertains to implementation of the
`max_marginal_relevance_search` method.

### Changes

- a `max_marginal_relevance_search` method implementation has been added
in `weaviate.py`
- tests have been added to the the new method
- vcr cassettes have been added for the weaviate tests

### Test Plan

Added tests for the `max_marginal_relevance_search` implementation

### Change Safety

- [x] I have added tests to cover my changes
1 year ago
vowelparrot 4ffc58e07b
Add similarity_search_with_normalized_similarities (#2916)
Add a method that exposes a similarity search with corresponding
normalized similarity scores. Implement only for FAISS now.

### Motivation:

Some memory definitions combine `relevance` with other scores, like
recency , importance, etc.

While many (but not all) of the `VectorStore`'s expose a
`similarity_search_with_score` method, they don't all interpret the
units of that score (depends on the distance metric and whether or not
the the embeddings are normalized).

This PR proposes a `similarity_search_with_normalized_similarities`
method that lets consumers of the vector store not have to worry about
the metric and embedding scale.

*Most providers default to euclidean distance, with Pinecone being one
exception (defaults to cosine _similarity_).*

---------

Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
1 year ago
Davit Buniatyan b3a5b51728
[minor] Deep Lake auth improvements in docs, kwargs pass, faster tests (#2927)
Minor cosmetic changes 
- Activeloop environment cred authentication in notebooks with
`getpass.getpass` (instead of CLI which not always works)
- much faster tests with Deep Lake pytest mode on 
- Deep Lake kwargs pass

Notes
- I put pytest environment creds inside `vectorstores/conftest.py`, but
feel free to suggest a better location. For context, if I put in
`test_deeplake.py`, `ruff` doesn't let me to set them before import
deeplake

---------

Co-authored-by: Davit Buniatyan <d@activeloop.ai>
1 year ago
Ankush Gola ec59e9d886
Fix ChatAnthropic stop_sequences error (#2919) (#2920)
Note to self: Always run integration tests, even on "that last minute
change you thought would be safe" :)

---------

Co-authored-by: Mike Lambert <mike.lambert@anthropic.com>
1 year ago
Mike Lambert 392f1b3218
Add Anthropic ChatModel to langchain (#2293)
* Adds an Anthropic ChatModel
* Factors out common code in our LLMModel and ChatModel
* Supports streaming llm-tokens to the callbacks on a delta basis (until
a future V2 API does that for us)
* Some fixes
1 year ago
Harrison Chase 1e9378d0a8
Harrison/weaviate fixes (#2872)
Co-authored-by: cs0lar <cristiano.solarino@gmail.com>
Co-authored-by: cs0lar <cristiano.solarino@brightminded.com>
1 year ago
sergerdn 04c458a270
feat: improve pinecone tests (#2806)
Improve the integration tests for Pinecone by adding an `.env.example`
file for local testing. Additionally, add some dev dependencies
specifically for integration tests.

This change also helps me understand how Pinecone deals with certain
things, see related issues
https://github.com/hwchase17/langchain/issues/2484
https://github.com/hwchase17/langchain/issues/2816
1 year ago
vowelparrot bf0887c486
Add Slack Directory Loader (#2841)
Fixes linting issue from #2835 

Adds a loader for Slack Exports which can be a very valuable source of
knowledge to use for internal QA bots and other use cases.

```py
# Export data from your Slack Workspace first.
from langchain.document_loaders import SLackDirectoryLoader

SLACK_WORKSPACE_URL = "https://awesome.slack.com"

loader = ("Slack_Exports", SLACK_WORKSPACE_URL)
docs = loader.load()
```
1 year ago
Tim Asp be4fb24b32
OpenAI LLM: update `modelname_to_contextsize` with new models (#2843)
Token counts pulled from https://openai.com/pricing
1 year ago
Harrison Chase e49f1e628c
Harrison/gpt cache (#2744)
Co-authored-by: SimFG <bang.fu@zilliz.com>
1 year ago
Johnny Lee 0ab364404e
add continue to fix 'continue_on_failure' parameter for URL doc loader (#2735)
Currently, the function still fails if `continue_on_failure` is set to
True, because `elements` is not set.

---------

Co-authored-by: leecjohnny <johnny-lee1255@users.noreply.github.com>
1 year ago
sergerdn 4bdcedab54
fix: some imports for integration tests (#2612)
Add more missed imports for integration tests. Bump `pytest` to the
current latest version.
Fix `tests/integration_tests/vectorstores/test_elasticsearch.py` to
update its cassette(easy fix).

Related PR: https://github.com/hwchase17/langchain/pull/2560
1 year ago
Ankush Gola c1521ddbdb
Add workaround for not having async vector store methods (#2733)
This allows us to use the async API for the Retrieval chains, though it is not guaranteed to be thread safe.
1 year ago
vowelparrot 709f26b69e
Added bilibili loader (#2673) (#2724)
I've added a bilibili loader, bilibili is a very active video site in
China and I think we need this loader.

Example:
```python
from langchain.document_loaders.bilibili import BiliBiliLoader

loader = BiliBiliLoader(
       ["https://www.bilibili.com/video/BV1xt411o7Xu/",
       "https://www.bilibili.com/video/av330407025/"]
)
docs = loader.load()
```

Co-authored-by: 了空 <568250549@qq.com>
1 year ago
Naveen Tatikonda 4364d3316e
Add custom vector fields and text fields for OpenSearch (#2652)
**Description**
Add custom vector field name and text field name while indexing and
querying for OpenSearch

**Issues**
https://github.com/hwchase17/langchain/issues/2500

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>
1 year ago
Chetanya Rastogi 50c511d75f
Add new loader to load pdf as html content (#2607)
Adds a new pdf loader using the existing dependency on PDFMiner. 

The new loader can be helpful for chunking texts semantically into
sections as the output html content can be parsed via `BeautifulSoup` to
get more structured and rich information about font size, page numbers,
pdf headers/footers, etc. which may not be available otherwise with
other pdf loaders
1 year ago
sergerdn 6dc86ad48f
feat: add pytest-vcr for recording HTTP interactions in integration tests (#2445)
Using `pytest-vcr` in integration tests has several benefits. Firstly,
it removes the need to mock external services, as VCR records and
replays HTTP interactions on the fly. Secondly, it simplifies the
integration test setup by eliminating the need to set up and tear down
external services in some cases. Finally, it allows for more reliable
and deterministic integration tests by ensuring that HTTP interactions
are always replayed with the same response.
Overall, `pytest-vcr` is a valuable tool for simplifying integration
test setup and improving their reliability

This commit adds the `pytest-vcr` package as a dependency for
integration tests in the `pyproject.toml` file. It also introduces two
new fixtures in `tests/integration_tests/conftest.py` files for managing
cassette directories and VCR configurations.

In addition, the
`tests/integration_tests/vectorstores/test_elasticsearch.py` file has
been updated to use the `@pytest.mark.vcr` decorator for recording and
replaying HTTP interactions.

Finally, this commit removes the `documents` fixture from the
`test_elasticsearch.py` file and replaces it with a new fixture defined
in `tests/integration_tests/vectorstores/conftest.py` that yields a list
of documents to use in any other tests.

This also includes my second attempt to fix issue :
https://github.com/hwchase17/langchain/issues/2386

Maybe related https://github.com/hwchase17/langchain/issues/2484
1 year ago
Alex Rad bd780a8223
Add support for rwkv (#2422)
This adds support for running RWKV with pytorch. 

https://github.com/hwchase17/langchain/issues/2398

This does not yet support  rwkv.cpp
1 year ago
Davit Buniatyan b4914888a7
Deep Lake upgrade to include attribute search, distance metrics, returning scores and MMR (#2455)
### Features include

- Metadata based embedding search
- Choice of distance metric function (`L2` for Euclidean, `L1` for
Nuclear, `max` L-infinity distance, `cos` for cosine similarity, 'dot'
for dot product. Defaults to `L2`
- Returning scores
- Max Marginal Relevance Search
- Deleting samples from the dataset

### Notes
- Added numerous tests, let me know if you would like to shorten them or
make smarter

---------

Co-authored-by: Davit Buniatyan <d@activeloop.ai>
1 year ago
leo-gan fd69cc7e42
Removed duplicate BaseModel dependencies (#2471)
Removed duplicate BaseModel dependencies in class inheritances.
Also, sorted imports by `isort`.
1 year ago
sergerdn b410dc76aa
fix: elasticsearch (#2402)
- Create a new docker-compose file to start an Elasticsearch instance
for integration tests.
- Add new tests to `test_elasticsearch.py` to verify Elasticsearch
functionality.
- Include an optional group `test_integration` in the `pyproject.toml`
file. This group should contain dependencies for integration tests and
can be installed using the command `poetry install --with
test_integration`. Any new dependencies should be added by running
`poetry add some_new_deps --group "test_integration" `

Note:
New tests running in live mode, which involve end-to-end testing of the
OpenAI API. In the future, adding `pytest-vcr` to record and replay all
API requests would be a nice feature for testing process.More info:
https://pytest-vcr.readthedocs.io/en/latest/

Fixes https://github.com/hwchase17/langchain/issues/2386
1 year ago
Harrison Chase 0a9f04bad9
Harrison/gpt4all (#2366)
Co-authored-by: William FH <13333726+hinthornw@users.noreply.github.com>
Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
1 year ago
Harrison Chase e90d007db3
Harrison/msg files (#2375)
Co-authored-by: Sahil Masand <masand.sahil@gmail.com>
Co-authored-by: Sahil Masand <masands@cbh.com.au>
1 year ago
Kacper Łukawski 585f60a5aa
Qdrant update to 1.1.1 & docs polishing (#2388)
This PR updates Qdrant to 1.1.1 and introduces local mode, so there is
no need to spin up the Qdrant server. By that occasion, the Qdrant
example notebooks also got updated, covering more cases and answering
some commonly asked questions. All the Qdrant's integration tests were
switched to local mode, so no Docker container is required to launch
them.
1 year ago
Harrison Chase d85f57ef9c
Harrison/llama (#2314)
Co-authored-by: RJ Adriaansen <adriaansen@eshcc.eur.nl>
1 year ago
akmhmgc 94b2f536f3
Modify output for wikipedia api wrapper (#2287)
## Description
Thanks for the quick maintenance for great repository!!
I modified wikipedia api wrapper

## Details
- Add output for missing search results
- Add tests
1 year ago
Sam Cordner-Matthews 1ddd6dbf0b
Add ability to pass kwargs to loader classes in `DirectoryLoader`, add ability to modify encoding and BeautifulSoup behaviour in `BSHTMLLoader` (#2275)
Solves #2247. Noted that the only test I added checks for the
BeautifulSoup behaviour change. Happy to add a test for
`DirectoryLoader` if deemed necessary.
1 year ago
Harrison Chase b35260ed47
Harrison/memory base (#2122)
@3coins + @zoltan-fedor.... heres the pr + some minor changes i made.
thoguhts? can try to get it into tmrws release

---------

Co-authored-by: Zoltan Fedor <zoltan.0.fedor@gmail.com>
Co-authored-by: Piyush Jain <piyushjain@duck.com>
1 year ago
Ankush Gola ccee1aedd2
add async support for anthropic (#2114)
should not be merged in before
https://github.com/anthropics/anthropic-sdk-python/pull/11 gets released
1 year ago
Harrison Chase 3e879b47c1
Harrison/gitbook (#2044)
Co-authored-by: Irene López <45119610+ireneisdoomed@users.noreply.github.com>
1 year ago
Honkware aff33d52c5
Add OpenWeatherMap API Tool (#2083)
Added tool for OpenWeatherMap API
1 year ago
Charlie Holtz f16c1fb6df
Add replicate take 2 (#2077)
This PR adds a replicate integration to langchain. 

It's an updated version of
https://github.com/hwchase17/langchain/pull/1993, but with updates to
match latest replicate-python code.
https://github.com/replicate/replicate-python.

---------

Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
Co-authored-by: Zeke Sikelianos <zeke@sikelianos.com>
1 year ago
Harrison Chase f281033362
rm pandas dependency (#2102) 1 year ago
Harrison Chase 410bf37fb8
Harrison/big query (#2100)
Co-authored-by: lu-cashmoney <lucas.corley@gmail.com>
1 year ago
Harrison Chase eff5eed719
Harrison/jina (#2043)
Co-authored-by: numb3r3 <wangfelix87@gmail.com>
Co-authored-by: felix-wang <35718120+numb3r3@users.noreply.github.com>
1 year ago
Harrison Chase f74a1bebf5
Harrison/duckdb (#2064)
Co-authored-by: Trent Hauck <trent@trenthauck.com>
1 year ago
Ankush Gola b7ebb8fe30
enable streaming in anthropic llm wrapper (#2065) 1 year ago
Harrison Chase a0cd6672aa
Harrison/site map (#2061)
Co-authored-by: Tim Asp <707699+timothyasp@users.noreply.github.com>
1 year ago
Mario Kostelac e7d6de6b1c
(ChatOpenAI) Add model_name to LLMResult.llm_output (#1960)
This makes sure OpenAI and ChatOpenAI have the same llm_output, and
allow tracking usage per model. Same work for OpenAI was done in
https://github.com/hwchase17/langchain/pull/1713.
1 year ago
Harrison Chase 6e1b5b8f7e
Harrison/figma doc loader (#1908)
Co-authored-by: Ismail Pelaseyed <homanp@gmail.com>
1 year ago
Eli 12f868b292
Propagate "filter" arg in Chroma similarity_search (#1869)
Technically a duplicate fix to #1619 but with unit tests and a small
documentation update
- Propagate `filter` arg in Chroma `similarity_search` to delegated call
to `similarity_search_with_score`
- Add `filter` arg to `similarity_search_by_vector`
- Clarify doc strings on FakeEmbeddings
1 year ago
Maurício Maia f155d9d3ec
Add metadata filter to PGVector search (#1872)
Add ability to filter pgvector documents by metadata.
1 year ago
Maurício Maia 2212520a6c
Add PGVector collection metadata (#1887)
The `CollectionStore` for `PGVector` has a `cmetadata` field but it's
never used. This PR add the ability to save metadata information to the
collection.
1 year ago
LeoGrin 3701b2901e
use namespace argument in Pinecone constructor (#1757)
Fix #1756

Use the `namespace` argument of `Pinecone.from_exisiting_index` to set
the default value of `namespace` for other methods. Leads to more
expected behavior and easier integration in chains.

For the test, I've added a line to delete and rebuild the
`langchain-demo` index at the beginning of the test. I'm not 100% sure
if it's a good idea but it makes the test reproducible.
1 year ago
Mario Kostelac aff44d0a98
(OpenAI) Add model_name to LLMResult.llm_output (#1713)
Given that different models have very different latencies and pricings,
it's benefitial to pass the information about the model that generated
the response. Such information allows implementing custom callback
managers and track usage and price per model.

Addresses https://github.com/hwchase17/langchain/issues/1557.
1 year ago
Daniel Chalef b157e0c1c3
Add HTML document_loader that includes page title metadata (#1720)
This `BSHTMLLoader` document_loader loads an HTML document, extracts
text and adds the page title to the returned Document's metadata. The
loader uses the already installed bs4 package to extract both text
content and the page title.

Included in this PR is an example HTML file and an integration test that
tests against this file.

---------

Co-authored-by: Daniel Chalef <daniel.chalef@private.org>
1 year ago
Kacper Łukawski 4a327dd1d6
Implement basic metadata filtering in Qdrant (#1689)
This PR implements a basic metadata filtering mechanism similar to the
ones in Chroma and Pinecone. It still cannot express complex conditions,
as there are no operators, but some users requested to have that feature
available.
1 year ago
Harrison Chase 0b29e68c17
Harrison/pgvector (#1679)
Co-authored-by: Aman Kumar <krsingh.aman@gmail.com>
1 year ago
Xin Qiu 4e13cef05a
feat: add redisearch vectorstore (#1307)
# Description

Add `RediSearch` vectorstore for LangChain

RediSearch: [RediSearch quick
start](https://redis.io/docs/stack/search/quick_start/)

# How to use

```
from langchain.vectorstores.redisearch import RediSearch

rds = RediSearch.from_documents(docs, embeddings,redisearch_url="redis://localhost:6379")
```
1 year ago
Tim Asp b3234bf3b0
cleanup: unify 3 different pdf loaders, rename PagedPDFSplitter (#1615)
`OnlinePDFLoader` and `PagedPDFSplitter` lived separate from the rest of
the pdf loaders.

Because they're all similar, I propose moving all to `pdy.py` and the
same docs/examples page.

Additionally, `PagedPDFSplitter` naming doesn't match the pattern the
rest of the loaders follow, so I renamed to `PyPDFLoader` and had it
inherit from `BasePDFLoader` so it can now load from remote file
sources.
1 year ago
Harrison Chase cb04ba0136
Add support for intermediate steps to SQLDatabaseSequentialChain (#1583) (#1601)
for https://github.com/hwchase17/langchain/issues/1582

I simply added the `return_intermediate_steps` and changed the
`output_keys` function.

I added 2 simple tests, 1 for SQLDatabaseSequentialChain without the
intermediate steps and 1 with

Co-authored-by: brad-nemetski <115185478+brad-nemetski@users.noreply.github.com>
1 year ago
Harrison Chase 3ee32a01ea
Harrison/prompt layer (#1547)
Co-authored-by: Jonathan Pedoeem <jonathanped@gmail.com>
Co-authored-by: AbuBakar <abubakarsohail123@gmail.com>
1 year ago
Harrison Chase c844d1fd46
Harrison/chunk size (#1549)
Co-authored-by: Florian Leuerer <31259070+floleuerer@users.noreply.github.com>
1 year ago
Harrison Chase 357d808484
Harrison/remote paths pdf (#1544)
Co-authored-by: Tim Asp <707699+timothyasp@users.noreply.github.com>
1 year ago
Ankush Gola 27104d4921
fix `ChatOpenAI.agenerate` (#1504) 1 year ago
Harrison Chase 7bec461782
Harrison/memory refactor (#1478)
moves memory to own module, factors out common stuff
1 year ago
Harrison Chase 0e21463f07
(rfc) chat models (#1424)
Co-authored-by: Ankush Gola <ankush.gola@gmail.com>
1 year ago
Harrison Chase a1b9dfc099
Harrison/similarity search chroma (#1434)
Co-authored-by: shibuiwilliam <shibuiyusuke@gmail.com>
2 years ago
Tim Asp 23231d65a9
Add PyMuPDF PDF loader (#1426)
Different PDF libraries have different strengths and weaknesses. PyMuPDF
does a good job at extracting the most amount of content from the doc,
regardless of the source quality, extremely fast (especially compared to
Unstructured).

https://pymupdf.readthedocs.io/en/latest/index.html
2 years ago
Nuno Campos 499e76b199
Allow the regular openai class to be used for ChatGPT models (#1393)
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
2 years ago
Kacper Łukawski 9ac442624c
Add Qdrant named arguments (#1386)
This PR:
- Increases `qdrant-client` version to 1.0.4
- Introduces custom content and metadata keys (as requested in #1087)
- Moves all the `QdrantClient` parameters into the method parameters to
simplify code completion
2 years ago
Ankush Gola fe30be6fba
add async and streaming support to `OpenAIChat` (#1378)
title says it all
2 years ago
Tim Asp 72ef69d1ba
Add new iFixit document loader (#1333)
iFixit is a wikipedia-like site that has a huge amount of open content
on how to fix things, questions/answers for common troubleshooting and
"things" related content that is more technical in nature. All content
is licensed under CC-BY-SA-NC 3.0

Adding docs from iFixit as context for user questions like "I dropped my
phone in water, what do I do?" or "My macbook pro is making a whining
noise, what's wrong with it?" can yield significantly better responses
than context free response from LLMs.
2 years ago
Harrison Chase 166cda2cc6
Harrison/deeplake (#1316)
Co-authored-by: Davit Buniatyan <d@activeloop.ai>
2 years ago
Harrison Chase aaad6cc954
Harrison/atlas db (#1315)
Co-authored-by: Brandon Duderstadt <brandonduderstadt@gmail.com>
2 years ago
Enrico Shippole 9becdeaadf
Add Writer, Banana, Modal, StochasticAI (#1270)
Add LLM wrappers and examples for Banana, Writer, Modal, Stochastic AI

Added rigid json format for Banana and Modal
2 years ago
Dennis Antela Martinez 53c67e04d4
add aleph alpha llm (#1207)
Integrate Aleph Alpha's client into Langchain to provide access to the
luminous models - more info on latest benchmarks here:
https://www.aleph-alpha.com/luminous-performance-benchmarks
2 years ago
Harrison Chase 44c8d8a9ac
move serpapi wrapper (#1199)
Co-authored-by: Tim Asp <707699+timothyasp@users.noreply.github.com>
2 years ago
Naveen Tatikonda 0118706fd6
Add Support for OpenSearch Vector database (#1191)
### Description
This PR adds a wrapper which adds support for the OpenSearch vector
database. Using opensearch-py client we are ingesting the embeddings of
given text into opensearch cluster using Bulk API. We can perform the
`similarity_search` on the index using the 3 popular searching methods
of OpenSearch k-NN plugin:

- `Approximate k-NN Search` use approximate nearest neighbor (ANN)
algorithms from the [nmslib](https://github.com/nmslib/nmslib),
[faiss](https://github.com/facebookresearch/faiss), and
[Lucene](https://lucene.apache.org/) libraries to power k-NN search.
- `Script Scoring` extends OpenSearch’s script scoring functionality to
execute a brute force, exact k-NN search.
- `Painless Scripting` adds the distance functions as painless
extensions that can be used in more complex combinations. Also, supports
brute force, exact k-NN search like Script Scoring.

### Issues Resolved 
https://github.com/hwchase17/langchain/issues/1054

---------

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>
2 years ago
Andrew White c5015d77e2
Allow k to be higher than doc size in max_marginal_relevance_search (#1187)
Fixes issue #1186. For some reason, #1117 didn't seem to fix it.
2 years ago
Harrison Chase 9d6d8f85da
Harrison/self hosted runhouse (#1154)
Co-authored-by: Donny Greenberg <dongreenberg2@gmail.com>
Co-authored-by: John Dagdelen <jdagdelen@users.noreply.github.com>
Co-authored-by: Harrison Chase <harrisonchase@Harrisons-MBP.attlocal.net>
Co-authored-by: Andrew White <white.d.andrew@gmail.com>
Co-authored-by: Peng Qu <82029664+pengqu123@users.noreply.github.com>
Co-authored-by: Matt Robinson <mthw.wm.robinson@gmail.com>
Co-authored-by: jeff <tangj1122@gmail.com>
Co-authored-by: Harrison Chase <harrisonchase@Harrisons-MacBook-Pro.local>
Co-authored-by: zanderchase <zander@unfold.ag>
Co-authored-by: Charles Frye <cfrye59@gmail.com>
Co-authored-by: zanderchase <zanderchase@gmail.com>
Co-authored-by: Shahriar Tajbakhsh <sh.tajbakhsh@gmail.com>
Co-authored-by: Stefan Keselj <skeselj@princeton.edu>
Co-authored-by: Francisco Ingham <fpingham@gmail.com>
Co-authored-by: Dhruv Anand <105786647+dhruv-anand-aintech@users.noreply.github.com>
Co-authored-by: cragwolfe <cragcw@gmail.com>
Co-authored-by: Anton Troynikov <atroyn@users.noreply.github.com>
Co-authored-by: William FH <13333726+hinthornw@users.noreply.github.com>
Co-authored-by: Oliver Klingefjord <oliver@klingefjord.com>
Co-authored-by: blob42 <contact@blob42.xyz>
Co-authored-by: blob42 <spike@w530>
Co-authored-by: Enrico Shippole <henryshippole@gmail.com>
Co-authored-by: Ibis Prevedello <ibiscp@gmail.com>
Co-authored-by: jped <jonathanped@gmail.com>
Co-authored-by: Justin Torre <justintorre75@gmail.com>
Co-authored-by: Ivan Vendrov <ivan@anthropic.com>
Co-authored-by: Sasmitha Manathunga <70096033+mmz-001@users.noreply.github.com>
Co-authored-by: Ankush Gola <9536492+agola11@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
Co-authored-by: Jeff Huber <jeffchuber@gmail.com>
Co-authored-by: Akshay <64036106+akshayvkt@users.noreply.github.com>
Co-authored-by: Andrew Huang <jhuang16888@gmail.com>
Co-authored-by: rogerserper <124558887+rogerserper@users.noreply.github.com>
Co-authored-by: seanaedmiston <seane999@gmail.com>
Co-authored-by: Hasegawa Yuya <52068175+Hase-U@users.noreply.github.com>
Co-authored-by: Ivan Vendrov <ivendrov@gmail.com>
Co-authored-by: Chen Wu (吴尘) <henrychenwu@cmu.edu>
Co-authored-by: Dennis Antela Martinez <dennis.antela@gmail.com>
Co-authored-by: Maxime Vidal <max.vidal@hotmail.fr>
Co-authored-by: Rishabh Raizada <110235735+rishabh-ti@users.noreply.github.com>
2 years ago
Noah Gundotra 8c5fbab72d
[Integration Tests] Cast fake embeddings to ALL float values (#1102)
Pydantic validation breaks tests for example (`test_qdrant.py`) because
fake embeddings contain an integer.

This PR casts the embeddings array to all floats.

Now the `qdrant` test passes, `poetry run pytest
tests/integration_tests/vectorstores/test_qdrant.py`
2 years ago
yakigac 1ed708391e
Fix a bug that shows "KeyError 'items'" (#1118)
Fix KeyError 'items' when no result found.

## Problem

When no result found for a query, google search crashed with `KeyError
'items'`.

## Solution

I added a check for an empty response before accessing the 'items' key.
It will handle the case correctly.

## Other

my twitter: yakigac
(I don't mind even if you don't mention me for this PR. But just because
last time my real name was shout out :) )
2 years ago
Hasegawa Yuya e08961ab25
Fixed openai embeddings to be safe by batching them based on token size calculation. (#991)
I modified the logic of the batch calculation for embedding according to
this cookbook

https://github.com/openai/openai-cookbook/blob/main/examples/Embedding_long_inputs.ipynb
2 years ago
seanaedmiston f0a258555b
Support similarity search by vector (in FAISS) (#961)
Alternate implementation to PR #960 Again - only FAISS is implemented.
If accepted can add this to other vectorstores or leave as
NotImplemented? Suggestions welcome...
2 years ago
rogerserper e46cd3b7db
Google Search API integration with serper.dev (wrapper, tests, docs, … (#909)
Adds Google Search integration with [Serper](https://serper.dev) a
low-cost alternative to SerpAPI (10x cheaper + generous free tier).
Includes documentation, tests and examples. Hopefully I am not missing
anything.

Developers can sign up for a free account at
[serper.dev](https://serper.dev) and obtain an api key.

## Usage

```python
from langchain.utilities import GoogleSerperAPIWrapper
from langchain.llms.openai import OpenAI
from langchain.agents import initialize_agent, Tool

import os
os.environ["SERPER_API_KEY"] = ""
os.environ['OPENAI_API_KEY'] = ""

llm = OpenAI(temperature=0)
search = GoogleSerperAPIWrapper()
tools = [
    Tool(
        name="Intermediate Answer",
        func=search.run
    )
]

self_ask_with_search = initialize_agent(tools, llm, agent="self-ask-with-search", verbose=True)
self_ask_with_search.run("What is the hometown of the reigning men's U.S. Open champion?")
```

### Output
```
Entering new AgentExecutor chain...
 Yes.
Follow up: Who is the reigning men's U.S. Open champion?
Intermediate answer: Current champions Carlos Alcaraz, 2022 men's singles champion.
Follow up: Where is Carlos Alcaraz from?
Intermediate answer: El Palmar, Spain
So the final answer is: El Palmar, Spain

> Finished chain.

'El Palmar, Spain'
```
2 years ago
Ankush Gola caa8e4742e
Enable streaming for OpenAI LLM (#986)
* Support a callback `on_llm_new_token` that users can implement when
`OpenAI.streaming` is set to `True`
2 years ago
Harrison Chase 88bebb4caa
Harrison/llm integrations (#1039)
Co-authored-by: jped <jonathanped@gmail.com>
Co-authored-by: Justin Torre <justintorre75@gmail.com>
Co-authored-by: Ivan Vendrov <ivan@anthropic.com>
2 years ago
Enrico Shippole f30dcc6359
Add GooseAI, CerebriumAI, Petals, ForefrontAI (#981)
Add GooseAI, CerebriumAI, Petals, ForefrontAI
2 years ago
Anton Troynikov d43d430d86
Chroma persistence (#1028)
This PR adds persistence to the Chroma vector store.

Users can supply a `persist_directory` with any of the `Chroma` creation
methods. If supplied, the store will be automatically persisted at that
directory.

If a user creates a new `Chroma` instance with the same persistence
directory, it will get loaded up automatically. If they use `from_texts`
or `from_documents` in this way, the documents will be loaded into the
existing store.

There is the chance of some funky behavior if the user passes a
different embedding function from the one used to create the collection
- we will make this easier in future updates. For now, we log a warning.
2 years ago
Anton Troynikov 78abd277ff
Chroma in LangChain (#1010)
Chroma is a simple to use, open-source, zero-config, zero setup
vectorstore.

Simply `pip install chromadb`, and you're good to go. 

Out-of-the-box Chroma is suitable for most LangChain workloads, but is
highly flexible. I tested to 1M embs on my M1 mac, with out issues and
reasonably fast query times.

Look out for future releases as we integrate more Chroma features with
LangChain!
2 years ago
Harrison Chase c64f98e2bb
Harrison/format agent instructions (#973)
Co-authored-by: Andrew White <white.d.andrew@gmail.com>
Co-authored-by: Harrison Chase <harrisonchase@Harrisons-MBP.attlocal.net>
Co-authored-by: Peng Qu <82029664+pengqu123@users.noreply.github.com>
2 years ago
Harrison Chase 91c6cea227
Harrison/batch embeds (#972)
Co-authored-by: John Dagdelen <jdagdelen@users.noreply.github.com>
Co-authored-by: Harrison Chase <harrisonchase@Harrisons-MBP.attlocal.net>
2 years ago
Ankush Gola bc7e56e8df
Add asyncio support for LLM (OpenAI), Chain (LLMChain, LLMMathChain), and Agent (#841)
Supporting asyncio in langchain primitives allows for users to run them
concurrently and creates more seamless integration with
asyncio-supported frameworks (FastAPI, etc.)

Summary of changes:

**LLM**
* Add `agenerate` and `_agenerate`
* Implement in OpenAI by leveraging `client.Completions.acreate`

**Chain**
* Add `arun`, `acall`, `_acall`
* Implement them in `LLMChain` and `LLMMathChain` for now

**Agent**
* Refactor and leverage async chain and llm methods
* Add ability for `Tools` to contain async coroutine
* Implement async SerpaPI `arun`

Create demo notebook.

Open questions:
* Should all the async stuff go in separate classes? I've seen both
patterns (keeping the same class and having async and sync methods vs.
having class separation)
2 years ago
Harrison Chase bc53c928fc
Harrison/athropic (#921)
Co-authored-by: Mike Lambert <mlambert@gmail.com>
Co-authored-by: mrbean <sam@you.com>
Co-authored-by: mrbean <43734688+sam-h-bean@users.noreply.github.com>
Co-authored-by: Ivan Vendrov <ivendrov@gmail.com>
2 years ago
Harrison Chase 1e56879d38
Harrison/save faiss (#916)
Co-authored-by: Shrey Joshi <shreyjoshi2004@gmail.com>
2 years ago
Harrison Chase ba5a2f06b9
Harrison/inference endpoint (#861)
Co-authored-by: Eno Reyes <enoreyes@gmail.com>
2 years ago
Kevin Huo 31b054f69d
Add pinecone integration test (#911)
Basic integration test for pinecone
2 years ago
Harrison Chase 3f48eed5bd
Harrison/milvus (#856)
Signed-off-by: Filip Haltmayer <filip.haltmayer@zilliz.com>
Signed-off-by: Frank Liu <frank.liu@zilliz.com>
Co-authored-by: Filip Haltmayer <81822489+filip-halt@users.noreply.github.com>
Co-authored-by: Frank Liu <frank@frankzliu.com>
2 years ago
kahkeng 4a8f5cdf4b
Add alternative token-based text splitter (#816)
This does not involve a separator, and will naively chunk input text at
the appropriate boundaries in token space.

This is helpful if we have strict token length limits that we need to
strictly follow the specified chunk size, and we can't use aggressive
separators like spaces to guarantee the absence of long strings.

CharacterTextSplitter will let these strings through without splitting
them, which could cause overflow errors downstream.

Splitting at arbitrary token boundaries is not ideal but is hopefully
mitigated by having a decent overlap quantity. Also this results in
chunks which has exact number of tokens desired, instead of sometimes
overcounting if we concatenate shorter strings.

Potentially also helps with #528.
2 years ago
Harrison Chase 23d5f64bda
Harrison/ngram example (#846)
Co-authored-by: Sean Spriggens <ssprigge@syr.edu>
2 years ago
Harrison Chase d564308e0f
rfc: instruct embeddings (#811)
Co-authored-by: seanaedmiston <seane999@gmail.com>
2 years ago
Harrison Chase 7b4882a2f4
Harrison/tf embeddings (#817)
Co-authored-by: Ryohei Kuroki <10434946+yakigac@users.noreply.github.com>
2 years ago
dham e04b063ff4
add faiss local saving/loading (#676)
- This uses the faiss built-in `write_index` and `load_index` to save
and load faiss indexes locally
- Also fixes #674
- The save/load functions also use the faiss library, so I refactored
the dependency into a function
2 years ago
Harrison Chase 0b204d8c21
Harrison/quadrant (#665)
Co-authored-by: Kacper Łukawski <kacperlukawski@users.noreply.github.com>
2 years ago
Harrison Chase 4d4cff0530
Harrison/cohere experimental (#638)
Co-authored-by: inyourhead <44607279+xettrisomeman@users.noreply.github.com>
2 years ago
Harrison Chase ffc7e04d44
Harrison/wolfram alpha (#579)
Co-authored-by: Nicolas <nicolascamara29@gmail.com>
2 years ago
Harrison Chase 0072686aab
Harrison/new search engine (#477)
Co-authored-by: Nicolas <nicolascamara29@gmail.com>
2 years ago
Harrison Chase f8b605293f
Harrison/improve memory (#432)
add AI prefix

add new type of memory

Co-authored-by: Jason <chisanch@usc.edu>
2 years ago
Harrison Chase cf98f219f9
Harrison/tools exp (#372) 2 years ago
Harrison Chase 3474f39e21
Harrison/improve cache (#368)
make it so everything goes through generate, which removes the need for
two types of caches
2 years ago
Harrison Chase a7084ad6e4
Harrison/version 0040 (#366) 2 years ago
mrbean 50257fce59
Support Streaming Tokens from OpenAI (#364)
https://github.com/hwchase17/langchain/issues/363

@hwchase17 how much does this make you want to cry?
2 years ago
mrbean fe6695b9e7
Add HuggingFacePipeline LLM (#353)
https://github.com/hwchase17/langchain/issues/354

Add support for running your own HF pipeline locally. This would allow
you to get a lot more dynamic with what HF features and models you
support since you wouldn't be beholden to what is hosted in HF hub. You
could also do stuff with HF Optimum to quantize your models and stuff to
get pretty fast inference even running on a laptop.
2 years ago
Harrison Chase 9bb7195085
Harrison/llm saving (#331)
Co-authored-by: Akash Samant <70665700+asamant21@users.noreply.github.com>
2 years ago
Harrison Chase 3ca2c8d6c5
allow passing of stop params into openai (#232) 2 years ago
Harrison Chase ca2394028f
move search to not be a chain (#226) 2 years ago
Andrew Gleave ea67c049f0
Support SQL statements that return no results (#222)
Adds support for statements such as insert, update etc which do not
return any rows.

`engine.execute` is deprecated and so execution has been updated to use
`connection.exec_driver_sql` as-per:


https://docs.sqlalchemy.org/en/14/core/connections.html#sqlalchemy.engine.Engine.execute
2 years ago
Harrison Chase 1b9b8efbc9
pal chain (#207)
from https://arxiv.org/pdf/2211.10435.pdf
2 years ago
Harrison Chase b94244eb12
nits (#210)
use json.dump

move test to integration tests (since it requires huggingface_hub)
2 years ago
Bagatur b90e25f786
Add HuggingFace Hub Embeddings (#125)
Add support for calling HuggingFace embedding models
using the HuggingFaceHub Inference API. New class mirrors
the existing HuggingFaceHub LLM implementation. Currently
only supports 'sentence-transformers' models.

Closes #86
2 years ago
Harrison Chase ae9c6257fe
Harrison/arbitrary params (#186) 2 years ago
Harrison Chase d3a7429f61
(WIP) agents (#171) 2 years ago
Samantha Whitmore 315b0c09c6
wip: add method for both docstore and embeddings (#119)
this will break atm but wanted to get thoughts on implementation.

1. should add() be on docstore interface?
2. should InMemoryDocstore change to take a list of documents as init?
(makes this slightly easier to implement in FAISS -- if we think it is
less clean then could expose a method to get the number of documents
currently in the dict, and perform the logic of creating the necessary
dictionary in the FAISS.add_texts method.

Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
2 years ago
Harrison Chase c02eb199b6
add few shot example (#148) 2 years ago
Harrison Chase 9f223e6ccc
Harrison/fix lint (#138) 2 years ago
Delip Rao 76cecf8165
A fix for Jupyter environment variable issue (#135)
- fixes the Jupyter environment variable issues mentioned in issue #134 
- fixes format/lint issues in some unrelated files (from make
format/lint)


![image](https://user-images.githubusercontent.com/347398/201599322-090af858-362d-4d69-bf59-208aea65419a.png)
2 years ago
Harrison Chase f23b3ceb49
consolidate run functions (#126)
consolidating logic for when a chain is able to run with single input
text, single output text

open to feedback on naming, logic, usefulness
2 years ago
Harrison Chase d87e73ddb1
huggingface tokenizer (#75) 2 years ago
Harrison Chase e43534d41c
add integration with manifest (#62) 2 years ago
tomeras91 d8734ce5ad
Add AI21 LLMs (#99)
Integrate AI21 /complete API into langchain, to allow access to Jurassic
models.
2 years ago
Samantha Whitmore a0780cc930
OptimizedPrompt -- k-shot example choice backed by semantic search (#91) 2 years ago
Delip Rao 3ee6e332dd
Implements NLTK and Spacy-based TextSplitters (#103)
This PR is for Issue #88 

- [x] `make format`
- [x] `make lint`
- [x] `make tests`
2 years ago
issam9 28282ad099
Issam9/cohere embeddings (#105)
Add support for cohere embeddings
2 years ago
Delip Rao 95dd2f140e
Make Integration Tests "work" again (#106)
This fixes Issue #104 

The tests for HF Embeddings is skipped because of the segfault issue
mentioned there. Perhaps, a new issue should be created for that?
2 years ago
Harrison Chase b9f61390e9
add text2text generation (#93)
fixes issue #90
2 years ago
Samantha Whitmore efbc03bda8
NLPCloud client integration (#81)
lots of kwargs! generation docs here:
https://docs.nlpcloud.com/#generation

This somewhat breaks the paradigm introduced in LLM base class as the
stop sequence isn't a list, and should rightfully be introduced at the
time of initialization of the class, along with the other kwargs that
depend on its presence (e.g. remove_end_sequence, etc.) curious if you'd
want to refactor LLM base class to take out stop as a specific named
kwarg?
2 years ago
issam9 990cd821cc
Issam/hf embeddings (#68)
Add support of HuggingFace embedding models
2 years ago
Harrison Chase 76aff023d7
FAISS and embedding support (#48)
also adds embeddings and an in memory docstore
2 years ago
Harrison Chase e982cf4b2e
Harrison/update docstore (#47)
change docstore interface
2 years ago
Harrison Chase af81e9ca9c
add sql database (#35) 2 years ago
Harrison Chase ce7b14b843
Harrison/add react chain (#24)
from https://arxiv.org/abs/2210.03629

still need to think if docstore abstraction makes sense
2 years ago
Harrison Chase 020c42dcae
Harrison/add huggingface hub (#23)
Add support for huggingface hub

I could not find a good way to enforce stop tokens over the huggingface
hub api - that needs to hopefully be cleaned up in the future
2 years ago