langchain

mirror of https://github.com/hwchase17/langchain synced 2024-11-10 01:10:59 +00:00

Author	SHA1	Message	Date
Brendan Collins	9aef79c2e3	Add Geopandas.GeoDataFrame Document Loader (#3817 ) Work in Progress. WIP Not ready... Adds Document Loader support for [Geopandas.GeoDataFrames](https://geopandas.org/) Example: - [x] stub out `GeoDataFrameLoader` class - [x] stub out integration tests - [ ] Experiment with different geometry text representations - [ ] Verify CRS is successfully added in metadata - [ ] Test effectiveness of searches on geometries - [ ] Test with different geometry types (point, line, polygon with multi-variants). - [ ] Add documentation --------- Co-authored-by: Lance Martin <lance@langchain.dev> Co-authored-by: Bagatur <baskaryan@gmail.com> Co-authored-by: Lance Martin <122662504+rlancemartin@users.noreply.github.com>	2023-07-19 12:14:41 -07:00
Adilkhan Sarsen	7bb843477f	Removed kwargs from add_texts (#7595 ) Removing **kwargs argument from add_texts method in DeepLake vectorstore as it confuses users and doesn't fail when user is typing incorrect parameters. Also added small test to ensure the change is applies correctly. Guys could pls take a look: @rlancemartin, @eyurtsev, this is a small PR. Thx so much!	2023-07-19 09:23:49 -07:00
Jarek Kazmierczak	f2ef3ff54a	Google Cloud Enterprise Search retriever (#7857 ) Added a retriever that encapsulated Google Cloud Enterprise Search. --------- Co-authored-by: Bagatur <baskaryan@gmail.com>	2023-07-18 18:24:08 -07:00
Hanit	0d23c0c82a	Allowing additional params for OpenAIEmbeddings. (#7752 ) (#7654) --------- Co-authored-by: Bagatur <baskaryan@gmail.com>	2023-07-18 12:14:51 -07:00
shibuiwilliam	177baef3a1	Add test for svm retriever (#7768 ) # What - This is to add unit test for svm retriever. --------- Co-authored-by: Bagatur <baskaryan@gmail.com>	2023-07-18 09:57:24 -07:00
shibuiwilliam	f29a5d4bcc	add test for knn retriever (#7769 ) # What - This is to add test for knn retriever. --------- Co-authored-by: Bagatur <baskaryan@gmail.com>	2023-07-18 09:52:11 -07:00
shibuiwilliam	235264a246	Add/test faiss (#7809 ) # What - Add missing test cases to faiss vectore stores	2023-07-18 08:30:35 -07:00
Bill Zhang	dda11d2a05	WeaviateHybridSearchRetriever option to enable scores. (#7861 ) Description: This PR adds the option to retrieve scores and explanations in the WeaviateHybridSearchRetriever. This feature improves the usability of the retriever by allowing users to understand the scoring logic behind the search results and further refine their search queries. Issue: This PR is a solution to the issue #7855 Dependencies: This PR does not introduce any new dependencies. Tag maintainer: @rlancemartin, @eyurtsev I have included a unit test for the added feature, ensuring that it retrieves scores and explanations correctly. I have also included an example notebook demonstrating its use.	2023-07-18 07:57:17 -07:00
German Martin	f1eaa9b626	Lost in the middle: We have been ordering documents the WRONG way. (for long context) (#7520 ) Motivation, it seems that when dealing with a long context and "big" number of relevant documents we must avoid using out of the box score ordering from vector stores. See: https://arxiv.org/pdf/2306.01150.pdf So, I added an additional parameter that allows you to reorder the retrieved documents so we can work around this performance degradation. The relevance respect the original search score but accommodates the lest relevant document in the middle of the context. Extract from the paper (one image speaks 1000 tokens): ![image](https://github.com/hwchase17/langchain/assets/1821407/fafe4843-6e18-4fa6-9416-50cc1d32e811) This seems to be common to all diff arquitectures. SO I think we need a good generic way to implement this reordering and run some test in our already running retrievers. It could be that my approach is not the best one from the architecture point of view, happy to have a discussion about that. For me this was the best place to introduce the change and start retesting diff implementations. @rlancemartin, @eyurtsev --------- Co-authored-by: Lance Martin <lance@langchain.dev>	2023-07-18 07:45:15 -07:00
William FH	e294ba475a	Some mitigations for RCE in PAL chain (#7870 ) Some docstring / small nits to #6003 --------- Co-authored-by: BoazWasserman <49598618+boazwasserman@users.noreply.github.com> Co-authored-by: HippoTerrific <49598618+HippoTerrific@users.noreply.github.com> Co-authored-by: Or Raz <orraz1994@gmail.com>	2023-07-17 22:58:47 -07:00
Matt Robinson	3c489be773	feat: optional post-processing for Unstructured loaders (#7850 ) ### Summary Adds a post-processing method for Unstructured loaders that allows users to optionally modify or clean extracted elements. ### Testing ```python from langchain.document_loaders import UnstructuredFileLoader from unstructured.cleaners.core import clean_extra_whitespace loader = UnstructuredFileLoader( "./example_data/layout-parser-paper.pdf", mode="elements", post_processors=[clean_extra_whitespace], ) docs = loader.load() docs[:5] ``` ### Reviewrs - @rlancemartin - @eyurtsev - @hwchase17	2023-07-17 12:13:05 -07:00
Dayuan Jiang	ee40d37098	add bm25 module (#7779 ) - Description: Add a BM25 Retriever that do not need Elastic search - Dependencies: rank_bm25(if it is not installed it will be install by using pip, just like TFIDFRetriever do) - Tag maintainer: @rlancemartin, @eyurtsev - Twitter handle: DayuanJian21687 --------- Co-authored-by: Bagatur <baskaryan@gmail.com>	2023-07-17 07:30:17 -07:00
Liu Ming	fa0a9e502a	Add LLM for ChatGLM(2)-6B API (#7774 ) Description: Add LLM for ChatGLM-6B & ChatGLM2-6B API Related Issue: Will the langchain support ChatGLM? #4766 Add support for selfhost models like ChatGLM or transformer models #1780 Dependencies: No extra library install required. It wraps api call to a ChatGLM(2)-6B server(start with api.py), so api endpoint is required to run. Tag maintainer: @mlot Any comments on this PR would be appreciated. --------- Co-authored-by: mlot <limpo2000@gmail.com> Co-authored-by: Bagatur <baskaryan@gmail.com>	2023-07-17 07:27:17 -07:00
Yifei Song	2e47412073	Add Xorbits agent (#7647 ) - [Xorbits](https://doc.xorbits.io/en/latest/) is an open-source computing framework that makes it easy to scale data science and machine learning workloads in parallel. Xorbits can leverage multi cores or GPUs to accelerate computation on a single machine, or scale out up to thousands of machines to support processing terabytes of data. - This PR added support for the Xorbits agent, which allows langchain to interact with Xorbits Pandas dataframe and Xorbits Numpy array. - Dependencies: This change requires the Xorbits library to be installed in order to be used. `pip install xorbits` - Request for review: @hinthornw - Twitter handle: https://twitter.com/Xorbitsio	2023-07-17 07:09:51 -07:00
William FH	c58d35765d	Add examples to docstrings (#7796 ) and: - remove dataset name from autogenerated project name - print out project name to view	2023-07-16 12:05:56 -07:00
Gordon Clark	96f3dff050	MediaWiki docloader improvements + unit tests (#5879 ) Starting over from #5654 because I utterly borked the poetry.lock file. Adds new paramerters for to the MWDumpLoader class: * skip_redirecst (bool) Tells the loader to skip articles that redirect to other articles. False by default. * stop_on_error (bool) Tells the parser to skip any page that causes a parse error. True by default. * namespaces (List[int]) Tells the parser which namespaces to parse. Contains namespaces from -2 to 15 by default. Default values are chosen to preserve backwards compatibility. Sample dump XML and full unit test coverage (with extended tests that pass!) also included! --------- Co-authored-by: Harrison Chase <hw.chase.17@gmail.com> Co-authored-by: Bagatur <baskaryan@gmail.com>	2023-07-15 10:49:36 -04:00
Kacper Łukawski	1ff5b67025	Implement async API for Qdrant vector store (#7704 ) Inspired by #5550, I implemented full async API support in Qdrant. The docs were extended to mention the existence of asynchronous operations in Langchain. I also used that chance to restructure the tests of Qdrant and provided a suite of tests for the async version. Async API requires the GRPC protocol to be enabled. Thus, it doesn't work on local mode yet, but we're considering including the support to be consistent.	2023-07-15 09:33:26 -04:00
Aarav Borthakur	210296a71f	Integrate Rockset as a document loader (#7681 ) <!-- Thank you for contributing to LangChain! Replace this comment with: - Description: a description of the change, - Issue: the issue # it fixes (if applicable), - Dependencies: any dependencies required for this change, - Tag maintainer: for a quicker response, tag the relevant maintainer (see below), - Twitter handle: we announce bigger features on Twitter. If your PR gets announced and you'd like a mention, we'll gladly shout you out! If you're adding a new integration, please include: 1. a test for the integration, preferably unit tests that do not rely on network access, 2. an example notebook showing its use. Maintainer responsibilities: - General / Misc / if you don't know who to tag: @baskaryan - DataLoaders / VectorStores / Retrievers: @rlancemartin, @eyurtsev - Models / Prompts: @hwchase17, @baskaryan - Memory: @hwchase17 - Agents / Tools / Toolkits: @hinthornw - Tracing / Callbacks: @agola11 - Async: @agola11 If no one reviews your PR within a few days, feel free to @-mention the same people again. See contribution guidelines for more information on how to write/run tests, lint, etc: https://github.com/hwchase17/langchain/blob/master/.github/CONTRIBUTING.md --> Integrate [Rockset](https://rockset.com/docs/) as a document loader. Issue: None Dependencies: Nothing new (rockset's dependency was already added [here](https://github.com/hwchase17/langchain/pull/6216)) Tag maintainer: @rlancemartin I have added a test for the integration and an example notebook showing its use. I ran `make lint` and everything looks good. --------- Co-authored-by: Bagatur <baskaryan@gmail.com>	2023-07-14 07:58:13 -07:00
Leonid Kuligin	85e1c9b348	Added support for examples for VertexAI chat models. (#7636 ) #5278 Co-authored-by: Leonid Kuligin <kuligin@google.com>	2023-07-14 02:03:04 -04:00
Richy Wang	45bb414be2	Add LLM for Alibaba's Damo Academy's Tongyi Qwen API (#7477 ) - Add langchain.llms.Tonyi for text completion, in examples into the Tonyi Text API, - Add system tests. Note async completion for the Text API is not yet supported and will be included in a future PR. Dependencies: dashscope. It will be installed manually cause it is not need by everyone. Happy for feedback on any aspect of this PR @hwchase17 @baskaryan.	2023-07-14 01:58:22 -04:00
Lance Martin	6325a3517c	Make recursive loader yield while crawling (#7568 ) Support actual lazy_load since it can take a while to crawl larger directories.	2023-07-13 21:55:20 -07:00
UmerHA	82f3e32d8d	[Small upgrade] Allow document limit in AzureCognitiveSearchRetriever (#7690 ) Multiple people have asked in #5081 for a way to limit the documents returned from an AzureCognitiveSearchRetriever. This PR adds the `top_n` parameter to allow that. Twitter handle: [@UmerHAdil](twitter.com/umerHAdil)	2023-07-13 23:04:40 -04:00
Kenton Parton	9124221d31	Fixed handling of absolute URLs in `RecursiveUrlLoader` (#7677 ) <!-- Thank you for contributing to LangChain! Replace this comment with: - Description: - Issue: the issue # it fixes (if applicable), - Dependencies: any dependencies required for this change, - Tag maintainer: for a quicker response, tag the relevant maintainer (see below), - Twitter handle: we announce bigger features on Twitter. If your PR gets announced and you'd like a mention, we'll gladly shout you out! If you're adding a new integration, please include: 1. a test for the integration, preferably unit tests that do not rely on network access, 2. an example notebook showing its use. Maintainer responsibilities: - General / Misc / if you don't know who to tag: @baskaryan - DataLoaders / VectorStores / Retrievers: @rlancemartin, @eyurtsev - Models / Prompts: @hwchase17, @baskaryan - Memory: @hwchase17 - Agents / Tools / Toolkits: @hinthornw - Tracing / Callbacks: @agola11 - Async: @agola11 If no one reviews your PR within a few days, feel free to @-mention the same people again. See contribution guidelines for more information on how to write/run tests, lint, etc: https://github.com/hwchase17/langchain/blob/master/.github/CONTRIBUTING.md --> ## Description This PR addresses a bug in the RecursiveUrlLoader class where absolute URLs were being treated as relative URLs, causing malformed URLs to be produced. The fix involves using the urljoin function from the urllib.parse module to correctly handle both absolute and relative URLs. @rlancemartin @eyurtsev --------- Co-authored-by: Lance Martin <lance@langchain.dev>	2023-07-13 15:34:00 -07:00
EllieRoseS	c087ce74f7	Added matching async load func to PlaywrightURLLoader (#5938 ) Fixes # (issue) The existing PlaywrightURLLoader load() function uses a synchronous browser which is not compatible with jupyter. This PR adds a sister function aload() which can be run insisde a notebook. --------- Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>	2023-07-13 17:51:38 -04:00
William FH	aab2a7cd4b	Normalize Trajectory Eval Score (#7668 )	2023-07-13 09:58:28 -07:00
Tamas Molnar	24c1654208	Fix SQLAlchemy LLM cache clear (#7653 ) Fixes #7652 Description: This is a fix for clearing the cache for SQL Alchemy based LLM caches. The langchain.llm_cache.clear() did not take effect for SQLite cache. Reason: it didn't commit the deletion database change. See SQLAlchemy documentation for proper usage: https://docs.sqlalchemy.org/en/20/orm/session_basics.html#opening-and-closing-a-session https://docs.sqlalchemy.org/en/20/orm/session_basics.html#deleting @hwchase17 @baskaryan --------- Co-authored-by: Tamas Molnar <tamas.molnar@nagarro.com>	2023-07-13 09:39:04 -04:00
Bagatur	c17a80f11c	fix chroma updated upsert interface (#7643 ) new chroma release seems to not support empty dicts for metadata. related to #7633	2023-07-13 09:27:14 -04:00
William FH	a673a51efa	[Breaking] Update Evaluation Functionality (#7388 ) - Migrate from deprecated langchainplus_sdk to `langsmith` package - Update the `run_on_dataset()` API to use an eval config - Update a number of evaluators, as well as the loading logic - Update docstrings / reference docs - Update tracer to share single HTTP session	2023-07-13 02:13:06 -07:00
Ma Donghao	6f62e5461c	Update the parser regex of map_rerank (#6419 ) Sometimes the score responded by chatgpt would be like 'Respone example\nScore: 90 (fully answers the question, but could provide more detail on the specific error message)' For the score contains not only numbers, it raise a ValueError like Update the RegexParser from `.` to `\d` would help us to ignore the text after number. Co-authored-by: Bagatur <baskaryan@gmail.com>	2023-07-13 03:01:42 -04:00
Bagatur	b08f903755	fix chroma init bug (#7639 )	2023-07-13 03:00:33 -04:00
Nir Gazit	f307ca094b	fix(memory): allow internal chains to use memory (#6769 ) Fixed #6768. This is a workaround only. I think a better longer-term solution is for chains to declare how many input variables they actually need (as opposed to ones that are in the prompt, where some may be satisfied by the memory). Then, a wrapping chain can check the input match against the actual input variables. @hwchase17	2023-07-13 02:47:44 -04:00
Jason Fan	8effd90be0	Add new types of document transformers (#7379 ) - Description: Add two new document transformers that translates documents into different languages and converts documents into q&a format to improve vector search results. Uses OpenAI function calling via the [doctran](https://github.com/psychic-api/doctran/tree/main) library. - Issue: N/A - Dependencies: `doctran = "^0.0.5"` - Tag maintainer: @rlancemartin @eyurtsev @hwchase17 - Twitter handle: @psychicapi or @jfan001 Notes - Adheres to the `DocumentTransformer` abstraction set by @dev2049 in #3182 - refactored `EmbeddingsRedundantFilter` to put it in a file under a new `document_transformers` module - Added basic docs for `DocumentInterrogator`, `DocumentTransformer` as well as the existing `EmbeddingsRedundantFilter` --------- Co-authored-by: Lance Martin <lance@langchain.dev> Co-authored-by: Bagatur <baskaryan@gmail.com>	2023-07-12 23:53:30 -04:00
Yaroslav Halchenko	0d92a7f357	codespell: workflow, config + some (quite a few) typos fixed (#6785 ) Probably the most boring PR to review ;) Individual commits might be easier to digest --------- Co-authored-by: Bagatur <baskaryan@gmail.com> Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>	2023-07-12 16:20:08 -04:00
Sam	931e68692e	Adds a chain around sympy for symbolic math (#6834 ) - Description: Adds a new chain that acts as a wrapper around Sympy to give LLMs the ability to do some symbolic math. - Dependencies: SymPy --------- Co-authored-by: sreiswig <sreiswig@github.com> Co-authored-by: Bagatur <baskaryan@gmail.com>	2023-07-12 15:17:32 -04:00
Alec Flett	6cdd4b5edc	only add handlers if they are new (#7504 ) When using callbacks, there are times when callbacks can be added redundantly: for instance sometimes you might need to create an llm with specific callbacks, but then also create and agent that uses a chain that has those callbacks already set. This means that "callbacks" might get passed down again to the llm at predict() time, resulting in duplicate calls to the `on_llm_start` callback. For the sake of simplicity, I made it so that langchain never adds an exact handler/callbacks object in `add_handler`, thus avoiding the duplicate handler issue. Tagging @hwchase17 for callback review --------- Co-authored-by: Bagatur <baskaryan@gmail.com>	2023-07-12 03:48:29 -04:00
Junlin Zhou	5f17c57174	Update chat agents' output parser to extract action by regex (#7511 ) Currently `ChatOutputParser` extracts actions by splitting the text on "```", and then load the second part as a json string. But sometimes the LLM will wrap the action in markdown code block like: ````markdown ```json { "action": "foo", "action_input": "bar" } ``` ```` Splitting text on "```" will cause `OutputParserException` in such case. This PR changes the behaviour to extract the `$JSON_BLOB` by regex, so that it can handle both ` ``` ``` ` and ` ```json ``` ` @hinthornw --------- Co-authored-by: Junlin Zhou <jlzhou@zjuici.com>	2023-07-12 03:12:02 -04:00
Bagatur	ebcb144342	unit test sqlalachemy (#7582 )	2023-07-12 03:03:16 -04:00
Bagatur	2babe3069f	Revert pinecone v4 support (#7566 ) Revert `9d13dcd`	2023-07-11 20:58:59 -04:00
Kacper Łukawski	1f83b5f47e	Reuse the existing collection if configured properly in Qdrant.from_texts (#7530 ) This PR changes the behavior of `Qdrant.from_texts` so the collection is reused if not requested to recreate it. Previously, calling `Qdrant.from_texts` or `Qdrant.from_documents` resulted in removing the old data which was confusing for many.	2023-07-11 16:24:35 -04:00
Leonid Kuligin	6674b33cf5	Added support for chat_history (#7555 ) #7469 Co-authored-by: Leonid Kuligin <kuligin@google.com>	2023-07-11 15:27:26 -04:00
Boris	9129318466	CPAL (#6255 ) # Causal program-aided language (CPAL) chain ## Motivation This builds on the recent [PAL](https://arxiv.org/abs/2211.10435) to stop LLM hallucination. The problem with the [PAL](https://arxiv.org/abs/2211.10435) approach is that it hallucinates on a math problem with a nested chain of dependence. The innovation here is that this new CPAL approach includes causal structure to fix hallucination. For example, using the below word problem, PAL answers with 5, and CPAL answers with 13. "Tim buys the same number of pets as Cindy and Boris." "Cindy buys the same number of pets as Bill plus Bob." "Boris buys the same number of pets as Ben plus Beth." "Bill buys the same number of pets as Obama." "Bob buys the same number of pets as Obama." "Ben buys the same number of pets as Obama." "Beth buys the same number of pets as Obama." "If Obama buys one pet, how many pets total does everyone buy?" The CPAL chain represents the causal structure of the above narrative as a causal graph or DAG, which it can also plot, as shown below. ![complex-graph](https://github.com/hwchase17/langchain/assets/367522/d938db15-f941-493d-8605-536ad530f576) . The two major sections below are: 1. Technical overview 2. Future application Also see [this jupyter notebook](https://github.com/borisdev/langchain/blob/master/docs/extras/modules/chains/additional/cpal.ipynb) doc. ## 1. Technical overview ### CPAL versus PAL Like [PAL](https://arxiv.org/abs/2211.10435), CPAL intends to reduce large language model (LLM) hallucination. The CPAL chain is different from the PAL chain for a couple of reasons. * CPAL adds a causal structure (or DAG) to link entity actions (or math expressions). * The CPAL math expressions are modeling a chain of cause and effect relations, which can be intervened upon, whereas for the PAL chain math expressions are projected math identities. PAL's generated python code is wrong. It hallucinates when complexity increases. ```python def solution(): """Tim buys the same number of pets as Cindy and Boris.Cindy buys the same number of pets as Bill plus Bob.Boris buys the same number of pets as Ben plus Beth.Bill buys the same number of pets as Obama.Bob buys the same number of pets as Obama.Ben buys the same number of pets as Obama.Beth buys the same number of pets as Obama.If Obama buys one pet, how many pets total does everyone buy?""" obama_pets = 1 tim_pets = obama_pets cindy_pets = obama_pets + obama_pets boris_pets = obama_pets + obama_pets total_pets = tim_pets + cindy_pets + boris_pets result = total_pets return result # math result is 5 ``` CPAL's generated python code is correct. ```python story outcome data name code value depends_on 0 obama pass 1.0 [] 1 bill bill.value = obama.value 1.0 [obama] 2 bob bob.value = obama.value 1.0 [obama] 3 ben ben.value = obama.value 1.0 [obama] 4 beth beth.value = obama.value 1.0 [obama] 5 cindy cindy.value = bill.value + bob.value 2.0 [bill, bob] 6 boris boris.value = ben.value + beth.value 2.0 [ben, beth] 7 tim tim.value = cindy.value + boris.value 4.0 [cindy, boris] query data { "question": "how many pets total does everyone buy?", "expression": "SELECT SUM(value) FROM df", "llm_error_msg": "" } # query result is 13 ``` Based on the comments below, CPAL's intended location in the library is `experimental/chains/cpal` and PAL's location is`chains/pal`. ### CPAL vs Graph QA Both the CPAL chain and the Graph QA chain extract entity-action-entity relations into a DAG. The CPAL chain is different from the Graph QA chain for a few reasons. * Graph QA does not connect entities to math expressions * Graph QA does not associate actions in a sequence of dependence. * Graph QA does not decompose the narrative into these three parts: 1. Story plot or causal model 4. Hypothetical question 5. Hypothetical condition ### Evaluation Preliminary evaluation on simple math word problems shows that this CPAL chain generates less hallucination than the PAL chain on answering questions about a causal narrative. Two examples are in [this jupyter notebook](https://github.com/borisdev/langchain/blob/master/docs/extras/modules/chains/additional/cpal.ipynb) doc. ## 2. Future application ### "Describe as Narrative, Test as Code" The thesis here is that the Describe as Narrative, Test as Code approach allows you to represent a causal mental model both as code and as a narrative, giving you the best of both worlds. #### Why describe a causal mental mode as a narrative? The narrative form is quick. At a consensus building meeting, people use narratives to persuade others of their causal mental model, aka. plan. You can share, version control and index a narrative. #### Why test a causal mental model as a code? Code is testable, complex narratives are not. Though fast, narratives are problematic as their complexity increases. The problem is LLMs and humans are prone to hallucination when predicting the outcomes of a narrative. The cost of building a consensus around the validity of a narrative outcome grows as its narrative complexity increases. Code does not require tribal knowledge or social power to validate. Code is composable, complex narratives are not. The answer of one CPAL chain can be the hypothetical conditions of another CPAL Chain. For stochastic simulations, a composable plan can be integrated with the [DoWhy library](https://github.com/py-why/dowhy). Lastly, for the futuristic folk, a composable plan as code allows ordinary community folk to design a plan that can be integrated with a blockchain for funding. An explanation of a dependency planning application is [here.](https://github.com/borisdev/cpal-llm-chain-demo) --- Twitter handle: @boris_dev --------- Co-authored-by: Boris Dev <borisdev@Boriss-MacBook-Air.local>	2023-07-11 10:11:21 -04:00
Hashem Alsaket	1dd4236177	Fix HF endpoint returns blank for text-generation (#7386 ) Description: Current `_call` function in the `langchain.llms.HuggingFaceEndpoint` class truncates response when `task=text-generation`. Same error discussed a few days ago on Hugging Face: https://huggingface.co/tiiuae/falcon-40b-instruct/discussions/51 Issue: Fixes #7353 Tag maintainer: @hwchase17 @baskaryan @hinthornw --------- Co-authored-by: Bagatur <baskaryan@gmail.com>	2023-07-11 03:06:05 -04:00
Raymond Yuan	5171c3bcca	Refactor vector storage to correctly handle relevancy scores (#6570 ) Description: This pull request aims to support generating the correct generic relevancy scores for different vector stores by refactoring the relevance score functions and their selection in the base class and subclasses of VectorStore. This is especially relevant with VectorStores that require a distance metric upon initialization. Note many of the current implenetations of `_similarity_search_with_relevance_scores` are not technically correct, as they just return `self.similarity_search_with_score(query, k, **kwargs)` without applying the relevant score function Also includes changes associated with: https://github.com/hwchase17/langchain/pull/6564 and https://github.com/hwchase17/langchain/pull/6494 See more indepth discussion in thread in #6494 Issue: https://github.com/hwchase17/langchain/issues/6526 https://github.com/hwchase17/langchain/issues/6481 https://github.com/hwchase17/langchain/issues/6346 Dependencies: None The changes include: - Properly handling score thresholding in FAISS `similarity_search_with_score_by_vector` for the corresponding distance metric. - Refactoring the `_similarity_search_with_relevance_scores` method in the base class and removing it from the subclasses for incorrectly implemented subclasses. - Adding a `_select_relevance_score_fn` method in the base class and implementing it in the subclasses to select the appropriate relevance score function based on the distance strategy. - Updating the `__init__` methods of the subclasses to set the `relevance_score_fn` attribute. - Removing the `_default_relevance_score_fn` function from the FAISS class and using the base class's `_euclidean_relevance_score_fn` instead. - Adding the `DistanceStrategy` enum to the `utils.py` file and updating the imports in the vector store classes. - Updating the tests to import the `DistanceStrategy` enum from the `utils.py` file. --------- Co-authored-by: Hanit <37485638+hanit-com@users.noreply.github.com>	2023-07-10 20:37:03 -07:00
Stanko Kuveljic	9d13dcd17c	Pinecone: Add V4 support (#7473 )	2023-07-10 08:39:47 -07:00
Adilkhan Sarsen	5debd5043e	Added deeplake use case examples of the new features (#6528 ) <!-- Thank you for contributing to LangChain! Your PR will appear in our release under the title you set. Please make sure it highlights your valuable contribution. Replace this with a description of the change, the issue it fixes (if applicable), and relevant context. List any dependencies required for this change. After you're done, someone will review your PR. They may suggest improvements. If no one reviews your PR within a few days, feel free to @-mention the same people again, as notifications can get lost. Finally, we'd love to show appreciation for your contribution - if you'd like us to shout you out on Twitter, please also include your handle! --> <!-- Remove if not applicable --> Fixes # (issue) #### Before submitting <!-- If you're adding a new integration, please include: 1. a test for the integration - favor unit tests that does not rely on network access. 2. an example notebook showing its use See contribution guidelines for more information on how to write tests, lint etc: https://github.com/hwchase17/langchain/blob/master/.github/CONTRIBUTING.md --> #### Who can review? Tag maintainers/contributors who might be interested: <!-- For a quicker response, figure out the right person to tag with @ @hwchase17 - project lead Tracing / Callbacks - @agola11 Async - @agola11 DataLoaders - @eyurtsev Models - @hwchase17 - @agola11 Agents / Tools / Toolkits - @hwchase17 VectorStores / Retrievers / Memory - @dev2049 --> 1. Added use cases of the new features 2. Done some code refactoring --------- Co-authored-by: Ivo Stranic <istranic@gmail.com>	2023-07-10 07:04:29 -07:00
Yifei Song	7d29bb2c02	Add Xorbits Dataframe as a Document Loader (#7319 ) - [Xorbits](https://doc.xorbits.io/en/latest/) is an open-source computing framework that makes it easy to scale data science and machine learning workloads in parallel. Xorbits can leverage multi cores or GPUs to accelerate computation on a single machine, or scale out up to thousands of machines to support processing terabytes of data. - This PR added support for the Xorbits document loader, which allows langchain to leverage Xorbits to parallelize and distribute the loading of data. - Dependencies: This change requires the Xorbits library to be installed in order to be used. `pip install xorbits` - Request for review: @rlancemartin, @eyurtsev - Twitter handle: https://twitter.com/Xorbitsio Co-authored-by: Bagatur <baskaryan@gmail.com>	2023-07-10 04:24:47 -04:00
Sergio Moreno	21a353e9c2	feat: ctransformers support async chain (#6859 ) - Description: Adding async method for CTransformers - Issue: I've found impossible without this code to run Websockets inside a FastAPI micro service and a CTransformers model. - Tag maintainer: Not necessary yet, I don't like to mention directly - Twitter handle: @_semoal	2023-07-10 04:23:41 -04:00
Paul-Emile Brotons	d2cf0d16b3	adding max_marginal_relevance_search method to MongoDBAtlasVectorSearch (#7310 ) Adding a maximal_marginal_relevance method to the MongoDBAtlasVectorSearch vectorstore enhances the user experience by providing more diverse search results Issue: #7304	2023-07-10 04:04:19 -04:00
Matt Robinson	bcab894f4e	feat: Add `UnstructuredTSVLoader` (#7367 ) ### Summary Adds an `UnstructuredTSVLoader` for TSV files. Also updates the doc strings for `UnstructuredCSV` and `UnstructuredExcel` loaders. ### Testing ```python from langchain.document_loaders.tsv import UnstructuredTSVLoader loader = UnstructuredTSVLoader( file_path="example_data/mlb_teams_2012.csv", mode="elements" ) docs = loader.load() ```	2023-07-10 03:07:10 -04:00
Jona Sassenhagen	7ffc431b3a	Add spacy sentencizer (#7442 ) `SpacyTextSplitter` currently uses spacy's statistics-based `en_core_web_sm` model for sentence splitting. This is a good splitter, but it's also pretty slow, and in this case it's doing a lot of work that's not needed given that the spacy parse is then just thrown away. However, there is also a simple rules-based spacy sentencizer. Using this is at least an order of magnitude faster than using `en_core_web_sm` according to my local tests. Also, spacy sentence tokenization based on `en_core_web_sm` can be sped up in this case by not doing the NER stage. This shaves some cycles too, both when loading the model and when parsing the text. Consequently, this PR adds the option to use the basic spacy sentencizer, and it disables the NER stage for the current approach, which is kept as the default. Lastly, when extracting the tokenized sentences, the `text` attribute is called directly instead of doing the string conversion, which is IMO a bit more idiomatic.	2023-07-10 02:52:05 -04:00

1 2 3 4 5 ...

736 Commits