langchain

mirror of https://github.com/hwchase17/langchain synced 2024-11-04 06:00:26 +00:00

Author	SHA1	Message	Date
cs0lar	8b9e02da9d	Fix/issue 1213 (#2932 ) ### Background Continuing to implement all the interface methods defined by the `VectorStore` class. This PR pertains to implementation of the `max_marginal_relevance_search` method. ### Changes - a `max_marginal_relevance_search` method implementation has been added in `weaviate.py` - tests have been added to the the new method - vcr cassettes have been added for the weaviate tests ### Test Plan Added tests for the `max_marginal_relevance_search` implementation ### Change Safety - [x] I have added tests to cover my changes	2023-04-16 13:11:30 -07:00
Harrison Chase	4c02f4bc30	Fix bug in svm.LinearSVC, add support for a relevancy_threshold (#2959 ) (#2981 ) - Modify SVMRetriever class to add an optional relevancy_threshold - Modify SVMRetriever.get_relevant_documents method to filter out documents with similarity scores below the relevancy threshold - Normalized the similarities to be between 0 and 1 so the relevancy_threshold makes more sense - The number of results are limited to the top k documents or the maximum number of relevant documents above the threshold, whichever is smaller This code will now return the top self.k results (or less, if there are not enough results that meet the self.relevancy_threshold criteria). The svm.LinearSVC implementation in scikit-learn is non-deterministic, which means SVMRetriever.from_texts(["bar", "world", "foo", "hello", "foo bar"]) could return [3 0 5 4 2 1] instead of [0 3 5 4 2 1] with a query of "foo". If you pass in multiple "foo" texts, the order could be different each time. Here, we only care if the 0 is the first element, otherwise it will offset the text and similarities. Example: ```python retriever = SVMRetriever.from_texts( ["foo", "bar", "world", "hello", "foo bar"], OpenAIEmbeddings(), k=4, relevancy_threshold=.25 ) result = retriever.get_relevant_documents("foo") ``` yields ```python [Document(page_content='foo', metadata={}), Document(page_content='foo bar', metadata={})] ``` --------- Co-authored-by: Brandon Sandoval <52767641+account00001@users.noreply.github.com>	2023-04-16 12:57:18 -07:00
Mauricio Scheffer	7302787a7b	Fix docs for parse_with_prompt (#2986 )	2023-04-16 12:57:04 -07:00
Paul Garner	69698be3e6	consistently use getLogger(__name__), no root logger (#2989 ) re https://github.com/hwchase17/langchain/issues/439#issuecomment-1510442791 I think it's not polite for a library to use the root logger both of these forms are also used: ``` logger = logging.getLogger(__name__) logger = logging.getLogger(__file__) ``` I am not sure if there is any reason behind one vs the other? (...I am guessing maybe just contributed by different people) it seems to me it'd be better to consistently use `logging.getLogger(__name__)` this makes it easier for consumers of the library to set up log handlers, e.g. for everything with `langchain.` prefix	2023-04-16 12:49:35 -07:00
Harrison Chase	32db2a2c2f	fix lint	2023-04-16 10:56:19 -07:00
Azam Iftikhar	1e655d5ffd	Fixed Regular expression (#2933 ) ### https://github.com/hwchase17/langchain/issues/2898 Instead of `"Action" and "Action Input"` keywords, we are getting `"Action 1" and "Action 1 Input" or "Action Input 1" ` from gpt-3.5-turbo Updated the Regular expression to handle all these cases Attaching the screenshot of the result from the updated Regular expression. <img width="1036" alt="Screenshot 2023-04-16 at 1 39 00 AM" src="https://user-images.githubusercontent.com/55012400/232251184-23ca6cc2-7229-411a-b6e1-53b2f5ec18a5.png">	2023-04-16 09:16:50 -07:00
Harrison Chase	88d3ce12b8	Harrison/diffbot (#2984 ) Co-authored-by: Manuel Saelices <msaelices@gmail.com>	2023-04-16 09:11:24 -07:00
vowelparrot	5ca7ce77cd	Remove pythonrepl from LLM-MathChain (#2943 ) Use numexpr evaluate instead of the python REPL to avoid malicious code injection. Tested against the (limited) math dataset and got the same score as before. For more permissive tools (like the REPL tool itself), other approaches ought to be provided (some combination of Sanitizer + Restricted python + unprivileged-docker + ...), but for a calculator tool, only mathematical expressions should be permitted. See https://github.com/hwchase17/langchain/issues/814	2023-04-16 08:50:32 -07:00
Daniel Nouri	2a0f65f7af	tiktoken: Relax Python version check (#2966 ) tiktoken supports Python >= 3.8, see here: `e1c661edf3/pyproject.toml (L10)` Also works fine when trying locally!	2023-04-16 08:44:21 -07:00
Chetanya Rastogi	aead062a70	Add an example tutorial for using PDFMinerPDFasHTMLLoader (#2960 ) Last week I added the `PDFMinerPDFasHTMLLoader`. I am adding some example code in the notebook to serve as a tutorial for how that loader can be used to create snippets of a pdf that are structured within sections. All the other loaders only provide the `Document` objects segmented by pages but that's pretty loose given the amount of other metadata that can be extracted. With the new loader, one can leverage font-size of the text to decide when a new sections starts and can segment the text more semantically as shown in the tutorial notebook. The cell shows that we are able to find the content of entire section under Related Work for the example pdf which is spread across 2 pages and hence is stored as two separate documents by other loaders	2023-04-16 08:34:39 -07:00
Tim Asp	51894ddd98	allow tokentextsplitters to use model name to select encoder (#2963 ) Fixes a bug I was seeing when the `TokenTextSplitter` was correctly splitting text under the gpt3.5-turbo token limit, but when firing the prompt off too openai, it'd come back with an error that we were over the context limit. gpt3.5-turbo and gpt-4 use `cl100k_base` tokenizer, and so the counts are just always off with the default `gpt-2` encoder. It's possible to pass along the encoding to the `TokenTextSplitter`, but it's much simpler to pass the model name of the LLM. No more concern about keeping the tokenizer and llm model in sync :)	2023-04-16 08:33:47 -07:00
Alex Iribarren	706ebd8f9c	Enforce maximum Wikipedia query length (#2969 ) I got the following stacktrace when the agent was trying to search Wikipedia with a huge query: ``` Thought:{ "action": "Wikipedia", "action_input": "Outstanding is a song originally performed by the Gap Band and written by member Raymond Calhoun. The song originally appeared on the group's platinum-selling 1982 album Gap Band IV. It is one of their signature songs and biggest hits, reaching the number one spot on the U.S. R&B Singles Chart in February 1983. \"Outstanding\" peaked at number 51 on the Billboard Hot 100." } Traceback (most recent call last): File "/usr/src/app/tests/chat.py", line 121, in <module> answer = agent_chain.run(input=question) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/langchain/chains/base.py", line 216, in run return self(kwargs)[self.output_keys[0]] ^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/langchain/chains/base.py", line 116, in __call__ raise e File "/usr/local/lib/python3.11/site-packages/langchain/chains/base.py", line 113, in __call__ outputs = self._call(inputs) ^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/langchain/agents/agent.py", line 828, in _call next_step_output = self._take_next_step( ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/langchain/agents/agent.py", line 725, in _take_next_step observation = tool.run( ^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/langchain/tools/base.py", line 73, in run raise e File "/usr/local/lib/python3.11/site-packages/langchain/tools/base.py", line 70, in run observation = self._run(tool_input) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/langchain/agents/tools.py", line 17, in _run return self.func(tool_input) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/langchain/utilities/wikipedia.py", line 40, in run search_results = self.wiki_client.search(query) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/wikipedia/util.py", line 28, in __call__ ret = self._cache[key] = self.fn(args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/wikipedia/wikipedia.py", line 109, in search raise WikipediaException(raw_results['error']['info']) wikipedia.exceptions.WikipediaException: An unknown error occured: "Search request is longer than the maximum allowed length. (Actual: 373; allowed: 300)". Please report it on GitHub! ``` This commit limits the maximum size of the query passed to Wikipedia to avoid this issue.	2023-04-16 08:30:57 -07:00
Nahin Khan	9a03f00e6c	Fix typos (#2977 )	2023-04-16 08:28:36 -07:00
Altay Sansal	9d8ab28837	Add `top_k` and `filter` fields to `ChatGPTPluginRetriever` (#2852 ) This allows to adjust the number of results to retrieve and filter documents based on metadata. --------- Co-authored-by: Altay Sansal <altay.sansal@tgs.com>	2023-04-15 21:07:53 -07:00
vowelparrot	4ffc58e07b	Add similarity_search_with_normalized_similarities (#2916 ) Add a method that exposes a similarity search with corresponding normalized similarity scores. Implement only for FAISS now. ### Motivation: Some memory definitions combine `relevance` with other scores, like recency , importance, etc. While many (but not all) of the `VectorStore`'s expose a `similarity_search_with_score` method, they don't all interpret the units of that score (depends on the distance metric and whether or not the the embeddings are normalized). This PR proposes a `similarity_search_with_normalized_similarities` method that lets consumers of the vector store not have to worry about the metric and embedding scale. Most providers default to euclidean distance, with Pinecone being one exception (defaults to cosine _similarity_). --------- Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>	2023-04-15 21:06:08 -07:00
Tim Asp	b9db20481f	Fix wrong token counts from `get_num_tokens` from openai llms (#2952 ) The encoding fetch was out of date. Luckily OpenAI has a nice[ `encoding_for_model`](`46287bfa49/tiktoken/model.py`) function in `tiktoken` we can use now.	2023-04-15 16:09:17 -07:00
Tim Asp	fea5619ce9	Add title, lang, description to Web loader document metadata (#2955 ) Title, lang and description are on almost every web page, and are incredibly useful pieces of information that currently isn't captured with the current web base loader I thought about adding the title and description to the content of the document, as that content could be useful in search, but I left it out for right now. If you think it'd be worth adding, happy to add it. I've found it's nice to have the title/description in the metadata to have some structured data when retrieving rows from vectordbs for use with summary and source citation, so if we do want to add it to the `page_content`, i'd advocate for it to also be included in metadata.	2023-04-15 16:07:08 -07:00
Maciej Pióro	f7bf917baf	Fix missing docker-compose (#2899 ) Fix missing `docker-compose` command if only `docker compose` (note space) is available.	2023-04-15 16:05:11 -07:00
Harrison Chase	b634489b2e	bump version to 141 (#2950 )	2023-04-15 12:56:39 -07:00
Harrison Chase	274b25c010	SVM retriever (#2947 ) (#2949 ) Add SVM retriever class, based on https://github.com/karpathy/randomfun/blob/master/knn_vs_svm.ipynb. Testing still WIP, but the logic is correct (I have a local implementation outside of Langchain working). --------- Co-authored-by: Lance Martin <122662504+PineappleExpress808@users.noreply.github.com> Co-authored-by: rlm <31treehaus@31s-MacBook-Pro.local>	2023-04-15 12:49:59 -07:00
Harrison Chase	baf350e32b	parametrize redis (#2946 )	2023-04-15 12:47:36 -07:00
dev2049	36aa7f30e4	Move PythonRepl -> langchain.utilities (#2917 )	2023-04-15 10:50:25 -07:00
dev2049	7c73e9df5d	Add kwargs to VectorStore.maximum_marginal_relevance (#2921 ) Same as similarity_search, allows child classes to add vector store-specific args (this was technically already happening in couple places but now typing is correct).	2023-04-15 10:49:49 -07:00
Davit Buniatyan	b3a5b51728	[minor] Deep Lake auth improvements in docs, kwargs pass, faster tests (#2927 ) Minor cosmetic changes - Activeloop environment cred authentication in notebooks with `getpass.getpass` (instead of CLI which not always works) - much faster tests with Deep Lake pytest mode on - Deep Lake kwargs pass Notes - I put pytest environment creds inside `vectorstores/conftest.py`, but feel free to suggest a better location. For context, if I put in `test_deeplake.py`, `ruff` doesn't let me to set them before import deeplake --------- Co-authored-by: Davit Buniatyan <d@activeloop.ai>	2023-04-15 10:49:16 -07:00
Harrison Chase	c4ae8c1d24	bump ver to 140 (#2895 )	2023-04-15 09:23:19 -07:00
Nahin Khan	ad3973a3b8	Fix typo (#2942 )	2023-04-15 08:53:25 -07:00
Harrison Chase	cf2789d86d	delete antropic chat notebook (#2945 )	2023-04-15 08:48:51 -07:00
Hai Nguyen Mau	0aa828b1dc	typo fix (#2937 ) missing w in link	2023-04-15 08:31:43 -07:00
Ankush Gola	ec59e9d886	Fix ChatAnthropic stop_sequences error (#2919 ) (#2920 ) Note to self: Always run integration tests, even on "that last minute change you thought would be safe" :) --------- Co-authored-by: Mike Lambert <mike.lambert@anthropic.com>	2023-04-14 17:22:01 -07:00
Akash NP	13a0ed064b	add encoding to avoid UnicodeDecodeError (#2908 ) About Specify encoding to avoid UnicodeDecodeError when reading .txt for users who are following the tutorial. Reference ``` return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 1205: character maps to <undefined> ``` Environment OS: Win 11 Python: 3.8	2023-04-14 16:36:03 -07:00
Mike Lambert	392f1b3218	Add Anthropic ChatModel to langchain (#2293 ) * Adds an Anthropic ChatModel * Factors out common code in our LLMModel and ChatModel * Supports streaming llm-tokens to the callbacks on a delta basis (until a future V2 API does that for us) * Some fixes	2023-04-14 15:09:07 -07:00
Kwuang Tang	66bef1d7ed	Ignore files from .gitignore in Git loader (#2909 ) fixes #2905 extends #2851	2023-04-14 15:02:21 -07:00
Boris Feld	7ee87eb0c8	Comet callback updates (#2889 ) I'm working with @DN6 and I made some small fixes and improvements after playing with the integration.	2023-04-14 13:19:58 -07:00
dev2049	634358db5e	Fix OpenAI LLM docstring (#2910 )	2023-04-14 11:09:36 -07:00
pranjaldoshi96	30573b2e30	Correct instruction to use openweathermap utility in docstring (#2906 ) Co-authored-by: Pranjal Doshi <pranjald@nvidia.com>	2023-04-14 10:46:20 -07:00
Kwuang Tang	a508afa91c	Add file filter param to Git loader (#2904 ) Allows users to specify what files should be loaded instead of indiscriminately loading the entire repo. extends #2851 NOTE: for reviewers, `hide whitespace` option recommended since I changed the indentation of an if-block to use `continue` instead so it looks less like a Christmas tree :)	2023-04-14 10:45:54 -07:00
Ismail Pelaseyed	7e525a3b91	Add link to repo for deploying LangChain to Digitalocean App Platform (#2894 ) This PR adds a link to a minimal example of deploying `LangChain` to `Digitalocean App Platform`.	2023-04-14 08:55:21 -07:00
Peter Stolz	ccacf804a8	Fix format string in pinecone error handling (#2897 )	2023-04-14 08:53:02 -07:00
Francis Felici	86189cdcf9	Update load_qa_chain() docstring (#2900 ) Seems to be missing `map_rerank` as a potential argument of `chain_type`	2023-04-14 08:51:30 -07:00
Harrison Chase	8fef69296d	nits (#2873 )	2023-04-14 07:55:12 -07:00
Harrison Chase	0a38bbc750	updates to vectorstore memory (#2875 )	2023-04-14 07:54:57 -07:00
Ikko Eltociear Ashimine	203c0eb2ae	docs: update getting_started.ipynb (#2883 ) HuggingFace -> Hugging Face	2023-04-14 07:40:26 -07:00
ecneladis	1a44b71ddf	Fix Baby AGI notebooks (#2882 ) - fix broken notebook cell in `ae485b623d` - Python Black formatting	2023-04-14 07:40:04 -07:00
Nicolas	3c7204d604	docs: Quick fix to Mendable Search (#2876 ) Fixed a small issue on the icon UI when using in Safari.	2023-04-13 23:15:57 -07:00
Harrison Chase	1e9378d0a8	Harrison/weaviate fixes (#2872 ) Co-authored-by: cs0lar <cristiano.solarino@gmail.com> Co-authored-by: cs0lar <cristiano.solarino@brightminded.com>	2023-04-13 22:37:34 -07:00
Harrison Chase	07d7096de6	Harrison/playwright (#2871 ) Co-authored-by: Manuel Saelices <msaelices@gmail.com>	2023-04-13 22:15:03 -07:00
Jon Luo	5565f56273	Use SQL dialect-specific prompts for SQLDatabaseChain (#2748 ) Mentioned the idea here initially: https://github.com/hwchase17/langchain/pull/2106#issuecomment-1487509106 Since there have been dialect-specific issues, we should use dialect-specific prompts. This way, each prompt can be separately modified to best suit each dialect as needed. This adds a prompt for each dialect supported in sqlalchemy (mssql, mysql, mariadb, postgres, oracle, sqlite). For this initial implementation, the only differencse between the prompts is the instruction for the clause to use to limit the number of rows queried for, and the instruction for wrapping column names using each dialect's identifier quote character.	2023-04-13 22:10:49 -07:00
drod	9907cb0485	Refactor similarity_search function in elastic_vector_search.py (#2761 ) Optimization :Limit search results when k < 10 Fix issue when k > 10: Elasticsearch will return only 10 docs [default-search-result](https://www.elastic.co/guide/en/elasticsearch/reference/current/paginate-search-results.html) By default, searches return the top 10 matching hits Add size parameter to the search request to limit the number of returned results from Elasticsearch. Remove slicing of the hits list, since the response will already contain the desired number of results.	2023-04-13 22:09:00 -07:00
rafael	1cc7ea333c	chat_models.openai: Set tenacity timeout to openai's recommendation (#2768 ) [OpenAI's cookbook](https://github.com/openai/openai-cookbook/blob/main/examples/How_to_handle_rate_limits.ipynb) suggest a tenacity backoff between 1 and 60 seconds. Currently langchain's backoff is between 4 and 10 seconds, which causes frequent timeout errors on my end. This PR changes the timeout to the suggested values.	2023-04-13 22:08:46 -07:00
Harrison Chase	705596b46a	Harrison/fix create sql agent (#2870 ) Co-authored-by: Timothé Pearce <timothe.pearce@gmail.com>	2023-04-13 22:07:58 -07:00

... 2 3 4 5 6 ...

1488 Commits