langchain

mirror of https://github.com/hwchase17/langchain synced 2024-10-29 17:07:25 +00:00

Author	SHA1	Message	Date
Nuno Campos	79bb5c4f95	Port format instructions fix from js (#3015 )	2023-04-17 20:29:17 -07:00
Harrison Chase	e3cf00b88b	redis from url (#3024 )	2023-04-17 20:28:12 -07:00
Davis Chase	19c85aa990	Factor out doc formatting and add validation (#3026 ) @cnhhoang850 slightly more generic fix for #2944, works for whatever the expected metadata keys are not just `source`	2023-04-17 20:28:01 -07:00
Naveen Tatikonda	3453b7457c	OpenSearch: Add Support for Boolean Filter with ANN search (#3038 ) ### Description Add Support for Boolean Filter with ANN search Documentation - https://opensearch.org/docs/latest/search-plugins/knn/filter-search-knn/#boolean-filter-with-ann-search ### Issues Resolved https://github.com/hwchase17/langchain/issues/2924 Signed-off-by: Naveen Tatikonda <navtat@amazon.com>	2023-04-17 20:26:26 -07:00
leo-gan	5420a0e404	updated langchain/docs/modules/models/llms/integrations/ notebooks (#3041 ) - Updated `langchain/docs/modules/models/llms/integrations/` notebooks: added links to the original sites, the install information, etc. - Added the `nlpcloud` notebook. - Removed "Example" from Titles of some notebooks, so all notebook titles are consistent.	2023-04-17 20:25:32 -07:00
Azam Iftikhar	471ef84835	Examples fixed (#3042 ) ### https://github.com/hwchase17/langchain/issues/2997 Replaced `conversation.memory.store` to `conversation.memory.entity_store.store` As conversation.memory.store doesn't exist and re-ran the whole file.	2023-04-17 20:25:01 -07:00
Tim Asp	dcdcd3f636	bugfix: throw exception if structured output parser doesn't get what it wants (#3044 ) allows the user to catch the issue and handle it rather than failing hard. This happens more than you'd expect when using output parsers with chatgpt, especially if the temp is anything but 0. Sometimes it doesn't want to listen and just does its own thing.	2023-04-17 20:24:40 -07:00
Harrison Chase	afd3e70ae5	Harrison/confluent loader (#2994 ) Co-authored-by: Justin Flick <Justinjayflick@gmail.com>	2023-04-17 20:23:45 -07:00
Altay Sansal	95d578d246	Fix type hint regression (#3033 ) Not sure what happened here but some of the file got overwritten by #2859 which broke filtering logic. Here is it fixed back to normal. @hwchase17 can we expedite this if possible :-) --------- Co-authored-by: Altay Sansal <altay.sansal@tgs.com>	2023-04-17 15:49:18 -07:00
Noah Gundotra	577ec92f16	Include testing instructions for getting setup in CONTRIBUTING.md (#3020 ) Running tests is good sanity check for new users to ensure their development environment is setup correctly.	2023-04-17 08:34:07 -07:00
Harrison Chase	98c70bc190	bump version to 142 (#3021 )	2023-04-17 08:00:00 -07:00
vowelparrot	2356447323	Update Characters notebook (#3019 ) - Most important - fixes the relevance_fn name in the notebook to align with the docs - Updates comments for the summary: <img width="787" alt="image" src="https://user-images.githubusercontent.com/130414180/232520616-2a99e8c3-a821-40c2-a0d5-3f3ea196c9bb.png"> - The new conversation is a bit better, still unfortunate they try to schedule a followup. - Rm the max dialogue turns argument to the conversation function	2023-04-17 07:48:48 -07:00
Harrison Chase	f1d15b4a75	update nb	2023-04-16 22:09:31 -07:00
Harrison Chase	e54f1b69ca	add notebook	2023-04-16 21:54:15 -07:00
vowelparrot	99c0382209	Generative Characters (#2859 ) Add a time-weighted memory retriever and a notebook that approximates a Generative Agent from https://arxiv.org/pdf/2304.03442.pdf The "daily plan" components are removed for now since they are less useful without a virtual world, but the memory is an interesting component to build off. --------- Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>	2023-04-16 21:41:00 -07:00
Jan Backes	a9310a3e8b	Add Annoy as VectorStore (#2939 ) Adds Annoy (https://github.com/spotify/annoy) as vector Store. RESOLVES hwchase17/langchain#2842 discord ref: https://discord.com/channels/1038097195422978059/1051632794427723827/1096089994168377354 --------- Co-authored-by: Harrison Chase <hw.chase.17@gmail.com> Co-authored-by: vowelparrot <130414180+vowelparrot@users.noreply.github.com>	2023-04-16 13:44:04 -07:00
Harrison Chase	e12e00df12	use output parsers in agents (#2987 )	2023-04-16 13:15:21 -07:00
cs0lar	8b9e02da9d	Fix/issue 1213 (#2932 ) ### Background Continuing to implement all the interface methods defined by the `VectorStore` class. This PR pertains to implementation of the `max_marginal_relevance_search` method. ### Changes - a `max_marginal_relevance_search` method implementation has been added in `weaviate.py` - tests have been added to the the new method - vcr cassettes have been added for the weaviate tests ### Test Plan Added tests for the `max_marginal_relevance_search` implementation ### Change Safety - [x] I have added tests to cover my changes	2023-04-16 13:11:30 -07:00
Harrison Chase	4c02f4bc30	Fix bug in svm.LinearSVC, add support for a relevancy_threshold (#2959 ) (#2981 ) - Modify SVMRetriever class to add an optional relevancy_threshold - Modify SVMRetriever.get_relevant_documents method to filter out documents with similarity scores below the relevancy threshold - Normalized the similarities to be between 0 and 1 so the relevancy_threshold makes more sense - The number of results are limited to the top k documents or the maximum number of relevant documents above the threshold, whichever is smaller This code will now return the top self.k results (or less, if there are not enough results that meet the self.relevancy_threshold criteria). The svm.LinearSVC implementation in scikit-learn is non-deterministic, which means SVMRetriever.from_texts(["bar", "world", "foo", "hello", "foo bar"]) could return [3 0 5 4 2 1] instead of [0 3 5 4 2 1] with a query of "foo". If you pass in multiple "foo" texts, the order could be different each time. Here, we only care if the 0 is the first element, otherwise it will offset the text and similarities. Example: ```python retriever = SVMRetriever.from_texts( ["foo", "bar", "world", "hello", "foo bar"], OpenAIEmbeddings(), k=4, relevancy_threshold=.25 ) result = retriever.get_relevant_documents("foo") ``` yields ```python [Document(page_content='foo', metadata={}), Document(page_content='foo bar', metadata={})] ``` --------- Co-authored-by: Brandon Sandoval <52767641+account00001@users.noreply.github.com>	2023-04-16 12:57:18 -07:00
Mauricio Scheffer	7302787a7b	Fix docs for parse_with_prompt (#2986 )	2023-04-16 12:57:04 -07:00
Paul Garner	69698be3e6	consistently use getLogger(__name__), no root logger (#2989 ) re https://github.com/hwchase17/langchain/issues/439#issuecomment-1510442791 I think it's not polite for a library to use the root logger both of these forms are also used: ``` logger = logging.getLogger(__name__) logger = logging.getLogger(__file__) ``` I am not sure if there is any reason behind one vs the other? (...I am guessing maybe just contributed by different people) it seems to me it'd be better to consistently use `logging.getLogger(__name__)` this makes it easier for consumers of the library to set up log handlers, e.g. for everything with `langchain.` prefix	2023-04-16 12:49:35 -07:00
Harrison Chase	32db2a2c2f	fix lint	2023-04-16 10:56:19 -07:00
Azam Iftikhar	1e655d5ffd	Fixed Regular expression (#2933 ) ### https://github.com/hwchase17/langchain/issues/2898 Instead of `"Action" and "Action Input"` keywords, we are getting `"Action 1" and "Action 1 Input" or "Action Input 1" ` from gpt-3.5-turbo Updated the Regular expression to handle all these cases Attaching the screenshot of the result from the updated Regular expression. <img width="1036" alt="Screenshot 2023-04-16 at 1 39 00 AM" src="https://user-images.githubusercontent.com/55012400/232251184-23ca6cc2-7229-411a-b6e1-53b2f5ec18a5.png">	2023-04-16 09:16:50 -07:00
Harrison Chase	88d3ce12b8	Harrison/diffbot (#2984 ) Co-authored-by: Manuel Saelices <msaelices@gmail.com>	2023-04-16 09:11:24 -07:00
vowelparrot	5ca7ce77cd	Remove pythonrepl from LLM-MathChain (#2943 ) Use numexpr evaluate instead of the python REPL to avoid malicious code injection. Tested against the (limited) math dataset and got the same score as before. For more permissive tools (like the REPL tool itself), other approaches ought to be provided (some combination of Sanitizer + Restricted python + unprivileged-docker + ...), but for a calculator tool, only mathematical expressions should be permitted. See https://github.com/hwchase17/langchain/issues/814	2023-04-16 08:50:32 -07:00
Daniel Nouri	2a0f65f7af	tiktoken: Relax Python version check (#2966 ) tiktoken supports Python >= 3.8, see here: `e1c661edf3/pyproject.toml (L10)` Also works fine when trying locally!	2023-04-16 08:44:21 -07:00
Chetanya Rastogi	aead062a70	Add an example tutorial for using PDFMinerPDFasHTMLLoader (#2960 ) Last week I added the `PDFMinerPDFasHTMLLoader`. I am adding some example code in the notebook to serve as a tutorial for how that loader can be used to create snippets of a pdf that are structured within sections. All the other loaders only provide the `Document` objects segmented by pages but that's pretty loose given the amount of other metadata that can be extracted. With the new loader, one can leverage font-size of the text to decide when a new sections starts and can segment the text more semantically as shown in the tutorial notebook. The cell shows that we are able to find the content of entire section under Related Work for the example pdf which is spread across 2 pages and hence is stored as two separate documents by other loaders	2023-04-16 08:34:39 -07:00
Tim Asp	51894ddd98	allow tokentextsplitters to use model name to select encoder (#2963 ) Fixes a bug I was seeing when the `TokenTextSplitter` was correctly splitting text under the gpt3.5-turbo token limit, but when firing the prompt off too openai, it'd come back with an error that we were over the context limit. gpt3.5-turbo and gpt-4 use `cl100k_base` tokenizer, and so the counts are just always off with the default `gpt-2` encoder. It's possible to pass along the encoding to the `TokenTextSplitter`, but it's much simpler to pass the model name of the LLM. No more concern about keeping the tokenizer and llm model in sync :)	2023-04-16 08:33:47 -07:00
Alex Iribarren	706ebd8f9c	Enforce maximum Wikipedia query length (#2969 ) I got the following stacktrace when the agent was trying to search Wikipedia with a huge query: ``` Thought:{ "action": "Wikipedia", "action_input": "Outstanding is a song originally performed by the Gap Band and written by member Raymond Calhoun. The song originally appeared on the group's platinum-selling 1982 album Gap Band IV. It is one of their signature songs and biggest hits, reaching the number one spot on the U.S. R&B Singles Chart in February 1983. \"Outstanding\" peaked at number 51 on the Billboard Hot 100." } Traceback (most recent call last): File "/usr/src/app/tests/chat.py", line 121, in <module> answer = agent_chain.run(input=question) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/langchain/chains/base.py", line 216, in run return self(kwargs)[self.output_keys[0]] ^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/langchain/chains/base.py", line 116, in __call__ raise e File "/usr/local/lib/python3.11/site-packages/langchain/chains/base.py", line 113, in __call__ outputs = self._call(inputs) ^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/langchain/agents/agent.py", line 828, in _call next_step_output = self._take_next_step( ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/langchain/agents/agent.py", line 725, in _take_next_step observation = tool.run( ^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/langchain/tools/base.py", line 73, in run raise e File "/usr/local/lib/python3.11/site-packages/langchain/tools/base.py", line 70, in run observation = self._run(tool_input) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/langchain/agents/tools.py", line 17, in _run return self.func(tool_input) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/langchain/utilities/wikipedia.py", line 40, in run search_results = self.wiki_client.search(query) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/wikipedia/util.py", line 28, in __call__ ret = self._cache[key] = self.fn(args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/wikipedia/wikipedia.py", line 109, in search raise WikipediaException(raw_results['error']['info']) wikipedia.exceptions.WikipediaException: An unknown error occured: "Search request is longer than the maximum allowed length. (Actual: 373; allowed: 300)". Please report it on GitHub! ``` This commit limits the maximum size of the query passed to Wikipedia to avoid this issue.	2023-04-16 08:30:57 -07:00
Nahin Khan	9a03f00e6c	Fix typos (#2977 )	2023-04-16 08:28:36 -07:00
Altay Sansal	9d8ab28837	Add `top_k` and `filter` fields to `ChatGPTPluginRetriever` (#2852 ) This allows to adjust the number of results to retrieve and filter documents based on metadata. --------- Co-authored-by: Altay Sansal <altay.sansal@tgs.com>	2023-04-15 21:07:53 -07:00
vowelparrot	4ffc58e07b	Add similarity_search_with_normalized_similarities (#2916 ) Add a method that exposes a similarity search with corresponding normalized similarity scores. Implement only for FAISS now. ### Motivation: Some memory definitions combine `relevance` with other scores, like recency , importance, etc. While many (but not all) of the `VectorStore`'s expose a `similarity_search_with_score` method, they don't all interpret the units of that score (depends on the distance metric and whether or not the the embeddings are normalized). This PR proposes a `similarity_search_with_normalized_similarities` method that lets consumers of the vector store not have to worry about the metric and embedding scale. Most providers default to euclidean distance, with Pinecone being one exception (defaults to cosine _similarity_). --------- Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>	2023-04-15 21:06:08 -07:00
Tim Asp	b9db20481f	Fix wrong token counts from `get_num_tokens` from openai llms (#2952 ) The encoding fetch was out of date. Luckily OpenAI has a nice[ `encoding_for_model`](`46287bfa49/tiktoken/model.py`) function in `tiktoken` we can use now.	2023-04-15 16:09:17 -07:00
Tim Asp	fea5619ce9	Add title, lang, description to Web loader document metadata (#2955 ) Title, lang and description are on almost every web page, and are incredibly useful pieces of information that currently isn't captured with the current web base loader I thought about adding the title and description to the content of the document, as that content could be useful in search, but I left it out for right now. If you think it'd be worth adding, happy to add it. I've found it's nice to have the title/description in the metadata to have some structured data when retrieving rows from vectordbs for use with summary and source citation, so if we do want to add it to the `page_content`, i'd advocate for it to also be included in metadata.	2023-04-15 16:07:08 -07:00
Maciej Pióro	f7bf917baf	Fix missing docker-compose (#2899 ) Fix missing `docker-compose` command if only `docker compose` (note space) is available.	2023-04-15 16:05:11 -07:00
Harrison Chase	b634489b2e	bump version to 141 (#2950 )	2023-04-15 12:56:39 -07:00
Harrison Chase	274b25c010	SVM retriever (#2947 ) (#2949 ) Add SVM retriever class, based on https://github.com/karpathy/randomfun/blob/master/knn_vs_svm.ipynb. Testing still WIP, but the logic is correct (I have a local implementation outside of Langchain working). --------- Co-authored-by: Lance Martin <122662504+PineappleExpress808@users.noreply.github.com> Co-authored-by: rlm <31treehaus@31s-MacBook-Pro.local>	2023-04-15 12:49:59 -07:00
Harrison Chase	baf350e32b	parametrize redis (#2946 )	2023-04-15 12:47:36 -07:00
dev2049	36aa7f30e4	Move PythonRepl -> langchain.utilities (#2917 )	2023-04-15 10:50:25 -07:00
dev2049	7c73e9df5d	Add kwargs to VectorStore.maximum_marginal_relevance (#2921 ) Same as similarity_search, allows child classes to add vector store-specific args (this was technically already happening in couple places but now typing is correct).	2023-04-15 10:49:49 -07:00
Davit Buniatyan	b3a5b51728	[minor] Deep Lake auth improvements in docs, kwargs pass, faster tests (#2927 ) Minor cosmetic changes - Activeloop environment cred authentication in notebooks with `getpass.getpass` (instead of CLI which not always works) - much faster tests with Deep Lake pytest mode on - Deep Lake kwargs pass Notes - I put pytest environment creds inside `vectorstores/conftest.py`, but feel free to suggest a better location. For context, if I put in `test_deeplake.py`, `ruff` doesn't let me to set them before import deeplake --------- Co-authored-by: Davit Buniatyan <d@activeloop.ai>	2023-04-15 10:49:16 -07:00
Harrison Chase	c4ae8c1d24	bump ver to 140 (#2895 )	2023-04-15 09:23:19 -07:00
Nahin Khan	ad3973a3b8	Fix typo (#2942 )	2023-04-15 08:53:25 -07:00
Harrison Chase	cf2789d86d	delete antropic chat notebook (#2945 )	2023-04-15 08:48:51 -07:00
Hai Nguyen Mau	0aa828b1dc	typo fix (#2937 ) missing w in link	2023-04-15 08:31:43 -07:00
Ankush Gola	ec59e9d886	Fix ChatAnthropic stop_sequences error (#2919 ) (#2920 ) Note to self: Always run integration tests, even on "that last minute change you thought would be safe" :) --------- Co-authored-by: Mike Lambert <mike.lambert@anthropic.com>	2023-04-14 17:22:01 -07:00
Akash NP	13a0ed064b	add encoding to avoid UnicodeDecodeError (#2908 ) About Specify encoding to avoid UnicodeDecodeError when reading .txt for users who are following the tutorial. Reference ``` return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 1205: character maps to <undefined> ``` Environment OS: Win 11 Python: 3.8	2023-04-14 16:36:03 -07:00
Mike Lambert	392f1b3218	Add Anthropic ChatModel to langchain (#2293 ) * Adds an Anthropic ChatModel * Factors out common code in our LLMModel and ChatModel * Supports streaming llm-tokens to the callbacks on a delta basis (until a future V2 API does that for us) * Some fixes	2023-04-14 15:09:07 -07:00
Kwuang Tang	66bef1d7ed	Ignore files from .gitignore in Git loader (#2909 ) fixes #2905 extends #2851	2023-04-14 15:02:21 -07:00
Boris Feld	7ee87eb0c8	Comet callback updates (#2889 ) I'm working with @DN6 and I made some small fixes and improvements after playing with the integration.	2023-04-14 13:19:58 -07:00

1 2 3 4 5 ...

1455 Commits