langchain

mirror of https://github.com/hwchase17/langchain synced 2024-10-29 17:07:25 +00:00

Author	SHA1	Message	Date
Harrison Chase	8f06085b24	make tools conditional (#11647 )	2023-10-11 06:11:05 +01:00
Bassem Yacoube	5451b724fc	Adds support for llama2 and fixes MPT-7b url (#11465 ) - Description: This is an update to OctoAI LLM provider that adds support for llama2 endpoints hosted on OctoAI and updates MPT-7b url with the current one. @baskaryan Thanks! --------- Co-authored-by: ML Wiz <bassemgeorgi@gmail.com>	2023-10-10 20:34:35 -07:00
Todd Kerpelman	0bff399af1	Make metadata from the url_selenium loader match that of the web_base loader (#11617 ) Description: I noticed the metadata returned by the url_selenium loader was missing several values included by the web_base loader. (The former returned `{source: ...}`, the latter returned `{source: ..., title: ..., description: ..., language: ...}`.) This change fixes it so both loaders return all 4 key value pairs. Files have been properly formatted and all tests are passing. Note, however, that I am not much of a python expert, so that whole "Adding the imports inside the code so that tests pass" thing seems weird to me. Please LMK if I did anything wrong.	2023-10-10 20:32:45 -07:00
Tarun Thotakura	c9d4d53545	Fixed the assignment of custom_llm_provider argument (#11628 ) - Description: Assigning the custom_llm_provider to the default params function so that it will be passed to the litellm - Issue: Even though the custom_llm_provider argument is being defined it's not being assigned anywhere in the code and hence its not being passed to litellm, therefore any litellm call which uses the custom_llm_provider as required parameter is being failed. This parameter is mainly used by litellm when we are doing inference via Custom API server. https://docs.litellm.ai/docs/providers/custom_openai_proxy - Dependencies: No dependencies are required @krrishdholakia , @baskaryan --------- Co-authored-by: Bagatur <baskaryan@gmail.com>	2023-10-10 20:29:24 -07:00
Leonid Ganeline	db67ccb0bb	docstrings cleanup (#11640 ) Added missed docstrings. Some reformatting.	2023-10-10 19:56:47 -07:00
Yang, Bo	3a82bd7bdb	Use raise from statement so that users can find detailed error message (#11461 ) - Description: Use `raise from` statement so that users can find detailed error message - Tag maintainer: @baskaryan, @eyurtsev, @hwchase17	2023-10-10 17:25:23 -07:00
Nuno Campos	9a0ed75a95	Add configurable fields with options (#11601 ) <!-- Thank you for contributing to LangChain! Replace this entire comment with: - Description: a description of the change, - Issue: the issue # it fixes (if applicable), - Dependencies: any dependencies required for this change, - Tag maintainer: for a quicker response, tag the relevant maintainer (see below), - Twitter handle: we announce bigger features on Twitter. If your PR gets announced, and you'd like a mention, we'll gladly shout you out! Please make sure your PR is passing linting and testing before submitting. Run `make format`, `make lint` and `make test` to check this locally. See contribution guidelines for more information on how to write/run tests, lint, etc: https://github.com/langchain-ai/langchain/blob/master/.github/CONTRIBUTING.md If you're adding a new integration, please include: 1. a test for the integration, preferably unit tests that do not rely on network access, 2. an example notebook showing its use. It lives in `docs/extras` directory. If no one reviews your PR within a few days, please @-mention one of @baskaryan, @eyurtsev, @hwchase17. -->	2023-10-10 22:17:22 +01:00
Bagatur	7232e082de	bump 312 (#11621 )	2023-10-10 12:34:49 -07:00
Eugene Yurtsev	58220cda72	Remove LLM Bash and related bash utilities (#11619 ) Deprecate LLMBash and related bash utilities	2023-10-10 14:54:09 -04:00
Shubham Kushwaha	49de862076	Arcee.ai LLM & Retriever integration (#11579 ) - Description: This PR introduces a new LLM and Retriever API to https://arcee.ai for the python client - Issue: implements the integrations as requested in #11578 , - Dependencies: no dependencies are required, - Tag maintainer: @hwchase17 - Twitter handle: shwooobham ✅ `make format`, `make lint` and `make test` runs locally. ```shell =========== 1245 passed, 277 skipped, 20 warnings in 16.26s =========== ./scripts/check_pydantic.sh . ./scripts/check_imports.sh poetry run ruff . [ "." = "" ] \|\| poetry run black . --check All done! ✨ 🍰 ✨ 1818 files would be left unchanged. [ "." = "" ] \|\| poetry run mypy . Success: no issues found in 1815 source files [ "." = "" ] \|\| poetry run black . All done! ✨ 🍰 ✨ 1818 files left unchanged. [ "." = "" ] \|\| poetry run ruff --select I --fix . poetry run codespell --toml pyproject.toml poetry run codespell --toml pyproject.toml -w ``` Contributions 1. Arcee (langchain/llms), ArceeRetriever (langchain/retrievers), ArceeWrapper (langchain/utilities) 2. docs for Arcee (llms/arcee.py) and ArceeRetriever(retrievers/arcee.py) 3. cc: @jacobsolawetz @ben-epstein --------- Co-authored-by: Shubham <shubham@sORo.local>	2023-10-10 10:20:45 -07:00
Eugene Yurtsev	b56ca0c2a4	Deprecate LLMSymbolicMath from langchain core (#11615 ) Deprecate LLMSymbolicMath from langchain core package.	2023-10-10 12:33:51 -04:00
Eugene Yurtsev	c9bce5bbfb	Add version to langchain_experimental (#11613 ) Add version to langchain experimental	2023-10-10 11:17:41 -04:00
Predrag Gruevski	22abeb9f6c	Disable loading jinja2 `PromptTemplate` from file. (#10252 ) jinja2 templates are not sandboxed and are at risk for arbitrary code execution. To mitigate this risk: - We no longer support loading jinja2-formatted prompt template files. - `PromptTemplate` with jinja2 may still be constructed manually, but the class carries a security warning reminding the user to not pass untrusted input into it. Resolves #4394.	2023-10-10 11:15:42 -04:00
Nuno Campos	c7c03d4709	Fix mutation bugs in callback manager configure (#11603 ) <!-- Thank you for contributing to LangChain! Replace this entire comment with: - Description: a description of the change, - Issue: the issue # it fixes (if applicable), - Dependencies: any dependencies required for this change, - Tag maintainer: for a quicker response, tag the relevant maintainer (see below), - Twitter handle: we announce bigger features on Twitter. If your PR gets announced, and you'd like a mention, we'll gladly shout you out! Please make sure your PR is passing linting and testing before submitting. Run `make format`, `make lint` and `make test` to check this locally. See contribution guidelines for more information on how to write/run tests, lint, etc: https://github.com/langchain-ai/langchain/blob/master/.github/CONTRIBUTING.md If you're adding a new integration, please include: 1. a test for the integration, preferably unit tests that do not rely on network access, 2. an example notebook showing its use. It lives in `docs/extras` directory. If no one reviews your PR within a few days, please @-mention one of @baskaryan, @eyurtsev, @hwchase17. -->	2023-10-10 14:50:18 +01:00
cccs-eric	e2a9072b80	Fix CohereRerank configuration (#11583 ) Description: CohereRerank is missing `cohere_api_key` as a field and since extras are forbidden, it is not possible to pass-in the key. The only way is to use an env variable named `COHERE_API_KEY`. For example, if trying to create a compressor like this: ```python cohere_api_key = "......Cohere api key......" compressor = CohereRerank(cohere_api_key=cohere_api_key) ``` you will get the following error: ``` File "/langchain/.venv/lib/python3.10/site-packages/pydantic/v1/main.py", line 341, in __init__ raise validation_error pydantic.v1.error_wrappers.ValidationError: 1 validation error for CohereRerank cohere_api_key extra fields not permitted (type=value_error.extra) ```	2023-10-09 23:26:34 -07:00
Anar	55fef4b64b	implemented add files method in LLMRails (#11518 ) This PR provides add files method with LLMRails. Implemented here are: docs/extras/integrations/vectorstores/llm-rails.ipynb --------- Co-authored-by: Anar Aliyev <aaliyev@mgmt.cloudnet.services>	2023-10-09 16:29:43 -07:00
Stephen Hankinson	316dddc7cd	fix wording of query_sql_database_tool_description (#11530 ) - Description: Fixes minor typo for the query_sql_database_tool_description in the db toolkit - Issue: N/A - Dependencies: N/A - Tag maintainer: @nfcampos - Twitter handle: N/A	2023-10-09 15:32:45 -07:00
Ash Vardanian	1acfe86353	Accelerating Math Utils with SimSIMD (#11566 ) LangChain relies on NumPy to compute cosine distances, which becomes a bottleneck with the growing dimensionality and number of embeddings. To avoid this bottleneck, in our libraries at [Unum](https://github.com/unum-cloud), we have created a specialized package - [SimSIMD](https://github.com/ashvardanian/simsimd), that knows how to use newer hardware capabilities. Compared to SciPy and NumPy, it reaches 3x-200x performance for various data types. Since publication, several LangChain users have asked me if I can integrate it into LangChain to accelerate their workflows, so here I am 🤗 ## Benchmarking To conduct benchmarks locally, run this in your Jupyter: ```py import numpy as np import scipy as sp import simsimd as simd import timeit as tt def cosine_similarity_np(X: np.ndarray, Y: np.ndarray) -> np.ndarray: X_norm = np.linalg.norm(X, axis=1) Y_norm = np.linalg.norm(Y, axis=1) with np.errstate(divide="ignore", invalid="ignore"): similarity = np.dot(X, Y.T) / np.outer(X_norm, Y_norm) similarity[np.isnan(similarity) \| np.isinf(similarity)] = 0.0 return similarity def cosine_similarity_sp(X: np.ndarray, Y: np.ndarray) -> np.ndarray: return 1 - sp.spatial.distance.cdist(X, Y, metric='cosine') def cosine_similarity_simd(X: np.ndarray, Y: np.ndarray) -> np.ndarray: return 1 - simd.cdist(X, Y, metric='cosine') X = np.random.randn(1, 1536).astype(np.float32) Y = np.random.randn(1, 1536).astype(np.float32) repeat = 1000 print("NumPy: {:,.0f} ops/s, SciPy: {:,.0f} ops/s, SimSIMD: {:,.0f} ops/s".format( repeat / tt.timeit(lambda: cosine_similarity_np(X, Y), number=repeat), repeat / tt.timeit(lambda: cosine_similarity_sp(X, Y), number=repeat), repeat / tt.timeit(lambda: cosine_similarity_simd(X, Y), number=repeat), )) ``` ## Results I ran this on an M2 Pro Macbook for various data types and different number of rows in `X` and reformatted the results as a table for readability: \| Data Type \| NumPy \| SciPy \| SimSIMD \| \| :--- \| ---: \| ---: \| ---: \| \| `f32, 1` \| 59,114 ops/s \| 80,330 ops/s \| 475,351 ops/s \| \| `f16, 1` \| 32,880 ops/s \| 82,420 ops/s \| 650,177 ops/s \| \| `i8, 1` \| 47,916 ops/s \| 115,084 ops/s \| 866,958 ops/s \| \| `f32, 10` \| 40,135 ops/s \| 24,305 ops/s \| 185,373 ops/s \| \| `f16, 10` \| 7,041 ops/s \| 17,596 ops/s \| 192,058 ops/s \| \| `f16, 10` \| 21,989 ops/s \| 25,064 ops/s \| 619,131 ops/s \| \| `f32, 100` \| 3,536 ops/s \| 3,094 ops/s \| 24,206 ops/s \| \| `f16, 100` \| 900 ops/s \| 2,014 ops/s \| 23,364 ops/s \| \| `i8, 100` \| 5,510 ops/s \| 3,214 ops/s \| 143,922 ops/s \| It's important to note that SimSIMD will underperform if both matrices are huge. That, however, seems to be an uncommon usage pattern for LangChain users. You can find a much more detailed performance report for different hardware models here: - [Apple M2 Pro](https://ashvardanian.com/posts/simsimd-faster-scipy/#appendix-1-performance-on-apple-m2-pro). - [4th Gen Intel Xeon Platinum](https://ashvardanian.com/posts/simsimd-faster-scipy/#appendix-2-performance-on-4th-gen-intel-xeon-platinum-8480). - [AWS Graviton 3](https://ashvardanian.com/posts/simsimd-faster-scipy/#appendix-3-performance-on-aws-graviton-3). ## Additional Notes 1. Previous version used `X = np.array(X)`, to repackage lists of lists. It's an anti-pattern, as it will use double-precision floating-point numbers, which are slow on both CPUs and GPUs. I have replaced it with `X = np.array(X, dtype=np.float32)`, but a more selective approach should be discussed. 2. In numerical computations, it's recommended to explicitly define tolerance levels, which were previously avoided in `np.allclose(expected, actual)` calls. For now, I've set absolute tolerance to distance computation errors as 0.01: `np.allclose(expected, actual, atol=1e-2)`. --- - Dependencies: adds `simsimd` dependency - Tag maintainer: @hwchase17 - Twitter handle: @ashvardanian --------- Co-authored-by: Bagatur <baskaryan@gmail.com>	2023-10-09 14:56:55 -07:00
benchello	5de64e6d60	Add option to specify metadata columns in CSV loader (#11576 ) #### Description This PR adds the option to specify additional metadata columns in the CSVLoader beyond just `Source`. The current CSV loader includes all columns in `page_content` and if we want to have columns specified for `page_content` and `metadata` we have to do something like the below.: ``` csv = pd.read_csv( "path_to_csv" ).to_dict("records") documents = [ Document( page_content=doc["content"], metadata={ "last_modified_by": doc["last_modified_by"], "point_of_contact": doc["point_of_contact"], } ) for doc in csv ] ``` #### Usage Example Usage: ``` csv_test = CSVLoader( file_path="path_to_csv", metadata_columns=["last_modified_by", "point_of_contact"] ) ``` Example CSV: ``` content, last_modified_by, point_of_contact "hello world", "Person A", "Person B" ``` Example Result: ``` Document { page_content: "hello world" metadata: { row: '0', source: 'path_to_csv', last_modified_by: 'Person A', point_of_contact: 'Person B', } ``` --------- Co-authored-by: Ben Chello <bchello@dropbox.com> Co-authored-by: Bagatur <baskaryan@gmail.com>	2023-10-09 14:56:45 -07:00
Stephen Hankinson	447a523662	fix comments in output format (#11536 ) - Description: Fixes the comments in the ConvoOutputParser. Because the \\\\ is escaping a single \\, they render something like: `"action_input": string \ The input to the action` in the prompt. Changing this to \\\\\\\\ lets it escape two slashes so that it renders a proper comment: `"action_input": string \\ The input to the action` - Issue: N/A - Dependencies: - Tag maintainer: @hwchase17 - Twitter handle:	2023-10-09 14:55:44 -07:00
Michael Landis	8e45f720a8	feat: add momento vector index as a vector store provider (#11567 ) Description: - Added Momento Vector Index (MVI) as a vector store provider. This includes an implementation with docstrings, integration tests, a notebook, and documentation on the docs pages. - Updated the Momento dependency in pyproject.toml and the lock file to enable access to MVI. - Refactored the Momento cache and chat history session store to prefer using "MOMENTO_API_KEY" over "MOMENTO_AUTH_TOKEN" for consistency with MVI. This change is backwards compatible with the previous "auth_token" variable usage. Updated the code and tests accordingly. Dependencies: - Updated Momento dependency in pyproject.toml. Testing: - Run the integration tests with a Momento API key. Get one at the [Momento Console](https://console.gomomento.com) for free. MVI is available in AWS us-west-2 with a superuser key. - `MOMENTO_API_KEY=<your key> poetry run pytest tests/integration_tests/vectorstores/test_momento_vector_index.py` Tag maintainer: @eyurtsev Twitter handle: Please mention @momentohq for this addition to langchain. With the integration of Momento Vector Index, Momento caching, and session store, Momento provides serverless support for the core langchain data needs. Also mention @mlonml for the integration.	2023-10-09 14:02:59 -07:00
Eugene Yurtsev	ca2eed36b7	LangChain cli fix a few bugs (#11573 ) Code was assuming that `git` and `poetry` exist. In addition, it was not ignoring pycache files that get generated during run time	2023-10-09 13:30:16 -07:00
Hugues Chocart	258ae1ba5f	[LLMonitor Callback Handler]: Add error handling (#11563 ) Wraps every callback handler method in error handlers to avoid breaking users' programs when an error occurs inside the handler. Thanks @valdo99 for the suggestion 🙂	2023-10-09 13:26:35 -07:00
Eugene Yurtsev	2aabfafe1e	Module documentation for langchain runnables (#11550 ) Add in code documentation for langchain runnables module.	2023-10-09 16:02:29 -04:00
Eugene Yurtsev	d8fa94e6fa	RunnablePassthrough: In code documentation (#11552 ) Add in code documentation for a runnable passthrough	2023-10-09 16:02:16 -04:00
Eugene Yurtsev	b42f218cfc	RunnableLambda: Add in code docs (#11521 ) Add in code docs for Runnable Lambda	2023-10-09 14:37:46 -04:00
maks-operlejn-ds	f64522fbaf	Reset deanonymizer mapping (#11559 ) @hwchase17 @baskaryan	2023-10-09 11:11:05 -07:00
maks-operlejn-ds	b14b65d62a	Support all presidio entities (#11558 ) https://microsoft.github.io/presidio/supported_entities/ @baskaryan @hwchase17	2023-10-09 11:10:46 -07:00
maks-operlejn-ds	4d62def9ff	Better deanonymizer matching strategy (#11557 ) @baskaryan, @hwchase17	2023-10-09 11:10:29 -07:00
Ash Vardanian	a992b9670d	Fix: Missing DuckDuckGo package version (#11535 ) [The `duckduckgo-search` v3.9.2 was removed from PyPi](https://pypi.org/project/duckduckgo-search/#history). That breaks the build. - Description: refreshes the Poetry dependency to v3.9.3 - Tag maintainer: @baskaryan - Twitter handle: @ashvardanian	2023-10-09 10:55:46 -07:00
Bagatur	8932ed3f07	bump 311 (#11555 )	2023-10-09 08:17:07 -07:00
Bagatur	e7a0def1bc	QoL improvements to query constructor (#11504 ) updating query constructor and self query retriever to - make it easier to pass in examples - validate attributes used in query - remove invalid parts of query - make it easier to get + edit prompt - make query constructor a runnable - make self query retriever use as runnable	2023-10-09 08:10:52 -07:00
Taikono-Himazin	eec53fa294	Added autodetect_encoding option to csvLoader (#11327 )	2023-10-09 08:06:43 -07:00
Holt Skinner	09c66fe04f	feat: Update Google Document AI Parser (#11413 ) - Description: Code Refactoring, Documentation Improvements for Google Document AI PDF Parser - Adds Online (synchronous) processing option. - Adds default field mask to limit payload size. - Skips Human review by default. - Issue: Fixes #10589 --------- Co-authored-by: Erick Friis <erick@langchain.dev>	2023-10-09 08:04:25 -07:00
Nuno Campos	628cc4cce8	Rename RunnableMap to RunnableParallel (#11487 ) - keep alias for RunnableMap - update docs to use RunnableParallel and RunnablePassthrough.assign <!-- Thank you for contributing to LangChain! Replace this entire comment with: - Description: a description of the change, - Issue: the issue # it fixes (if applicable), - Dependencies: any dependencies required for this change, - Tag maintainer: for a quicker response, tag the relevant maintainer (see below), - Twitter handle: we announce bigger features on Twitter. If your PR gets announced, and you'd like a mention, we'll gladly shout you out! Please make sure your PR is passing linting and testing before submitting. Run `make format`, `make lint` and `make test` to check this locally. See contribution guidelines for more information on how to write/run tests, lint, etc: https://github.com/langchain-ai/langchain/blob/master/.github/CONTRIBUTING.md If you're adding a new integration, please include: 1. a test for the integration, preferably unit tests that do not rely on network access, 2. an example notebook showing its use. It lives in `docs/extras` directory. If no one reviews your PR within a few days, please @-mention one of @baskaryan, @eyurtsev, @hwchase17. -->	2023-10-09 11:22:03 +01:00
Eugene Yurtsev	6a10e8ef31	Add documentation to Runnable (#11516 )	2023-10-08 08:09:04 +01:00
William FH	eb572f41a6	Add LangSmith Run Chat Loader (#11458 )	2023-10-06 17:02:18 -07:00
David Duong	484947c492	Fetch up-to-date attributes for env-pulled kwargs during serialisation of OpenAI classes (#11499 )	2023-10-06 22:43:29 +01:00
Bagatur	5470e730d2	raise openapi import error (#11495 )	2023-10-06 12:57:24 -07:00
Erick Friis	29f5f70415	Rename some last hwchase17/langchain links (#11494 )	2023-10-06 12:34:30 -07:00
Fabrice Pont	872836c541	feat: add markdown list parser (#11411 ) Description: add `MarkdownListOutputParser` as a new `ListOutputParser` Issue: #11410	2023-10-06 12:25:45 -07:00
Erick Friis	8f50b616c5	Remove optional from vectara source (#11493 ) fyi @ofermend --------- Co-authored-by: Ofer Mendelevitch <ofer@vectara.com> Co-authored-by: Ofer Mendelevitch <ofermend@gmail.com>	2023-10-06 12:12:44 -07:00
Bagatur	53887242a1	bump 310 (#11486 )	2023-10-06 09:49:10 -07:00
Jesús Vélez Santiago	a1c7532298	Add async sql record manager and async indexing API (#10726 ) - Description: Add support for a SQLRecordManager in async environments. It includes the creation of `RecorManagerAsync` abstract class. - Issue: None - Dependencies: Optional `aiosqlite`. - Tag maintainer: @nfcampos - Twitter handle: @jvelezmagic --------- Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>	2023-10-06 09:38:44 -04:00
Qihui Xie	57ade13b2b	fix llm_inputs duplication problem in intermediate_steps in SQLDatabaseChain (#10279 ) Use `.copy()` to fix the bug that the first `llm_inputs` element is overwritten by the second `llm_inputs` element in `intermediate_steps`. *Problem description:* In [line 127]( `c732d8fffd/libs/experimental/langchain_experimental/sql/base.py (L127C17-L127C17)`), the `llm_inputs` of the sql generation step is appended as the first element of `intermediate_steps`: ``` intermediate_steps.append(llm_inputs) # input: sql generation ``` However, `llm_inputs` is a mutable dict, it is updated in [line 179](https://github.com/langchain-ai/langchain/blob/master/libs/experimental/langchain_experimental/sql/base.py#L179) for the final answer step: ``` llm_inputs["input"] = input_text ``` Then, the updated `llm_inputs` is appended as another element of `intermediate_steps` in [line 180](`c732d8fffd/libs/experimental/langchain_experimental/sql/base.py (L180)`): ``` intermediate_steps.append(llm_inputs) # input: final answer ``` As a result, the final `intermediate_steps` returned in [line 189](`c732d8fffd/libs/experimental/langchain_experimental/sql/base.py (L189C43-L189C43)`) actually contains two same `llm_inputs` elements, i.e., the `llm_inputs` for the sql generation step overwritten by the one for final answer step by mistake. Users are not able to get the actual `llm_inputs` for the sql generation step from `intermediate_steps` Simply calling `.copy()` when appending `llm_inputs` to `intermediate_steps` can solve this problem.	2023-10-05 21:32:08 -07:00
Florian	d78f418c0d	Extract abstracts from Pubmed articles, even if they have no extra label (#10245 ) ### Description This pull request involves modifications to the extraction method for abstracts/summaries within the PubMed utility. A condition has been added to verify the presence of unlabeled abstracts. Now an abstract will be extracted even if it does not have a subtitle. In addition, the extraction of the abstract was extended to books. ### Issue The PubMed utility occasionally returns an empty result when extracting abstracts from articles, despite the presence of an abstract for the paper on PubMed. This issue arises due to the varying structure of articles; some articles follow a "subtitle/label: text" format, while others do not include subtitles in their abstracts. An example of the latter case can be found at: [https://pubmed.ncbi.nlm.nih.gov/37666905/](url) --------- Co-authored-by: Bagatur <baskaryan@gmail.com>	2023-10-05 18:56:46 -07:00
Viktor Zhemchuzhnikov	fd9da60aea	Add async support to SelfQueryRetriever (#10175 ) ### Description SelfQueryRetriever is missing async support, so I am adding it. I also removed deprecated predict_and_parse method usage here, and added some tests. ### Issue N/A ### Tag maintainer Not yet ### Twitter handle N/A	2023-10-05 18:54:21 -07:00
Theron Tau	35297ca0d3	Add feature for extracting images from pdf and recognizing text from images. (#10653 ) Description It is for #10423 that it will be a useful feature if we can extract images from pdf and recognize text on them. I have implemented it with `PyPDFLoader`, `PyPDFium2Loader`, `PyPDFDirectoryLoader`, `PyMuPDFLoader`, `PDFMinerLoader`, and `PDFPlumberLoader`. [RapidOCR](https://github.com/RapidAI/RapidOCR.git) is used to recognize text on extracted images. It is time-consuming for ocr so a boolen parameter `extract_images` is set to control whether to extract and recognize. I have tested the time usage for each parser on my own laptop thinkbook 14+ with AMD R7-6800H by unit test and the result is: \| extract_images \| PyPDFParser \| PDFMinerParser \| PyMuPDFParser \| PyPDFium2Parser \| PDFPlumberParser \| \| ------------- \| ------------- \| ------------- \| ------------- \| ------------- \| ------------- \| \| False \| 0.27s \| 0.39s \| 0.06s \| 0.08s \| 1.01s \| \| True \| 17.01s \| 20.67s \| 20.32s \| 19,75s \| 20.55s \| Issue #10423 Dependencies rapidocr_onnxruntime in [RapidOCR](https://github.com/RapidAI/RapidOCR/tree/main) --------- Co-authored-by: Bagatur <baskaryan@gmail.com>	2023-10-05 18:51:59 -07:00
Bagatur	8e3fbc97ca	Add vowpal_wabbit RL chain (#11462 )	2023-10-05 18:39:45 -07:00
Haris Wang	f1269830a0	Fix bug in MarkdownHeaderTextSplitter for codeblock (#10262 ) - Description: The previous version of the MarkdownHeaderTextSplitter did not take into account the possibility of '#' appearing within code blocks, which caused segmentation anomalies in these situations. This PR has fixed this issue. - Issue: - Dependencies: No - Tag maintainer: - Twitter handle: cc @baskaryan @eyurtsev @rlancemartin --------- Co-authored-by: Bagatur <baskaryan@gmail.com>	2023-10-05 18:34:42 -07:00

1 2 3 4 5 ...

1334 Commits