langchain

mirror of https://github.com/hwchase17/langchain synced 2024-10-29 17:07:25 +00:00

Author	SHA1	Message	Date
Cristóbal Carnero Liñán	e494b0a09f	feat (documents): add a source code loader based on AST manipulation (#6486 ) #### Summary A new approach to loading source code is implemented: Each top-level function and class in the code is loaded into separate documents. Then, an additional document is created with the top-level code, but without the already loaded functions and classes. This could improve the accuracy of QA chains over source code. For instance, having this script: ``` class MyClass: def __init__(self, name): self.name = name def greet(self): print(f"Hello, {self.name}!") def main(): name = input("Enter your name: ") obj = MyClass(name) obj.greet() if __name__ == '__main__': main() ``` The loader will create three documents with this content: First document: ``` class MyClass: def __init__(self, name): self.name = name def greet(self): print(f"Hello, {self.name}!") ``` Second document: ``` def main(): name = input("Enter your name: ") obj = MyClass(name) obj.greet() ``` Third document: ``` # Code for: class MyClass: # Code for: def main(): if __name__ == '__main__': main() ``` A threshold parameter is added to control whether small scripts are split in this way or not. At this moment, only Python and JavaScript are supported. The appropriate parser is determined by examining the file extension. #### Tests This PR adds: - Unit tests - Integration tests #### Dependencies Only one dependency was added as optional (needed for the JavaScript parser). #### Documentation A notebook is added showing how the loader can be used. #### Who can review? @eyurtsev @hwchase17 --------- Co-authored-by: rlm <pexpresss31@gmail.com>	2023-06-27 15:58:47 -07:00
Eugene Yurtsev	0dc304ca80	Add html parsers (#4874 ) # Add bs4 html parser * Some minor refactors * Extract the bs4 html parsing code from the bs html loader * Move some tests from integration tests to unit tests	2023-05-17 22:39:11 -04:00
Lester Yang	cd3f9865f3	Feature: pdfplumber PDF loader with BaseBlobParser (#4552 ) # Feature: pdfplumber PDF loader with BaseBlobParser * Adds pdfplumber as a PDF loader * Adds pdfplumber as a blob parser.	2023-05-15 09:47:02 -04:00
Eugene Yurtsev	2ceb807da2	Add PDF parser implementations (#4356 ) # Add PDF parser implementations This PR separates the data loading from the parsing for a number of existing PDF loaders. Parser tests have been designed to help encourage developers to create a consistent interface for parsing PDFs. This interface can be made more consistent in the future by adding information into the initializer on desired behavior with respect to splitting by page etc. This code is expected to be backwards compatible -- with the exception of a bug fix with pymupdf parser which was returning `bytes` in the page content rather than strings. Also changing the lazy parser method of document loader to return an Iterator rather than Iterable over documents. ## Before submitting <!-- If you're adding a new integration, include an integration test and an example notebook showing its use! --> ## Who can review? Community members can review the PR once tests pass. Tag maintainers/contributors who might be interested: @ <!-- For a quicker response, figure out the right person to tag with @ @hwchase17 - project lead Tracing / Callbacks - @agola11 Async - @agola11 DataLoader Abstractions - @eyurtsev LLM/Chat Wrappers - @hwchase17 - @agola11 Tools / Toolkits - @vowelparrot -->	2023-05-09 10:24:17 -04:00

4 Commits