langchain/docs/extras/modules/data_connection/document_loaders/integrations
Cristóbal Carnero Liñán e494b0a09f
feat (documents): add a source code loader based on AST manipulation (#6486)
#### Summary

A new approach to loading source code is implemented:

Each top-level function and class in the code is loaded into separate
documents. Then, an additional document is created with the top-level
code, but without the already loaded functions and classes.

This could improve the accuracy of QA chains over source code.

For instance, having this script:

```
class MyClass:
    def __init__(self, name):
        self.name = name

    def greet(self):
        print(f"Hello, {self.name}!")

def main():
    name = input("Enter your name: ")
    obj = MyClass(name)
    obj.greet()

if __name__ == '__main__':
    main()
```

The loader will create three documents with this content:

First document:
```
class MyClass:
    def __init__(self, name):
        self.name = name

    def greet(self):
        print(f"Hello, {self.name}!")
```

Second document:
```
def main():
    name = input("Enter your name: ")
    obj = MyClass(name)
    obj.greet()
```

Third document:
```
# Code for: class MyClass:

# Code for: def main():

if __name__ == '__main__':
    main()
```

A threshold parameter is added to control whether small scripts are
split in this way or not.

At this moment, only Python and JavaScript are supported. The
appropriate parser is determined by examining the file extension.

#### Tests

This PR adds:

- Unit tests
- Integration tests

#### Dependencies

Only one dependency was added as optional (needed for the JavaScript
parser).

#### Documentation

A notebook is added showing how the loader can be used.

#### Who can review?

@eyurtsev @hwchase17

---------

Co-authored-by: rlm <pexpresss31@gmail.com>
2023-06-27 15:58:47 -07:00
..
example_data feat (documents): add a source code loader based on AST manipulation (#6486) 2023-06-27 15:58:47 -07:00
acreom.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
airbyte_json.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
airtable.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
alibaba_cloud_maxcompute.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
apify_dataset.ipynb docs/fix links (#6498) 2023-06-20 14:06:50 -07:00
arxiv.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
aws_s3_directory.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
aws_s3_file.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
azlyrics.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
azure_blob_storage_container.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
azure_blob_storage_file.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
bibtex.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
bilibili.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
blackboard.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
blockchain.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
chatgpt_loader.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
college_confidential.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
confluence.ipynb fix titles in documentation 2023-06-17 11:09:11 -07:00
conll-u.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
copypaste.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
csv.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
diffbot.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
discord.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
docugami.ipynb docs/fix links (#6498) 2023-06-20 14:06:50 -07:00
duckdb.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
email.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
embaas.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
epub.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
evernote.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
excel.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
facebook_chat.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
fauna.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
figma.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
git.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
gitbook.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
github.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
google_bigquery.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
google_cloud_storage_directory.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
google_cloud_storage_file.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
google_drive.ipynb Harrison/gdrive enhancements (#6375) 2023-06-18 11:07:23 -07:00
gutenberg.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
hacker_news.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
hugging_face_dataset.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
ifixit.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
image_captions.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
image.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
imsdb.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
iugu.ipynb Minor Grammar Fixes in Docs and Comments (#6536) 2023-06-21 09:53:31 -07:00
joplin.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
jupyter_notebook.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
mastodon.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
mediawikidump.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
merge_doc_loader.ipynb Create merge loader that combines documents from a set of loaders (#6659) 2023-06-23 13:02:48 -07:00
mhtml.ipynb Added a MHTML document loader (#6311) 2023-06-25 13:12:08 -07:00
microsoft_onedrive.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
microsoft_powerpoint.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
microsoft_word.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
modern_treasury.ipynb Minor Grammar Fixes in Docs and Comments (#6536) 2023-06-21 09:53:31 -07:00
notion.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
notiondb.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
obsidian.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
odt.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
open_city_data.ipynb Loader for OpenCityData and minor cleanups to Pandas, Airtable loaders (#6301) 2023-06-22 22:20:42 -07:00
pandas_dataframe.ipynb Loader for OpenCityData and minor cleanups to Pandas, Airtable loaders (#6301) 2023-06-22 22:20:42 -07:00
psychic.ipynb docs/fix links (#6498) 2023-06-20 14:06:50 -07:00
pyspark_dataframe.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
readthedocs_documentation.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
recursive_url_loader.ipynb RecusiveUrlLoader to RecursiveUrlLoader (#6787) 2023-06-26 23:12:14 -07:00
reddit.ipynb docs/fix links (#6498) 2023-06-20 14:06:50 -07:00
roam.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
rst.ipynb feat: Add UnstructuredRSTLoader (#6594) 2023-06-25 12:41:57 -07:00
sitemap.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
slack.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
snowflake.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
source_code.ipynb feat (documents): add a source code loader based on AST manipulation (#6486) 2023-06-27 15:58:47 -07:00
spreedly.ipynb Minor Grammar Fixes in Docs and Comments (#6536) 2023-06-21 09:53:31 -07:00
stripe.ipynb Minor Grammar Fixes in Docs and Comments (#6536) 2023-06-21 09:53:31 -07:00
subtitle.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
telegram.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
tomarkdown.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
toml.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
trello.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
twitter.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
unstructured_file.ipynb Harrison/unstructured page number (#6464) 2023-06-19 22:31:43 -07:00
url.ipynb Add markdown to specify important arguments (#6246) 2023-06-18 17:47:00 -07:00
weather.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
web_base.ipynb Update web_base.ipynb (#6430) 2023-06-19 21:43:35 -07:00
whatsapp_chat.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
wikipedia.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
xml.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
youtube_audio.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00
youtube_transcript.ipynb Doc refactor (#6300) 2023-06-16 11:52:56 -07:00