You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
langchain/libs/community/langchain_community/document_loaders
Ian Gregory e5472b5eb8
Framework for supporting more languages in LanguageParser (#13318)
## Description

I am submitting this for a school project as part of a team of 5. Other
team members are @LeilaChr, @maazh10, @Megabear137, @jelalalamy. This PR
also has contributions from community members @Harrolee and @Mario928.

Initial context is in the issue we opened (#11229).

This pull request adds:

- Generic framework for expanding the languages that `LanguageParser`
can handle, using the
[tree-sitter](https://github.com/tree-sitter/py-tree-sitter#py-tree-sitter)
parsing library and existing language-specific parsers written for it
- Support for the following additional languages in `LanguageParser`:
  - C
  - C++
  - C#
  - Go
- Java (contributed by @Mario928
https://github.com/ThatsJustCheesy/langchain/pull/2)
  - Kotlin
  - Lua
  - Perl
  - Ruby
  - Rust
  - Scala
- TypeScript (contributed by @Harrolee
https://github.com/ThatsJustCheesy/langchain/pull/1)

Here is the [design
document](https://docs.google.com/document/d/17dB14cKCWAaiTeSeBtxHpoVPGKrsPye8W0o_WClz2kk)
if curious, but no need to read it.

## Issues

- Closes #11229
- Closes #10996
- Closes #8405

## Dependencies

`tree_sitter` and `tree_sitter_languages` on PyPI. We have tried to add
these as optional dependencies.

## Documentation

We have updated the list of supported languages, and also added a
section to `source_code.ipynb` detailing how to add support for
additional languages using our framework.

## Maintainer

- @hwchase17 (previously reviewed
https://github.com/langchain-ai/langchain/pull/6486)

Thanks!!

## Git commits

We will gladly squash any/all of our commits (esp merge commits) if
necessary. Let us know if this is desirable, or if you will be
squash-merging anyway.

<!-- Thank you for contributing to LangChain!

Replace this entire comment with:
  - **Description:** a description of the change, 
  - **Issue:** the issue # it fixes (if applicable),
  - **Dependencies:** any dependencies required for this change,
- **Tag maintainer:** for a quicker response, tag the relevant
maintainer (see below),
- **Twitter handle:** we announce bigger features on Twitter. If your PR
gets announced, and you'd like a mention, we'll gladly shout you out!

Please make sure your PR is passing linting and testing before
submitting. Run `make format`, `make lint` and `make test` to check this
locally.

See contribution guidelines for more information on how to write/run
tests, lint, etc:

https://github.com/langchain-ai/langchain/blob/master/.github/CONTRIBUTING.md

If you're adding a new integration, please include:
1. a test for the integration, preferably unit tests that do not rely on
network access,
2. an example notebook showing its use. It lives in `docs/extras`
directory.

If no one reviews your PR within a few days, please @-mention one of
@baskaryan, @eyurtsev, @hwchase17.
 -->

---------

Co-authored-by: Maaz Hashmi <mhashmi373@gmail.com>
Co-authored-by: LeilaChr <87657694+LeilaChr@users.noreply.github.com>
Co-authored-by: Jeremy La <jeremylai511@gmail.com>
Co-authored-by: Megabear137 <zubair.alnoor27@gmail.com>
Co-authored-by: Lee Harrold <lhharrold@sep.com>
Co-authored-by: Mario928 <88029051+Mario928@users.noreply.github.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
8 months ago
..
blob_loaders community[patch]: doc loaders mypy fixes (#17368) 8 months ago
parsers Framework for supporting more languages in LanguageParser (#13318) 8 months ago
__init__.py community[minor]: Add pebblo safe document loader (#16862) 8 months ago
acreom.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
airbyte.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
airbyte_json.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
airtable.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
apify_dataset.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
arcgis_loader.py community[patch]: doc loaders mypy fixes (#17368) 8 months ago
arxiv.py Update arxiv.py with get_summaries_as_docs inside of Arxivloader (#14953) 9 months ago
assemblyai.py community[patch]: doc loaders mypy fixes (#17368) 8 months ago
astradb.py community[patch]: doc loaders mypy fixes (#17368) 8 months ago
async_html.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
athena.py community[patch]: remove print (#17435) 8 months ago
azlyrics.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
azure_ai_data.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
azure_blob_storage_container.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
azure_blob_storage_file.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
baiducloud_bos_directory.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
baiducloud_bos_file.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
base.py infra: add -p to mkdir in lint steps (#17013) 8 months ago
base_o365.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
bibtex.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
bigquery.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
bilibili.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
blackboard.py infra: add print rule to ruff (#16221) 8 months ago
blockchain.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
brave_search.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
browserless.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
cassandra.py infra: add -p to mkdir in lint steps (#17013) 8 months ago
chatgpt.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
chm.py community[patch]: docstrings (#16810) 8 months ago
chromium.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
college_confidential.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
concurrent.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
confluence.py infra: add print rule to ruff (#16221) 8 months ago
conllu.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
couchbase.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
csv_loader.py community[patch]: doc loaders mypy fixes (#17368) 8 months ago
cube_semantic.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
datadog_logs.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
dataframe.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
diffbot.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
directory.py community[patch]: doc loaders mypy fixes (#17368) 8 months ago
discord.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
doc_intelligence.py infra: add -p to mkdir in lint steps (#17013) 8 months ago
docugami.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
docusaurus.py docs: docstrings `langchain_community` update (#14889) 10 months ago
dropbox.py infra: add print rule to ruff (#16221) 8 months ago
duckdb_loader.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
email.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
epub.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
etherscan.py infra: add print rule to ruff (#16221) 8 months ago
evernote.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
excel.py Docs: fix excel document loader typo (#15470) 9 months ago
facebook_chat.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
fauna.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
figma.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
gcs_directory.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
gcs_file.py fix: correct spelling mistakes of "seperate, intialise, pre-defined" (#14647) 9 months ago
generic.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
geodataframe.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
git.py community[patch]: doc loaders mypy fixes (#17368) 8 months ago
gitbook.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
github.py community[patch]: Add Pagination to GitHubIssuesLoader for Efficient GitHub Issues Retrieval (#16934) 8 months ago
google_speech_to_text.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
googledrive.py infra: add print rule to ruff (#16221) 8 months ago
gutenberg.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
helpers.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
hn.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
html.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
html_bs.py fix: correct spelling mistakes of "seperate, intialise, pre-defined" (#14647) 9 months ago
hugging_face_dataset.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
ifixit.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
image.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
image_captions.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
imsdb.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
iugu.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
joplin.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
json_loader.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
lakefs.py docs: docstrings `langchain_community` update (#14889) 10 months ago
larksuite.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
markdown.py corrected outdated link (#15053) 9 months ago
mastodon.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
max_compute.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
mediawikidump.py infra: add -p to mkdir in lint steps (#17013) 8 months ago
merge.py langchain[minor],community[minor]: Add async methods in BaseLoader (#16634) 8 months ago
mhtml.py fix: correct spelling mistakes of "seperate, intialise, pre-defined" (#14647) 9 months ago
modern_treasury.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
mongodb.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
news.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
notebook.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
notion.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
notiondb.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
nuclia.py infra: add print rule to ruff (#16221) 8 months ago
obs_directory.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
obs_file.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
obsidian.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
odt.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
onedrive.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
onedrive_file.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
onenote.py infra: add print rule to ruff (#16221) 8 months ago
open_city_data.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
org_mode.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
pdf.py community[patch]: doc loaders mypy fixes (#17368) 8 months ago
pebblo.py community[minor]: Add pebblo safe document loader (#16862) 8 months ago
polars_dataframe.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
powerpoint.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
psychic.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
pubmed.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
pyspark_dataframe.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
python.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
quip.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
readthedocs.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
recursive_url_loader.py community[patch]: doc loaders mypy fixes (#17368) 8 months ago
reddit.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
roam.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
rocksetdb.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
rspace.py fix: correct spelling mistakes of "seperate, intialise, pre-defined" (#14647) 9 months ago
rss.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
rst.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
rtf.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
s3_directory.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
s3_file.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
sharepoint.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
sitemap.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
slack_directory.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
snowflake_loader.py infra: add print rule to ruff (#16221) 8 months ago
spreedly.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
srt.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
stripe.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
surrealdb.py community[patch]: SurrealDB fix for asyncio (#16092) 8 months ago
telegram.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
tencent_cos_directory.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
tencent_cos_file.py fix: correct spelling mistakes of "seperate, intialise, pre-defined" (#14647) 9 months ago
tensorflow_datasets.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
text.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
tomarkdown.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
toml.py infra: add print rule to ruff (#16221) 8 months ago
trello.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
tsv.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
twitter.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
unstructured.py community[patch]: Load list of files using UnstructuredFileLoader (#16216) 8 months ago
url.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
url_playwright.py community[proxy]: Enhancement/add proxy support playwrighturlloader 16751 (#16822) 8 months ago
url_selenium.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
vsdx.py community[minor]: New documents loader for visio files (with extension .vsdx) (#16171) 8 months ago
weather.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
web_base.py community[patch]: Add Cookie Support to Fetch Method (#16673) 8 months ago
whatsapp_chat.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
wikipedia.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
word_document.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
xml.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
xorbits.py community[major], core[patch], langchain[patch], experimental[patch]: Create langchain-community (#14463) 10 months ago
youtube.py community[patch]: docstrings (#16810) 8 months ago