mirror of
https://github.com/hwchase17/langchain
synced 2024-11-10 01:10:59 +00:00
e5472b5eb8
## Description I am submitting this for a school project as part of a team of 5. Other team members are @LeilaChr, @maazh10, @Megabear137, @jelalalamy. This PR also has contributions from community members @Harrolee and @Mario928. Initial context is in the issue we opened (#11229). This pull request adds: - Generic framework for expanding the languages that `LanguageParser` can handle, using the [tree-sitter](https://github.com/tree-sitter/py-tree-sitter#py-tree-sitter) parsing library and existing language-specific parsers written for it - Support for the following additional languages in `LanguageParser`: - C - C++ - C# - Go - Java (contributed by @Mario928 https://github.com/ThatsJustCheesy/langchain/pull/2) - Kotlin - Lua - Perl - Ruby - Rust - Scala - TypeScript (contributed by @Harrolee https://github.com/ThatsJustCheesy/langchain/pull/1) Here is the [design document](https://docs.google.com/document/d/17dB14cKCWAaiTeSeBtxHpoVPGKrsPye8W0o_WClz2kk) if curious, but no need to read it. ## Issues - Closes #11229 - Closes #10996 - Closes #8405 ## Dependencies `tree_sitter` and `tree_sitter_languages` on PyPI. We have tried to add these as optional dependencies. ## Documentation We have updated the list of supported languages, and also added a section to `source_code.ipynb` detailing how to add support for additional languages using our framework. ## Maintainer - @hwchase17 (previously reviewed https://github.com/langchain-ai/langchain/pull/6486) Thanks!! ## Git commits We will gladly squash any/all of our commits (esp merge commits) if necessary. Let us know if this is desirable, or if you will be squash-merging anyway. <!-- Thank you for contributing to LangChain! Replace this entire comment with: - **Description:** a description of the change, - **Issue:** the issue # it fixes (if applicable), - **Dependencies:** any dependencies required for this change, - **Tag maintainer:** for a quicker response, tag the relevant maintainer (see below), - **Twitter handle:** we announce bigger features on Twitter. If your PR gets announced, and you'd like a mention, we'll gladly shout you out! Please make sure your PR is passing linting and testing before submitting. Run `make format`, `make lint` and `make test` to check this locally. See contribution guidelines for more information on how to write/run tests, lint, etc: https://github.com/langchain-ai/langchain/blob/master/.github/CONTRIBUTING.md If you're adding a new integration, please include: 1. a test for the integration, preferably unit tests that do not rely on network access, 2. an example notebook showing its use. It lives in `docs/extras` directory. If no one reviews your PR within a few days, please @-mention one of @baskaryan, @eyurtsev, @hwchase17. --> --------- Co-authored-by: Maaz Hashmi <mhashmi373@gmail.com> Co-authored-by: LeilaChr <87657694+LeilaChr@users.noreply.github.com> Co-authored-by: Jeremy La <jeremylai511@gmail.com> Co-authored-by: Megabear137 <zubair.alnoor27@gmail.com> Co-authored-by: Lee Harrold <lhharrold@sep.com> Co-authored-by: Mario928 <88029051+Mario928@users.noreply.github.com> Co-authored-by: Bagatur <baskaryan@gmail.com> Co-authored-by: Harrison Chase <hw.chase.17@gmail.com> |
||
---|---|---|
.. | ||
blob_loaders | ||
parsers | ||
__init__.py | ||
acreom.py | ||
airbyte_json.py | ||
airbyte.py | ||
airtable.py | ||
apify_dataset.py | ||
arcgis_loader.py | ||
arxiv.py | ||
assemblyai.py | ||
astradb.py | ||
async_html.py | ||
athena.py | ||
azlyrics.py | ||
azure_ai_data.py | ||
azure_blob_storage_container.py | ||
azure_blob_storage_file.py | ||
baiducloud_bos_directory.py | ||
baiducloud_bos_file.py | ||
base_o365.py | ||
base.py | ||
bibtex.py | ||
bigquery.py | ||
bilibili.py | ||
blackboard.py | ||
blockchain.py | ||
brave_search.py | ||
browserless.py | ||
cassandra.py | ||
chatgpt.py | ||
chm.py | ||
chromium.py | ||
college_confidential.py | ||
concurrent.py | ||
confluence.py | ||
conllu.py | ||
couchbase.py | ||
csv_loader.py | ||
cube_semantic.py | ||
datadog_logs.py | ||
dataframe.py | ||
diffbot.py | ||
directory.py | ||
discord.py | ||
doc_intelligence.py | ||
docugami.py | ||
docusaurus.py | ||
dropbox.py | ||
duckdb_loader.py | ||
email.py | ||
epub.py | ||
etherscan.py | ||
evernote.py | ||
excel.py | ||
facebook_chat.py | ||
fauna.py | ||
figma.py | ||
gcs_directory.py | ||
gcs_file.py | ||
generic.py | ||
geodataframe.py | ||
git.py | ||
gitbook.py | ||
github.py | ||
google_speech_to_text.py | ||
googledrive.py | ||
gutenberg.py | ||
helpers.py | ||
hn.py | ||
html_bs.py | ||
html.py | ||
hugging_face_dataset.py | ||
ifixit.py | ||
image_captions.py | ||
image.py | ||
imsdb.py | ||
iugu.py | ||
joplin.py | ||
json_loader.py | ||
lakefs.py | ||
larksuite.py | ||
markdown.py | ||
mastodon.py | ||
max_compute.py | ||
mediawikidump.py | ||
merge.py | ||
mhtml.py | ||
modern_treasury.py | ||
mongodb.py | ||
news.py | ||
notebook.py | ||
notion.py | ||
notiondb.py | ||
nuclia.py | ||
obs_directory.py | ||
obs_file.py | ||
obsidian.py | ||
odt.py | ||
onedrive_file.py | ||
onedrive.py | ||
onenote.py | ||
open_city_data.py | ||
org_mode.py | ||
pdf.py | ||
pebblo.py | ||
polars_dataframe.py | ||
powerpoint.py | ||
psychic.py | ||
pubmed.py | ||
pyspark_dataframe.py | ||
python.py | ||
quip.py | ||
readthedocs.py | ||
recursive_url_loader.py | ||
reddit.py | ||
roam.py | ||
rocksetdb.py | ||
rspace.py | ||
rss.py | ||
rst.py | ||
rtf.py | ||
s3_directory.py | ||
s3_file.py | ||
sharepoint.py | ||
sitemap.py | ||
slack_directory.py | ||
snowflake_loader.py | ||
spreedly.py | ||
srt.py | ||
stripe.py | ||
surrealdb.py | ||
telegram.py | ||
tencent_cos_directory.py | ||
tencent_cos_file.py | ||
tensorflow_datasets.py | ||
text.py | ||
tomarkdown.py | ||
toml.py | ||
trello.py | ||
tsv.py | ||
twitter.py | ||
unstructured.py | ||
url_playwright.py | ||
url_selenium.py | ||
url.py | ||
vsdx.py | ||
weather.py | ||
web_base.py | ||
whatsapp_chat.py | ||
wikipedia.py | ||
word_document.py | ||
xml.py | ||
xorbits.py | ||
youtube.py |