Commit Graph

63 Commits (eeb7c96e0cd4ab8dc1c658f8a052e585e0f33732)

Author SHA1 Message Date
Aivin V. Solatorio 6567b73e1a
JSON loader (#4067)
This implements a loader of text passages in JSON format. The `jq`
syntax is used to define a schema for accessing the relevant contents
from the JSON file. This requires dependency on the `jq` package:
https://pypi.org/project/jq/.

---------

Signed-off-by: Aivin V. Solatorio <avsolatorio@gmail.com>
1 year ago
Harrison Chase a9c2450330
Harrison/toml loader (#4090)
Co-authored-by: Mika Ayenson <Mikaayenson@users.noreply.github.com>
1 year ago
Harrison Chase fba6921b50
Harrison/one drive loader (#4081)
Co-authored-by: José Ferraz Neto <netoferraz@gmail.com>
1 year ago
Harrison Chase 5a269d3175
Harrison/media wiki xml (#4072)
Co-authored-by: Géraud de Drouas <gdedrouas@users.noreply.github.com>
1 year ago
Steve Kim 9b830f437c
Deleted importing Document from document_loaders.base because Documen… (#4068)
Hi,

- Modification:
https://python.langchain.com/en/latest/modules/indexes/document_loaders/examples/arxiv.html
- Reason: In this example, the first line is unnecessary because the
Document class does not exist in the base.
- Resolves: Issue #4052

--------
P.S: This pull-request is my first time, so please let me know if I need
to correct or write more explanation.
1 year ago
Zander Chase aa38355999
Vwp/docs improved document loaders (#4006)
Huge thanks to @leo-gan for improving the document loaders notebooks

---------

Co-authored-by: Leonid Ganeline <leo.gan.57@gmail.com>
1 year ago
Harrison Chase f04faf8496
Harrison/spreedly (#3937)
Co-authored-by: Esmit Pérez <esmitperez@users.noreply.github.com>
1 year ago
Davis Chase e7e29f9937
Dev2049/add modern treasury (#3924)
Modified Modern Treasury and Strip slightly so credentials don't have to
be passed in explicitly. Thanks @mattgmarcus for adding Modern Treasury!

---------

Co-authored-by: Matt Marcus <matt.g.marcus@gmail.com>
1 year ago
Nikolas Garske c4d3d74148
Fix typos in arxiv.ipynb (#3887)
Several minor typos in the doc for the arxiv document loaders were
fixed.
1 year ago
Harrison Chase 20aad0bed1 stripe docs 1 year ago
Harrison Chase c494ca3ad2
Harrison/doc2txt (#3772)
Co-authored-by: rishni ratnam <rishniratnam@gmail.com>
1 year ago
Harrison Chase b7ae9f715d
Langchain with reddit (#3661) (#3768)
I have added a reddit document loader which fetches the text from the
Posts of Subreddits or Reddit users, using the `praw` Python package. I
have also added an example notebook reddit.ipynb in order to guide users
to use this dataloader.
This code was made in format similar to twiiter document loader. I have
run code formating, linting and also checked the code myself for
different scenarios.

This is my first contribution to an open source project and I am really
excited about this. If you want to suggest some improvements in my code,
I will be happy to do it. :)

Co-authored-by: Taaha Bajwa <taaha.s.bajwa@gmail.com>
1 year ago
Jon Saginaw f8d69e4e52
Enhancement: Blockchain Document Loader with better Metadata support (#3710)
This PR includes some minor alignment updates, including:

- metadata object extended to support contractAddress, blockchainType,
and tokenId
- notebook doc better aligned to standard langchain format
- startToken changed from int to str to support multiple hex value types
on the Alchemy API

The updated metadata will look like the below. It's possible for a
single contractAddress to exist across multiple blockchains (e.g.
Ethereum, Polygon, etc.) so it's important to include the
blockchainType.

```
 metadata = {"source": self.contract_address, 
                      "blockchain": self.blockchainType,
                      "tokenId": tokenId}
```
1 year ago
Davis Chase 220a7076ac
Add Mathpix pdf loader (#3727)
Inspo
https://twitter.com/danielgross/status/1651695062307274754?s=46&t=1zHLap5WG4I_kQPPjfW9fA

Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
1 year ago
Harrison Chase 40f6e60e68
Harrison/stripe (#3762)
Co-authored-by: Ismail Pelaseyed <homanp@gmail.com>
1 year ago
Harrison Chase 7a129ac043
Harrison/pypdf loader (#3764)
Co-authored-by: Felipe Meres <felipe@felipemeres.com>
1 year ago
leo-gan 36c59e0c25
`Arxiv` document loader (#3627)
It makes sense to use `arxiv` as another source of the documents for
downloading.
- Added the `arxiv` document_loader, based on the
`utilities/arxiv.py:ArxivAPIWrapper`
- added tests
- added an example notebook
- sorted `__all__` in `__init__.py` (otherwise it is hard to find a
class in the very long list)
1 year ago
Eric Peter 603ea75bcd
Fix docs error for google drive loader (#3574) 1 year ago
apurvsibal af7906f100
Update Alchemy Key URL (#3559)
Update Alchemy Key URL in Blockchain Document Loader. I want to say
thank you for the incredible work the LangChain library creators have
done.

I am amazed at how seamlessly the Loader integrates with Ethereum
Mainnet, Ethereum Testnet, Polygon Mainnet, and Polygon Testnet, and I
am excited to see how this technology can be extended in the future.

@hwchase17 - Please let me know if I can improve or if I have missed any
community guidelines in making the edit? Thank you again for your hard
work and dedication to the open source community.
1 year ago
Harrison Chase 0fc0aa62f2
Harrison/blockchain docloader (#3491)
Co-authored-by: Jon Saginaw <saginawj@users.noreply.github.com>
1 year ago
Maxwell Mullin 696f840426
GuessedAtParserWarning from RTD document loader documentation example (#3397)
Addresses #3396 by adding 

`features='html.parser'` in example
1 year ago
jrhe 980cc41709
Adds progress bar using tqdm to directory_loader (#3349)
Approach copied from `WebBaseLoader`. Assumes the user doesn't have
`tqdm` installed.
1 year ago
Haste171 93d53e417a
Update unstructured_file.ipynb (#3377)
Fix typo in docs
1 year ago
Harrison Chase e5ffbee5eb
Harrison/hf document loader (#3394)
Co-authored-by: Azam Iftikhar <azamiftikhar1000@gmail.com>
1 year ago
Honkware a5ad1c270f
Add ChatGPT Data Loader (#3336)
This pull request adds a ChatGPT document loader to the document loaders
module in `langchain/document_loaders/chatgpt.py`. Additionally, it
includes an example Jupyter notebook in
`docs/modules/indexes/document_loaders/examples/chatgpt_loader.ipynb`
which uses fake sample data based on the original structure of the
`conversations.json` file.

The following files were added/modified:
- `langchain/document_loaders/__init__.py`
- `langchain/document_loaders/chatgpt.py`
- `docs/modules/indexes/document_loaders/examples/chatgpt_loader.ipynb`
-
`docs/modules/indexes/document_loaders/examples/example_data/fake_conversations.json`

This pull request was made in response to the recent release of ChatGPT
data exports by email:
https://help.openai.com/en/articles/7260999-how-do-i-export-my-chatgpt-history
1 year ago
Paul Garner aa9d5707e0
Add PythonLoader which auto-detects encoding of Python files (#3311)
This PR contributes a `PythonLoader`, which inherits from
`TextLoader` but detects and sets the encoding automatically.
1 year ago
Harrison Chase 8f22949dc4 update nnotebook title 1 year ago
Harrison Chase 96809b5794
Harrison/discord loader (#3200)
Co-authored-by: Rajtilak Bhattacharjee <rajtilak.blog@gmail.com>
1 year ago
Harrison Chase aad0a498ac
Harrison/output error (#3094)
Co-authored-by: yummydum <sumita@nowcast.co.jp>
1 year ago
Harrison Chase 1c1b77bbfe
Harrison/discord (#3092)
Co-authored-by: Rajtilak Bhattacharjee <rajtilak.blog@gmail.com>
1 year ago
Harrison Chase 1920536d99
Harrison/obsidian (#3060)
Co-authored-by: Ben Hofferber <hofferber.ben@gmail.com>
1 year ago
Zander Chase 93c0514105
Add Twitter Tweet Loader (#3050)
Reformatted version of #3022

---------

Co-authored-by: LiaoKong <568250549@qq.com>
1 year ago
Harrison Chase 5107fac656
Harrison/rec gd (#3054)
Co-authored-by: Benjamin Scholtz <BenSchZA@users.noreply.github.com>
1 year ago
Harrison Chase db7106cb79
Harrison/image caption loader (#3051)
Co-authored-by: Sean Saito <saitosean@ymail.com>
1 year ago
Harrison Chase afd3e70ae5
Harrison/confluent loader (#2994)
Co-authored-by: Justin Flick <Justinjayflick@gmail.com>
1 year ago
Harrison Chase 88d3ce12b8
Harrison/diffbot (#2984)
Co-authored-by: Manuel Saelices <msaelices@gmail.com>
1 year ago
Chetanya Rastogi aead062a70
Add an example tutorial for using PDFMinerPDFasHTMLLoader (#2960)
Last week I added the `PDFMinerPDFasHTMLLoader`. I am adding some
example code in the notebook to serve as a tutorial for how that loader
can be used to create snippets of a pdf that are structured within
sections. All the other loaders only provide the `Document` objects
segmented by pages but that's pretty loose given the amount of other
metadata that can be extracted.

With the new loader, one can leverage font-size of the text to decide
when a new sections starts and can segment the text more semantically as
shown in the tutorial notebook. The cell shows that we are able to find
the content of entire section under **Related Work** for the example pdf
which is spread across 2 pages and hence is stored as two separate
documents by other loaders
1 year ago
Kwuang Tang a508afa91c
Add file filter param to Git loader (#2904)
Allows users to specify what files should be loaded instead of
indiscriminately loading the entire repo.

extends #2851 

NOTE: for reviewers, `hide whitespace` option recommended since I
changed the indentation of an if-block to use `continue` instead so it
looks less like a Christmas tree :)
1 year ago
Harrison Chase 8fef69296d
nits (#2873) 1 year ago
Harrison Chase 07d7096de6
Harrison/playwright (#2871)
Co-authored-by: Manuel Saelices <msaelices@gmail.com>
1 year ago
ecneladis 74abeb8c53
Update output in Git notebook (#2868)
Supplemental to https://github.com/hwchase17/langchain/pull/2851.
Updates one notebook cell that I forgot to commit before.
1 year ago
ecneladis 016738e676
Add GitLoader (#2851) 1 year ago
vowelparrot bf0887c486
Add Slack Directory Loader (#2841)
Fixes linting issue from #2835 

Adds a loader for Slack Exports which can be a very valuable source of
knowledge to use for internal QA bots and other use cases.

```py
# Export data from your Slack Workspace first.
from langchain.document_loaders import SLackDirectoryLoader

SLACK_WORKSPACE_URL = "https://awesome.slack.com"

loader = ("Slack_Exports", SLACK_WORKSPACE_URL)
docs = loader.load()
```
1 year ago
Tim Asp 53dc157145
[Docs] minor fixes to loaders links and rst warnings (#2846)
The doc loaders index was picking up a bunch of subheadings because I
mistakenly made the MD titles H1s. Fixed that.

also the easy minor warnings from docs_build
2 years ago
Rounak Datta 7688bf9182
WhatsApp document loader - update regex (#2776)
I was testing out the WhatsApp Document loader, and noticed that
sometimes the date is of the following format (notice the additional
underscore):
```
3/24/23, 1:54_PM - +91 99999 99999 joined using this group's invite link
3/24/23, 6:29_PM - +91 99999 99999: When are we starting then?
```

Wierdly, the underscore is visible in Vim, but not on editors like
VSCode. I presume it is some unusual character/line terminator.
Nevertheless, I think handling this edge case will make the document
loader more robust.
2 years ago
vowelparrot 2db9b7a45d
Revert "Add Slack Directory Loader (#2835)" (#2839)
This reverts commit a6f767ae7a.

To fix the linting error.
2 years ago
vowelparrot a6f767ae7a
Add Slack Directory Loader (#2835)
Adds a loader for Slack Exports which can be a very valuable source of
    knowledge to use for internal QA bots and other use cases.

    ```py
    # Export data from your Slack Workspace first.
    from langchain.document_loaders import SLackDirectoryLoader

    SLACK_WORKSPACE_URL = "https://awesome.slack.com"

    loader = ("Slack_Exports", SLACK_WORKSPACE_URL)
    docs = loader.load()
```

---------

Co-authored-by: Mikhail Dubov <mikhail@chattermill.io>
2 years ago
vowelparrot 709f26b69e
Added bilibili loader (#2673) (#2724)
I've added a bilibili loader, bilibili is a very active video site in
China and I think we need this loader.

Example:
```python
from langchain.document_loaders.bilibili import BiliBiliLoader

loader = BiliBiliLoader(
       ["https://www.bilibili.com/video/BV1xt411o7Xu/",
       "https://www.bilibili.com/video/av330407025/"]
)
docs = loader.load()
```

Co-authored-by: 了空 <568250549@qq.com>
2 years ago
Chetanya Rastogi 50c511d75f
Add new loader to load pdf as html content (#2607)
Adds a new pdf loader using the existing dependency on PDFMiner. 

The new loader can be helpful for chunking texts semantically into
sections as the output html content can be parsed via `BeautifulSoup` to
get more structured and rich information about font size, page numbers,
pdf headers/footers, etc. which may not be available otherwise with
other pdf loaders
2 years ago
Harrison Chase e90d007db3
Harrison/msg files (#2375)
Co-authored-by: Sahil Masand <masand.sahil@gmail.com>
Co-authored-by: Sahil Masand <masands@cbh.com.au>
2 years ago