This PR addresses several improvements:
- Previously it was not possible to load spaces of more than 100 pages.
The `limit` was being used both as an overall page limit *and* as a per
request pagination limit. This, in combination with the fact that
atlassian seem to use a server-side hard limit of 100 when page content
is expanded, meant it wasn't possible to download >100 pages. Now
`limit` is used *only* as a per-request pagination limit and `max_pages`
is introduced as the way to limit the total number of pages returned by
the paginator.
- Document metadata now includes `source` (the source url), making it
compatible with `RetrievalQAWithSourcesChain`.
- It is now possible to include inline and footer comments.
- It is now possible to pass `verify_ssl=False` and other parameters to
the confluence object for use cases that require it.
Fixes linting issue from #2835
Adds a loader for Slack Exports which can be a very valuable source of
knowledge to use for internal QA bots and other use cases.
```py
# Export data from your Slack Workspace first.
from langchain.document_loaders import SLackDirectoryLoader
SLACK_WORKSPACE_URL = "https://awesome.slack.com"
loader = ("Slack_Exports", SLACK_WORKSPACE_URL)
docs = loader.load()
```
Currently, the function still fails if `continue_on_failure` is set to
True, because `elements` is not set.
---------
Co-authored-by: leecjohnny <johnny-lee1255@users.noreply.github.com>
I've added a bilibili loader, bilibili is a very active video site in
China and I think we need this loader.
Example:
```python
from langchain.document_loaders.bilibili import BiliBiliLoader
loader = BiliBiliLoader(
["https://www.bilibili.com/video/BV1xt411o7Xu/",
"https://www.bilibili.com/video/av330407025/"]
)
docs = loader.load()
```
Co-authored-by: 了空 <568250549@qq.com>
Adds a new pdf loader using the existing dependency on PDFMiner.
The new loader can be helpful for chunking texts semantically into
sections as the output html content can be parsed via `BeautifulSoup` to
get more structured and rich information about font size, page numbers,
pdf headers/footers, etc. which may not be available otherwise with
other pdf loaders
Solves #2247. Noted that the only test I added checks for the
BeautifulSoup behaviour change. Happy to add a test for
`DirectoryLoader` if deemed necessary.
This `BSHTMLLoader` document_loader loads an HTML document, extracts
text and adds the page title to the returned Document's metadata. The
loader uses the already installed bs4 package to extract both text
content and the page title.
Included in this PR is an example HTML file and an integration test that
tests against this file.
---------
Co-authored-by: Daniel Chalef <daniel.chalef@private.org>
Different PDF libraries have different strengths and weaknesses. PyMuPDF
does a good job at extracting the most amount of content from the doc,
regardless of the source quality, extremely fast (especially compared to
Unstructured).
https://pymupdf.readthedocs.io/en/latest/index.html
iFixit is a wikipedia-like site that has a huge amount of open content
on how to fix things, questions/answers for common troubleshooting and
"things" related content that is more technical in nature. All content
is licensed under CC-BY-SA-NC 3.0
Adding docs from iFixit as context for user questions like "I dropped my
phone in water, what do I do?" or "My macbook pro is making a whining
noise, what's wrong with it?" can yield significantly better responses
than context free response from LLMs.