Update confluence.py to return spaces between elements (#5383)

# Update confluence.py to return spaces between elements like headers
and links.

Please see
https://stackoverflow.com/questions/48913975/how-to-return-nicely-formatted-text-in-beautifulsoup4-when-html-text-is-across-m

Given:

```html
<address>
        183 Main St<br>East Copper<br>Massachusetts<br>U S A<br>
        MA 01516-113
    </address>
```

The document loader currently returns:

```
'183 Main StEast CopperMassachusettsU S A        MA 01516-113'
```

After this change, the document loader will return:

```
183 Main St East Copper Massachusetts U S A MA 01516-113
```


@eyurtsev would you prefer this to be an option that can be passed in?
searx_updates
Gardner Bickford 12 months ago committed by GitHub
parent b72401b47b
commit b81f98b8a6
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

@ -347,15 +347,17 @@ class ConfluenceLoader(BaseLoader):
attachment_texts = self.process_attachment(page["id"])
else:
attachment_texts = []
text = BeautifulSoup(
page["body"]["storage"]["value"], "lxml"
).get_text() + "".join(attachment_texts)
text = BeautifulSoup(page["body"]["storage"]["value"], "lxml").get_text(
" ", strip=True
) + "".join(attachment_texts)
if include_comments:
comments = self.confluence.get_page_comments(
page["id"], expand="body.view.value", depth="all"
)["results"]
comment_texts = [
BeautifulSoup(comment["body"]["view"]["value"], "lxml").get_text()
BeautifulSoup(comment["body"]["view"]["value"], "lxml").get_text(
" ", strip=True
)
for comment in comments
]
text = text + "".join(comment_texts)

Loading…
Cancel
Save