Commit Graph

318 Commits

Author SHA1 Message Date
Gijs Kruitbosch
8fec62d246 Strip XML namespaces from tag names to deal with broken serializations 2018-06-09 09:51:16 +01:00
Gijs Kruitbosch
8e92a1fa19 Reuse textNode variable for CDATA blocks, too 2018-06-09 09:49:25 +01:00
David A Roberts
ea4165721f Remove single-cell tables 2018-06-09 09:49:01 +01:00
David A Roberts
bf64b58d90 Update tests 2018-06-08 11:06:01 +01:00
David A Roberts
72bd1a8532 Don't nest paragraphs 2018-06-08 11:06:01 +01:00
David A Roberts
68c9af4ffa Use numeric encoding for non-XML entities
JSDOMParser can't handle HTML named entities like ` `
2018-06-08 11:06:01 +01:00
David A Roberts
611e9e3a6f JSDOMParser: handle CDATA sections 2018-06-08 11:06:01 +01:00
David A Roberts
afcc4b8e49 Fix titles not being trimmed sometimes 2018-06-08 11:06:01 +01:00
Gijs Kruitbosch
d4b842c82a Match headings on trimmed strings to avoid whitespace causing mismatches 2018-05-24 16:32:16 +01:00
Gijs Kruitbosch
8c02a0d34c Fix #283 and remove hidden nodes 2018-05-16 21:54:12 +01:00
David A Roberts
656a6673d9 Don't put non-phrasing content into paragraphs 2018-05-16 10:43:30 +01:00
David A Roberts
5ae90930cd Don't convert DIVs to Ps when more than 25% links 2018-05-15 13:29:55 +01:00
David A Roberts
9f2c5cb42e Put phrasing content into paragraphs
This removes the need for `p.readability-styled` elements.
2018-05-15 13:29:55 +01:00
David A Roberts
c823a6efb2 Fix generate-testcase.js 2018-05-14 09:44:33 +01:00
Gijs Kruitbosch
f4ab856992 Check for a document being passed
This provides a descriptive error message if no document is passed, and
ignores the first argument if the second argument looks like
a reasonable DOM document instance.
2018-05-01 12:29:22 +01:00
David A Roberts
7a24801958 Don't include root html node in candidates
Fixes #435
2018-04-29 09:22:00 +01:00
David A Roberts
acfd3759a1 Generate XHTML-compatible input for test cases
Fixes the bug noted in the README
2018-04-28 22:28:16 +01:00
David A Roberts
d60184966c Remove unused URI parameter from constructor 2018-04-26 10:30:15 +01:00
David A Roberts
5ee03bc960 Stop Readability depending on Node.* constants 2018-04-22 16:23:46 +01:00
Andres Rey
3c76104adb Fix engadget test case 2018-03-23 00:27:50 +00:00
Andres Rey
4b99f41ec9 Add engadget test case 2018-03-23 00:27:50 +00:00
Andres Rey
6c5bc62959 Remove aside tags on test cases 2018-03-23 00:27:50 +00:00
Andres Rey
6fd816496c Clean <aside> tags on _prepArticle 2018-03-23 00:27:50 +00:00
David A Roberts
f8d9b1c224 Update test expectations 2018-03-19 13:20:29 +00:00
David A Roberts
8414158fa9 Fix _replaceBrs
Previously, `nextElem` was not actually proceeding to the next element, and therefore aborting the paragraph at the first `<br>` (rather than the first `<br><br>` as the comment indicates).
2018-03-19 13:20:29 +00:00
Joan Espasa Arxer
3ff9a166fb Changed wordThreshold to charThreshold to better reflect the semantics. 2018-03-13 14:37:27 +00:00
Brad Philips
8525c6af36 Fix relative URIs given <base> tags (#422) 2018-03-02 11:38:14 +00:00
Gijs Kruitbosch
d598baf02b Improve URL handling in JSDOMParser and Readability.js
This change ups the required node version to 7.0 because it relies on the builtin url module.

We now pass a url when constructing a jsdom document or JSDOMParser document.
Because this is an API change, I'm increasing the package version.

Ultimately, I would like to remove the  argument from the readability constructor. It should
use the documentURI from the document it is passed.
2018-02-28 11:29:29 +00:00
Andres Rey
834672ef86 Return longest text after failing to detect text longer than the configured value (#423)
Save extracted text across attempts and return the longest one when all attempts fail, and add a test case from hukumusume
2018-02-27 14:26:54 +00:00
Tom Z?hner
264b8e8968 Remove link elements when preparing article for display 2018-01-30 15:11:59 +00:00
Thomas Jaggi
fd1557560a [Docs] Fixed JSDOM usage note 2018-01-02 22:12:53 +00:00
Andres Rey
fa9d8bda48 Add la-nacion test case 2017-12-11 14:00:48 +00:00
Andres Rey
01ffd0c617 Remove "modal" from strings to remove 2017-12-11 14:00:48 +00:00
Gijs
8da91b9eed
Fix omitted semicolon 2017-12-05 13:22:32 +00:00
Gijs
0a30527c85
Explicitly mention lack of Node in node.js environments 2017-12-05 13:22:01 +00:00
Gijs Kruitbosch
807bf05aa3 Fix className usage so it deals correctly with SVG nodes (fixes #412). 2017-12-05 11:06:39 +00:00
Gijs Kruitbosch
c586aeb404 Fix generate-testcase.js script so it keeps caption classes 2017-11-30 10:41:15 +00:00
Gijs Kruitbosch
ad4dd26448 Update test expectations 2017-11-30 10:41:15 +00:00
Gijs Kruitbosch
092a8aeaff Revert removing ids from elements 2017-11-30 10:41:15 +00:00
Andres Rey
eb895b97a2 Add test case for title and h1 discrepancy 2017-11-27 15:53:38 +00:00
Andres Rey
9ce4d87232 Fall back to the original title if after trimming the text we have too many words before the colon. 2017-11-27 15:53:38 +00:00
Andres Rey
c2e370c2c7 Add telegraph test case 2017-11-22 21:14:14 +00:00
Andres Rey
5a5c8ba1a2 Add node to elementsToScore when _hasSinglePInsideElement is true 2017-11-22 21:14:14 +00:00
Cameron McCormack
5ad448f831 Update test expectations. 2017-11-21 10:04:59 +00:00
Cameron McCormack
d88c9afc63 Use a hard coded classesToPreserve in tests. 2017-11-21 10:04:59 +00:00
Cameron McCormack
6729538c77 Clean IDs and classes from output. 2017-11-21 10:04:59 +00:00
Tomas Dvorak
19b9f9de14 added npmignore for test and benchmarks resources 2017-11-13 13:32:23 +00:00
Björgvin Ragnarsson
c3ff1a2d2c remove dead code 2017-11-02 23:15:13 +00:00
Iqbal Ahmed
b3fde168cb Allow the word threshold parameter to be configurable 2017-09-19 15:38:10 +01:00
Taylor Hunt
b7c32feb25 Remove presentational HTML attributes (#385)
* Remove presentational HTML attributes

Fixes #383

This patch loops through a list of known-presentational attributes in HTML, attempting to remove each from each cleaned element. (Checking for the attribute's existence first seems to just add needless overhead.)

The extra check for the HTML namespace is to avoid removing attributes that inline SVG needs.

* Only remove `width`/`height` for certain elements

Embedded media elements are allowed to have them, but not others.

* Address PR feedback

* Fix loop index formatting
* Only remove `width`/`height` from certain elements
* Combine logic into a single check/remove

* Attempt fixing my recursion

* One weird trick to get your loops to run

* Add inline SVG bailout

Try not to touch any styles for `<svg>`, because it's inherently presentational.

* Update tests to match newly-removed attributes

* Oh those wacky SVGs

The `position:absolute` is a trick to import clipping paths into the document without putting a big 300×150 empty space in it. (`display:none` and such disable the clipPath in some browsers.)

* Whoops, missed some `width`s

* Normalize SVG tagName

JSDOMParser differs from the official DOM here
2017-08-09 20:26:34 +01:00