Commit Graph

377 Commits

Author SHA1 Message Date
PalmerAL
b551f1cf6e Fix missing content on Wikipedia articles (#560) 2019-09-30 19:25:29 +01:00
Joe Winett
60f470c4bb Remove aria-hidden="true" nodes (fixes #541) (#555)
Remove aria-hidden="true" nodes (fixes #541)
2019-08-29 08:33:28 +01:00
Jordy van den Aardweg
2982216913 Added "keepClasses" option to prevent cleaning of classes (#552) 2019-08-04 08:56:27 +01:00
Gijs
f33a6c2a23
Switch to a newer node.js to fix build issues (#551) 2019-07-15 14:53:42 +01:00
Gijs
234f420279
Clarify security implications of using readability 2019-07-15 14:40:34 +01:00
PalmerAL
9092b2a29c Remove sharing elements in fewer situations (#545)
* remove fewer share elements

* simplify and fix social-buttons testcase
2019-05-22 23:53:51 +01:00
PalmerAL
814f0a3884 Add support for detecting lazy-loaded images (#542)
Add support for detecting lazy-loaded images using `src` or `srcset` attributes.
2019-05-08 23:48:37 +01:00
Mozilla-GitHub-Standards
26379fe62e Add Mozilla Code of Conduct file
Fixes #537.

_(Message COC002)_
2019-03-29 12:24:48 +00:00
Gijs Kruitbosch
cb5771fd4a Add nested font tags to test _setNodeTag on those (see #59) 2019-03-15 12:02:21 +00:00
Radhi
9009f64f9c Fix table header missing (#530) 2019-03-07 13:09:21 +00:00
Radhi
6761a7e412 Fix embedded videos getting removed (#526)
Fix embedded videos getting removed
2019-03-07 13:02:15 +00:00
PalmerAL
f5c46a7b14 fix formatting 2019-03-05 01:33:00 +00:00
PalmerAL
681bf0c47b use default threshold for share elements 2019-03-05 01:33:00 +00:00
PalmerAL
b9cece3e58 add test 2019-03-05 01:33:00 +00:00
PalmerAL
e76aba3485 only remove sharing elements if they contain <500 characters 2019-03-05 01:33:00 +00:00
PalmerAL
27ee1e947e update regexes in readerable.js 2019-03-01 11:04:58 +00:00
PalmerAL
a014e0c9c8 exclude graphs from nytimes articles 2019-03-01 11:04:58 +00:00
Radhi Fadlillah
c942b32945 Revert source files and fix expected results 2019-03-01 11:02:48 +00:00
Radhi Fadlillah
bd5087d2f1 fix error in testing "wikipedia" 2019-03-01 11:02:48 +00:00
Radhi Fadlillah
3e025d58e5 fix error in testing "lwn-01" 2019-03-01 11:02:48 +00:00
Radhi Fadlillah
df95c9d717 fix error in testing "keep-tabulard-data" 2019-03-01 11:02:48 +00:00
Radhi Fadlillah
6a5066abe2 Fix tabular data got removed 2019-03-01 11:02:48 +00:00
PalmerAL
f70d36852b check itemprop when determining whether a node is a byline 2019-02-23 18:26:15 +00:00
EvsChen
b9f47bcc8d fix(test-util): fix generate testcase tool 2019-02-15 13:14:30 +00:00
Andres Rey
d41de78c26 Close img tag 2019-02-11 22:06:35 +00:00
Andres Rey
1187b2dae1 Update test expectations 2019-02-11 22:06:35 +00:00
Andres Rey
3ca8c12d87 Update test expectations 2019-02-11 22:06:35 +00:00
Andres Rey
f836a8f291 Add "gdpr" to the list of negative tags 2019-02-11 22:06:35 +00:00
Andres Rey
4ffd482004 Add medicalnewstoday test case with incorrect results 2019-02-11 22:06:35 +00:00
Taylor Buley
c0c097c930 update JSDOM example for node 2019-01-29 12:06:26 +00:00
Gijs Kruitbosch
60ef565b67 Don't choke on <meta> tags that do not have a content attribute 2019-01-28 15:55:07 +00:00
Gijs
878545f64d
Make usage sections in README more discoverable
This just reorders some of the content and reduces duplication.
2019-01-07 18:56:27 +00:00
Gijs Kruitbosch
30f9670a5f Avoid setAttribute errors from invalid attributes, fixes #392 2019-01-07 18:53:24 +00:00
Gijs
15d411a865
Add comment to indicate duplicate regexes
This comment was added in mozilla-central and seems useful, adding it to keep m-c and github in sync.
2019-01-03 14:27:18 +00:00
Gijs Kruitbosch
d8c837012b Fix benchmark script for script split and new JSDOM version 2018-12-29 18:22:14 +00:00
Gijs Kruitbosch
512e1c18a7 Update to latest JSDOM 2018-12-29 18:22:14 +00:00
Gijs Kruitbosch
977be42d1f Fix link normalization for live HTMLCollections
Newer versions of JSDOM implement getElementsByTagName correctly.
This means it returns a live node list. When calling
`Element.replaceChild` for links inside the loop over that
collection, elements disappear from the list, meaning we miss
every other item. Without this fix, the `clean-links` testcase
breaks.
2018-12-29 18:22:14 +00:00
Gijs Kruitbosch
e8bb7f722f Fix whitespace normalization in title metadata
When switching to a newer version of JSDOM, it is more literal
about listing whitespace as part of textContent, including
newlines and not normalizing multiple spaces.

It seems prudent to just always normalize whitespace for titles,
which are guaranteed to be pretty short anyway.
2018-12-29 18:22:14 +00:00
Gijs Kruitbosch
3610476663 Remove CSS that jsdom struggles to parse 2018-12-29 18:22:14 +00:00
Gijs Kruitbosch
2620542dd1 Split off isProbablyReaderable implementation 2018-12-29 18:22:14 +00:00
Maria Luiza Soares
8c41d92560 Assert on siteName in all test cases 2018-12-21 18:28:28 +00:00
Maria Luiza Soares
1bac47c70d Add newly generated test case 2018-12-21 18:28:28 +00:00
Maria Luiza Soares
262fffd703 Retrieve site name on parse, based on meta og:site_name 2018-12-21 18:28:28 +00:00
Gijs
876c81f710 Update sorting function in Readability.js
Simplify sorting function also considering case where arguments are equal

Co-Authored-By: jemrobinson <james.em.robinson@gmail.com>
2018-11-20 12:08:07 +00:00
James Robinson
ee18c21fc2 Switched sort function from boolean to explicit -1 and 1 thus avoiding failures to sort when false is evaluated as 0 2018-11-20 12:08:07 +00:00
Dan Burzo
44e90de00b Elements that have no .style (e.g. mathml) are probably visible; fixes #493 2018-11-07 13:29:41 +00:00
Hugo Locurcio
9fbe42683a Add .gitattributes file
This ignores HTML (test data) so the repository is considered
to use JavaScript instead of HTML on GitHub.
2018-11-01 15:41:20 +00:00
Daniel Aleksandersen
3be1aaa01c Recognize Sina Weibo meta tags
http://open.weibo.com/wiki/Weibo_meta_tag
2018-08-28 11:04:29 +01:00
Daniel Aleksandersen
5a69d4a8eb Improve metadata extraction (#478)
* Improve metadata extraction

* Recognize meta[property] as a space-separated list
* Recognize Dulin Core (dc|dcterm): metadata.
* Prefer Dublin Core, Open Graph, Twitter, and HTML in that order.
* _getArticleTitle() is now only used as fallback if document
 doesn't provide good metadata.
2018-08-25 00:28:00 +01:00
Daniel Aleksandersen
0449dbf186 Recognize more iframe video embed video services
* TenCent QQ Video, Alexa Rank 8
* Twitch clips and streams, Alexa Rank 33
* Internet Archive, Alexa Rank 265
* Wikimedia, Alexa Rank 347
2018-08-22 16:08:46 +01:00