Commit Graph

362 Commits

Author SHA1 Message Date
PalmerAL
27ee1e947e update regexes in readerable.js 2019-03-01 11:04:58 +00:00
PalmerAL
a014e0c9c8 exclude graphs from nytimes articles 2019-03-01 11:04:58 +00:00
Radhi Fadlillah
c942b32945 Revert source files and fix expected results 2019-03-01 11:02:48 +00:00
Radhi Fadlillah
bd5087d2f1 fix error in testing "wikipedia" 2019-03-01 11:02:48 +00:00
Radhi Fadlillah
3e025d58e5 fix error in testing "lwn-01" 2019-03-01 11:02:48 +00:00
Radhi Fadlillah
df95c9d717 fix error in testing "keep-tabulard-data" 2019-03-01 11:02:48 +00:00
Radhi Fadlillah
6a5066abe2 Fix tabular data got removed 2019-03-01 11:02:48 +00:00
PalmerAL
f70d36852b check itemprop when determining whether a node is a byline 2019-02-23 18:26:15 +00:00
EvsChen
b9f47bcc8d fix(test-util): fix generate testcase tool 2019-02-15 13:14:30 +00:00
Andres Rey
d41de78c26 Close img tag 2019-02-11 22:06:35 +00:00
Andres Rey
1187b2dae1 Update test expectations 2019-02-11 22:06:35 +00:00
Andres Rey
3ca8c12d87 Update test expectations 2019-02-11 22:06:35 +00:00
Andres Rey
f836a8f291 Add "gdpr" to the list of negative tags 2019-02-11 22:06:35 +00:00
Andres Rey
4ffd482004 Add medicalnewstoday test case with incorrect results 2019-02-11 22:06:35 +00:00
Taylor Buley
c0c097c930 update JSDOM example for node 2019-01-29 12:06:26 +00:00
Gijs Kruitbosch
60ef565b67 Don't choke on <meta> tags that do not have a content attribute 2019-01-28 15:55:07 +00:00
Gijs
878545f64d
Make usage sections in README more discoverable
This just reorders some of the content and reduces duplication.
2019-01-07 18:56:27 +00:00
Gijs Kruitbosch
30f9670a5f Avoid setAttribute errors from invalid attributes, fixes #392 2019-01-07 18:53:24 +00:00
Gijs
15d411a865
Add comment to indicate duplicate regexes
This comment was added in mozilla-central and seems useful, adding it to keep m-c and github in sync.
2019-01-03 14:27:18 +00:00
Gijs Kruitbosch
d8c837012b Fix benchmark script for script split and new JSDOM version 2018-12-29 18:22:14 +00:00
Gijs Kruitbosch
512e1c18a7 Update to latest JSDOM 2018-12-29 18:22:14 +00:00
Gijs Kruitbosch
977be42d1f Fix link normalization for live HTMLCollections
Newer versions of JSDOM implement getElementsByTagName correctly.
This means it returns a live node list. When calling
`Element.replaceChild` for links inside the loop over that
collection, elements disappear from the list, meaning we miss
every other item. Without this fix, the `clean-links` testcase
breaks.
2018-12-29 18:22:14 +00:00
Gijs Kruitbosch
e8bb7f722f Fix whitespace normalization in title metadata
When switching to a newer version of JSDOM, it is more literal
about listing whitespace as part of textContent, including
newlines and not normalizing multiple spaces.

It seems prudent to just always normalize whitespace for titles,
which are guaranteed to be pretty short anyway.
2018-12-29 18:22:14 +00:00
Gijs Kruitbosch
3610476663 Remove CSS that jsdom struggles to parse 2018-12-29 18:22:14 +00:00
Gijs Kruitbosch
2620542dd1 Split off isProbablyReaderable implementation 2018-12-29 18:22:14 +00:00
Maria Luiza Soares
8c41d92560 Assert on siteName in all test cases 2018-12-21 18:28:28 +00:00
Maria Luiza Soares
1bac47c70d Add newly generated test case 2018-12-21 18:28:28 +00:00
Maria Luiza Soares
262fffd703 Retrieve site name on parse, based on meta og:site_name 2018-12-21 18:28:28 +00:00
Gijs
876c81f710 Update sorting function in Readability.js
Simplify sorting function also considering case where arguments are equal

Co-Authored-By: jemrobinson <james.em.robinson@gmail.com>
2018-11-20 12:08:07 +00:00
James Robinson
ee18c21fc2 Switched sort function from boolean to explicit -1 and 1 thus avoiding failures to sort when false is evaluated as 0 2018-11-20 12:08:07 +00:00
Dan Burzo
44e90de00b Elements that have no .style (e.g. mathml) are probably visible; fixes #493 2018-11-07 13:29:41 +00:00
Hugo Locurcio
9fbe42683a Add .gitattributes file
This ignores HTML (test data) so the repository is considered
to use JavaScript instead of HTML on GitHub.
2018-11-01 15:41:20 +00:00
Daniel Aleksandersen
3be1aaa01c Recognize Sina Weibo meta tags
http://open.weibo.com/wiki/Weibo_meta_tag
2018-08-28 11:04:29 +01:00
Daniel Aleksandersen
5a69d4a8eb Improve metadata extraction (#478)
* Improve metadata extraction

* Recognize meta[property] as a space-separated list
* Recognize Dulin Core (dc|dcterm): metadata.
* Prefer Dublin Core, Open Graph, Twitter, and HTML in that order.
* _getArticleTitle() is now only used as fallback if document
 doesn't provide good metadata.
2018-08-25 00:28:00 +01:00
Daniel Aleksandersen
0449dbf186 Recognize more iframe video embed video services
* TenCent QQ Video, Alexa Rank 8
* Twitch clips and streams, Alexa Rank 33
* Internet Archive, Alexa Rank 265
* Wikimedia, Alexa Rank 347
2018-08-22 16:08:46 +01:00
Gijs Kruitbosch
f782bc5f06 Avoid global flag when looking for metadata using regexes 2018-08-21 17:56:25 +02:00
Johann Hofmann
93a2f1b026
Merge pull request #471 from gijsk/moar-eslint
Add more eslint rules (fixes #457)
2018-07-16 07:15:08 +02:00
Gijs Kruitbosch
30611cc57f Fix quotes issues in test and benchmark files 2018-07-15 15:43:50 +01:00
Gijs Kruitbosch
f511d1aa2b Enable eslint checks for quotes and single-line loops/conditionals 2018-07-14 22:09:14 +01:00
Gijs Kruitbosch
7cf95bd427 Fix same-line loops and if statements 2018-07-14 22:09:00 +01:00
Gijs Kruitbosch
d9f7bb2965 Fix quotes 2018-07-14 14:28:41 +01:00
Gijs Kruitbosch
7d03bec52d Fix issues with finding nytimes content caused by in-article ads 2018-07-10 14:13:01 +01:00
tmm2018
076bf2017b [docs] - mozilla/readibility - README.md - fixing tiny little issues (grammar, rethorics, spelling, etc.) (#462)
* [docs] - mozilla/readibility - README.md - add articles to the description of the properties of the Readability output
2018-06-13 08:14:36 -07:00
Gijs
4b193ccd6a
Include URI information for jsdom in the README.
See #453 for an example of where this led to confusion.
2018-06-12 10:12:55 -07:00
Gijs Kruitbosch
8fec62d246 Strip XML namespaces from tag names to deal with broken serializations 2018-06-09 09:51:16 +01:00
Gijs Kruitbosch
8e92a1fa19 Reuse textNode variable for CDATA blocks, too 2018-06-09 09:49:25 +01:00
David A Roberts
ea4165721f Remove single-cell tables 2018-06-09 09:49:01 +01:00
David A Roberts
bf64b58d90 Update tests 2018-06-08 11:06:01 +01:00
David A Roberts
72bd1a8532 Don't nest paragraphs 2018-06-08 11:06:01 +01:00
David A Roberts
68c9af4ffa Use numeric encoding for non-XML entities
JSDOMParser can't handle HTML named entities like `&nbsp;`
2018-06-08 11:06:01 +01:00