Commit Graph

380 Commits

Author SHA1 Message Date
Daniel Aleksandersen
3be1aaa01c Recognize Sina Weibo meta tags
http://open.weibo.com/wiki/Weibo_meta_tag
2018-08-28 11:04:29 +01:00
Daniel Aleksandersen
5a69d4a8eb Improve metadata extraction (#478)
* Improve metadata extraction

* Recognize meta[property] as a space-separated list
* Recognize Dulin Core (dc|dcterm): metadata.
* Prefer Dublin Core, Open Graph, Twitter, and HTML in that order.
* _getArticleTitle() is now only used as fallback if document
 doesn't provide good metadata.
2018-08-25 00:28:00 +01:00
Daniel Aleksandersen
0449dbf186 Recognize more iframe video embed video services
* TenCent QQ Video, Alexa Rank 8
* Twitch clips and streams, Alexa Rank 33
* Internet Archive, Alexa Rank 265
* Wikimedia, Alexa Rank 347
2018-08-22 16:08:46 +01:00
Gijs Kruitbosch
f782bc5f06 Avoid global flag when looking for metadata using regexes 2018-08-21 17:56:25 +02:00
Johann Hofmann
93a2f1b026
Merge pull request #471 from gijsk/moar-eslint
Add more eslint rules (fixes #457)
2018-07-16 07:15:08 +02:00
Gijs Kruitbosch
30611cc57f Fix quotes issues in test and benchmark files 2018-07-15 15:43:50 +01:00
Gijs Kruitbosch
f511d1aa2b Enable eslint checks for quotes and single-line loops/conditionals 2018-07-14 22:09:14 +01:00
Gijs Kruitbosch
7cf95bd427 Fix same-line loops and if statements 2018-07-14 22:09:00 +01:00
Gijs Kruitbosch
d9f7bb2965 Fix quotes 2018-07-14 14:28:41 +01:00
Gijs Kruitbosch
7d03bec52d Fix issues with finding nytimes content caused by in-article ads 2018-07-10 14:13:01 +01:00
tmm2018
076bf2017b [docs] - mozilla/readibility - README.md - fixing tiny little issues (grammar, rethorics, spelling, etc.) (#462)
* [docs] - mozilla/readibility - README.md - add articles to the description of the properties of the Readability output
2018-06-13 08:14:36 -07:00
Gijs
4b193ccd6a
Include URI information for jsdom in the README.
See #453 for an example of where this led to confusion.
2018-06-12 10:12:55 -07:00
Gijs Kruitbosch
8fec62d246 Strip XML namespaces from tag names to deal with broken serializations 2018-06-09 09:51:16 +01:00
Gijs Kruitbosch
8e92a1fa19 Reuse textNode variable for CDATA blocks, too 2018-06-09 09:49:25 +01:00
David A Roberts
ea4165721f Remove single-cell tables 2018-06-09 09:49:01 +01:00
David A Roberts
bf64b58d90 Update tests 2018-06-08 11:06:01 +01:00
David A Roberts
72bd1a8532 Don't nest paragraphs 2018-06-08 11:06:01 +01:00
David A Roberts
68c9af4ffa Use numeric encoding for non-XML entities
JSDOMParser can't handle HTML named entities like ` `
2018-06-08 11:06:01 +01:00
David A Roberts
611e9e3a6f JSDOMParser: handle CDATA sections 2018-06-08 11:06:01 +01:00
David A Roberts
afcc4b8e49 Fix titles not being trimmed sometimes 2018-06-08 11:06:01 +01:00
Gijs Kruitbosch
d4b842c82a Match headings on trimmed strings to avoid whitespace causing mismatches 2018-05-24 16:32:16 +01:00
Gijs Kruitbosch
8c02a0d34c Fix #283 and remove hidden nodes 2018-05-16 21:54:12 +01:00
David A Roberts
656a6673d9 Don't put non-phrasing content into paragraphs 2018-05-16 10:43:30 +01:00
David A Roberts
5ae90930cd Don't convert DIVs to Ps when more than 25% links 2018-05-15 13:29:55 +01:00
David A Roberts
9f2c5cb42e Put phrasing content into paragraphs
This removes the need for `p.readability-styled` elements.
2018-05-15 13:29:55 +01:00
David A Roberts
c823a6efb2 Fix generate-testcase.js 2018-05-14 09:44:33 +01:00
Gijs Kruitbosch
f4ab856992 Check for a document being passed
This provides a descriptive error message if no document is passed, and
ignores the first argument if the second argument looks like
a reasonable DOM document instance.
2018-05-01 12:29:22 +01:00
David A Roberts
7a24801958 Don't include root html node in candidates
Fixes #435
2018-04-29 09:22:00 +01:00
David A Roberts
acfd3759a1 Generate XHTML-compatible input for test cases
Fixes the bug noted in the README
2018-04-28 22:28:16 +01:00
David A Roberts
d60184966c Remove unused URI parameter from constructor 2018-04-26 10:30:15 +01:00
David A Roberts
5ee03bc960 Stop Readability depending on Node.* constants 2018-04-22 16:23:46 +01:00
Andres Rey
3c76104adb Fix engadget test case 2018-03-23 00:27:50 +00:00
Andres Rey
4b99f41ec9 Add engadget test case 2018-03-23 00:27:50 +00:00
Andres Rey
6c5bc62959 Remove aside tags on test cases 2018-03-23 00:27:50 +00:00
Andres Rey
6fd816496c Clean <aside> tags on _prepArticle 2018-03-23 00:27:50 +00:00
David A Roberts
f8d9b1c224 Update test expectations 2018-03-19 13:20:29 +00:00
David A Roberts
8414158fa9 Fix _replaceBrs
Previously, `nextElem` was not actually proceeding to the next element, and therefore aborting the paragraph at the first `<br>` (rather than the first `<br><br>` as the comment indicates).
2018-03-19 13:20:29 +00:00
Joan Espasa Arxer
3ff9a166fb Changed wordThreshold to charThreshold to better reflect the semantics. 2018-03-13 14:37:27 +00:00
Brad Philips
8525c6af36 Fix relative URIs given <base> tags (#422) 2018-03-02 11:38:14 +00:00
Gijs Kruitbosch
d598baf02b Improve URL handling in JSDOMParser and Readability.js
This change ups the required node version to 7.0 because it relies on the builtin url module.

We now pass a url when constructing a jsdom document or JSDOMParser document.
Because this is an API change, I'm increasing the package version.

Ultimately, I would like to remove the  argument from the readability constructor. It should
use the documentURI from the document it is passed.
2018-02-28 11:29:29 +00:00
Andres Rey
834672ef86 Return longest text after failing to detect text longer than the configured value (#423)
Save extracted text across attempts and return the longest one when all attempts fail, and add a test case from hukumusume
2018-02-27 14:26:54 +00:00
Tom Z?hner
264b8e8968 Remove link elements when preparing article for display 2018-01-30 15:11:59 +00:00
Thomas Jaggi
fd1557560a [Docs] Fixed JSDOM usage note 2018-01-02 22:12:53 +00:00
Andres Rey
fa9d8bda48 Add la-nacion test case 2017-12-11 14:00:48 +00:00
Andres Rey
01ffd0c617 Remove "modal" from strings to remove 2017-12-11 14:00:48 +00:00
Gijs
8da91b9eed
Fix omitted semicolon 2017-12-05 13:22:32 +00:00
Gijs
0a30527c85
Explicitly mention lack of Node in node.js environments 2017-12-05 13:22:01 +00:00
Gijs Kruitbosch
807bf05aa3 Fix className usage so it deals correctly with SVG nodes (fixes #412). 2017-12-05 11:06:39 +00:00
Gijs Kruitbosch
c586aeb404 Fix generate-testcase.js script so it keeps caption classes 2017-11-30 10:41:15 +00:00
Gijs Kruitbosch
ad4dd26448 Update test expectations 2017-11-30 10:41:15 +00:00