* include more ancestors in candidate scoring
* fix medium-3 testcase
The original source file contained two copies of the document, which
was causing incorrect results
* remove unnecessary nested elements
* fix removal of empty elements
* add option to regenerate all testcases
* update tests
* fix quanta testcase
* fix creating testcase from network
* fix early exit in testcase generation
* format HTML before comparing while testing
* upgrade js-beautify
* don't merge outer readability div
* Add initial test case for kinja's lazy image
* Implement method to remove small data uri image
* Convert relative uri in poster and srcset of media nodes
* Eslint doesn't like arrow function
* Unescape HTML entities in metadata
* Fix wrong regex for parsing srcset urls
* Remove line to check data url since it already handled by new URL
* Replace String.matchAll since it only supported in Node 12+
* Use numeric code when unescaping HTML
* Don't remove data URL src if it's svg
* Don't remove b64 src if it's the only attr that contains image
* Make the comma part non-optional in regex for srcset url
* Fix wrong code for unescaping HTML
* Don't capture comma and semicolon in data URL regex
This avoid `contentWithSidebar` causing complete removal of the content.
As a side-effect, it slightly improves byline detection by not removing
content as early on as before.
When switching to a newer version of JSDOM, it is more literal
about listing whitespace as part of textContent, including
newlines and not normalizing multiple spaces.
It seems prudent to just always normalize whitespace for titles,
which are guaranteed to be pretty short anyway.
* Improve metadata extraction
* Recognize meta[property] as a space-separated list
* Recognize Dulin Core (dc|dcterm): metadata.
* Prefer Dublin Core, Open Graph, Twitter, and HTML in that order.
* _getArticleTitle() is now only used as fallback if document
doesn't provide good metadata.