mercury-parser

Commit Graph

Author	SHA1	Message	Date
John Brayton	143631b4b7	feat: arstechnica.com extractor (#553 ) * feat:Add a custom extractor for ma.ttias.be. When parsing content for cron.weekly issues, such as the one at https://ma.ttias.be/cronweekly/issue-130/, Mercury Parser would remove headings and ordered lists that were part of the content. This resolves that as follows: * Remove "id" attributes from "h1" and "h2" elements. Those attributes would result in the elements having a low weight. * Since Mercury Parser demotes "h1" elements to "h2", demote "h2" elements to "h3". * Add class="entry-content-asset" to "ul" elements to avoid them being removed. * removed redundant comment. * feat: Add a custom extractor for engadget.com. * Works, but I need to figure how to make pagination work correctly. * fixed pagination - would only retrieve first or second page because we would send contentOnly: true on subsequent pages (page 2). removed failover: true from preview. * rolled back { fallback: false } option removal * Clarified comments. Co-authored-by: John Holdun <john@johnholdun.com>	2 years ago

Author

SHA1

Message

Date

John Brayton

143631b4b7

feat: arstechnica.com extractor (#553 )

* feat:Add a custom extractor for ma.ttias.be.

When parsing content for cron.weekly issues, such as the one at https://ma.ttias.be/cronweekly/issue-130/, Mercury Parser would remove headings and ordered lists that were part of the content. This resolves that as follows:

* Remove "id" attributes from "h1" and "h2" elements. Those attributes would result in the elements having a low weight.
* Since Mercury Parser demotes "h1" elements to "h2", demote "h2" elements to "h3".
* Add class="entry-content-asset" to "ul" elements to avoid them being removed.

* removed redundant comment.

* feat: Add a custom extractor for engadget.com.

* Works, but I need to figure how to make pagination work correctly.

* fixed pagination - would only retrieve first or second page because we would send contentOnly: true on subsequent pages (page 2).
removed failover: true from preview.

* rolled back { fallback: false } option removal

* Clarified comments.

Co-authored-by: John Holdun <john@johnholdun.com>

1 Commits (7b68bcd94c18f11499031324ddf22575a5400fdd)