* feat:Add a custom extractor for ma.ttias.be.
When parsing content for cron.weekly issues, such as the one at https://ma.ttias.be/cronweekly/issue-130/, Mercury Parser would remove headings and ordered lists that were part of the content. This resolves that as follows:
* Remove "id" attributes from "h1" and "h2" elements. Those attributes would result in the elements having a low weight.
* Since Mercury Parser demotes "h1" elements to "h2", demote "h2" elements to "h3".
* Add class="entry-content-asset" to "ul" elements to avoid them being removed.
* removed redundant comment.
* feat: Add a custom extractor for engadget.com.
* feat: Add a custom extractor for www.ndtv.com.
* Works, but I need to figure how to make pagination work correctly.
* fixed pagination - would only retrieve first or second page because we would send contentOnly: true on subsequent pages (page 2).
removed failover: true from preview.
* rolled back { fallback: false } option removal
* Clarified comments.
* rolling back yarn.lock changes
Co-authored-by: John Holdun <john@johnholdun.com>
* feat:Add a custom extractor for ma.ttias.be.
When parsing content for cron.weekly issues, such as the one at https://ma.ttias.be/cronweekly/issue-130/, Mercury Parser would remove headings and ordered lists that were part of the content. This resolves that as follows:
* Remove "id" attributes from "h1" and "h2" elements. Those attributes would result in the elements having a low weight.
* Since Mercury Parser demotes "h1" elements to "h2", demote "h2" elements to "h3".
* Add class="entry-content-asset" to "ul" elements to avoid them being removed.
* removed redundant comment.
* feat: Add a custom extractor for engadget.com.
* Works, but I need to figure how to make pagination work correctly.
* fixed pagination - would only retrieve first or second page because we would send contentOnly: true on subsequent pages (page 2).
removed failover: true from preview.
* rolled back { fallback: false } option removal
* Clarified comments.
Co-authored-by: John Holdun <john@johnholdun.com>
* feat:Add a custom extractor for ma.ttias.be.
When parsing content for cron.weekly issues, such as the one at https://ma.ttias.be/cronweekly/issue-130/, Mercury Parser would remove headings and ordered lists that were part of the content. This resolves that as follows:
* Remove "id" attributes from "h1" and "h2" elements. Those attributes would result in the elements having a low weight.
* Since Mercury Parser demotes "h1" elements to "h2", demote "h2" elements to "h3".
* Add class="entry-content-asset" to "ul" elements to avoid them being removed.
* removed redundant comment.
* feat: Add a custom extractor for engadget.com.
Co-authored-by: John Holdun <john@johnholdun.com>
* Implemented custom extractor gruene.de
* Cleaner output of custom extracter www.gruene.de
* Updated fixture for www.gruene.de from real page
* Trying to pick image from og:image -- doesn't work ...
Co-authored-by: John Holdun <john@johnholdun.com>
* feat: Add a custom extractor for pastebin.com
* feat: transforms <li> to <p> in pastebin.com
Co-authored-by: Felipe Canejo <felipecanejo@gmail.com>
Co-authored-by: John Holdun <john@johnholdun.com>
* feat:Add a custom extractor for ma.ttias.be.
When parsing content for cron.weekly issues, such as the one at https://ma.ttias.be/cronweekly/issue-130/, Mercury Parser would remove headings and ordered lists that were part of the content. This resolves that as follows:
* Remove "id" attributes from "h1" and "h2" elements. Those attributes would result in the elements having a low weight.
* Since Mercury Parser demotes "h1" elements to "h2", demote "h2" elements to "h3".
* Add class="entry-content-asset" to "ul" elements to avoid them being removed.
* removed redundant comment.
Co-authored-by: John Holdun <john@johnholdun.com>
These extractors were variously failing tests as I tried updating dependencies. It seems like some of the format detection logic has changed, and making these date detectors more explicit fixes them.
* Adding custom parser for Times of India
* moved transforms to clean
The transforms were just working as cleans. Moved things around as per recommendations.
Co-authored-by: Postlight Bot <adam.pash+postlight-bot@postlight.com>
* feat: ability to add custom extractors via api
* docs: updating readme
* fix: example.com was being used in another test
* fix: timezone was messing up date_published test
* fix: using a unique site for testing
* fix: updated custom extractor api
* docs: updating readme
* fix: removing unused fixture
* fix: updating test description
* feat: ability to add custom extractors via cli
* fix: medium extractor now pulls content
* fix: remove youtube caption if no preview available
* fix: remove youtube node if no image
* fix: removing dek from medium.com extractor
* generate the custom extractor and get the first test to pass
* add the basic extractors (title, author, date, etc)
* select the score as well as the review text, and break the content test
* prepend the score to the content
* get the date from the datetime attribute
* mangle this test a little, but just a little (it does work properly)
* move from prepending the score to the review text to adding it as a custom field in the extractor
* Add prototype of custom extractor for clinicaltrials.gov
* Add .DS_Store to gitignore
* Make tests for title, author and date_published selectors pass
* Make content selector test pass
* Fix date_published test
* Rebuild
* Remove .DS-Store from gitignore
* Improve extractor and text/fixture of clinicaltrials.gov
* fix: new yorker extractor
* fix: date_published selector
* fix: remove footer from content
* feat: add additional selector for title
* feat: support article with multiple authors