* feat:Add a custom extractor for ma.ttias.be.
When parsing content for cron.weekly issues, such as the one at https://ma.ttias.be/cronweekly/issue-130/, Mercury Parser would remove headings and ordered lists that were part of the content. This resolves that as follows:
* Remove "id" attributes from "h1" and "h2" elements. Those attributes would result in the elements having a low weight.
* Since Mercury Parser demotes "h1" elements to "h2", demote "h2" elements to "h3".
* Add class="entry-content-asset" to "ul" elements to avoid them being removed.
* removed redundant comment.
Co-authored-by: John Holdun <john@johnholdun.com>
* Adding custom parser for Times of India
* moved transforms to clean
The transforms were just working as cleans. Moved things around as per recommendations.
Co-authored-by: Postlight Bot <adam.pash+postlight-bot@postlight.com>
* generate the custom extractor and get the first test to pass
* add the basic extractors (title, author, date, etc)
* select the score as well as the review text, and break the content test
* prepend the score to the content
* get the date from the datetime attribute
* mangle this test a little, but just a little (it does work properly)
* move from prepending the score to the review text to adding it as a custom field in the extractor
* Add prototype of custom extractor for clinicaltrials.gov
* Add .DS_Store to gitignore
* Make tests for title, author and date_published selectors pass
* Make content selector test pass
* Fix date_published test
* Rebuild
* Remove .DS-Store from gitignore
* Improve extractor and text/fixture of clinicaltrials.gov
* feat: prospect magazine parser
Couldn’t find a way to parse the date but I think it’s good otherwise.
* fix: pulls date
* fix: add timezone
* fix: generalize
* feat: forward.com parser
LGTM although image didn’t show up in preview
* feat: also pull imge into content
* fix: generalize selectors
* fix: generalize selector
* feat: qdaily parser
Firstly — I accidentally tried to generate the parser on the master
branch, and I’m not sure where it is, maybe floating in the nether
world.
On to the parser — this one was a bit tricky because things were in
Chinese! The content appears to be parsing (as seen in preview) but
it’s not passing the test. I noticed the second “ ‘ “ mark isn’t
appearing on the parser side.
Additionally, some of the lazy loading images aren’t appearing in the
preview (I cleaned the wrong lazy load images that appeared), so
someone will probably have to work on that (I don’t know how to do
transforms yet).
* fix tests
* fix: selector generalization