Commit Graph

120 Commits (e217648c0be8fb48f40033e43da2134e71266c56)

Author SHA1 Message Date
John Brayton e217648c0b
feat: ma.ttias.be extractor (#551)
* feat:Add a custom extractor for ma.ttias.be.

When parsing content for cron.weekly issues, such as the one at https://ma.ttias.be/cronweekly/issue-130/, Mercury Parser would remove headings and ordered lists that were part of the content. This resolves that as follows:

* Remove "id" attributes from "h1" and "h2" elements. Those attributes would result in the elements having a low weight.
* Since Mercury Parser demotes "h1" elements to "h2", demote "h2" elements to "h3".
* Add class="entry-content-asset" to "ul" elements to avoid them being removed.

* removed redundant comment.

Co-authored-by: John Holdun <john@johnholdun.com>
2 years ago
Nitin Khanna 8c9982247b
feat: Ladbible.com extractor (#624)
* Ladbible.com extractors and test

* CircleCI says timezone needs to be Europe/London aka BST

Co-authored-by: Postlight Bot <adam.pash+postlight-bot@postlight.com>
Co-authored-by: Jad Termsani <32297675+JadTermsani@users.noreply.github.com>
3 years ago
Nitin Khanna 30d6f472ee
feat: Times of India extractor (#503)
* Adding custom parser for Times of India

* moved transforms to clean

The transforms were just working as cleans. Moved things around as per recommendations.

Co-authored-by: Postlight Bot <adam.pash+postlight-bot@postlight.com>
3 years ago
Sven Wiegand f95947fe88 Implemented custom extractor epaper.zeit.de (#488) 5 years ago
david0leong 911b0f87c8 Add custom extractor for biorxiv.org (#467)
* Add custom extractor for biorxiv.org

* Fix content selector

* Improve content selector
5 years ago
Ben Ubois 0942c37876 feat: custom parser for phoronix.com. (#431) 5 years ago
Michael P. Geraci 571a913745 feat: pitchfork extractor (#439)
* generate the custom extractor and get the first test to pass

* add the basic extractors (title, author, date, etc)

* select the score as well as the review text, and break the content test

* prepend the score to the content

* get the date from the datetime attribute

* mangle this test a little, but just a little (it does work properly)

* move from prepending the score to the review text to adding it as a custom field in the extractor
5 years ago
david0leong 694ea820aa Custom Extractor for clinicaltrials.gov (#305)
* Add prototype of custom extractor for clinicaltrials.gov

* Add .DS_Store to gitignore

* Make tests for title, author and date_published selectors pass

* Make content selector test pass

* Fix date_published test

* Rebuild

* Remove .DS-Store from gitignore

* Improve extractor and text/fixture of clinicaltrials.gov
5 years ago
Wajeeh Zantout e66ad8b81c feat: add le monde extractor (#415) 5 years ago
kik0220 f81dc63617 feat: add rbbtoday.com custom parser (#411)
* feat: add rbbtoday.com custom parser

* fix: content test

* fix: dek and content
5 years ago
kik0220 5e1113b3a9 feat: add japan.zdnet.com custom parser (#410)
* feat: add japan.zdnet.com custom parser

* fix: author and date_published selector
5 years ago
kik0220 77e3bc00e2 feat: add wired.jp custom parser (#409)
* feat: add wired.jp custom parser

* fix: author test

* fix: date_published selector

* test: fix dek and contest

* test: fix content (without clean dek)
5 years ago
kik0220 0b36c96de0 feat: add techlog.iij.ad.jp custom parser (#405)
* feat: add techlog.iij.ad.jp custom parser

* fix: date_published and content selector
5 years ago
kik0220 406bf1b1a9 feat: add weekly.ascii.jp custom parser (#401)
* feat: add weekly.ascii.jp custom parser

* fix: title and date_published selector
5 years ago
kik0220 216bfade00 feat: add www.ipa.go.jp custom parser (#408) 5 years ago
kik0220 3ae8f3bde3 feat: add www.oreilly.co.jp custom parser (#407) 5 years ago
kik0220 7396e81b72 feat: add sect.iij.ad.jp custom parser (#404) 5 years ago
kik0220 3f1d9030ee feat: add www.lifehacker.jp custom parser (#403) 5 years ago
kik0220 b077000c4a feat: add getnews.jp custom parser (#402) 5 years ago
kik0220 b5425c3e8a feat: add www.gizmodo.jp custom parser (#400) 5 years ago
kik0220 a38c727a0a feat: add deadline.com custom parser (#383)
* feat: add deadline.com custom parser

* fix: timezone

* fix: date_published selectors

* fix: title and author selector

* test: transform .embed-twitter

* fix: regenerate the fixture and fix content selector
5 years ago
kik0220 74a3c49a3c feat: add japan.cnet.com custom parser (#382)
* feat: add japan.cnet.com custom parser

* fix: remove transform
5 years ago
kik0220 7b07f88448 feat: add www.yomiuri.co.jp custom parser (#381) 5 years ago
kik0220 8ca2894751 feat: add bookwalker.jp custom parser (#374) 5 years ago
kik0220 a5f06ce27a feat: add takagi-hiromitsu.jp custom parser (#364) 5 years ago
kik0220 b9c57dbc2f feat: add www.publickey1.jp custom parser (#365)
* feat: add www.publickey1.jp custom parser

* fix: date_published selector
5 years ago
kik0220 d7dbea8a95 feat: add www.itmedia.co.jp custom parser (#366)
* feat: add www.itmedia.co.jp custom parser

* feat: add nlab.itmedia.co.jp support

* fix: title selectors
5 years ago
kik0220 9218f80da6 feat: add www.moongift.jp custom parser (#367)
* feat: add www.moongift.jp custom parser

* fix: date_published selectors

* fix: pass test

* fix: add timezone
5 years ago
kik0220 4eb73dffb0 feat: add www.infoq.com custom parser (#368)
* feat: add www.infoq.com custom parser

* fix: date_published selector
5 years ago
kik0220 ce5cd2dd0d feat: add phpspot.org custom parser (#369)
* feat: add phpspot.org custom parser

* fix: date_published selector
5 years ago
kik0220 73be0c5a10 feat: add www.jnsa.org custom parser (#346)
* feat: add www.jnsa.org custom parser
5 years ago
Adam Pash eacd1ee97f feat: custom genius parser. (#284)
also adds ability to transform value returned by an attribute selector
5 years ago
kik0220 c389c966d7 feat: add jvndb.jvn.jp custom parser (#345) 5 years ago
kik0220 8493d05cb5 feat: add scan.netsecurity.ne.jp custom parser (#347) 5 years ago
kik0220 2a76c6c212 feat: add www.elecom.co.jp custom parser (#348) 5 years ago
kik0220 a9e010b718 feat: add www.sanwa.co.jp custom parser (#349) 5 years ago
kik0220 1639eae324 feat: add www.asahi.com custom parser (#350) 5 years ago
kik0220 21f7de70c1 feat: add buzzap.jp custom parser (#351) 5 years ago
kik0220 f3a7e393a3 feat: add www.ossnews.jp custom parser (#352) 5 years ago
kik0220 c309bdb373 feat: add otrs.com custom parser (#353) 5 years ago
Ben Ubois a7e4c67d1d Extract content from GitHub repos. (#306)
* Extract content from GitHub repos.

* Add published and dek.

* Timezone fix.
5 years ago
Toufic Mouallem 7844129fda feat: Add custom parser for Reddit (#307) 5 years ago
Jordan Hotmann 83d1c2401b feat: add custom extractor for blisterreview.com (#299) 5 years ago
kik0220 d9a1e7b22b feat: add news.mynavi.jp custom parser (#287) 5 years ago
Wajeeh Zantout 1ccd14e1e9 feat: add fortinet custom parser (#188)
* feat: add fortinet custom parser

* fix: eslint error

* fix: transform noscript images

* feat: add fortinet custom parser

* fix: eslint error

* fix: transform noscript images

* fix: transform method

* test: transform method

* fix: fs import
5 years ago
Wajeeh Zantout 9b36003b62 feat: add fastcompany custom parser (#191)
* feat: add fastcompany custom parser

* fix: eslint error

* fix: test for date_published

* feat: add fastcompany custom parser

* fix: eslint error

* fix: test for date_published

* fix: fs import
5 years ago
Janet f13bb721f6 feat: prospect magazine parser (#147)
* feat: prospect magazine parser

Couldn’t find a way to parse the date but I think it’s good otherwise.

* fix: pulls date

* fix: add timezone

* fix: generalize
7 years ago
Kevin Ngao 1b28713cf5 feat: fool.com parser (#158)
* feat: add fool.com custom parser
7 years ago
Janet c18959779d feat: forward.com parser (#144)
* feat: forward.com parser

LGTM although image didn’t show up in preview

* feat: also pull imge into content

* fix: generalize selectors

* fix: generalize selector
7 years ago
Janet 50e548bac2 feat: qdaily parser (#146)
* feat: qdaily parser

Firstly — I accidentally tried to generate the parser on the master
branch, and I’m not sure where it is, maybe floating in the nether
world.

On to the parser — this one was a bit tricky because things were in
Chinese! The content appears to be parsing (as seen in preview) but
it’s not passing the test. I noticed the second “ ‘ “ mark isn’t
appearing on the parser side.

Additionally, some of the lazy loading images aren’t appearing in the
preview (I cleaned the wrong lazy load images that appeared), so
someone will probably have to work on that (I don’t know how to do
transforms yet).

* fix tests

* fix: selector generalization
7 years ago