Commit Graph

361 Commits

Author SHA1 Message Date
Simon Reinhardt
035aa65dbc
Added custom extractor for www.spektrum.de (#677)
Co-authored-by: Simon Reinhardt <simon.reinhardt@hype.de>
Co-authored-by: John Holdun <john@johnholdun.com>
2022-08-10 15:37:06 -07:00
John Holdun
f259d13753
feat: Add figcaption to list of non-convertible span parents (#682)
Based on this comment: https://github.com/postlight/mercury-parser/issues/530#issuecomment-580105171
2022-08-10 15:31:08 -07:00
Nate Weaver
de314a9728
Add li to the list of non-convertible parents for spans (#531)
Co-authored-by: John Holdun <john@johnholdun.com>
2022-08-10 15:26:03 -07:00
John Brayton
9a961aa595
feat: Add a custom extractor for www.ndtv.com. (#554)
* feat:Add a custom extractor for ma.ttias.be.

When parsing content for cron.weekly issues, such as the one at https://ma.ttias.be/cronweekly/issue-130/, Mercury Parser would remove headings and ordered lists that were part of the content. This resolves that as follows:

* Remove "id" attributes from "h1" and "h2" elements. Those attributes would result in the elements having a low weight.
* Since Mercury Parser demotes "h1" elements to "h2", demote "h2" elements to "h3".
* Add class="entry-content-asset" to "ul" elements to avoid them being removed.

* removed redundant comment.

* feat: Add a custom extractor for engadget.com.

* feat: Add a custom extractor for www.ndtv.com.

* Works, but I need to figure how to make pagination work correctly.

* fixed pagination - would only retrieve first or second page because we would send contentOnly: true on subsequent pages (page 2).
removed failover: true from preview.

* rolled back { fallback: false } option removal

* Clarified comments.

* rolling back yarn.lock changes

Co-authored-by: John Holdun <john@johnholdun.com>
2022-08-10 15:16:14 -07:00
John Brayton
143631b4b7
feat: arstechnica.com extractor (#553)
* feat:Add a custom extractor for ma.ttias.be.

When parsing content for cron.weekly issues, such as the one at https://ma.ttias.be/cronweekly/issue-130/, Mercury Parser would remove headings and ordered lists that were part of the content. This resolves that as follows:

* Remove "id" attributes from "h1" and "h2" elements. Those attributes would result in the elements having a low weight.
* Since Mercury Parser demotes "h1" elements to "h2", demote "h2" elements to "h3".
* Add class="entry-content-asset" to "ul" elements to avoid them being removed.

* removed redundant comment.

* feat: Add a custom extractor for engadget.com.

* Works, but I need to figure how to make pagination work correctly.

* fixed pagination - would only retrieve first or second page because we would send contentOnly: true on subsequent pages (page 2).
removed failover: true from preview.

* rolled back { fallback: false } option removal

* Clarified comments.

Co-authored-by: John Holdun <john@johnholdun.com>
2022-08-10 15:10:35 -07:00
John Brayton
3c5c0bdba9
feat: Add a custom extractor for www.engadget.com. (#552)
* feat:Add a custom extractor for ma.ttias.be.

When parsing content for cron.weekly issues, such as the one at https://ma.ttias.be/cronweekly/issue-130/, Mercury Parser would remove headings and ordered lists that were part of the content. This resolves that as follows:

* Remove "id" attributes from "h1" and "h2" elements. Those attributes would result in the elements having a low weight.
* Since Mercury Parser demotes "h1" elements to "h2", demote "h2" elements to "h3".
* Add class="entry-content-asset" to "ul" elements to avoid them being removed.

* removed redundant comment.

* feat: Add a custom extractor for engadget.com.

Co-authored-by: John Holdun <john@johnholdun.com>
2022-08-10 15:02:27 -07:00
Sven Wiegand
13dfe720bd
Custom extractor for www.gruene.de (#485)
* Implemented custom extractor gruene.de

* Cleaner output of custom extracter www.gruene.de

* Updated fixture for www.gruene.de from real page

* Trying to pick image from og:image -- doesn't work ...

Co-authored-by: John Holdun <john@johnholdun.com>
2022-08-10 14:50:43 -07:00
Marco Wiedemeyer
d0c78911e6
Add a new custom extractor for www.abendblatt.de (#559)
* Add custom extractor for www.abendblatt.de

* update

Co-authored-by: Marco Wiedemeyer <marco.wiedemeyer@ottogroup.com>
Co-authored-by: John Holdun <john@johnholdun.com>
2022-05-09 09:19:33 -07:00
Felipe Canejo
6014016283
feat: Add a custom extractor for pastebin.com (#556)
* feat: Add a custom extractor for pastebin.com

* feat: transforms <li> to <p> in pastebin.com

Co-authored-by: Felipe Canejo <felipecanejo@gmail.com>
Co-authored-by: John Holdun <john@johnholdun.com>
2022-05-09 09:10:57 -07:00
John Brayton
e217648c0b
feat: ma.ttias.be extractor (#551)
* feat:Add a custom extractor for ma.ttias.be.

When parsing content for cron.weekly issues, such as the one at https://ma.ttias.be/cronweekly/issue-130/, Mercury Parser would remove headings and ordered lists that were part of the content. This resolves that as follows:

* Remove "id" attributes from "h1" and "h2" elements. Those attributes would result in the elements having a low weight.
* Since Mercury Parser demotes "h1" elements to "h2", demote "h2" elements to "h3".
* Add class="entry-content-asset" to "ul" elements to avoid them being removed.

* removed redundant comment.

Co-authored-by: John Holdun <john@johnholdun.com>
2022-05-09 09:07:27 -07:00
James Shakespeare
70e99d56cf
Feat: update qz.com selectors and tests (#538)
* feat: update qz.com selectors and tests

* chore: remove out of date fixture
2022-05-09 09:02:20 -07:00
Ethan Jucovy
af9cfcd120
fix: don't try to re-decode prepared response (#498)
* fix: don't try to re-decode prepared response

* Remove stray console.log
2022-05-09 08:51:15 -07:00
Joe Moon
fb44ab0244
Bugfix new yorker wired extractors (#604)
* www.newyorker.com: add updated fixtures and fix extractors

* www.wired.com: add updated fixtures and fix extractors

Co-authored-by: John Holdun <john@johnholdun.com>
2022-05-09 08:40:54 -07:00
John Holdun
65e338a403
feat: Add date formats to two extractors (#660)
These extractors were variously failing tests as I tried updating dependencies. It seems like some of the format detection logic has changed, and making these date detectors more explicit fixes them.
2022-05-06 14:45:56 -07:00
Nitin Khanna
8c9982247b
feat: Ladbible.com extractor (#624)
* Ladbible.com extractors and test

* CircleCI says timezone needs to be Europe/London aka BST

Co-authored-by: Postlight Bot <adam.pash+postlight-bot@postlight.com>
Co-authored-by: Jad Termsani <32297675+JadTermsani@users.noreply.github.com>
2021-09-08 12:03:23 -05:00
Nitin Khanna
30d6f472ee
feat: Times of India extractor (#503)
* Adding custom parser for Times of India

* moved transforms to clean

The transforms were just working as cleans. Moved things around as per recommendations.

Co-authored-by: Postlight Bot <adam.pash+postlight-bot@postlight.com>
2021-09-08 12:00:28 -05:00
Wajeeh Zantout
b0e708aac6
feat: update nytimes extractor (#506)
* feat: update custom extractor for nytimes.com
2019-10-17 09:54:42 +03:00
Michael Ashley
e12c916499
feat: ability to add custom extractors via api (#484)
* feat: ability to add custom extractors via api

* docs: updating readme

* fix: example.com was being used in another test

* fix: timezone was messing up date_published test

* fix: using a unique site for testing

* fix: updated custom extractor api

* docs: updating readme

* fix: removing unused fixture

* fix: updating test description

* feat: ability to add custom extractors via cli
2019-09-04 07:32:28 -07:00
Sven Wiegand
f95947fe88 Implemented custom extractor epaper.zeit.de (#488) 2019-08-28 07:15:14 -07:00
Michael Ashley
2422e4717d
fix: incorrect parsing on medium.com (#477)
* fix: medium extractor now pulls content

* fix: remove youtube caption if no preview available

* fix: remove youtube node if no image

* fix: removing dek from medium.com extractor
2019-08-28 07:04:27 -07:00
Jakob Fix
a918a9d6fa doc: correct link that points to wrong line (#469) 2019-08-21 10:10:10 -07:00
Michael Ashley
0686ee7956
fix: incorrect parsing on theatlantic.com (#475)
* fix: incorrect parsing on theatlantic.com

* chore: updating theatlantic.com tests & fixtures

* chore: removing script data from minified fixture
2019-08-20 09:58:24 -07:00
david0leong
911b0f87c8 Add custom extractor for biorxiv.org (#467)
* Add custom extractor for biorxiv.org

* Fix content selector

* Improve content selector
2019-08-19 13:46:03 -07:00
Jakob Fix
76d59f2d58 doc: correct internal page links (#470)
Specifically, to the cleaning content and using transform sections.
2019-08-16 14:41:46 -07:00
Kirill Danshin
592f175270 tests: remove a duplicate test (#448) 2019-07-03 09:30:10 -07:00
Toufic Mouallem
939d181951 fix: support query strings in lazy-loaded srcsets (#387) 2019-06-26 10:13:58 -07:00
Ben Ubois
0942c37876 feat: custom parser for phoronix.com. (#431) 2019-06-26 09:55:13 -07:00
Michael P. Geraci
571a913745 feat: pitchfork extractor (#439)
* generate the custom extractor and get the first test to pass

* add the basic extractors (title, author, date, etc)

* select the score as well as the review text, and break the content test

* prepend the score to the content

* get the date from the datetime attribute

* mangle this test a little, but just a little (it does work properly)

* move from prepending the score to the review text to adding it as a custom field in the extractor
2019-06-26 09:02:17 -07:00
david0leong
694ea820aa Custom Extractor for clinicaltrials.gov (#305)
* Add prototype of custom extractor for clinicaltrials.gov

* Add .DS_Store to gitignore

* Make tests for title, author and date_published selectors pass

* Make content selector test pass

* Fix date_published test

* Rebuild

* Remove .DS-Store from gitignore

* Improve extractor and text/fixture of clinicaltrials.gov
2019-05-27 09:25:51 +03:00
Wajeeh Zantout
7c8de71c52 fix: new yorker extractor (#414)
* fix: new yorker extractor

* fix: date_published selector

* fix: remove footer from content

* feat: add additional selector for title

* feat: support article with multiple authors
2019-05-15 11:00:50 +03:00
Wajeeh Zantout
e66ad8b81c feat: add le monde extractor (#415) 2019-05-14 14:53:49 +03:00
kik0220
f81dc63617 feat: add rbbtoday.com custom parser (#411)
* feat: add rbbtoday.com custom parser

* fix: content test

* fix: dek and content
2019-05-08 14:04:03 +03:00
kik0220
5e1113b3a9 feat: add japan.zdnet.com custom parser (#410)
* feat: add japan.zdnet.com custom parser

* fix: author and date_published selector
2019-05-08 13:51:03 +03:00
kik0220
77e3bc00e2 feat: add wired.jp custom parser (#409)
* feat: add wired.jp custom parser

* fix: author test

* fix: date_published selector

* test: fix dek and contest

* test: fix content (without clean dek)
2019-05-08 13:32:04 +03:00
kik0220
0b36c96de0 feat: add techlog.iij.ad.jp custom parser (#405)
* feat: add techlog.iij.ad.jp custom parser

* fix: date_published and content selector
2019-05-08 13:20:47 +03:00
kik0220
406bf1b1a9 feat: add weekly.ascii.jp custom parser (#401)
* feat: add weekly.ascii.jp custom parser

* fix: title and date_published selector
2019-05-08 13:10:42 +03:00
kik0220
216bfade00 feat: add www.ipa.go.jp custom parser (#408) 2019-05-03 13:40:42 +03:00
kik0220
3ae8f3bde3 feat: add www.oreilly.co.jp custom parser (#407) 2019-05-03 13:30:48 +03:00
kik0220
7396e81b72 feat: add sect.iij.ad.jp custom parser (#404) 2019-05-03 13:19:06 +03:00
kik0220
3f1d9030ee feat: add www.lifehacker.jp custom parser (#403) 2019-05-03 13:14:53 +03:00
kik0220
b077000c4a feat: add getnews.jp custom parser (#402) 2019-05-03 13:10:55 +03:00
kik0220
b5425c3e8a feat: add www.gizmodo.jp custom parser (#400) 2019-05-03 13:06:51 +03:00
kik0220
a38c727a0a feat: add deadline.com custom parser (#383)
* feat: add deadline.com custom parser

* fix: timezone

* fix: date_published selectors

* fix: title and author selector

* test: transform .embed-twitter

* fix: regenerate the fixture and fix content selector
2019-04-24 15:29:02 +03:00
kik0220
74a3c49a3c feat: add japan.cnet.com custom parser (#382)
* feat: add japan.cnet.com custom parser

* fix: remove transform
2019-04-24 14:39:54 +03:00
kik0220
7b07f88448 feat: add www.yomiuri.co.jp custom parser (#381) 2019-04-24 11:00:56 +03:00
Toufic Mouallem
3f46859d14
fix: skip absolutizing invalid srcsets (#386)
* fix: skip absolutizing empty srcsets

* test: empty srcsets are handled properly
2019-04-24 10:18:57 +03:00
kik0220
779c1154fb fix: add date_published selector in www.sanwa.co.jp extractor (#378) 2019-04-16 13:46:24 +03:00
kik0220
ea5b65f019 fix: add date_published selector in www.elecom.co.jp extractor (#377) 2019-04-16 13:41:40 +03:00
kik0220
7c0949e587 fix: add date_published selector in www.ossnews.jp extractor (#376) 2019-04-16 13:36:42 +03:00
kik0220
3e91ac55db fix: add date_published selector in jvndb.jvn.jp extractor (#375) 2019-04-16 13:32:41 +03:00