mercury-parser

mirror of https://github.com/postlight/mercury-parser synced 2024-11-11 01:10:35 +00:00

Author	SHA1	Message	Date
Simon Reinhardt	035aa65dbc	Added custom extractor for www.spektrum.de (#677 ) Co-authored-by: Simon Reinhardt <simon.reinhardt@hype.de> Co-authored-by: John Holdun <john@johnholdun.com>	2022-08-10 15:37:06 -07:00
John Holdun	f259d13753	feat: Add figcaption to list of non-convertible span parents (#682 ) Based on this comment: https://github.com/postlight/mercury-parser/issues/530#issuecomment-580105171	2022-08-10 15:31:08 -07:00
Nate Weaver	de314a9728	Add li to the list of non-convertible parents for spans (#531 ) Co-authored-by: John Holdun <john@johnholdun.com>	2022-08-10 15:26:03 -07:00
John Brayton	9a961aa595	feat: Add a custom extractor for www.ndtv.com. (#554 ) * feat:Add a custom extractor for ma.ttias.be. When parsing content for cron.weekly issues, such as the one at https://ma.ttias.be/cronweekly/issue-130/, Mercury Parser would remove headings and ordered lists that were part of the content. This resolves that as follows: * Remove "id" attributes from "h1" and "h2" elements. Those attributes would result in the elements having a low weight. * Since Mercury Parser demotes "h1" elements to "h2", demote "h2" elements to "h3". * Add class="entry-content-asset" to "ul" elements to avoid them being removed. * removed redundant comment. * feat: Add a custom extractor for engadget.com. * feat: Add a custom extractor for www.ndtv.com. * Works, but I need to figure how to make pagination work correctly. * fixed pagination - would only retrieve first or second page because we would send contentOnly: true on subsequent pages (page 2). removed failover: true from preview. * rolled back { fallback: false } option removal * Clarified comments. * rolling back yarn.lock changes Co-authored-by: John Holdun <john@johnholdun.com>	2022-08-10 15:16:14 -07:00
John Brayton	143631b4b7	feat: arstechnica.com extractor (#553 ) * feat:Add a custom extractor for ma.ttias.be. When parsing content for cron.weekly issues, such as the one at https://ma.ttias.be/cronweekly/issue-130/, Mercury Parser would remove headings and ordered lists that were part of the content. This resolves that as follows: * Remove "id" attributes from "h1" and "h2" elements. Those attributes would result in the elements having a low weight. * Since Mercury Parser demotes "h1" elements to "h2", demote "h2" elements to "h3". * Add class="entry-content-asset" to "ul" elements to avoid them being removed. * removed redundant comment. * feat: Add a custom extractor for engadget.com. * Works, but I need to figure how to make pagination work correctly. * fixed pagination - would only retrieve first or second page because we would send contentOnly: true on subsequent pages (page 2). removed failover: true from preview. * rolled back { fallback: false } option removal * Clarified comments. Co-authored-by: John Holdun <john@johnholdun.com>	2022-08-10 15:10:35 -07:00
John Brayton	3c5c0bdba9	feat: Add a custom extractor for www.engadget.com. (#552 ) * feat:Add a custom extractor for ma.ttias.be. When parsing content for cron.weekly issues, such as the one at https://ma.ttias.be/cronweekly/issue-130/, Mercury Parser would remove headings and ordered lists that were part of the content. This resolves that as follows: * Remove "id" attributes from "h1" and "h2" elements. Those attributes would result in the elements having a low weight. * Since Mercury Parser demotes "h1" elements to "h2", demote "h2" elements to "h3". * Add class="entry-content-asset" to "ul" elements to avoid them being removed. * removed redundant comment. * feat: Add a custom extractor for engadget.com. Co-authored-by: John Holdun <john@johnholdun.com>	2022-08-10 15:02:27 -07:00
Sven Wiegand	13dfe720bd	Custom extractor for www.gruene.de (#485 ) * Implemented custom extractor gruene.de * Cleaner output of custom extracter www.gruene.de * Updated fixture for www.gruene.de from real page * Trying to pick image from og:image -- doesn't work ... Co-authored-by: John Holdun <john@johnholdun.com>	2022-08-10 14:50:43 -07:00
Marco Wiedemeyer	d0c78911e6	Add a new custom extractor for www.abendblatt.de (#559 ) * Add custom extractor for www.abendblatt.de * update Co-authored-by: Marco Wiedemeyer <marco.wiedemeyer@ottogroup.com> Co-authored-by: John Holdun <john@johnholdun.com>	2022-05-09 09:19:33 -07:00
Felipe Canejo	6014016283	feat: Add a custom extractor for pastebin.com (#556 ) * feat: Add a custom extractor for pastebin.com * feat: transforms <li> to <p> in pastebin.com Co-authored-by: Felipe Canejo <felipecanejo@gmail.com> Co-authored-by: John Holdun <john@johnholdun.com>	2022-05-09 09:10:57 -07:00
John Brayton	e217648c0b	feat: ma.ttias.be extractor (#551 ) * feat:Add a custom extractor for ma.ttias.be. When parsing content for cron.weekly issues, such as the one at https://ma.ttias.be/cronweekly/issue-130/, Mercury Parser would remove headings and ordered lists that were part of the content. This resolves that as follows: * Remove "id" attributes from "h1" and "h2" elements. Those attributes would result in the elements having a low weight. * Since Mercury Parser demotes "h1" elements to "h2", demote "h2" elements to "h3". * Add class="entry-content-asset" to "ul" elements to avoid them being removed. * removed redundant comment. Co-authored-by: John Holdun <john@johnholdun.com>	2022-05-09 09:07:27 -07:00
James Shakespeare	70e99d56cf	Feat: update qz.com selectors and tests (#538 ) * feat: update qz.com selectors and tests * chore: remove out of date fixture	2022-05-09 09:02:20 -07:00
Ethan Jucovy	af9cfcd120	fix: don't try to re-decode prepared response (#498 ) * fix: don't try to re-decode prepared response * Remove stray console.log	2022-05-09 08:51:15 -07:00
Joe Moon	fb44ab0244	Bugfix new yorker wired extractors (#604 ) * www.newyorker.com: add updated fixtures and fix extractors * www.wired.com: add updated fixtures and fix extractors Co-authored-by: John Holdun <john@johnholdun.com>	2022-05-09 08:40:54 -07:00
John Holdun	65e338a403	feat: Add date formats to two extractors (#660 ) These extractors were variously failing tests as I tried updating dependencies. It seems like some of the format detection logic has changed, and making these date detectors more explicit fixes them.	2022-05-06 14:45:56 -07:00
Nitin Khanna	8c9982247b	feat: Ladbible.com extractor (#624 ) * Ladbible.com extractors and test * CircleCI says timezone needs to be Europe/London aka BST Co-authored-by: Postlight Bot <adam.pash+postlight-bot@postlight.com> Co-authored-by: Jad Termsani <32297675+JadTermsani@users.noreply.github.com>	2021-09-08 12:03:23 -05:00
Nitin Khanna	30d6f472ee	feat: Times of India extractor (#503 ) * Adding custom parser for Times of India * moved transforms to clean The transforms were just working as cleans. Moved things around as per recommendations. Co-authored-by: Postlight Bot <adam.pash+postlight-bot@postlight.com>	2021-09-08 12:00:28 -05:00
Wajeeh Zantout	b0e708aac6	feat: update nytimes extractor (#506 ) * feat: update custom extractor for nytimes.com	2019-10-17 09:54:42 +03:00
Michael Ashley	e12c916499	feat: ability to add custom extractors via api (#484 ) * feat: ability to add custom extractors via api * docs: updating readme * fix: example.com was being used in another test * fix: timezone was messing up date_published test * fix: using a unique site for testing * fix: updated custom extractor api * docs: updating readme * fix: removing unused fixture * fix: updating test description * feat: ability to add custom extractors via cli	2019-09-04 07:32:28 -07:00
Sven Wiegand	f95947fe88	Implemented custom extractor epaper.zeit.de (#488 )	2019-08-28 07:15:14 -07:00
Michael Ashley	2422e4717d	fix: incorrect parsing on medium.com (#477 ) * fix: medium extractor now pulls content * fix: remove youtube caption if no preview available * fix: remove youtube node if no image * fix: removing dek from medium.com extractor	2019-08-28 07:04:27 -07:00
Jakob Fix	a918a9d6fa	doc: correct link that points to wrong line (#469 )	2019-08-21 10:10:10 -07:00
Michael Ashley	0686ee7956	fix: incorrect parsing on theatlantic.com (#475 ) * fix: incorrect parsing on theatlantic.com * chore: updating theatlantic.com tests & fixtures * chore: removing script data from minified fixture	2019-08-20 09:58:24 -07:00
david0leong	911b0f87c8	Add custom extractor for biorxiv.org (#467 ) * Add custom extractor for biorxiv.org * Fix content selector * Improve content selector	2019-08-19 13:46:03 -07:00
Jakob Fix	76d59f2d58	doc: correct internal page links (#470 ) Specifically, to the cleaning content and using transform sections.	2019-08-16 14:41:46 -07:00
Kirill Danshin	592f175270	tests: remove a duplicate test (#448 )	2019-07-03 09:30:10 -07:00
Toufic Mouallem	939d181951	fix: support query strings in lazy-loaded srcsets (#387 )	2019-06-26 10:13:58 -07:00
Ben Ubois	0942c37876	feat: custom parser for phoronix.com. (#431 )	2019-06-26 09:55:13 -07:00
Michael P. Geraci	571a913745	feat: pitchfork extractor (#439 ) * generate the custom extractor and get the first test to pass * add the basic extractors (title, author, date, etc) * select the score as well as the review text, and break the content test * prepend the score to the content * get the date from the datetime attribute * mangle this test a little, but just a little (it does work properly) * move from prepending the score to the review text to adding it as a custom field in the extractor	2019-06-26 09:02:17 -07:00
david0leong	694ea820aa	Custom Extractor for clinicaltrials.gov (#305 ) * Add prototype of custom extractor for clinicaltrials.gov * Add .DS_Store to gitignore * Make tests for title, author and date_published selectors pass * Make content selector test pass * Fix date_published test * Rebuild * Remove .DS-Store from gitignore * Improve extractor and text/fixture of clinicaltrials.gov	2019-05-27 09:25:51 +03:00
Wajeeh Zantout	7c8de71c52	fix: new yorker extractor (#414 ) * fix: new yorker extractor * fix: date_published selector * fix: remove footer from content * feat: add additional selector for title * feat: support article with multiple authors	2019-05-15 11:00:50 +03:00
Wajeeh Zantout	e66ad8b81c	feat: add le monde extractor (#415 )	2019-05-14 14:53:49 +03:00
kik0220	f81dc63617	feat: add rbbtoday.com custom parser (#411 ) * feat: add rbbtoday.com custom parser * fix: content test * fix: dek and content	2019-05-08 14:04:03 +03:00
kik0220	5e1113b3a9	feat: add japan.zdnet.com custom parser (#410 ) * feat: add japan.zdnet.com custom parser * fix: author and date_published selector	2019-05-08 13:51:03 +03:00
kik0220	77e3bc00e2	feat: add wired.jp custom parser (#409 ) * feat: add wired.jp custom parser * fix: author test * fix: date_published selector * test: fix dek and contest * test: fix content (without clean dek)	2019-05-08 13:32:04 +03:00
kik0220	0b36c96de0	feat: add techlog.iij.ad.jp custom parser (#405 ) * feat: add techlog.iij.ad.jp custom parser * fix: date_published and content selector	2019-05-08 13:20:47 +03:00
kik0220	406bf1b1a9	feat: add weekly.ascii.jp custom parser (#401 ) * feat: add weekly.ascii.jp custom parser * fix: title and date_published selector	2019-05-08 13:10:42 +03:00
kik0220	216bfade00	feat: add www.ipa.go.jp custom parser (#408 )	2019-05-03 13:40:42 +03:00
kik0220	3ae8f3bde3	feat: add www.oreilly.co.jp custom parser (#407 )	2019-05-03 13:30:48 +03:00
kik0220	7396e81b72	feat: add sect.iij.ad.jp custom parser (#404 )	2019-05-03 13:19:06 +03:00
kik0220	3f1d9030ee	feat: add www.lifehacker.jp custom parser (#403 )	2019-05-03 13:14:53 +03:00
kik0220	b077000c4a	feat: add getnews.jp custom parser (#402 )	2019-05-03 13:10:55 +03:00
kik0220	b5425c3e8a	feat: add www.gizmodo.jp custom parser (#400 )	2019-05-03 13:06:51 +03:00
kik0220	a38c727a0a	feat: add deadline.com custom parser (#383 ) * feat: add deadline.com custom parser * fix: timezone * fix: date_published selectors * fix: title and author selector * test: transform .embed-twitter * fix: regenerate the fixture and fix content selector	2019-04-24 15:29:02 +03:00
kik0220	74a3c49a3c	feat: add japan.cnet.com custom parser (#382 ) * feat: add japan.cnet.com custom parser * fix: remove transform	2019-04-24 14:39:54 +03:00
kik0220	7b07f88448	feat: add www.yomiuri.co.jp custom parser (#381 )	2019-04-24 11:00:56 +03:00
Toufic Mouallem	3f46859d14	fix: skip absolutizing invalid srcsets (#386 ) * fix: skip absolutizing empty srcsets * test: empty srcsets are handled properly	2019-04-24 10:18:57 +03:00
kik0220	779c1154fb	fix: add date_published selector in www.sanwa.co.jp extractor (#378 )	2019-04-16 13:46:24 +03:00
kik0220	ea5b65f019	fix: add date_published selector in www.elecom.co.jp extractor (#377 )	2019-04-16 13:41:40 +03:00
kik0220	7c0949e587	fix: add date_published selector in www.ossnews.jp extractor (#376 )	2019-04-16 13:36:42 +03:00
kik0220	3e91ac55db	fix: add date_published selector in jvndb.jvn.jp extractor (#375 )	2019-04-16 13:32:41 +03:00

1 2 3 4 5 ...

361 Commits