mercury-parser

Commit Graph

Author	SHA1	Message	Date
touchRED	9cf23f197b	fix: update al.com timezone	1 year ago
touchRED	c49b5efc7a	fix: few more extractor updates	1 year ago
touchRED	3c336ef155	fix: add UTC timezone to extractors where needed	1 year ago
touchRED	8ad9309e3b	fix: update some extractors to remove unnecessary timezones	1 year ago
touchRED	6a5f892c68	fix: update tests to remove dayjs where possible, update formats	1 year ago
touchRED	98b8f69d41	fix: replace moment with dayjs	1 year ago
connor trotter	e8ba7ece29	fix: select extended types before content (#733 )	1 year ago
Sarah Doire	c0364ec52b	feat: update all fixtures and custom parsers to match (#713 ) * feat: Refactor and update fixtures This patch changes how fixtures are stored. Previously, a fixture's folder identified its domain and its filename identified when it was fetched. This has been changed so that the filename indicates the domain and the modified time of the file indicates how recently it was fetched. A fixture's filename can optionally include a modifier to distinguish between two different page types on the same domain, for example. Also included here are changes to the update-fixture script, both to accomodate the new filename scheme as well as to actually update all fixtures. The functionality for running automatically and opening PRs has been removed but will likely be reintroduced. Finally, all fixtures have been updated. * Remove reference to deleted extractor * feat: first batch of test and parser updates due to new fixtures * feat: update more custom parsers and unit tests * feat: update more custom parsers and unit tests and remove unnecessary parser * feat: update more custom parsers and unit tests * feat: update more parsers and add correct bloomberg html files * fix: remove console statement * feat: all parsers updated and tests passing * fix: update date_published tests to account for test server time difference * fix: cleanup remaining fixtures in folders * feat: move fixtures for newest custom parsers * feat: remove script changes * fix: update dist files to account for reverting script changes * adding .DS_Store to .gitignore * adding .DS_Store to .gitignore -- 2 * adding .DS_Store to .gitignore -- 3 lol * cleaning up some tests * fix: ran build:generator command to update generate-custom-parser dist file * fix: update rollup configs to generate source maps and update source maps * fix: use underscore in place of unused error variable * fix: remove unused fixture Co-authored-by: Postlight Bot <adam.pash+postlight-bot@postlight.com> Co-authored-by: flbn <overasc@gmail.com>	1 year ago
Sarah Doire	7b68bcd94c	feat: remove obsolete custom extractors (#712 )	2 years ago
Andrei Zhemaituk	4981355628	fixed and improved extraction for latest layout of politico.com (#701 ) * fixed and improved extraction for latest layout of politico.com * explicit timezone for politico.com extractor * handling more layout of politico.com Co-authored-by: Andrei Zhemaituk <azhemoytuk@workfusion.com> Co-authored-by: Sarah Doire <sarah.doire@postlight.com>	2 years ago
Andrei Zhemaituk	45bb28e217	custom parser for www.investmentexecutive.com (#700 ) Co-authored-by: Andrei Zhemaituk <azhemoytuk@workfusion.com> Co-authored-by: Sarah Doire <sarah.doire@postlight.com>	2 years ago
Andrei Zhemaituk	6532316973	custom parser for cbc.ca (#699 ) Co-authored-by: Andrei Zhemaituk <azhemoytuk@workfusion.com> Co-authored-by: Sarah Doire <sarah.doire@gmail.com> Co-authored-by: Sarah Doire <sarah.doire@postlight.com>	2 years ago
Sarah Doire	6b4359d062	fix: postlight parser test (#710 )	2 years ago
Austin	4c843be377	adjust postlight insights custom selectors (#707 )	2 years ago
Austin	635fcf6356	fix: handle sec & ms timestamps properly (#702 )	2 years ago
Michael Ashley	ab401822aa	maintenance update - october 2022 (#696 ) * fix: add alternative word count method * fix: replace pages_rendered key with rendered_pages for consistency * fix: return first lead_image_url when multiple og:image present * fix: properly pull image src from lazy loaded img * fix: allow drop cap character in medium custom extractor * fix: refined medium parser	2 years ago
Sarah Doire	8ca8a5f7e5	feat: add postlight.com custom extractor (#695 )	2 years ago
John Holdun	97472cf4f8	Change Name (#688 ) Mercury Parser is now Postlight Parser!	2 years ago
John Holdun	112846f74f	chore: Inline test fixtures (#683 ) Not to be confused with extractor fixtures, which are snapshots of a webpage. This change removes the pattern of separate JS files that provide "fixtures" for tests, which are used as provided or expected strings in tests. They were inconsistent and disorganized, and generally just served to add indirection to test files. So now all those strings are defined where they are used in their respective tests.	2 years ago
Simon Reinhardt	035aa65dbc	Added custom extractor for www.spektrum.de (#677 ) Co-authored-by: Simon Reinhardt <simon.reinhardt@hype.de> Co-authored-by: John Holdun <john@johnholdun.com>	2 years ago
John Holdun	f259d13753	feat: Add figcaption to list of non-convertible span parents (#682 ) Based on this comment: https://github.com/postlight/mercury-parser/issues/530#issuecomment-580105171	2 years ago
Nate Weaver	de314a9728	Add li to the list of non-convertible parents for spans (#531 ) Co-authored-by: John Holdun <john@johnholdun.com>	2 years ago
John Brayton	9a961aa595	feat: Add a custom extractor for www.ndtv.com. (#554 ) * feat:Add a custom extractor for ma.ttias.be. When parsing content for cron.weekly issues, such as the one at https://ma.ttias.be/cronweekly/issue-130/, Mercury Parser would remove headings and ordered lists that were part of the content. This resolves that as follows: * Remove "id" attributes from "h1" and "h2" elements. Those attributes would result in the elements having a low weight. * Since Mercury Parser demotes "h1" elements to "h2", demote "h2" elements to "h3". * Add class="entry-content-asset" to "ul" elements to avoid them being removed. * removed redundant comment. * feat: Add a custom extractor for engadget.com. * feat: Add a custom extractor for www.ndtv.com. * Works, but I need to figure how to make pagination work correctly. * fixed pagination - would only retrieve first or second page because we would send contentOnly: true on subsequent pages (page 2). removed failover: true from preview. * rolled back { fallback: false } option removal * Clarified comments. * rolling back yarn.lock changes Co-authored-by: John Holdun <john@johnholdun.com>	2 years ago
John Brayton	143631b4b7	feat: arstechnica.com extractor (#553 ) * feat:Add a custom extractor for ma.ttias.be. When parsing content for cron.weekly issues, such as the one at https://ma.ttias.be/cronweekly/issue-130/, Mercury Parser would remove headings and ordered lists that were part of the content. This resolves that as follows: * Remove "id" attributes from "h1" and "h2" elements. Those attributes would result in the elements having a low weight. * Since Mercury Parser demotes "h1" elements to "h2", demote "h2" elements to "h3". * Add class="entry-content-asset" to "ul" elements to avoid them being removed. * removed redundant comment. * feat: Add a custom extractor for engadget.com. * Works, but I need to figure how to make pagination work correctly. * fixed pagination - would only retrieve first or second page because we would send contentOnly: true on subsequent pages (page 2). removed failover: true from preview. * rolled back { fallback: false } option removal * Clarified comments. Co-authored-by: John Holdun <john@johnholdun.com>	2 years ago
John Brayton	3c5c0bdba9	feat: Add a custom extractor for www.engadget.com. (#552 ) * feat:Add a custom extractor for ma.ttias.be. When parsing content for cron.weekly issues, such as the one at https://ma.ttias.be/cronweekly/issue-130/, Mercury Parser would remove headings and ordered lists that were part of the content. This resolves that as follows: * Remove "id" attributes from "h1" and "h2" elements. Those attributes would result in the elements having a low weight. * Since Mercury Parser demotes "h1" elements to "h2", demote "h2" elements to "h3". * Add class="entry-content-asset" to "ul" elements to avoid them being removed. * removed redundant comment. * feat: Add a custom extractor for engadget.com. Co-authored-by: John Holdun <john@johnholdun.com>	2 years ago
Sven Wiegand	13dfe720bd	Custom extractor for www.gruene.de (#485 ) * Implemented custom extractor gruene.de * Cleaner output of custom extracter www.gruene.de * Updated fixture for www.gruene.de from real page * Trying to pick image from og:image -- doesn't work ... Co-authored-by: John Holdun <john@johnholdun.com>	2 years ago
Marco Wiedemeyer	d0c78911e6	Add a new custom extractor for www.abendblatt.de (#559 ) * Add custom extractor for www.abendblatt.de * update Co-authored-by: Marco Wiedemeyer <marco.wiedemeyer@ottogroup.com> Co-authored-by: John Holdun <john@johnholdun.com>	2 years ago
Felipe Canejo	6014016283	feat: Add a custom extractor for pastebin.com (#556 ) * feat: Add a custom extractor for pastebin.com * feat: transforms <li> to <p> in pastebin.com Co-authored-by: Felipe Canejo <felipecanejo@gmail.com> Co-authored-by: John Holdun <john@johnholdun.com>	2 years ago
John Brayton	e217648c0b	feat: ma.ttias.be extractor (#551 ) * feat:Add a custom extractor for ma.ttias.be. When parsing content for cron.weekly issues, such as the one at https://ma.ttias.be/cronweekly/issue-130/, Mercury Parser would remove headings and ordered lists that were part of the content. This resolves that as follows: * Remove "id" attributes from "h1" and "h2" elements. Those attributes would result in the elements having a low weight. * Since Mercury Parser demotes "h1" elements to "h2", demote "h2" elements to "h3". * Add class="entry-content-asset" to "ul" elements to avoid them being removed. * removed redundant comment. Co-authored-by: John Holdun <john@johnholdun.com>	2 years ago
James Shakespeare	70e99d56cf	Feat: update qz.com selectors and tests (#538 ) * feat: update qz.com selectors and tests * chore: remove out of date fixture	2 years ago
Ethan Jucovy	af9cfcd120	fix: don't try to re-decode prepared response (#498 ) * fix: don't try to re-decode prepared response * Remove stray console.log	2 years ago
Joe Moon	fb44ab0244	Bugfix new yorker wired extractors (#604 ) * www.newyorker.com: add updated fixtures and fix extractors * www.wired.com: add updated fixtures and fix extractors Co-authored-by: John Holdun <john@johnholdun.com>	2 years ago
John Holdun	65e338a403	feat: Add date formats to two extractors (#660 ) These extractors were variously failing tests as I tried updating dependencies. It seems like some of the format detection logic has changed, and making these date detectors more explicit fixes them.	2 years ago
Nitin Khanna	8c9982247b	feat: Ladbible.com extractor (#624 ) * Ladbible.com extractors and test * CircleCI says timezone needs to be Europe/London aka BST Co-authored-by: Postlight Bot <adam.pash+postlight-bot@postlight.com> Co-authored-by: Jad Termsani <32297675+JadTermsani@users.noreply.github.com>	3 years ago
Nitin Khanna	30d6f472ee	feat: Times of India extractor (#503 ) * Adding custom parser for Times of India * moved transforms to clean The transforms were just working as cleans. Moved things around as per recommendations. Co-authored-by: Postlight Bot <adam.pash+postlight-bot@postlight.com>	3 years ago
Wajeeh Zantout	b0e708aac6	feat: update nytimes extractor (#506 ) * feat: update custom extractor for nytimes.com	5 years ago
Michael Ashley	e12c916499	feat: ability to add custom extractors via api (#484 ) * feat: ability to add custom extractors via api * docs: updating readme * fix: example.com was being used in another test * fix: timezone was messing up date_published test * fix: using a unique site for testing * fix: updated custom extractor api * docs: updating readme * fix: removing unused fixture * fix: updating test description * feat: ability to add custom extractors via cli	5 years ago
Sven Wiegand	f95947fe88	Implemented custom extractor epaper.zeit.de (#488 )	5 years ago
Michael Ashley	2422e4717d	fix: incorrect parsing on medium.com (#477 ) * fix: medium extractor now pulls content * fix: remove youtube caption if no preview available * fix: remove youtube node if no image * fix: removing dek from medium.com extractor	5 years ago
Jakob Fix	a918a9d6fa	doc: correct link that points to wrong line (#469 )	5 years ago
Michael Ashley	0686ee7956	fix: incorrect parsing on theatlantic.com (#475 ) * fix: incorrect parsing on theatlantic.com * chore: updating theatlantic.com tests & fixtures * chore: removing script data from minified fixture	5 years ago
david0leong	911b0f87c8	Add custom extractor for biorxiv.org (#467 ) * Add custom extractor for biorxiv.org * Fix content selector * Improve content selector	5 years ago
Jakob Fix	76d59f2d58	doc: correct internal page links (#470 ) Specifically, to the cleaning content and using transform sections.	5 years ago
Kirill Danshin	592f175270	tests: remove a duplicate test (#448 )	5 years ago
Toufic Mouallem	939d181951	fix: support query strings in lazy-loaded srcsets (#387 )	5 years ago
Ben Ubois	0942c37876	feat: custom parser for phoronix.com. (#431 )	5 years ago
Michael P. Geraci	571a913745	feat: pitchfork extractor (#439 ) * generate the custom extractor and get the first test to pass * add the basic extractors (title, author, date, etc) * select the score as well as the review text, and break the content test * prepend the score to the content * get the date from the datetime attribute * mangle this test a little, but just a little (it does work properly) * move from prepending the score to the review text to adding it as a custom field in the extractor	5 years ago
david0leong	694ea820aa	Custom Extractor for clinicaltrials.gov (#305 ) * Add prototype of custom extractor for clinicaltrials.gov * Add .DS_Store to gitignore * Make tests for title, author and date_published selectors pass * Make content selector test pass * Fix date_published test * Rebuild * Remove .DS-Store from gitignore * Improve extractor and text/fixture of clinicaltrials.gov	5 years ago
Wajeeh Zantout	7c8de71c52	fix: new yorker extractor (#414 ) * fix: new yorker extractor * fix: date_published selector * fix: remove footer from content * feat: add additional selector for title * feat: support article with multiple authors	5 years ago
Wajeeh Zantout	e66ad8b81c	feat: add le monde extractor (#415 )	5 years ago

1 2 3 4 5 ...

380 Commits (fix-remove-moment-js)