mercury-parser

mirror of https://github.com/postlight/mercury-parser synced 2024-10-31 03:20:40 +00:00

Author	SHA1	Message	Date
Sarah Doire	21da9f5151	feat: update scripts to reflect new fixture structure	2022-12-13 12:33:08 -06:00
Sarah Doire	c0364ec52b	feat: update all fixtures and custom parsers to match (#713 ) * feat: Refactor and update fixtures This patch changes how fixtures are stored. Previously, a fixture's folder identified its domain and its filename identified when it was fetched. This has been changed so that the filename indicates the domain and the modified time of the file indicates how recently it was fetched. A fixture's filename can optionally include a modifier to distinguish between two different page types on the same domain, for example. Also included here are changes to the update-fixture script, both to accomodate the new filename scheme as well as to actually update all fixtures. The functionality for running automatically and opening PRs has been removed but will likely be reintroduced. Finally, all fixtures have been updated. * Remove reference to deleted extractor * feat: first batch of test and parser updates due to new fixtures * feat: update more custom parsers and unit tests * feat: update more custom parsers and unit tests and remove unnecessary parser * feat: update more custom parsers and unit tests * feat: update more parsers and add correct bloomberg html files * fix: remove console statement * feat: all parsers updated and tests passing * fix: update date_published tests to account for test server time difference * fix: cleanup remaining fixtures in folders * feat: move fixtures for newest custom parsers * feat: remove script changes * fix: update dist files to account for reverting script changes * adding .DS_Store to .gitignore * adding .DS_Store to .gitignore -- 2 * adding .DS_Store to .gitignore -- 3 lol * cleaning up some tests * fix: ran build:generator command to update generate-custom-parser dist file * fix: update rollup configs to generate source maps and update source maps * fix: use underscore in place of unused error variable * fix: remove unused fixture Co-authored-by: Postlight Bot <adam.pash+postlight-bot@postlight.com> Co-authored-by: flbn <overasc@gmail.com>	2022-12-13 10:05:33 -06:00
Sarah Doire	7b68bcd94c	feat: remove obsolete custom extractors (#712 )	2022-11-10 12:35:27 -06:00
Andrei Zhemaituk	4981355628	fixed and improved extraction for latest layout of politico.com (#701 ) * fixed and improved extraction for latest layout of politico.com * explicit timezone for politico.com extractor * handling more layout of politico.com Co-authored-by: Andrei Zhemaituk <azhemoytuk@workfusion.com> Co-authored-by: Sarah Doire <sarah.doire@postlight.com>	2022-11-09 14:01:22 -06:00
Andrei Zhemaituk	45bb28e217	custom parser for www.investmentexecutive.com (#700 ) Co-authored-by: Andrei Zhemaituk <azhemoytuk@workfusion.com> Co-authored-by: Sarah Doire <sarah.doire@postlight.com>	2022-11-09 13:46:34 -06:00
Andrei Zhemaituk	6532316973	custom parser for cbc.ca (#699 ) Co-authored-by: Andrei Zhemaituk <azhemoytuk@workfusion.com> Co-authored-by: Sarah Doire <sarah.doire@gmail.com> Co-authored-by: Sarah Doire <sarah.doire@postlight.com>	2022-11-09 13:37:36 -06:00
Sarah Doire	6b4359d062	fix: postlight parser test (#710 )	2022-11-09 13:33:31 -06:00
Austin	4c843be377	adjust postlight insights custom selectors (#707 )	2022-11-02 10:43:50 -07:00
John Holdun	ad8d4aa268	release: 2.2.3 (#703 )	2022-10-24 15:55:24 -07:00
Austin	635fcf6356	fix: handle sec & ms timestamps properly (#702 )	2022-10-24 15:04:43 -07:00
Michael Ashley	ab401822aa	maintenance update - october 2022 (#696 ) * fix: add alternative word count method * fix: replace pages_rendered key with rendered_pages for consistency * fix: return first lead_image_url when multiple og:image present * fix: properly pull image src from lazy loaded img * fix: allow drop cap character in medium custom extractor * fix: refined medium parser	2022-10-07 08:47:41 -07:00
Sarah Doire	8ca8a5f7e5	feat: add postlight.com custom extractor (#695 )	2022-10-06 11:06:50 -07:00
John Holdun	39b9ff55c4	release: 2.2.2 (#689 )	2022-09-08 15:36:42 -07:00
John Holdun	f1932e3672	Update README.md	2022-09-08 15:29:17 -07:00
John Holdun	97472cf4f8	Change Name (#688 ) Mercury Parser is now Postlight Parser!	2022-09-08 15:28:23 -07:00
John Holdun	eb9d0bc5e8	Update more dependencies (#687 ) * Update more dependencies Bumps almost everything up, removing almost all warnings from yarn audit. Doesn't touch cheerio or jest, as they require more attention and QA still. * Adjust more dependencies, tweak build files	2022-09-08 13:46:26 -07:00
John Holdun	112846f74f	chore: Inline test fixtures (#683 ) Not to be confused with extractor fixtures, which are snapshots of a webpage. This change removes the pattern of separate JS files that provide "fixtures" for tests, which are used as provided or expected strings in tests. They were inconsistent and disorganized, and generally just served to add indirection to test files. So now all those strings are defined where they are used in their respective tests.	2022-08-15 17:00:04 -07:00
John Holdun	0d2bad544c	chore: Update builds	2022-08-11 12:05:44 -07:00
Simon Reinhardt	035aa65dbc	Added custom extractor for www.spektrum.de (#677 ) Co-authored-by: Simon Reinhardt <simon.reinhardt@hype.de> Co-authored-by: John Holdun <john@johnholdun.com>	2022-08-10 15:37:06 -07:00
John Holdun	f259d13753	feat: Add figcaption to list of non-convertible span parents (#682 ) Based on this comment: https://github.com/postlight/mercury-parser/issues/530#issuecomment-580105171	2022-08-10 15:31:08 -07:00
Nate Weaver	de314a9728	Add li to the list of non-convertible parents for spans (#531 ) Co-authored-by: John Holdun <john@johnholdun.com>	2022-08-10 15:26:03 -07:00
John Brayton	9a961aa595	feat: Add a custom extractor for www.ndtv.com. (#554 ) * feat:Add a custom extractor for ma.ttias.be. When parsing content for cron.weekly issues, such as the one at https://ma.ttias.be/cronweekly/issue-130/, Mercury Parser would remove headings and ordered lists that were part of the content. This resolves that as follows: * Remove "id" attributes from "h1" and "h2" elements. Those attributes would result in the elements having a low weight. * Since Mercury Parser demotes "h1" elements to "h2", demote "h2" elements to "h3". * Add class="entry-content-asset" to "ul" elements to avoid them being removed. * removed redundant comment. * feat: Add a custom extractor for engadget.com. * feat: Add a custom extractor for www.ndtv.com. * Works, but I need to figure how to make pagination work correctly. * fixed pagination - would only retrieve first or second page because we would send contentOnly: true on subsequent pages (page 2). removed failover: true from preview. * rolled back { fallback: false } option removal * Clarified comments. * rolling back yarn.lock changes Co-authored-by: John Holdun <john@johnholdun.com>	2022-08-10 15:16:14 -07:00
John Brayton	143631b4b7	feat: arstechnica.com extractor (#553 ) * feat:Add a custom extractor for ma.ttias.be. When parsing content for cron.weekly issues, such as the one at https://ma.ttias.be/cronweekly/issue-130/, Mercury Parser would remove headings and ordered lists that were part of the content. This resolves that as follows: * Remove "id" attributes from "h1" and "h2" elements. Those attributes would result in the elements having a low weight. * Since Mercury Parser demotes "h1" elements to "h2", demote "h2" elements to "h3". * Add class="entry-content-asset" to "ul" elements to avoid them being removed. * removed redundant comment. * feat: Add a custom extractor for engadget.com. * Works, but I need to figure how to make pagination work correctly. * fixed pagination - would only retrieve first or second page because we would send contentOnly: true on subsequent pages (page 2). removed failover: true from preview. * rolled back { fallback: false } option removal * Clarified comments. Co-authored-by: John Holdun <john@johnholdun.com>	2022-08-10 15:10:35 -07:00
John Brayton	3c5c0bdba9	feat: Add a custom extractor for www.engadget.com. (#552 ) * feat:Add a custom extractor for ma.ttias.be. When parsing content for cron.weekly issues, such as the one at https://ma.ttias.be/cronweekly/issue-130/, Mercury Parser would remove headings and ordered lists that were part of the content. This resolves that as follows: * Remove "id" attributes from "h1" and "h2" elements. Those attributes would result in the elements having a low weight. * Since Mercury Parser demotes "h1" elements to "h2", demote "h2" elements to "h3". * Add class="entry-content-asset" to "ul" elements to avoid them being removed. * removed redundant comment. * feat: Add a custom extractor for engadget.com. Co-authored-by: John Holdun <john@johnholdun.com>	2022-08-10 15:02:27 -07:00
Sven Wiegand	13dfe720bd	Custom extractor for www.gruene.de (#485 ) * Implemented custom extractor gruene.de * Cleaner output of custom extracter www.gruene.de * Updated fixture for www.gruene.de from real page * Trying to pick image from og:image -- doesn't work ... Co-authored-by: John Holdun <john@johnholdun.com>	2022-08-10 14:50:43 -07:00
dependabot[bot]	025261c120	chore(deps): Bump ws from 5.2.2 to 5.2.3 (#673 ) Bumps [ws](https://github.com/websockets/ws) from 5.2.2 to 5.2.3. - [Release notes](https://github.com/websockets/ws/releases) - [Commits](https://github.com/websockets/ws/compare/5.2.2...5.2.3) --- updated-dependencies: - dependency-name: ws dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2022-07-22 12:40:26 -05:00
dependabot[bot]	34bc6facc7	chore(deps): Bump moment from 2.29.2 to 2.29.4 (#672 ) Bumps [moment](https://github.com/moment/moment) from 2.29.2 to 2.29.4. - [Release notes](https://github.com/moment/moment/releases) - [Changelog](https://github.com/moment/moment/blob/develop/CHANGELOG.md) - [Commits](https://github.com/moment/moment/compare/2.29.2...2.29.4) --- updated-dependencies: - dependency-name: moment dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: John Holdun <john@johnholdun.com>	2022-07-22 12:34:24 -05:00
dependabot[bot]	7b15df58be	chore(deps): Bump terser from 4.8.0 to 4.8.1 (#671 ) Bumps [terser](https://github.com/terser/terser) from 4.8.0 to 4.8.1. - [Release notes](https://github.com/terser/terser/releases) - [Changelog](https://github.com/terser/terser/blob/master/CHANGELOG.md) - [Commits](https://github.com/terser/terser/compare/v4.8.0...v4.8.1) --- updated-dependencies: - dependency-name: terser dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: John Holdun <john@johnholdun.com>	2022-07-22 12:29:21 -05:00
John Holdun	fb74196d79	chore: Update CircleCI config (#661 ) Removing a couple extraneous CircleCI commands to see if they're still needed. I think one of the removed lines is causing #654 to fail, but let's see.	2022-07-22 12:03:31 -05:00
Jae Hanley	f7439ec3fd	modifies check-build to differentiate between test env (#665 ) Co-authored-by: John Holdun <john@johnholdun.com>	2022-07-22 11:56:00 -05:00
John Holdun	6ffa1a746e	chore: Update jQuery to 3.5.0 (#662 ) Resolves #607	2022-07-22 11:51:41 -05:00
dependabot[bot]	8d18b0ed0d	chore(deps): Bump shell-quote from 1.6.1 to 1.7.3 (#668 ) Bumps [shell-quote](https://github.com/substack/node-shell-quote) from 1.6.1 to 1.7.3. - [Release notes](https://github.com/substack/node-shell-quote/releases) - [Changelog](https://github.com/substack/node-shell-quote/blob/master/CHANGELOG.md) - [Commits](https://github.com/substack/node-shell-quote/compare/1.6.1...1.7.3) --- updated-dependencies: - dependency-name: shell-quote dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2022-07-22 11:48:56 -05:00
Samuel Clay	d5dabae20b	Update CHANGELOG.md (#663 ) Typo on v2.2.1 release date	2022-05-16 07:35:21 -07:00
Jim Nielsen	9cd9662bcb	support build of es modules (#570 )	2022-05-09 09:27:39 -07:00
Marco Wiedemeyer	d0c78911e6	Add a new custom extractor for www.abendblatt.de (#559 ) * Add custom extractor for www.abendblatt.de * update Co-authored-by: Marco Wiedemeyer <marco.wiedemeyer@ottogroup.com> Co-authored-by: John Holdun <john@johnholdun.com>	2022-05-09 09:19:33 -07:00
Felipe Canejo	6014016283	feat: Add a custom extractor for pastebin.com (#556 ) * feat: Add a custom extractor for pastebin.com * feat: transforms <li> to <p> in pastebin.com Co-authored-by: Felipe Canejo <felipecanejo@gmail.com> Co-authored-by: John Holdun <john@johnholdun.com>	2022-05-09 09:10:57 -07:00
John Brayton	e217648c0b	feat: ma.ttias.be extractor (#551 ) * feat:Add a custom extractor for ma.ttias.be. When parsing content for cron.weekly issues, such as the one at https://ma.ttias.be/cronweekly/issue-130/, Mercury Parser would remove headings and ordered lists that were part of the content. This resolves that as follows: * Remove "id" attributes from "h1" and "h2" elements. Those attributes would result in the elements having a low weight. * Since Mercury Parser demotes "h1" elements to "h2", demote "h2" elements to "h3". * Add class="entry-content-asset" to "ul" elements to avoid them being removed. * removed redundant comment. Co-authored-by: John Holdun <john@johnholdun.com>	2022-05-09 09:07:27 -07:00
James Shakespeare	70e99d56cf	Feat: update qz.com selectors and tests (#538 ) * feat: update qz.com selectors and tests * chore: remove out of date fixture	2022-05-09 09:02:20 -07:00
Michael Ashley	56a19bf934	fix: updating generate-parser dist (#499 )	2022-05-09 08:58:26 -07:00
Ethan Jucovy	af9cfcd120	fix: don't try to re-decode prepared response (#498 ) * fix: don't try to re-decode prepared response * Remove stray console.log	2022-05-09 08:51:15 -07:00
Peter Dave Hello	9515dc28c1	chore: update node version in .nvmrc & CONTRIBUTING.md (#599 ) Ref: #579, `a5a066c69d`	2022-05-09 08:45:12 -07:00
Joe Moon	fb44ab0244	Bugfix new yorker wired extractors (#604 ) * www.newyorker.com: add updated fixtures and fix extractors * www.wired.com: add updated fixtures and fix extractors Co-authored-by: John Holdun <john@johnholdun.com>	2022-05-09 08:40:54 -07:00
Nick Sweeting	99062da034	Add --version CLI flag (#610 ) * add --version CLI flag * move import to top of file for consistency Co-authored-by: John Holdun <john@johnholdun.com>	2022-05-09 08:37:10 -07:00
dependabot[bot]	32dff4aedb	chore(deps-dev): bump karma from 3.1.4 to 6.3.16 (#654 ) * chore(deps-dev): bump karma from 3.1.4 to 6.3.16 Bumps [karma](https://github.com/karma-runner/karma) from 3.1.4 to 6.3.16. - [Release notes](https://github.com/karma-runner/karma/releases) - [Changelog](https://github.com/karma-runner/karma/blob/master/CHANGELOG.md) - [Commits](https://github.com/karma-runner/karma/compare/v3.1.4...v6.3.16) --- updated-dependencies: - dependency-name: karma dependency-type: direct:development ... Signed-off-by: dependabot[bot] <support@github.com> * chore: Update CircleCI config Removing a couple extraneous CircleCI commands to see if they're still needed. I think one of the removed lines is causing #654 to fail, but let's see. * chore: Update karma-browserify * Revert "chore: Update CircleCI config" This reverts commit `c474be7433`. Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: John Holdun <john@johnholdun.com>	2022-05-09 08:32:42 -07:00
dependabot[bot]	736778d2e7	chore(deps): bump moment from 2.23.0 to 2.29.2 (#656 ) * chore(deps): bump moment from 2.23.0 to 2.29.2 Bumps [moment](https://github.com/moment/moment) from 2.23.0 to 2.29.2. - [Release notes](https://github.com/moment/moment/releases) - [Changelog](https://github.com/moment/moment/blob/develop/CHANGELOG.md) - [Commits](https://github.com/moment/moment/compare/2.23.0...2.29.2) --- updated-dependencies: - dependency-name: moment dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> * feat: Add stricter format definitions to extractors for failing tests Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: John Holdun <john@johnholdun.com>	2022-05-06 14:46:43 -07:00
John Holdun	65e338a403	feat: Add date formats to two extractors (#660 ) These extractors were variously failing tests as I tried updating dependencies. It seems like some of the format detection logic has changed, and making these date detectors more explicit fixes them.	2022-05-06 14:45:56 -07:00
dependabot[bot]	8dd3c7078a	chore(deps): bump jquery from 3.4.1 to 3.5.0 (#557 ) Bumps [jquery](https://github.com/jquery/jquery) from 3.4.1 to 3.5.0. - [Release notes](https://github.com/jquery/jquery/releases) - [Commits](https://github.com/jquery/jquery/compare/3.4.1...3.5.0) Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2022-05-06 13:27:00 -07:00
dependabot[bot]	88718d4caf	chore(deps): bump cached-path-relative from 1.0.2 to 1.1.0 (#647 ) Bumps [cached-path-relative](https://github.com/ashaffer/cached-path-relative) from 1.0.2 to 1.1.0. - [Release notes](https://github.com/ashaffer/cached-path-relative/releases) - [Commits](https://github.com/ashaffer/cached-path-relative/commits) --- updated-dependencies: - dependency-name: cached-path-relative dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2022-05-06 13:13:11 -07:00
dependabot[bot]	af5974f6ea	chore(deps): bump async from 2.6.1 to 2.6.4 (#658 ) Bumps [async](https://github.com/caolan/async) from 2.6.1 to 2.6.4. - [Release notes](https://github.com/caolan/async/releases) - [Changelog](https://github.com/caolan/async/blob/v2.6.4/CHANGELOG.md) - [Commits](https://github.com/caolan/async/compare/v2.6.1...v2.6.4) --- updated-dependencies: - dependency-name: async dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2022-05-06 13:04:44 -07:00
dependabot[bot]	5d5e833ff0	chore(deps): bump tmpl from 1.0.4 to 1.0.5 (#633 ) Bumps [tmpl](https://github.com/daaku/nodejs-tmpl) from 1.0.4 to 1.0.5. - [Release notes](https://github.com/daaku/nodejs-tmpl/releases) - [Commits](https://github.com/daaku/nodejs-tmpl/commits/v1.0.5) --- updated-dependencies: - dependency-name: tmpl dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2022-05-06 11:57:58 -07:00

1 2 3 4 5 ...

568 Commits