Commit Graph

560 Commits (ad8d4aa268fd5ebf29ac7964d62c99d3bb9c8f4a)
 

Author SHA1 Message Date
John Holdun ad8d4aa268
release: 2.2.3 (#703) 2 years ago
Austin 635fcf6356
fix: handle sec & ms timestamps properly (#702) 2 years ago
Michael Ashley ab401822aa
maintenance update - october 2022 (#696)
* fix: add alternative word count method

* fix: replace pages_rendered key with rendered_pages for consistency

* fix: return first lead_image_url when multiple og:image present

* fix: properly pull image src from lazy loaded img

* fix: allow drop cap character in medium custom extractor

* fix: refined medium parser
2 years ago
Sarah Doire 8ca8a5f7e5
feat: add postlight.com custom extractor (#695) 2 years ago
John Holdun 39b9ff55c4
release: 2.2.2 (#689) 2 years ago
John Holdun f1932e3672
Update README.md 2 years ago
John Holdun 97472cf4f8
Change Name (#688)
Mercury Parser is now Postlight Parser!
2 years ago
John Holdun eb9d0bc5e8
Update more dependencies (#687)
* Update more dependencies

Bumps almost everything up, removing almost all warnings from yarn audit. Doesn't touch cheerio or jest, as they require more attention and QA still.

* Adjust more dependencies, tweak build files
2 years ago
John Holdun 112846f74f
chore: Inline test fixtures (#683)
Not to be confused with extractor fixtures, which are snapshots of a webpage.

This change removes the pattern of separate JS files that provide "fixtures" for tests, which are used as provided or expected strings in tests. They were inconsistent and disorganized, and generally just served to add indirection to test files. So now all those strings are defined where they are used in their respective tests.
2 years ago
John Holdun 0d2bad544c chore: Update builds 2 years ago
Simon Reinhardt 035aa65dbc
Added custom extractor for www.spektrum.de (#677)
Co-authored-by: Simon Reinhardt <simon.reinhardt@hype.de>
Co-authored-by: John Holdun <john@johnholdun.com>
2 years ago
John Holdun f259d13753
feat: Add figcaption to list of non-convertible span parents (#682)
Based on this comment: https://github.com/postlight/mercury-parser/issues/530#issuecomment-580105171
2 years ago
Nate Weaver de314a9728
Add li to the list of non-convertible parents for spans (#531)
Co-authored-by: John Holdun <john@johnholdun.com>
2 years ago
John Brayton 9a961aa595
feat: Add a custom extractor for www.ndtv.com. (#554)
* feat:Add a custom extractor for ma.ttias.be.

When parsing content for cron.weekly issues, such as the one at https://ma.ttias.be/cronweekly/issue-130/, Mercury Parser would remove headings and ordered lists that were part of the content. This resolves that as follows:

* Remove "id" attributes from "h1" and "h2" elements. Those attributes would result in the elements having a low weight.
* Since Mercury Parser demotes "h1" elements to "h2", demote "h2" elements to "h3".
* Add class="entry-content-asset" to "ul" elements to avoid them being removed.

* removed redundant comment.

* feat: Add a custom extractor for engadget.com.

* feat: Add a custom extractor for www.ndtv.com.

* Works, but I need to figure how to make pagination work correctly.

* fixed pagination - would only retrieve first or second page because we would send contentOnly: true on subsequent pages (page 2).
removed failover: true from preview.

* rolled back { fallback: false } option removal

* Clarified comments.

* rolling back yarn.lock changes

Co-authored-by: John Holdun <john@johnholdun.com>
2 years ago
John Brayton 143631b4b7
feat: arstechnica.com extractor (#553)
* feat:Add a custom extractor for ma.ttias.be.

When parsing content for cron.weekly issues, such as the one at https://ma.ttias.be/cronweekly/issue-130/, Mercury Parser would remove headings and ordered lists that were part of the content. This resolves that as follows:

* Remove "id" attributes from "h1" and "h2" elements. Those attributes would result in the elements having a low weight.
* Since Mercury Parser demotes "h1" elements to "h2", demote "h2" elements to "h3".
* Add class="entry-content-asset" to "ul" elements to avoid them being removed.

* removed redundant comment.

* feat: Add a custom extractor for engadget.com.

* Works, but I need to figure how to make pagination work correctly.

* fixed pagination - would only retrieve first or second page because we would send contentOnly: true on subsequent pages (page 2).
removed failover: true from preview.

* rolled back { fallback: false } option removal

* Clarified comments.

Co-authored-by: John Holdun <john@johnholdun.com>
2 years ago
John Brayton 3c5c0bdba9
feat: Add a custom extractor for www.engadget.com. (#552)
* feat:Add a custom extractor for ma.ttias.be.

When parsing content for cron.weekly issues, such as the one at https://ma.ttias.be/cronweekly/issue-130/, Mercury Parser would remove headings and ordered lists that were part of the content. This resolves that as follows:

* Remove "id" attributes from "h1" and "h2" elements. Those attributes would result in the elements having a low weight.
* Since Mercury Parser demotes "h1" elements to "h2", demote "h2" elements to "h3".
* Add class="entry-content-asset" to "ul" elements to avoid them being removed.

* removed redundant comment.

* feat: Add a custom extractor for engadget.com.

Co-authored-by: John Holdun <john@johnholdun.com>
2 years ago
Sven Wiegand 13dfe720bd
Custom extractor for www.gruene.de (#485)
* Implemented custom extractor gruene.de

* Cleaner output of custom extracter www.gruene.de

* Updated fixture for www.gruene.de from real page

* Trying to pick image from og:image -- doesn't work ...

Co-authored-by: John Holdun <john@johnholdun.com>
2 years ago
dependabot[bot] 025261c120
chore(deps): Bump ws from 5.2.2 to 5.2.3 (#673)
Bumps [ws](https://github.com/websockets/ws) from 5.2.2 to 5.2.3.
- [Release notes](https://github.com/websockets/ws/releases)
- [Commits](https://github.com/websockets/ws/compare/5.2.2...5.2.3)

---
updated-dependencies:
- dependency-name: ws
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2 years ago
dependabot[bot] 34bc6facc7
chore(deps): Bump moment from 2.29.2 to 2.29.4 (#672)
Bumps [moment](https://github.com/moment/moment) from 2.29.2 to 2.29.4.
- [Release notes](https://github.com/moment/moment/releases)
- [Changelog](https://github.com/moment/moment/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/moment/moment/compare/2.29.2...2.29.4)

---
updated-dependencies:
- dependency-name: moment
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: John Holdun <john@johnholdun.com>
2 years ago
dependabot[bot] 7b15df58be
chore(deps): Bump terser from 4.8.0 to 4.8.1 (#671)
Bumps [terser](https://github.com/terser/terser) from 4.8.0 to 4.8.1.
- [Release notes](https://github.com/terser/terser/releases)
- [Changelog](https://github.com/terser/terser/blob/master/CHANGELOG.md)
- [Commits](https://github.com/terser/terser/compare/v4.8.0...v4.8.1)

---
updated-dependencies:
- dependency-name: terser
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: John Holdun <john@johnholdun.com>
2 years ago
John Holdun fb74196d79
chore: Update CircleCI config (#661)
Removing a couple extraneous CircleCI commands to see if they're still needed. I think one of the removed lines is causing #654 to fail, but let's see.
2 years ago
Jae Hanley f7439ec3fd
modifies check-build to differentiate between test env (#665)
Co-authored-by: John Holdun <john@johnholdun.com>
2 years ago
John Holdun 6ffa1a746e
chore: Update jQuery to 3.5.0 (#662)
Resolves #607
2 years ago
dependabot[bot] 8d18b0ed0d
chore(deps): Bump shell-quote from 1.6.1 to 1.7.3 (#668)
Bumps [shell-quote](https://github.com/substack/node-shell-quote) from 1.6.1 to 1.7.3.
- [Release notes](https://github.com/substack/node-shell-quote/releases)
- [Changelog](https://github.com/substack/node-shell-quote/blob/master/CHANGELOG.md)
- [Commits](https://github.com/substack/node-shell-quote/compare/1.6.1...1.7.3)

---
updated-dependencies:
- dependency-name: shell-quote
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2 years ago
Samuel Clay d5dabae20b
Update CHANGELOG.md (#663)
Typo on v2.2.1 release date
2 years ago
Jim Nielsen 9cd9662bcb
support build of es modules (#570) 2 years ago
Marco Wiedemeyer d0c78911e6
Add a new custom extractor for www.abendblatt.de (#559)
* Add custom extractor for www.abendblatt.de

* update

Co-authored-by: Marco Wiedemeyer <marco.wiedemeyer@ottogroup.com>
Co-authored-by: John Holdun <john@johnholdun.com>
2 years ago
Felipe Canejo 6014016283
feat: Add a custom extractor for pastebin.com (#556)
* feat: Add a custom extractor for pastebin.com

* feat: transforms <li> to <p> in pastebin.com

Co-authored-by: Felipe Canejo <felipecanejo@gmail.com>
Co-authored-by: John Holdun <john@johnholdun.com>
2 years ago
John Brayton e217648c0b
feat: ma.ttias.be extractor (#551)
* feat:Add a custom extractor for ma.ttias.be.

When parsing content for cron.weekly issues, such as the one at https://ma.ttias.be/cronweekly/issue-130/, Mercury Parser would remove headings and ordered lists that were part of the content. This resolves that as follows:

* Remove "id" attributes from "h1" and "h2" elements. Those attributes would result in the elements having a low weight.
* Since Mercury Parser demotes "h1" elements to "h2", demote "h2" elements to "h3".
* Add class="entry-content-asset" to "ul" elements to avoid them being removed.

* removed redundant comment.

Co-authored-by: John Holdun <john@johnholdun.com>
2 years ago
James Shakespeare 70e99d56cf
Feat: update qz.com selectors and tests (#538)
* feat: update qz.com selectors and tests

* chore: remove out of date fixture
2 years ago
Michael Ashley 56a19bf934
fix: updating generate-parser dist (#499) 2 years ago
Ethan Jucovy af9cfcd120
fix: don't try to re-decode prepared response (#498)
* fix: don't try to re-decode prepared response

* Remove stray console.log
2 years ago
Peter Dave Hello 9515dc28c1
chore: update node version in .nvmrc & CONTRIBUTING.md (#599)
Ref: #579, a5a066c69d
2 years ago
Joe Moon fb44ab0244
Bugfix new yorker wired extractors (#604)
* www.newyorker.com: add updated fixtures and fix extractors

* www.wired.com: add updated fixtures and fix extractors

Co-authored-by: John Holdun <john@johnholdun.com>
2 years ago
Nick Sweeting 99062da034
Add --version CLI flag (#610)
* add --version CLI flag

* move import to top of file for consistency

Co-authored-by: John Holdun <john@johnholdun.com>
2 years ago
dependabot[bot] 32dff4aedb
chore(deps-dev): bump karma from 3.1.4 to 6.3.16 (#654)
* chore(deps-dev): bump karma from 3.1.4 to 6.3.16

Bumps [karma](https://github.com/karma-runner/karma) from 3.1.4 to 6.3.16.
- [Release notes](https://github.com/karma-runner/karma/releases)
- [Changelog](https://github.com/karma-runner/karma/blob/master/CHANGELOG.md)
- [Commits](https://github.com/karma-runner/karma/compare/v3.1.4...v6.3.16)

---
updated-dependencies:
- dependency-name: karma
  dependency-type: direct:development
...

Signed-off-by: dependabot[bot] <support@github.com>

* chore: Update CircleCI config

Removing a couple extraneous CircleCI commands to see if they're still needed. I think one of the removed lines is causing #654 to fail, but let's see.

* chore: Update karma-browserify

* Revert "chore: Update CircleCI config"

This reverts commit c474be7433.

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: John Holdun <john@johnholdun.com>
2 years ago
dependabot[bot] 736778d2e7
chore(deps): bump moment from 2.23.0 to 2.29.2 (#656)
* chore(deps): bump moment from 2.23.0 to 2.29.2

Bumps [moment](https://github.com/moment/moment) from 2.23.0 to 2.29.2.
- [Release notes](https://github.com/moment/moment/releases)
- [Changelog](https://github.com/moment/moment/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/moment/moment/compare/2.23.0...2.29.2)

---
updated-dependencies:
- dependency-name: moment
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

* feat: Add stricter format definitions to extractors for failing tests

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: John Holdun <john@johnholdun.com>
2 years ago
John Holdun 65e338a403
feat: Add date formats to two extractors (#660)
These extractors were variously failing tests as I tried updating dependencies. It seems like some of the format detection logic has changed, and making these date detectors more explicit fixes them.
2 years ago
dependabot[bot] 8dd3c7078a
chore(deps): bump jquery from 3.4.1 to 3.5.0 (#557)
Bumps [jquery](https://github.com/jquery/jquery) from 3.4.1 to 3.5.0.
- [Release notes](https://github.com/jquery/jquery/releases)
- [Commits](https://github.com/jquery/jquery/compare/3.4.1...3.5.0)

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2 years ago
dependabot[bot] 88718d4caf
chore(deps): bump cached-path-relative from 1.0.2 to 1.1.0 (#647)
Bumps [cached-path-relative](https://github.com/ashaffer/cached-path-relative) from 1.0.2 to 1.1.0.
- [Release notes](https://github.com/ashaffer/cached-path-relative/releases)
- [Commits](https://github.com/ashaffer/cached-path-relative/commits)

---
updated-dependencies:
- dependency-name: cached-path-relative
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2 years ago
dependabot[bot] af5974f6ea
chore(deps): bump async from 2.6.1 to 2.6.4 (#658)
Bumps [async](https://github.com/caolan/async) from 2.6.1 to 2.6.4.
- [Release notes](https://github.com/caolan/async/releases)
- [Changelog](https://github.com/caolan/async/blob/v2.6.4/CHANGELOG.md)
- [Commits](https://github.com/caolan/async/compare/v2.6.1...v2.6.4)

---
updated-dependencies:
- dependency-name: async
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2 years ago
dependabot[bot] 5d5e833ff0
chore(deps): bump tmpl from 1.0.4 to 1.0.5 (#633)
Bumps [tmpl](https://github.com/daaku/nodejs-tmpl) from 1.0.4 to 1.0.5.
- [Release notes](https://github.com/daaku/nodejs-tmpl/releases)
- [Commits](https://github.com/daaku/nodejs-tmpl/commits/v1.0.5)

---
updated-dependencies:
- dependency-name: tmpl
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2 years ago
dependabot[bot] 78cd8e06ee
chore(deps): bump tar from 4.4.8 to 4.4.19 (#630)
Bumps [tar](https://github.com/npm/node-tar) from 4.4.8 to 4.4.19.
- [Release notes](https://github.com/npm/node-tar/releases)
- [Changelog](https://github.com/npm/node-tar/blob/main/CHANGELOG.md)
- [Commits](https://github.com/npm/node-tar/compare/v4.4.8...v4.4.19)

---
updated-dependencies:
- dependency-name: tar
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2 years ago
dependabot[bot] 9dbe885364
chore(deps): bump path-parse from 1.0.5 to 1.0.7 (#628)
Bumps [path-parse](https://github.com/jbgutierrez/path-parse) from 1.0.5 to 1.0.7.
- [Release notes](https://github.com/jbgutierrez/path-parse/releases)
- [Commits](https://github.com/jbgutierrez/path-parse/commits/v1.0.7)

---
updated-dependencies:
- dependency-name: path-parse
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2 years ago
dependabot[bot] 09558599fa
chore(deps): bump y18n from 3.2.1 to 3.2.2 (#609)
Bumps [y18n](https://github.com/yargs/y18n) from 3.2.1 to 3.2.2.
- [Release notes](https://github.com/yargs/y18n/releases)
- [Changelog](https://github.com/yargs/y18n/blob/master/CHANGELOG.md)
- [Commits](https://github.com/yargs/y18n/commits)

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2 years ago
dependabot[bot] 633c172ff0
chore(deps): bump mixin-deep from 1.3.1 to 1.3.2 (#489)
Bumps [mixin-deep](https://github.com/jonschlinkert/mixin-deep) from 1.3.1 to 1.3.2.
- [Release notes](https://github.com/jonschlinkert/mixin-deep/releases)
- [Commits](https://github.com/jonschlinkert/mixin-deep/compare/1.3.1...1.3.2)

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2 years ago
dependabot[bot] 6fcd2233a8
chore(deps): bump browserslist from 4.4.0 to 4.20.3 (#659)
Bumps [browserslist](https://github.com/browserslist/browserslist) from 4.4.0 to 4.20.3.
- [Release notes](https://github.com/browserslist/browserslist/releases)
- [Changelog](https://github.com/browserslist/browserslist/blob/main/CHANGELOG.md)
- [Commits](https://github.com/browserslist/browserslist/compare/4.4.0...4.20.3)

---
updated-dependencies:
- dependency-name: browserslist
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2 years ago
dependabot[bot] d2d4287581
chore(deps): bump ajv from 6.7.0 to 6.12.6 (#650)
Bumps [ajv](https://github.com/ajv-validator/ajv) from 6.7.0 to 6.12.6.
- [Release notes](https://github.com/ajv-validator/ajv/releases)
- [Commits](https://github.com/ajv-validator/ajv/compare/v6.7.0...v6.12.6)

---
updated-dependencies:
- dependency-name: ajv
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2 years ago
dependabot[bot] 11b2f4414a
chore(deps): bump pathval from 1.1.0 to 1.1.1 (#648)
Bumps [pathval](https://github.com/chaijs/pathval) from 1.1.0 to 1.1.1.
- [Release notes](https://github.com/chaijs/pathval/releases)
- [Changelog](https://github.com/chaijs/pathval/blob/master/CHANGELOG.md)
- [Commits](https://github.com/chaijs/pathval/compare/v1.1.0...v1.1.1)

---
updated-dependencies:
- dependency-name: pathval
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2 years ago
dependabot[bot] adecdcfef0
chore(deps): bump node-fetch from 2.3.0 to 2.6.7 (#642)
Bumps [node-fetch](https://github.com/node-fetch/node-fetch) from 2.3.0 to 2.6.7.
- [Release notes](https://github.com/node-fetch/node-fetch/releases)
- [Changelog](https://github.com/node-fetch/node-fetch/blob/main/docs/CHANGELOG.md)
- [Commits](https://github.com/node-fetch/node-fetch/compare/v2.3.0...v2.6.7)

---
updated-dependencies:
- dependency-name: node-fetch
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2 years ago