* feat: Refactor and update fixtures
This patch changes how fixtures are stored. Previously, a fixture's folder identified its domain and its filename identified when it was fetched. This has been changed so that the filename indicates the domain and the modified time of the file indicates how recently it was fetched. A fixture's filename can optionally include a modifier to distinguish between two different page types on the same domain, for example.
Also included here are changes to the update-fixture script, both to accomodate the new filename scheme as well as to actually update all fixtures. The functionality for running automatically and opening PRs has been removed but will likely be reintroduced.
Finally, all fixtures have been updated.
* Remove reference to deleted extractor
* feat: first batch of test and parser updates due to new fixtures
* feat: update more custom parsers and unit tests
* feat: update more custom parsers and unit tests and remove unnecessary parser
* feat: update more custom parsers and unit tests
* feat: update more parsers and add correct bloomberg html files
* fix: remove console statement
* feat: all parsers updated and tests passing
* fix: update date_published tests to account for test server time difference
* fix: cleanup remaining fixtures in folders
* feat: move fixtures for newest custom parsers
* feat: remove script changes
* fix: update dist files to account for reverting script changes
* adding .DS_Store to .gitignore
* adding .DS_Store to .gitignore -- 2
* adding .DS_Store to .gitignore -- 3 lol
* cleaning up some tests
* fix: ran build:generator command to update generate-custom-parser dist file
* fix: update rollup configs to generate source maps and update source maps
* fix: use underscore in place of unused error variable
* fix: remove unused fixture
Co-authored-by: Postlight Bot <adam.pash+postlight-bot@postlight.com>
Co-authored-by: flbn <overasc@gmail.com>
* fixed and improved extraction for latest layout of politico.com
* explicit timezone for politico.com extractor
* handling more layout of politico.com
Co-authored-by: Andrei Zhemaituk <azhemoytuk@workfusion.com>
Co-authored-by: Sarah Doire <sarah.doire@postlight.com>
Co-authored-by: Andrei Zhemaituk <azhemoytuk@workfusion.com>
Co-authored-by: Sarah Doire <sarah.doire@gmail.com>
Co-authored-by: Sarah Doire <sarah.doire@postlight.com>
* fix: add alternative word count method
* fix: replace pages_rendered key with rendered_pages for consistency
* fix: return first lead_image_url when multiple og:image present
* fix: properly pull image src from lazy loaded img
* fix: allow drop cap character in medium custom extractor
* fix: refined medium parser
* Update more dependencies
Bumps almost everything up, removing almost all warnings from yarn audit. Doesn't touch cheerio or jest, as they require more attention and QA still.
* Adjust more dependencies, tweak build files
Not to be confused with extractor fixtures, which are snapshots of a webpage.
This change removes the pattern of separate JS files that provide "fixtures" for tests, which are used as provided or expected strings in tests. They were inconsistent and disorganized, and generally just served to add indirection to test files. So now all those strings are defined where they are used in their respective tests.
* feat:Add a custom extractor for ma.ttias.be.
When parsing content for cron.weekly issues, such as the one at https://ma.ttias.be/cronweekly/issue-130/, Mercury Parser would remove headings and ordered lists that were part of the content. This resolves that as follows:
* Remove "id" attributes from "h1" and "h2" elements. Those attributes would result in the elements having a low weight.
* Since Mercury Parser demotes "h1" elements to "h2", demote "h2" elements to "h3".
* Add class="entry-content-asset" to "ul" elements to avoid them being removed.
* removed redundant comment.
* feat: Add a custom extractor for engadget.com.
* feat: Add a custom extractor for www.ndtv.com.
* Works, but I need to figure how to make pagination work correctly.
* fixed pagination - would only retrieve first or second page because we would send contentOnly: true on subsequent pages (page 2).
removed failover: true from preview.
* rolled back { fallback: false } option removal
* Clarified comments.
* rolling back yarn.lock changes
Co-authored-by: John Holdun <john@johnholdun.com>
* feat:Add a custom extractor for ma.ttias.be.
When parsing content for cron.weekly issues, such as the one at https://ma.ttias.be/cronweekly/issue-130/, Mercury Parser would remove headings and ordered lists that were part of the content. This resolves that as follows:
* Remove "id" attributes from "h1" and "h2" elements. Those attributes would result in the elements having a low weight.
* Since Mercury Parser demotes "h1" elements to "h2", demote "h2" elements to "h3".
* Add class="entry-content-asset" to "ul" elements to avoid them being removed.
* removed redundant comment.
* feat: Add a custom extractor for engadget.com.
* Works, but I need to figure how to make pagination work correctly.
* fixed pagination - would only retrieve first or second page because we would send contentOnly: true on subsequent pages (page 2).
removed failover: true from preview.
* rolled back { fallback: false } option removal
* Clarified comments.
Co-authored-by: John Holdun <john@johnholdun.com>
* feat:Add a custom extractor for ma.ttias.be.
When parsing content for cron.weekly issues, such as the one at https://ma.ttias.be/cronweekly/issue-130/, Mercury Parser would remove headings and ordered lists that were part of the content. This resolves that as follows:
* Remove "id" attributes from "h1" and "h2" elements. Those attributes would result in the elements having a low weight.
* Since Mercury Parser demotes "h1" elements to "h2", demote "h2" elements to "h3".
* Add class="entry-content-asset" to "ul" elements to avoid them being removed.
* removed redundant comment.
* feat: Add a custom extractor for engadget.com.
Co-authored-by: John Holdun <john@johnholdun.com>
* Implemented custom extractor gruene.de
* Cleaner output of custom extracter www.gruene.de
* Updated fixture for www.gruene.de from real page
* Trying to pick image from og:image -- doesn't work ...
Co-authored-by: John Holdun <john@johnholdun.com>
Removing a couple extraneous CircleCI commands to see if they're still needed. I think one of the removed lines is causing #654 to fail, but let's see.
* feat: Add a custom extractor for pastebin.com
* feat: transforms <li> to <p> in pastebin.com
Co-authored-by: Felipe Canejo <felipecanejo@gmail.com>
Co-authored-by: John Holdun <john@johnholdun.com>
* feat:Add a custom extractor for ma.ttias.be.
When parsing content for cron.weekly issues, such as the one at https://ma.ttias.be/cronweekly/issue-130/, Mercury Parser would remove headings and ordered lists that were part of the content. This resolves that as follows:
* Remove "id" attributes from "h1" and "h2" elements. Those attributes would result in the elements having a low weight.
* Since Mercury Parser demotes "h1" elements to "h2", demote "h2" elements to "h3".
* Add class="entry-content-asset" to "ul" elements to avoid them being removed.
* removed redundant comment.
Co-authored-by: John Holdun <john@johnholdun.com>
* chore(deps-dev): bump karma from 3.1.4 to 6.3.16
Bumps [karma](https://github.com/karma-runner/karma) from 3.1.4 to 6.3.16.
- [Release notes](https://github.com/karma-runner/karma/releases)
- [Changelog](https://github.com/karma-runner/karma/blob/master/CHANGELOG.md)
- [Commits](https://github.com/karma-runner/karma/compare/v3.1.4...v6.3.16)
---
updated-dependencies:
- dependency-name: karma
dependency-type: direct:development
...
Signed-off-by: dependabot[bot] <support@github.com>
* chore: Update CircleCI config
Removing a couple extraneous CircleCI commands to see if they're still needed. I think one of the removed lines is causing #654 to fail, but let's see.
* chore: Update karma-browserify
* Revert "chore: Update CircleCI config"
This reverts commit c474be7433.
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: John Holdun <john@johnholdun.com>
These extractors were variously failing tests as I tried updating dependencies. It seems like some of the format detection logic has changed, and making these date detectors more explicit fixes them.