* feat: Refactor and update fixtures
This patch changes how fixtures are stored. Previously, a fixture's folder identified its domain and its filename identified when it was fetched. This has been changed so that the filename indicates the domain and the modified time of the file indicates how recently it was fetched. A fixture's filename can optionally include a modifier to distinguish between two different page types on the same domain, for example.
Also included here are changes to the update-fixture script, both to accomodate the new filename scheme as well as to actually update all fixtures. The functionality for running automatically and opening PRs has been removed but will likely be reintroduced.
Finally, all fixtures have been updated.
* Remove reference to deleted extractor
* feat: first batch of test and parser updates due to new fixtures
* feat: update more custom parsers and unit tests
* feat: update more custom parsers and unit tests and remove unnecessary parser
* feat: update more custom parsers and unit tests
* feat: update more parsers and add correct bloomberg html files
* fix: remove console statement
* feat: all parsers updated and tests passing
* fix: update date_published tests to account for test server time difference
* fix: cleanup remaining fixtures in folders
* feat: move fixtures for newest custom parsers
* feat: remove script changes
* fix: update dist files to account for reverting script changes
* adding .DS_Store to .gitignore
* adding .DS_Store to .gitignore -- 2
* adding .DS_Store to .gitignore -- 3 lol
* cleaning up some tests
* fix: ran build:generator command to update generate-custom-parser dist file
* fix: update rollup configs to generate source maps and update source maps
* fix: use underscore in place of unused error variable
* fix: remove unused fixture
Co-authored-by: Postlight Bot <adam.pash+postlight-bot@postlight.com>
Co-authored-by: flbn <overasc@gmail.com>
* feat: extract custom types with extend option
Adds an `extend` option that lets you add custom types to be extracted
and returned alongside the defaults, either in a call to `parse()` or in
a custom extractor.
```
Mercury.parse(
url,
extend: {
last_edited: { selectors: ['#last-edited'], defaultCleaner: false }
}
)
```
* chore: use Reflect.ownKeys
* feat: add CLI options
* doc: add extend param to cli help
* refactor: extract selectExtendedTypes
* feat: only overwrite null extended results
* feat: add allowMultiple extraction option
* feat: accept extendList CLI args
* feat: allow attribute selectors in extends on CLI
* test: update extend tests
* fix: don't invoke cleaner for custom types
* feat: always return array if allowMultiple
* test: add test for array of single result
* refactor: extract extractHtml
* refactor: destructure allowMultiple
* fix: wrap multiple matches in $ for cheerio shim
* fix: find extended types before any other munging
* feat: absolutize all links
* fix: clean content more directly
* doc: Update CLI docs in README
* chore: update dist
* doc: Document extend in custom extractor README
* fix: scrub meta viewport tags
They leak to the parent page when using the web version of Mercury
Parser.
* chore: build
* fix: keep DOM in memory to avoid conflicts
Big undertaking to support Mercury in the browser. Builds are working and all tests are passing both for web and node builds. Most code is closely shared.