* feat: extract custom types with extend option
Adds an `extend` option that lets you add custom types to be extracted
and returned alongside the defaults, either in a call to `parse()` or in
a custom extractor.
```
Mercury.parse(
url,
extend: {
last_edited: { selectors: ['#last-edited'], defaultCleaner: false }
}
)
```
* chore: use Reflect.ownKeys
* feat: add CLI options
* doc: add extend param to cli help
* refactor: extract selectExtendedTypes
* feat: only overwrite null extended results
* feat: add allowMultiple extraction option
* feat: accept extendList CLI args
* feat: allow attribute selectors in extends on CLI
* test: update extend tests
* fix: don't invoke cleaner for custom types
* feat: always return array if allowMultiple
* test: add test for array of single result
* refactor: extract extractHtml
* refactor: destructure allowMultiple
* fix: wrap multiple matches in $ for cheerio shim
* fix: find extended types before any other munging
* feat: absolutize all links
* fix: clean content more directly
* doc: Update CLI docs in README
* chore: update dist
* doc: Document extend in custom extractor README
* chore: add missing fields to package.json
* feat: add postlight org scope to package name
* feat: automate npm publish
* test: npm publish without filters
* fix: add docker image
* test: change directory
* test: add working directory
* fix: defaults syntax
* test: add workspace
* fix: attach workspace
* fix: use standard mercury email
* fix: use ISO time format and preserve original timezone offset
* fix: do not match time zone offset
* chore: move babel runtime-corejs2 to prod deps
* chore: uncomment config to deploy on git tag
* feat: publish to npm public
* adding browser-request
It doesn't seem to impact the build, but technically it should be there
so for good measure, why not...
* chore: roll version back to original state
* dx: remove commented code and obvious comments that can be looked up
* dx: remove commented out eslint options
* dx: remove commented out code
* dx: remove commented out code
* dx: remove commented out code
* dx: remove test block as all its code was commented out
* dx: remove commented out code
* dx: remove commented out code
* dx: remove commented out code
* dx: remove regex example comments
* dx: remove commented out code
* dx: remove commented out code
* dx: remove commented out import
* dx: remove commented out code
* dx: remove commented out code
* dx: remove commented out code
* dx: remove commented out code
* dx: remove commented out code
* dx: remove commented out code
* dx: remove commented out code
* dx: remove commented out code
* dx: remove commented out code
* dx: remove commented out code
* dx: remove commented out code
* chore: remove empty files
* chore: re-prettier code that may have missed it
* added back nec comments
* chore: update .nvmrc
* added prettier and pre-commit hooks
* update docker image to new node
* add karma-cli to get web tests working
* explictly install karma... seems to fix problem
* remove pre-built phantomjs
* swap install order
* feat: prospect magazine parser
Couldn’t find a way to parse the date but I think it’s good otherwise.
* fix: pulls date
* fix: add timezone
* fix: generalize
* feat: forward.com parser
LGTM although image didn’t show up in preview
* feat: also pull imge into content
* fix: generalize selectors
* fix: generalize selector
* feat: qdaily parser
Firstly — I accidentally tried to generate the parser on the master
branch, and I’m not sure where it is, maybe floating in the nether
world.
On to the parser — this one was a bit tricky because things were in
Chinese! The content appears to be parsing (as seen in preview) but
it’s not passing the test. I noticed the second “ ‘ “ mark isn’t
appearing on the parser side.
Additionally, some of the lazy loading images aren’t appearing in the
preview (I cleaned the wrong lazy load images that appeared), so
someone will probably have to work on that (I don’t know how to do
transforms yet).
* fix tests
* fix: selector generalization
* feat: gothamist extractor
* feat: add other gothamist network sites
* fix: try getting date another way
* fix: add gothamist timezone
* fix: generalize selectors
* fix: h1 is inside entry-header, needs to be specific because of another h1 on the page
* fix: general and specific selector
* feat: natgeo parser
For some reason, the local copy of the article didn’t grab the author
name in it, so I couldn’t figure out how to parse it. The generic
parser took a name of an author of a paper mentioned in the article,
and thought that was the author name, which was funny.
I cleaned a large block quote that didn’t make sense as it was shown in
the preview, although I noticed that the Mercury chrome extension
didn’t even display it.
* fix: add date_published transform
* fix: date_published assertion
* disable: author assertion, generlize author selector
* rm: author assertion
* fix: image lead
* fix: guard agaist missing img url
* fix: generalize dek and title selectors
* feat: natgeo parser
Same as the news.nationalgeographic.com parser - for some reason the
author name doesn’t appear to be getting pulled into the local copy of
the file.
* fix: content assertion
* fix: generalize author byline
* disable: author assertion
* rm: author assertion
* fix: image lead, handles image-group
* fix: guard agaist missing img url
* fix: generalize dek and title selectors