Commit Graph

380 Commits (fix-remove-moment-js)

Author SHA1 Message Date
kik0220 f81dc63617 feat: add rbbtoday.com custom parser (#411)
* feat: add rbbtoday.com custom parser

* fix: content test

* fix: dek and content
5 years ago
kik0220 5e1113b3a9 feat: add japan.zdnet.com custom parser (#410)
* feat: add japan.zdnet.com custom parser

* fix: author and date_published selector
5 years ago
kik0220 77e3bc00e2 feat: add wired.jp custom parser (#409)
* feat: add wired.jp custom parser

* fix: author test

* fix: date_published selector

* test: fix dek and contest

* test: fix content (without clean dek)
5 years ago
kik0220 0b36c96de0 feat: add techlog.iij.ad.jp custom parser (#405)
* feat: add techlog.iij.ad.jp custom parser

* fix: date_published and content selector
5 years ago
kik0220 406bf1b1a9 feat: add weekly.ascii.jp custom parser (#401)
* feat: add weekly.ascii.jp custom parser

* fix: title and date_published selector
5 years ago
kik0220 216bfade00 feat: add www.ipa.go.jp custom parser (#408) 5 years ago
kik0220 3ae8f3bde3 feat: add www.oreilly.co.jp custom parser (#407) 5 years ago
kik0220 7396e81b72 feat: add sect.iij.ad.jp custom parser (#404) 5 years ago
kik0220 3f1d9030ee feat: add www.lifehacker.jp custom parser (#403) 5 years ago
kik0220 b077000c4a feat: add getnews.jp custom parser (#402) 5 years ago
kik0220 b5425c3e8a feat: add www.gizmodo.jp custom parser (#400) 5 years ago
kik0220 a38c727a0a feat: add deadline.com custom parser (#383)
* feat: add deadline.com custom parser

* fix: timezone

* fix: date_published selectors

* fix: title and author selector

* test: transform .embed-twitter

* fix: regenerate the fixture and fix content selector
5 years ago
kik0220 74a3c49a3c feat: add japan.cnet.com custom parser (#382)
* feat: add japan.cnet.com custom parser

* fix: remove transform
5 years ago
kik0220 7b07f88448 feat: add www.yomiuri.co.jp custom parser (#381) 5 years ago
Toufic Mouallem 3f46859d14
fix: skip absolutizing invalid srcsets (#386)
* fix: skip absolutizing empty srcsets

* test: empty srcsets are handled properly
5 years ago
kik0220 779c1154fb fix: add date_published selector in www.sanwa.co.jp extractor (#378) 5 years ago
kik0220 ea5b65f019 fix: add date_published selector in www.elecom.co.jp extractor (#377) 5 years ago
kik0220 7c0949e587 fix: add date_published selector in www.ossnews.jp extractor (#376) 5 years ago
kik0220 3e91ac55db fix: add date_published selector in jvndb.jvn.jp extractor (#375) 5 years ago
kik0220 8ca2894751 feat: add bookwalker.jp custom parser (#374) 5 years ago
kik0220 a5f06ce27a feat: add takagi-hiromitsu.jp custom parser (#364) 5 years ago
kik0220 b9c57dbc2f feat: add www.publickey1.jp custom parser (#365)
* feat: add www.publickey1.jp custom parser

* fix: date_published selector
5 years ago
kik0220 d7dbea8a95 feat: add www.itmedia.co.jp custom parser (#366)
* feat: add www.itmedia.co.jp custom parser

* feat: add nlab.itmedia.co.jp support

* fix: title selectors
5 years ago
kik0220 9218f80da6 feat: add www.moongift.jp custom parser (#367)
* feat: add www.moongift.jp custom parser

* fix: date_published selectors

* fix: pass test

* fix: add timezone
5 years ago
kik0220 4eb73dffb0 feat: add www.infoq.com custom parser (#368)
* feat: add www.infoq.com custom parser

* fix: date_published selector
5 years ago
kik0220 ce5cd2dd0d feat: add phpspot.org custom parser (#369)
* feat: add phpspot.org custom parser

* fix: date_published selector
5 years ago
Toufic Mouallem 3614e31abc fix: skip absolutizing empty hrefs (#372) 5 years ago
kik0220 73be0c5a10 feat: add www.jnsa.org custom parser (#346)
* feat: add www.jnsa.org custom parser
5 years ago
Adam Pash eacd1ee97f feat: custom genius parser. (#284)
also adds ability to transform value returned by an attribute selector
5 years ago
kik0220 c389c966d7 feat: add jvndb.jvn.jp custom parser (#345) 5 years ago
kik0220 8493d05cb5 feat: add scan.netsecurity.ne.jp custom parser (#347) 5 years ago
kik0220 2a76c6c212 feat: add www.elecom.co.jp custom parser (#348) 5 years ago
kik0220 a9e010b718 feat: add www.sanwa.co.jp custom parser (#349) 5 years ago
kik0220 1639eae324 feat: add www.asahi.com custom parser (#350) 5 years ago
kik0220 21f7de70c1 feat: add buzzap.jp custom parser (#351) 5 years ago
kik0220 f3a7e393a3 feat: add www.ossnews.jp custom parser (#352) 5 years ago
kik0220 c309bdb373 feat: add otrs.com custom parser (#353) 5 years ago
John Holdun 437f50a5c8 fix: Initialize Content-Type as empty string if not present (#359) 5 years ago
Toufic Mouallem 262dda94b3 fix: explicity reject non-200 status codes (#342) 5 years ago
Toufic Mouallem 144a797564
feat: Support passing custom headers in requests (#337) 5 years ago
Toufic Mouallem 3ed778b53e fix: Adapt CNBC extractor to article redesign (#336) 5 years ago
Drew Bell b3e2a0ffd1 feat: extract custom types with extend option (#313)
* feat: extract custom types with extend option

Adds an `extend` option that lets you add custom types to be extracted
and returned alongside the defaults, either in a call to `parse()` or in
a custom extractor.

```
Mercury.parse(
  url,
  extend: {
    last_edited: { selectors: ['#last-edited'], defaultCleaner: false }
  }
)
```

* chore: use Reflect.ownKeys

* feat: add CLI options

* doc: add extend param to cli help

* refactor: extract selectExtendedTypes

* feat: only overwrite null extended results

* feat: add allowMultiple extraction option

* feat: accept extendList CLI args

* feat: allow attribute selectors in extends on CLI

* test: update extend tests

* fix: don't invoke cleaner for custom types

* feat: always return array if allowMultiple

* test: add test for array of single result

* refactor: extract extractHtml

* refactor: destructure allowMultiple

* fix: wrap multiple matches in $ for cheerio shim

* fix: find extended types before any other munging

* feat: absolutize all links

* fix: clean content more directly

* doc: Update CLI docs in README

* chore: update dist

* doc: Document extend in custom extractor README
5 years ago
Toufic Mouallem 136d6df798
feat: Return specific errors on failed parse attempts 5 years ago
Toufic Mouallem a250f403f5 fix: Preserve whitespace in certain HTML elements (#333) 5 years ago
Ben Ubois a7e4c67d1d Extract content from GitHub repos. (#306)
* Extract content from GitHub repos.

* Add published and dek.

* Timezone fix.
5 years ago
Toufic Mouallem 0940971069 fix: better handling for responsive images (#312) 5 years ago
Drew Bell 785a22245f feat: switch from forked request to postman-request (#319) 5 years ago
Toufic Mouallem 7844129fda feat: Add custom parser for Reddit (#307) 5 years ago
Drew Bell 91fb0dfb46 fix: update parse signature in tests (#315) 5 years ago
Toufic Mouallem 9714cb70c5 feat: Use Deadspin parser for all Kinja websites (#304) 5 years ago
Jordan Hotmann 83d1c2401b feat: add custom extractor for blisterreview.com (#299) 5 years ago
kik0220 d9a1e7b22b feat: add news.mynavi.jp custom parser (#287) 5 years ago
Olli Sulopuisto 44a7ec791d docs: typofix (#300) 5 years ago
Ben Ubois ed14203e97 fix: return early if creating the resource failed. (#285) 5 years ago
Adam Pash 2afd8c9fa8
fix: jquery doesn't like the case insensitive selector (#274) 5 years ago
Adam Pash 9bf88b0ba3
chore: refactor format output adjustments (#272)
I had previously done this in an overly complicated manner. This PR cleans
it up a bit.
5 years ago
Ben Ubois 0e27448866 feat: Various Character Encoding Improvements (#270)
* Support HTML5 charset tag

In HTML5 `<meta charset="">` is shorthand for `<meta http-equiv="content-type" content="">`
https://developer.mozilla.org/en-US/docs/Web/HTML/Element/meta

* Handle more character encoding declaration methods.
5 years ago
Adam Pash 9b0664bc91
feat: add content format output options (#256) 5 years ago
Adam Pash c6f42c1278
docs: cleanup and update docs (#238) 5 years ago
George Haddad 5c0325f5a7
feat: hook up ci to publish to npm (#226)
* chore: add missing fields to  package.json

* feat: add postlight org scope to package name

* feat: automate npm publish

* test: npm publish without filters

* fix: add docker image

* test: change directory

* test: add working directory

* fix: defaults syntax

* test: add workspace

* fix: attach workspace

* fix: use standard mercury email

* fix: use ISO time format and preserve original timezone offset

* fix: do not match time zone offset

* chore: move babel runtime-corejs2 to prod deps

* chore: uncomment config to deploy on git tag

* feat: publish to npm public

* adding browser-request

It doesn't seem to impact the build, but technically it should be there
so for good measure, why not...

* chore: roll version back to original state
5 years ago
Adam Pash 663cc45bf4
fresh run of prettier; remove NOTES.md (#233) 5 years ago
Wajeeh Zantout 1ccd14e1e9 feat: add fortinet custom parser (#188)
* feat: add fortinet custom parser

* fix: eslint error

* fix: transform noscript images

* feat: add fortinet custom parser

* fix: eslint error

* fix: transform noscript images

* fix: transform method

* test: transform method

* fix: fs import
5 years ago
Wajeeh Zantout 9b36003b62 feat: add fastcompany custom parser (#191)
* feat: add fastcompany custom parser

* fix: eslint error

* fix: test for date_published

* feat: add fastcompany custom parser

* fix: eslint error

* fix: test for date_published

* fix: fs import
5 years ago
Toufic Mouallem bb6ad2682b fix: Transform relative URLs in srcset attributes to absolute URLs (#190) 5 years ago
Jad Termsani 15a5229998 fix: womansay.net image urls (#196) 5 years ago
Adam Pash 0e22947e2c
fix: non-forked packages breaking web build (#225) 5 years ago
Ralph Jbeily f3f6e21fd8 fix: author and date published selectors (#189) 5 years ago
Jad Termsani 28cf41304c
fix: timezone comparison (#222)
* fix: use format() instead of toISOString()

* fix: timezone comparison
5 years ago
Ralph Jbeily ca44ce3dd1
docs: add install build and test guide (#215)
* docs: add install build and test guide

* docs: remove install build and test guides

* docs: add installation guide
5 years ago
Ralph Jbeily 2e1e4d90c9
feat: add remarklint for md docs (#213)
* feat: add remarklint for md docs

* fix: remarkrc file and run linter on commit hook
5 years ago
Adam Pash 76d333f0be
deps: upgrade (#218) 5 years ago
George Haddad 56badb51f5
dx: remove unnec comments in source (#205)
* dx: remove commented code and obvious comments that can be looked up

* dx: remove commented out eslint options

* dx: remove commented out code

* dx: remove commented out code

* dx: remove commented out code

* dx: remove test block as all its code was commented out

* dx: remove commented out code

* dx: remove commented out code

* dx: remove commented out code

* dx: remove regex example comments

* dx: remove commented out code

* dx: remove commented out code

* dx: remove commented out import

* dx: remove commented out code

* dx: remove commented out code

* dx: remove commented out code

* dx: remove commented out code

* dx: remove commented out code

* dx: remove commented out code

* dx: remove commented out code

* dx: remove commented out code

* dx: remove commented out code

* dx: remove commented out code

* dx: remove commented out code

* chore: remove empty files

* chore: re-prettier code that may have missed it

* added back nec comments
5 years ago
Adam Pash e4b057f9ea
chore: update node and some deps (#209)
* chore: update .nvmrc

* added prettier and pre-commit hooks

* update docker image to new node

* add karma-cli to get web tests working

* explictly install karma... seems to fix problem

* remove pre-built phantomjs

* swap install order
5 years ago
Adam Pash 96640e3564
fix: failing fetchResource test (#187)
I think was a fixture problem
6 years ago
Adam Pash d850177b68
docs: Update README.md (#184) 6 years ago
Adam Pash 5663660f76
fix: nytimes custom parser title selector (#181)
* fix: nytimes custom parser title selector

* upgrade node version

* circle ci tweak
6 years ago
Jeremy Mack 5fcea1c5c3 fix: PARSING_NODE undefined (#172)
* fix: PARSING_NODE undefined

* chore: remove unused cleanup function/call
7 years ago
Jeremy Mack e92e798880 fix: viewport tags leaking to parent page (#170)
* fix: scrub meta viewport tags

They leak to the parent page when using the web version of Mercury
Parser.

* chore: build

* fix: keep DOM in memory to avoid conflicts
7 years ago
Adam Pash b8aa87c777 feat: improve wh parser (#168) 7 years ago
Adam Pash 61f0f4e1af fix: kept elements being removed (#166)
Elements marked to keep were removeable under specific circumstances.
This PR fixes these edge cases.
7 years ago
Adam Pash 453419de72 feat: improve wh.gov parser (#163)
* feat: support youtube-nocookie domain

* feat: updated wh.gov parser to support speeches
7 years ago
Janet f13bb721f6 feat: prospect magazine parser (#147)
* feat: prospect magazine parser

Couldn’t find a way to parse the date but I think it’s good otherwise.

* fix: pulls date

* fix: add timezone

* fix: generalize
7 years ago
Kevin Ngao 1b28713cf5 feat: fool.com parser (#158)
* feat: add fool.com custom parser
7 years ago
Janet c18959779d feat: forward.com parser (#144)
* feat: forward.com parser

LGTM although image didn’t show up in preview

* feat: also pull imge into content

* fix: generalize selectors

* fix: generalize selector
7 years ago
Janet 50e548bac2 feat: qdaily parser (#146)
* feat: qdaily parser

Firstly — I accidentally tried to generate the parser on the master
branch, and I’m not sure where it is, maybe floating in the nether
world.

On to the parser — this one was a bit tricky because things were in
Chinese! The content appears to be parsing (as seen in preview) but
it’s not passing the test. I noticed the second “ ‘ “ mark isn’t
appearing on the parser side.

Additionally, some of the lazy loading images aren’t appearing in the
preview (I cleaned the wrong lazy load images that appeared), so
someone will probably have to work on that (I don’t know how to do
transforms yet).

* fix tests

* fix: selector generalization
7 years ago
Silas Burton 51a4d1d12f feat: newrepublic parser shows image on page (#159) 7 years ago
Silas Burton 11382ce651 Feat: Slate extractor (#153)
* feat: slate extractor

* fix: generalize selectors

* fix: add Slate timezone
7 years ago
Silas Burton 5acaa6ab56 feat: ici.radio-canada.ca extractor (#156)
* feat: ici.radio-canada.ca extractor

* fix: add timezone
7 years ago
Silas Burton 4509b341e6 feat: better cleanup of atlantic articles (#157) 7 years ago
Kevin Ngao f2e3f055c2 Fixes an issue with encoding (#154)
* fix: fixes an issue with encoding on the fetch level
7 years ago
Silas Burton 9b371e51ac Feat: gothamist extractor (#151)
* feat: gothamist extractor

* feat: add other gothamist network sites

* fix: try getting date another way

* fix: add gothamist timezone

* fix: generalize selectors

* fix: h1 is inside entry-header, needs to be specific because of another h1 on the page

* fix: general and specific selector
7 years ago
Kevin Ngao afbef9bc39 Fix Encoding on Body (#143)
* fix: check encoding on body
7 years ago
Janet 93d2baf5cf feat: news.natgeo parser (#88)
* feat: natgeo parser

For some reason, the local copy of the article didn’t grab the author
name in it, so I couldn’t figure out how to parse it. The generic
parser took a name of an author of a paper mentioned in the article,
and thought that was the author name, which was funny.

I cleaned a large block quote that didn’t make sense as it was shown in
the preview, although I noticed that the Mercury chrome extension
didn’t even display it.

* fix: add date_published transform

* fix: date_published assertion

* disable: author assertion, generlize author selector

* rm: author assertion

* fix: image lead

* fix: guard agaist missing img url

* fix: generalize dek and title selectors
7 years ago
Janet 2279c2d486 feat: natgeo parser (#89)
* feat: natgeo parser

Same as the news.nationalgeographic.com parser - for some reason the
author name doesn’t appear to be getting pulled into the local copy of
the file.

* fix: content assertion

* fix: generalize author byline

* disable: author assertion

* rm: author assertion

* fix: image lead, handles image-group

* fix: guard agaist missing img url

* fix: generalize dek and title selectors
7 years ago
Adam Pash 08b5bb7ff1 feat: allow parser to define custom date formats (#141)
* feat: allow parser to define custom date formats

* feat: updating macrumors to test/verify format working correctly
7 years ago
Janet 11f466ccb3 feat: latimes parser (#92)
* feat: latimes parser
7 years ago
Kevin Ngao 26a8e4f75a feat: macrumors parser (#120)
* feat: add macrumors
7 years ago
Kevin Ngao b4fec6af98 feat: androidcentral parser (#119)
* feat: androidcentral parser
7 years ago
Janet beb0b89a4f feat: pagesix parser (#97)
* feat: pagesix parser
7 years ago
Janet f2160eb5b6 feat: si parser (#118)
* feat: si parser
7 years ago