Commit Graph

179 Commits

Author SHA1 Message Date
Janet
7709d69379 feat: npr parser (#86)
* feat: npr parser

Lead image appears in preview, but the test fails for some reason.

AssertionError: null ==
'https://media.npr.org/assets/img/2016/12/15/gettyimages-540681598_wide-
8b160732b96c083dc115134c3c019f3ac73586ba.jpg?s=1400'

Looks okay otherwise.

* feat: transformed figures/figcaptions, improved date_published and
addressed NPR's bad image metadata
2017-01-23 17:23:02 -08:00
Janet
8a82f2c0ab feat: recode parser (#85)
* feat: recode parser

Thumbs up, as far as I can tell.

Note: No image appeared in the preview.

* feat: pulling in lead image
2017-01-23 17:02:33 -08:00
Janet
ad29acd7b7 feat: fortune parser (#84)
* feat: fortune parser

For some reason, the dek doesn’t appear in the local version of the
article I selected. I tried parsing the meta tag containing
og:description but it’s not working, and the description is slightly
longer than the dek in the original article.

I’m not sure why, but for the lead image, the meta tag for og:image is
not parsing the image url.

:(

* feat: fortune redesigned, so re-did extractor

* fix: added timezone
2017-01-23 16:47:06 -08:00
Janet
c133ddf614 feat: qz parser (#81)
* feat: qz parser

I couldn’t figure out how to parse the date, but otherwise should be
fine. I added a clean for the div.article-aside element based on what I
saw in how the chrome extension worked.

* feat: updated content to grab top image

test: date is null :/
2017-01-23 16:08:07 -08:00
Janet
84312b6ef1 feat: dmagazine parser (#80)
* feat: dmagazine parser

I’m sorry to have failed you. :-( These are the issues I encountered:

1) author - does not have a unique selector to distinguish it from the
date, couldn’t parse it
2) date - no meta data in the head
3) no meta og:image in the head (my go to), so I couldn’t get the image
test to pass, but it appears to be parsing. The caption below it is the
same size as the body copy in the preview. I couldn’t figure out how to
“transform” it to caption size.

* feat: update date, image, and author selectors and corresponding tests

* feat: generalized content selector
2017-01-23 15:52:05 -08:00
Janet
e035f36361 feat: reuters parser (#78)
* feat: reuters parser

Date parses correctly but fails test because of format discrepancy.

Author tags are nested within the content, which is why the author
names are appearing twice. I wasn’t sure how to address this.

Additionally, the location appears twice, so I cleaned the location
tags from the content.

* test: fix date format

* transform .article-subtitle to h4; cleaning author but leaving location
2017-01-23 15:16:37 -08:00
Janet
dec49ab073 feat: mashable parser (#76)
* feat: mashable parser

As usual the date is giving me issues because of formatting
discrepancies:
AssertionError: '2016-12-13T22:33:06.000Z' == '2016-12-14T03:33:06.000Z'

Not sure how we wanna deal with Twitter card embeds that don’t show up?

Also, image credits did not show up in preview.

* test: fixed date format

* transforming .image-credit to figcaption
2017-01-23 15:00:18 -08:00
Janet
cddc1afb69 feat: chicago tribune parser (#75)
* feat: chicago tribune parser

Date is parsing but failing the test because:
AssertionError: '2016-12-13T21:45:00.000Z' == '2016-12-13T13:45:00-0800'

I tried to insert a line of code for Time Zone but I’m a n00b so I
don’t think I did it right.

No image showing up in the preview.

* fix: remove timezone from date_published extractor

* test: update unit tests to assert the correct value for date_published
2017-01-22 12:18:10 -05:00
Janet
aff651c2d8 feat: hellogiggles parser (#107)
Looks good to me!
2017-01-21 14:07:20 -05:00
Janet
11ad7b9a92 feat: thought catalog parser (#102)
Looks good!
2017-01-21 13:52:00 -05:00
Janet
aa43a6091c feat: cnbc parser (#96)
Should be good to go!
2017-01-21 13:25:23 -05:00
Janet
cd245f7980 feat: popsugar parser (#93)
I think this one is good to go!
2017-01-21 13:11:00 -05:00
Janet
a8ab7135e1 feat: observer parser (#91)
no problems
2017-01-21 12:47:26 -05:00
Janet
3bee7224cb feat: nbc news parser (#74) 2017-01-18 17:28:21 -08:00
Janet
88242dd233 feat: nj.com parser (#73) 2017-01-18 16:49:05 -08:00
Janet
1ac5670a54 feat: inquisitor parser (#72) 2017-01-18 16:34:22 -08:00
Janet
9e5b91ed8b feat: refinery29 parser (#71) 2016-12-21 21:57:13 -08:00
Janet
b78c58c43a feat: miami herald parser (#69) 2016-12-21 21:35:34 -08:00
Janet
aedf83edc6 feat: eonline parser (#68) 2016-12-21 21:24:14 -08:00
Janet
a20da5eb31 uproxx extractor (#66) 2016-12-21 21:05:10 -08:00
Janet
87c42b6358 feat: 247sports.com extractor (#64) 2016-12-21 20:52:23 -08:00
Janet
22e6c884fb feat: rolling stone extractor (#65) 2016-12-21 20:30:34 -08:00
Janet
6337231697 feat: usmagazine extractor (#63) 2016-12-21 20:06:47 -08:00
Janet
c06b19efe7 feat: people extractor (#70)
No major problems!
2016-12-21 19:46:48 -08:00
Janet
3cf2bb78c4 feat: vox custom parser (#67) 2016-12-15 17:48:15 -08:00
Janet
861c5f0dcb feat: bustle extractor (#60) 2016-12-08 15:32:08 -05:00
Adam Pash
06397a4360 feat: browser-friendly selector for medium (#61) 2016-12-07 17:58:29 -05:00
Adam Pash
3297ab079d feat: bloomberg extractor (#59)
Bloomberg has several templates. I'm supporting three different templates here, but I'm not sure that this is complete by any means.

It's also worth noting that SVGs don't make it through the parser terribly well for many reasons. One, for example, is that a lot of SVGs require custom CSS in order for them to make sense. I'm not sure this is something we can expect to address in the parser.
2016-12-07 14:39:00 -05:00
Janet
e55e9da534 feat: sbnation extractor (#55) 2016-12-07 14:25:57 -05:00
Adam Pash
8070e4790b test: streamlined guardian tests w/new single-extraction (#58) 2016-12-07 13:17:25 -05:00
Adam Pash
bdb751fb53 feat: more cleaning for wired (#56) 2016-12-07 12:15:39 -05:00
Janet
e7e41bd242 feat: the guardian custom extractor (#41) 2016-12-07 12:05:18 -05:00
Adam Pash
81aa89f2c1 feat: youtube custom extractor (#53) 2016-12-06 12:36:51 -05:00
Adam Pash
2fb47640f2 Feat: detect platforms (#52)
Detectors for matching extractors for publishing platforms. Currently supporting Medium and Blogger.
2016-12-06 12:17:03 -05:00
Adam Pash
15656cb3e1 Refactor: running tests more efficiently (#49)
Only running one parser per page we're testing rather than a parser per field we're testing.
2016-12-05 15:39:45 -05:00
Adam Pash
f9902cfa05 Fix: extension bugs (#47)
* feat: lead image on atlantic stories now included

* feat: supporting buzzfeed "longform" template

* feat: cleaning .parter-box from the atlantic
2016-12-02 16:02:00 -08:00
Adam Pash
16860f1d85 feat: improved nyt parser (#46)
NYT was one of the first, and its test was stale and it didn't have all
of its fields well defined.
2016-12-02 15:41:26 -08:00
Adam Pash
d0453efbf8 feat: improvements for nyer magazine articles (#45)
adds dek and date_published for magazine template
2016-12-02 15:30:09 -08:00
Adam Pash
00f8965c1f fix: cleaning up deks (#44)
We've solidified what we consider a dek. This PR removes the dek selectors that do not fit that mold.
2016-12-02 15:17:49 -08:00
Janet
b415d1d37c feat: aol custom extractor (#42)
* feat: aol custom parser

* removed work from other commits. merged with latest master
2016-12-01 17:05:15 -08:00
Matt
4cc3b68b5e feat: remove footer links (#40)
the links at the bottom of the stories feel a little spammy because of how we treat links vs. the way they are displayed on the Times, would like to clean them
2016-12-01 08:31:43 -08:00
Adam Pash
ff1963bdca feat: new cleaner for wapo (#38) 2016-11-30 17:01:53 -08:00
Silas Burton
c3d98a0d76 Feat cnn extractor (#34)
* wip: cnn custom extactor

* wip: cnn works except first paragraph

* final touches on cnn parser

* cleanup
2016-11-30 14:55:04 -08:00
Silas Burton
a0570f8e94 feat: extractor for the verge (#33)
* feat: extractor for the verge's standard article template

* feat: basic support for the verge feature template

* feat: allow multiple links to be previewed

* feat: content selector arrays

Content selector arrays allow custom parsers to select multiple elements
to match and include in the result.

* feat: updated verge parser to use multimatch selectors

* lint fix

* cleanup test builds
2016-11-30 14:08:56 -08:00
Adam Pash
233ca11a33 fix: added timezone to new republic date (#32) 2016-11-29 16:54:52 -08:00
Adam Pash
9a23b24a89 feat: adjustment for huffpo. skipping overly aggressive default cleaners (#30) 2016-11-29 16:16:39 -08:00
Silas Burton
be2e4b5c80 Feat: huffington post extractor (#28)
* wip: huffpo custom extractor

* wip: some huffpo cleanup
2016-11-29 15:50:48 -08:00
Adam Pash
94198c0a65 feat: new republic custom extractor (#25)
* wip: new republic custom extractor

* feat: new republic article extractor

* feat: new republic minutes article extractor
2016-11-29 15:30:52 -08:00
Janet
c4d72fb735 feat: add money.cnn custom parser (#26)
* feat: add money.cnn custom parser

* added timezone to cnn custom parser
2016-11-29 15:13:29 -08:00
Adam Pash
6343946dd8 Feat: custom timezones (#29)
* using moment-timezone to allow custom timezones

* added tz to tmz, even though still so-so
2016-11-29 14:46:46 -08:00
Adam Pash
a8face796a Fix extension bugs (#23)
* feat: cleaning supplemental elements in nytimes (visible in web only)

closes https://github.com/postlight/mercury-reader-chrome-extension/issues/102

* wip

* fix: more generous date published bits

* feat: added washington post extractor (including figure transforms)

closes https://github.com/postlight/mercury-reader-chrome-extension/issues/100

* feat: cleaning zoom lightbox from gizmodo/kinja

* lint fix
2016-11-28 16:58:21 -08:00
Adam Pash
3a2f32b0eb feat: added tmz custom parser (#22) 2016-11-28 15:10:28 -08:00
Adam Pash
c30fb2e4c0 chore: updated readme 2016-11-22 08:41:35 -08:00
Adam Pash
60a6861e18 Feat: browser support (#19)
Big undertaking to support Mercury in the browser. Builds are working and all tests are passing both for web and node builds. Most code is closely shared.
2016-11-21 14:17:06 -08:00
Adam Pash
048d654417 feat: parser auto-generates name; lint is more specific 2016-10-27 14:54:38 -07:00
Adam Pash
65c641a879 feat: enforcing line break rules in linter 2016-10-27 11:00:27 -07:00
Adam Pash
4d1d950807 updated generator templates for new style of import/export. also some
adjustments for usability
2016-10-27 10:44:06 -07:00
Adam Pash
7fa90f59b7 making all.js export a generic function to decrease possiblity of error 2016-10-27 10:19:21 -07:00
Adam Pash
de5b120b79 feat: allowing extractors to support multiple domains 2016-10-27 09:20:53 -07:00
Adam Pash
d038a36544 feat: custom medium extractor 2016-10-27 08:47:25 -07:00
Adam Pash
b65b0c98b0 feat: supporting all GMG sites using DeadspinExtractor 2016-10-26 16:05:15 -07:00
Adam Pash
40768fa188 feat: support lazy loading video on deadspin 2016-10-26 11:53:42 -07:00
Adam Pash
c63f500433 fix: narrowed selector to fix blogspot title selector 2016-10-26 11:16:31 -07:00
Adam Pash
5c7f2cd28e fix: better selector for nytimes authors 2016-10-17 18:55:58 -07:00
Drew Bell
76db95e884 feat: Add custom extrator for Apartment Therapy 2016-10-17 10:35:22 -05:00
Drew Bell
a708ad3b4f feat: Add custom parser for broadwayworld.com 2016-10-13 16:22:33 -05:00
Adam Pash
896021227d feat: added deadspin custom parser 2016-10-13 13:46:36 -07:00
Adam Pash
422deb4600 feat: generator generates potential selectors for all custom selectable fields 2016-10-10 15:57:47 -07:00
Adam Pash
c314e3befa feat: dek returns null if it's basically the same as the excerpt
Squashed commit of the following:

commit 0ee7d51ce609ad23d2deca1af41e7b4e56681bd7
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Oct 10 15:44:28 2016 -0700

    feat: dek does not return if it's basically the same as the excerpt

commit 6ad27f994fff3652e04ffe7c81f1ae0b1647e941
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Oct 10 14:35:54 2016 -0700

    feat: added excerpt util
2016-10-10 15:44:58 -07:00
Adam Pash
63c06c8a00 fix: babel-polyfill mess (I think) 2016-10-10 14:16:14 -07:00
Adam Pash
eb0aa0b1f6 feat: some small tweaks to toy's excellent parsers ☺️
Squashed commit of the following:

commit 9638220124a325322d6cda7d16c645185d5fe827
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Oct 10 11:02:29 2016 -0700

    fix: removed eslint plugin that was adding unneded async parens

commit ce2268c0f7c1b093c06f156730a0f1bc2aaba39c
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Oct 10 10:47:36 2016 -0700

    style: fix async in parens

commit 9591856915eddaf93170da1ce9225b8a378bdf55
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Oct 10 10:37:11 2016 -0700

    fix: remove parens around async

commit 6c56054717acc1f7e5499691780f8273f6d07bac
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Oct 10 10:35:50 2016 -0700

    fix msn fixture; adjusted yahoo test

commit 4fc117ad5fdc5528f29b0873d60a6a1709642f15
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Oct 10 10:14:38 2016 -0700

    removed dek and date_publised tests; neither exist in littlethings

commit 401094b4abc52901255fd2461f5839624f11d8a3
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Oct 10 10:08:44 2016 -0700

    feat: updated buzzfeed for content extraction

commit 19548a5485f70ff9b65e3e725d2364d07734ac9c
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Oct 10 09:54:30 2016 -0700

    fix: generator should make transforms an object, not array

commit b92113f9f7c97aca9e6d3ce9243abac967d26b63
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Oct 10 08:54:38 2016 -0700

    feat: updated politico

commit c026591040f7671cb2a6dd5177a995e21d015482
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Oct 10 08:48:52 2016 -0700

    fix: typos

commit 14aa8fa4ce38ff1c2a212cd0225437ae3042c2c3
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Oct 10 08:36:12 2016 -0700

    fix: incorrect command in readme

commit fe260e6122877e2cb0130a1ecde0e503017057a3
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Oct 10 08:31:11 2016 -0700

    fix: removed dek test because there is no dek on wikia
2016-10-10 11:03:43 -07:00
Toy Vano
3c99404566 Merge pull request #11 from postlight/feat-politico-extractor
feat: added politico extractor
2016-10-05 13:52:57 -04:00
Toy Vano
e766494922 feat: added politico extractor 2016-10-05 13:51:11 -04:00
Toy Vano
dd20e56bc5 Merge pull request #10 from postlight/feat-littlethings-extractor
feat: added littlethings extractor
2016-10-04 15:04:28 -04:00
Toy Vano
fd1ac3f2b9 feat: added littlethings extractor 2016-10-04 15:02:23 -04:00
Toy Vano
6c18551ed0 Merge pull request #9 from postlight/feat-wikia-extractor
feat: added wikia extractor
2016-10-04 12:11:29 -04:00
Toy Vano
017b9dfcc2 Merge pull request #8 from postlight/feat-buzzfeed-extractor
feat: added incomplete buzzfeed extractor
2016-10-04 12:11:22 -04:00
Toy Vano
bdf66314ea Merge pull request #7 from postlight/feat-yahoo-extractor
feat: added incomplete yahoo extractor
2016-10-04 12:11:15 -04:00
Toy Vano
b0e1a873c0 Merge pull request #6 from postlight/feat-msn-extractor
feat: added incomplete msn extractor
2016-10-04 12:11:07 -04:00
Toy Vano
1519eed3e5 feat: added wikia extractor 2016-10-04 12:06:19 -04:00
Toy Vano
9416ec73a4 feat: added incomplete buzzfeed extractor 2016-10-04 11:28:01 -04:00
Toy Vano
c6c35bd237 feat: added incomplete yahoo extractor 2016-10-03 17:48:11 -04:00
Toy Vano
320c740676 feat: added incomplete msn extractor 2016-10-03 13:27:51 -04:00
Adam Pash
e3ee5e93bf chore: small doc fixes 2016-09-30 15:01:48 -04:00
Toy Vano
7ecc696248 feat: added wired custom extractor 2016-09-30 14:32:28 -04:00
Adam Pash
20b7c5a8b6 chore: fix a few typos/links 2016-09-30 12:46:46 -04:00
Adam Pash
173f885674 feat: custom parser + generator + detailed readme instructions
Squashed commit of the following:

commit 02563daa67712c3679258ebebac60dfa9568dffb
Author: Adam Pash <adam.pash@gmail.com>
Date:   Fri Sep 30 12:25:44 2016 -0400

    updated readme, added newyorker parser for readme guide

commit 0ac613ef823efbffbf4cc9a89e5cb2489d1c4f6f
Author: Adam Pash <adam.pash@gmail.com>
Date:   Fri Sep 30 11:16:52 2016 -0400

    feat: updated parser so the saved fixture absolutizes urls

commit 85c7a2660b21f95c2205ca4a4378a7570687fed0
Author: Adam Pash <adam.pash@gmail.com>
Date:   Fri Sep 30 10:15:26 2016 -0400

    refactor: attribute selectors must be an array for custom extractors

commit f60f93d5d3d9b2f2d9ec6f28d27ae9dcf16ef01e
Author: Adam Pash <adam.pash@gmail.com>
Date:   Thu Sep 29 10:13:14 2016 -0400

    fix: whitelisting srcset and alt attributes

commit e31cb1f4e8a9fc9c3d9b20ef9f40ca6c8d6ad51a
Author: Adam Pash <adam.pash@gmail.com>
Date:   Thu Sep 29 09:44:21 2016 -0400

    some housekeeping for coverage tests

commit 39eafe420c776a1fe7f9fea634fb529a3ed75a71
Author: Adam Pash <adam.pash@gmail.com>
Date:   Wed Sep 28 17:52:08 2016 -0400

    fix: word count for multi-page articles

commit b04e0066b52f190481b1b604c64e3d0b1226ff02
Author: Adam Pash <adam.pash@gmail.com>
Date:   Thu Sep 22 10:40:23 2016 -0400

    major improvements to output

commit 3f3a880b63b47fe21953485da670b6e291ac60e5
Author: Adam Pash <adam.pash@gmail.com>
Date:   Wed Sep 21 17:27:53 2016 -0400

    updated test command

commit 14503426557a870755453572221d95c92cff4bd2
Author: Adam Pash <adam.pash@gmail.com>
Date:   Wed Sep 21 16:00:30 2016 -0400

    shortened generator command

commit 5ebd8343cd4b87b3f5787dab665bff0de96846e1
Author: Adam Pash <adam.pash@gmail.com>
Date:   Wed Sep 21 15:59:14 2016 -0400

    feat: can disable fallback to generic parser (this will be useful for testing custom parsers)
2016-09-30 12:26:25 -04:00
Adam Pash
39a3c0690d chore: readme improvement 2016-09-21 15:00:00 -04:00
Adam Pash
ef047107ea feat: content cleaner still runs, but can disable some cleaners 2016-09-21 14:38:03 -04:00
Adam Pash
75b1880f01 chore: cleaned up unused files, slight reorg 2016-09-20 11:08:02 -04:00
Adam Pash
ad42055f8f feat: switched test framework to jest 2016-09-20 10:52:16 -04:00
Adam Pash
8f42e119e8 feat: generator for custom parsers and some documentation
Squashed commit of the following:

commit deaf9e60d031d9ee06e74b8c0895495b187032a5
Author: Adam Pash <adam.pash@gmail.com>
Date:   Tue Sep 20 10:31:09 2016 -0400

    chore: README for custom parsers

commit a8e8ad633e0d1576a52dbc90ce31b98fb2ec21ee
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Sep 19 23:36:09 2016 -0400

    draft of readme

commit 4f0f463f821465c282ce006378e5d55f8f41df5f
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Sep 19 17:56:34 2016 -0400

    custom extractor used to build basic parser for theatlantic

commit c5562a3cede41f56c4e723dcfa1181b49dcaae4d
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Sep 19 17:20:13 2016 -0400

    pre-commit to test custom parser generator

commit 7d50d5b7ab780b79fae38afcb87a7d1da5d139b2
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Sep 19 17:19:55 2016 -0400

    feat: added nytimes parser

commit 58b8d83a56927177984ddfdf70830bc4f328f200
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Sep 19 17:17:28 2016 -0400

    feat: can do fuzzy search or go straight to file

commit c99add753723a8e2ac64d51d7379ac8e23125526
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Sep 19 10:52:26 2016 -0400

    refactored export for custom extractors for easier renames

commit 22563413669651bb497f1bb2a92085b71f2ae324
Author: Adam Pash <adam.pash@gmail.com>
Date:   Fri Sep 16 17:36:13 2016 -0400

    feat: custom extractor generation in place

commit 2285a29908a7f82a5de3c81f6b2b902ddec9bdaa
Author: Adam Pash <adam.pash@gmail.com>
Date:   Fri Sep 16 16:42:20 2016 -0400

    good progress
2016-09-20 10:37:03 -04:00
Adam Pash
7ade83692a feat: improve wikipedia parser 2016-09-16 13:59:05 -04:00
Adam Pash
2ae2dba690 chore: renamed iris to mercury 2016-09-16 13:26:37 -04:00
Adam Pash
005ba47f6f fix: wikpedia transform only grabs one image from .infobox 2016-09-16 13:17:21 -04:00
Adam Pash
cbd0636dcf chore: cleaned up python and other unneeded comments 2016-09-16 11:21:23 -04:00
Adam Pash
bf13b38a9b feat: some basic error handling for bad urls 2016-09-15 17:41:29 -04:00
Adam Pash
ffaf7db0f1 fix: some improvements to date parsing. punting on localization issues 2016-09-15 16:57:14 -04:00
Adam Pash
396313aeae feat: added twitter custom extractor
Squashed commit of the following:

commit 8116f14364869b72a8afabfcb44b2ac154caed96
Author: Adam Pash <adam.pash@gmail.com>
Date:   Thu Sep 15 16:27:27 2016 -0400

    feat: added twitter custom extractor

commit e478eb1b0bcdcb65fdd5fa64e37be92b6defd702
Author: Adam Pash <adam.pash@gmail.com>
Date:   Thu Sep 15 16:22:54 2016 -0400

    fix: made custom extractors and cleaners adhere to underscore keys
2016-09-15 16:27:46 -04:00
Adam Pash
d60d396c98 feat: added text direction to response 2016-09-15 15:08:04 -04:00