Commit Graph

380 Commits (fix-remove-moment-js)

Author SHA1 Message Date
Janet 2af0f6179a feat: rawstory parser (#109)
* feat: rawstory parser

Finished, with a little help from Frankie (thanks Frankie!)

* fix: date_published timezone
7 years ago
Janet 765032452d feat: thefederalistpapers parser (#101)
* feat: thefederalistpapers parser
7 years ago
Janet fb5eb2e104 feat: cnet parser (#104)
* feat: cnet parser

Date test fail - please take a look!

Also, image didn’t load in preview.

* fix: timezone

* fix: image lead
7 years ago
Janet 3c5fa28f10 feat: cbs sports parser (#98)
* feat: cbs sports parser
7 years ago
Janet 3cf2d0d3ef feat: msnbc parser (#100)
* feat: msnbc parser
7 years ago
Janet f9ab9eb885 feat: howtogeek extractor (#108)
* feat: howtogeek extractor

This one is a bit tricky - the author and date info appear in a comment
section at the bottom. Was able to parse the author but not the date
info. Halp

* howtogeek update

Thanks to @fdsimms I was able to parse the date, but not sure what to
test it against, so I left it blank.

* fix: date_published assertion, it was comparing against empty string

* fix: timezone

* amend: generalize author selector
7 years ago
Janet 258acdfd02 feat: opposing views parser (#103)
* feat: opposing views parser
7 years ago
Janet b63dd33579 feat: today parser (#106)
* feat: today parser

This looks fine — there are a couple of lines of “Related” but they are
within the body (and don’t have their own classes) so I couldn’t clean
them out.

* fix: fix content assertion
7 years ago
Janet c94eee7f92 feat: cinema blend parser (#105)
* feat: cinema blend parser

all systems go

* fix: timezone
7 years ago
Janet 64e3c205e8 feat: the political insider parser (#99)
* feat: the political insider parser with timezone
7 years ago
Janet 7b52d3d1fc feat: al.com parser (#110)
* feat: al.com parser

I think this is good but could you pls double check time zone on the
date? Thanks

* fix: date_published timezone
7 years ago
Janet 15df58496f feat: westernjournalism parser (#113)
* feat: westernjournalism parser

Adjacent sibling selector FTW!

Image not displaying in preview.

* feat: fix assertion, body does not include _Advertisement_ subtext
7 years ago
Janet ae12a1d701 feat: mental floss parser (#94)
* feat: mental floss parser
7 years ago
Janet bf29291395 feat: thepennyhoarder parser (#112)
* feat: thepennyhoarder parser

Looks good, although no image in preview!

* fix: adds selector for article lead image
7 years ago
Janet fadd198d04 feat: abcnewsgo parser (#90)
* feat: abcnewsgo parser
7 years ago
Adam Pash 25d9642ff9 feat: support cleaning and transforms for all fields (#138) 7 years ago
Janet 1054d854dd feat: america now parser (#114)
* feat: america now parser

Looks good but lead image did not display in preview.

* feat: adds selector for lead image
7 years ago
dviramontes 93c8ba0e56 feat: adds selector for lead image 7 years ago
dviramontes f71fe7685d feat: adds video embed transform 7 years ago
dviramontes a77515d861 fix: author selector, less brittle 7 years ago
Janet 4c48acba59 feat: fusion parser
Looks okay — image did not load in preview.
7 years ago
David A. Viramontes c679e493de Merge branch 'master' into feat-the-verge-polygon-supported-domain 7 years ago
Janet d292d8ef3a feat: ny daily news parser (#87)
* feat: ny daily news parser
7 years ago
dviramontes a53587acef feat: adds www.polygon.com to list of www.theverge.com supportedDomains 7 years ago
Janet 385b9d76a3 feat: sciencefly extractor (#116)
* feat: sciencefly extractor, use loading image rather than 404'ing meta
7 years ago
Adam Pash 6bd6278a07 feat: custom parser for wh blog (#130) 7 years ago
Adam Pash aa682d71e8 fix: medium bug (#129)
* fix: improved medium parser for images and multi-section content

* fix: duplicate video
7 years ago
Adam Pash 31eb4f9222 Feat: LinkedIn parser (#123)
* feat: rebuild custom parser

* feat: linkedin custom parser
7 years ago
Adam Pash 8662474d8a feat: changed user agent to latest chrome (#121)
* feat: changed user agent to latest chrome

* removed dead link
7 years ago
Janet 7709d69379 feat: npr parser (#86)
* feat: npr parser

Lead image appears in preview, but the test fails for some reason.

AssertionError: null ==
'https://media.npr.org/assets/img/2016/12/15/gettyimages-540681598_wide-
8b160732b96c083dc115134c3c019f3ac73586ba.jpg?s=1400'

Looks okay otherwise.

* feat: transformed figures/figcaptions, improved date_published and
addressed NPR's bad image metadata
7 years ago
Janet 8a82f2c0ab feat: recode parser (#85)
* feat: recode parser

Thumbs up, as far as I can tell.

Note: No image appeared in the preview.

* feat: pulling in lead image
7 years ago
Janet ad29acd7b7 feat: fortune parser (#84)
* feat: fortune parser

For some reason, the dek doesn’t appear in the local version of the
article I selected. I tried parsing the meta tag containing
og:description but it’s not working, and the description is slightly
longer than the dek in the original article.

I’m not sure why, but for the lead image, the meta tag for og:image is
not parsing the image url.

:(

* feat: fortune redesigned, so re-did extractor

* fix: added timezone
7 years ago
Janet c133ddf614 feat: qz parser (#81)
* feat: qz parser

I couldn’t figure out how to parse the date, but otherwise should be
fine. I added a clean for the div.article-aside element based on what I
saw in how the chrome extension worked.

* feat: updated content to grab top image

test: date is null :/
7 years ago
Janet 84312b6ef1 feat: dmagazine parser (#80)
* feat: dmagazine parser

I’m sorry to have failed you. :-( These are the issues I encountered:

1) author - does not have a unique selector to distinguish it from the
date, couldn’t parse it
2) date - no meta data in the head
3) no meta og:image in the head (my go to), so I couldn’t get the image
test to pass, but it appears to be parsing. The caption below it is the
same size as the body copy in the preview. I couldn’t figure out how to
“transform” it to caption size.

* feat: update date, image, and author selectors and corresponding tests

* feat: generalized content selector
7 years ago
Janet e035f36361 feat: reuters parser (#78)
* feat: reuters parser

Date parses correctly but fails test because of format discrepancy.

Author tags are nested within the content, which is why the author
names are appearing twice. I wasn’t sure how to address this.

Additionally, the location appears twice, so I cleaned the location
tags from the content.

* test: fix date format

* transform .article-subtitle to h4; cleaning author but leaving location
7 years ago
Janet dec49ab073 feat: mashable parser (#76)
* feat: mashable parser

As usual the date is giving me issues because of formatting
discrepancies:
AssertionError: '2016-12-13T22:33:06.000Z' == '2016-12-14T03:33:06.000Z'

Not sure how we wanna deal with Twitter card embeds that don’t show up?

Also, image credits did not show up in preview.

* test: fixed date format

* transforming .image-credit to figcaption
7 years ago
Janet cddc1afb69 feat: chicago tribune parser (#75)
* feat: chicago tribune parser

Date is parsing but failing the test because:
AssertionError: '2016-12-13T21:45:00.000Z' == '2016-12-13T13:45:00-0800'

I tried to insert a line of code for Time Zone but I’m a n00b so I
don’t think I did it right.

No image showing up in the preview.

* fix: remove timezone from date_published extractor

* test: update unit tests to assert the correct value for date_published
7 years ago
Janet aff651c2d8 feat: hellogiggles parser (#107)
Looks good to me!
7 years ago
Janet 11ad7b9a92 feat: thought catalog parser (#102)
Looks good!
7 years ago
Janet aa43a6091c feat: cnbc parser (#96)
Should be good to go!
7 years ago
Janet cd245f7980 feat: popsugar parser (#93)
I think this one is good to go!
7 years ago
Janet a8ab7135e1 feat: observer parser (#91)
no problems
7 years ago
Janet 3bee7224cb feat: nbc news parser (#74) 7 years ago
Janet 88242dd233 feat: nj.com parser (#73) 7 years ago
Janet 1ac5670a54 feat: inquisitor parser (#72) 7 years ago
Janet 9e5b91ed8b feat: refinery29 parser (#71) 8 years ago
Janet b78c58c43a feat: miami herald parser (#69) 8 years ago
Janet aedf83edc6 feat: eonline parser (#68) 8 years ago
Janet a20da5eb31 uproxx extractor (#66) 8 years ago
Janet 87c42b6358 feat: 247sports.com extractor (#64) 8 years ago
Janet 22e6c884fb feat: rolling stone extractor (#65) 8 years ago
Janet 6337231697 feat: usmagazine extractor (#63) 8 years ago
Janet c06b19efe7 feat: people extractor (#70)
No major problems!
8 years ago
Janet 3cf2bb78c4 feat: vox custom parser (#67) 8 years ago
Janet 861c5f0dcb feat: bustle extractor (#60) 8 years ago
Adam Pash 06397a4360 feat: browser-friendly selector for medium (#61) 8 years ago
Adam Pash 3297ab079d feat: bloomberg extractor (#59)
Bloomberg has several templates. I'm supporting three different templates here, but I'm not sure that this is complete by any means.

It's also worth noting that SVGs don't make it through the parser terribly well for many reasons. One, for example, is that a lot of SVGs require custom CSS in order for them to make sense. I'm not sure this is something we can expect to address in the parser.
8 years ago
Janet e55e9da534 feat: sbnation extractor (#55) 8 years ago
Adam Pash 8070e4790b test: streamlined guardian tests w/new single-extraction (#58) 8 years ago
Adam Pash bdb751fb53 feat: more cleaning for wired (#56) 8 years ago
Janet e7e41bd242 feat: the guardian custom extractor (#41) 8 years ago
Adam Pash 81aa89f2c1 feat: youtube custom extractor (#53) 8 years ago
Adam Pash 2fb47640f2 Feat: detect platforms (#52)
Detectors for matching extractors for publishing platforms. Currently supporting Medium and Blogger.
8 years ago
Adam Pash 64c0fad2fd fix: preserve whitespace (#51)
No longer normalizing whitespace in html
8 years ago
Adam Pash 15656cb3e1 Refactor: running tests more efficiently (#49)
Only running one parser per page we're testing rather than a parser per field we're testing.
8 years ago
Adam Pash f9902cfa05 Fix: extension bugs (#47)
* feat: lead image on atlantic stories now included

* feat: supporting buzzfeed "longform" template

* feat: cleaning .parter-box from the atlantic
8 years ago
Adam Pash 16860f1d85 feat: improved nyt parser (#46)
NYT was one of the first, and its test was stale and it didn't have all
of its fields well defined.
8 years ago
Adam Pash d0453efbf8 feat: improvements for nyer magazine articles (#45)
adds dek and date_published for magazine template
8 years ago
Adam Pash 00f8965c1f fix: cleaning up deks (#44)
We've solidified what we consider a dek. This PR removes the dek selectors that do not fit that mold.
8 years ago
Janet b415d1d37c feat: aol custom extractor (#42)
* feat: aol custom parser

* removed work from other commits. merged with latest master
8 years ago
Matt 4cc3b68b5e feat: remove footer links (#40)
the links at the bottom of the stories feel a little spammy because of how we treat links vs. the way they are displayed on the Times, would like to clean them
8 years ago
Adam Pash ff1963bdca feat: new cleaner for wapo (#38) 8 years ago
Adam Pash 0e6ccdf622 fix: browser cleanup (#35)
Cleaning up after the parser when it's done in the browser, before
returning result.
8 years ago
Silas Burton c3d98a0d76 Feat cnn extractor (#34)
* wip: cnn custom extactor

* wip: cnn works except first paragraph

* final touches on cnn parser

* cleanup
8 years ago
Silas Burton a0570f8e94 feat: extractor for the verge (#33)
* feat: extractor for the verge's standard article template

* feat: basic support for the verge feature template

* feat: allow multiple links to be previewed

* feat: content selector arrays

Content selector arrays allow custom parsers to select multiple elements
to match and include in the result.

* feat: updated verge parser to use multimatch selectors

* lint fix

* cleanup test builds
8 years ago
Adam Pash 233ca11a33 fix: added timezone to new republic date (#32) 8 years ago
Adam Pash cfe7f34be4 fix: normalizing spaces for authors/dek/title (#31)
* fix: normalizing spaces for authors/dek/title
8 years ago
Adam Pash 9a23b24a89 feat: adjustment for huffpo. skipping overly aggressive default cleaners (#30) 8 years ago
Silas Burton be2e4b5c80 Feat: huffington post extractor (#28)
* wip: huffpo custom extractor

* wip: some huffpo cleanup
8 years ago
Adam Pash 94198c0a65 feat: new republic custom extractor (#25)
* wip: new republic custom extractor

* feat: new republic article extractor

* feat: new republic minutes article extractor
8 years ago
Janet c4d72fb735 feat: add money.cnn custom parser (#26)
* feat: add money.cnn custom parser

* added timezone to cnn custom parser
8 years ago
Adam Pash 6343946dd8 Feat: custom timezones (#29)
* using moment-timezone to allow custom timezones

* added tz to tmz, even though still so-so
8 years ago
Adam Pash a8face796a Fix extension bugs (#23)
* feat: cleaning supplemental elements in nytimes (visible in web only)

closes https://github.com/postlight/mercury-reader-chrome-extension/issues/102

* wip

* fix: more generous date published bits

* feat: added washington post extractor (including figure transforms)

closes https://github.com/postlight/mercury-reader-chrome-extension/issues/100

* feat: cleaning zoom lightbox from gizmodo/kinja

* lint fix
8 years ago
Adam Pash 3a2f32b0eb feat: added tmz custom parser (#22) 8 years ago
Adam Pash 783a9cfb2f fix: changed overly liberal regex for removing transparent images 8 years ago
Adam Pash 7411922c55 feat: encoding response body based on content-type charset (#21)
Also some small code organization
8 years ago
Adam Pash c30fb2e4c0 chore: updated readme 8 years ago
Adam Pash 60a6861e18 Feat: browser support (#19)
Big undertaking to support Mercury in the browser. Builds are working and all tests are passing both for web and node builds. Most code is closely shared.
8 years ago
Adam Pash eaea57461a fix: servers returning bad headers was breaking request. temporarily (#20)
using fork with a fix for this until request merges the necessary pull request
8 years ago
Adam Pash 629eada1f7 feat: recording/playing back network requests with nock (#18)
* feat: recording/playing back network requests with nock

* lint fix
8 years ago
Adam Pash e325d860fd Feat: improving ci (#16)
This commit also swaps in yarn for npm and tweaks circle ci a bit.

* appveyor.yml first go

* changing node

* ps

* narrow it down

* trying this

* fix airbnb module

* trying with yarn

* logging

* hybrid?

* trying yarn w/circle

* bump workers?

* build off?

* updating script

* tweaking script for appveyor

* bumping maxworkers

* cleaning up

* build step?

* yarn it

* added appveyor badge
8 years ago
Adam Pash 048d654417 feat: parser auto-generates name; lint is more specific 8 years ago
Adam Pash 65c641a879 feat: enforcing line break rules in linter 8 years ago
Adam Pash 4d1d950807 updated generator templates for new style of import/export. also some
adjustments for usability
8 years ago
Adam Pash 7fa90f59b7 making all.js export a generic function to decrease possiblity of error 8 years ago
Adam Pash de5b120b79 feat: allowing extractors to support multiple domains 8 years ago
Adam Pash d038a36544 feat: custom medium extractor 8 years ago
Adam Pash 007ddec8ac feat: allowing iframes from src domain 8 years ago
Adam Pash b65b0c98b0 feat: supporting all GMG sites using DeadspinExtractor 8 years ago
Adam Pash 17317823de fix: bug that stopped proper attr cleaning in certain cases 8 years ago
Adam Pash 40768fa188 feat: support lazy loading video on deadspin 8 years ago
Adam Pash 38c90d239e fix: removeEmpty shouldn't remove elements with images or iframes inside 8 years ago
Adam Pash c63f500433 fix: narrowed selector to fix blogspot title selector 8 years ago
Adam Pash d3b11be473 feat: keeping youtube and vimeo iframe embeds (#14)
* feat: keeping youtube and vimeo iframe embeds

* fix: removing class from article correctly
8 years ago
Adam Pash 5c7f2cd28e fix: better selector for nytimes authors 8 years ago
Adam Pash 3b87b557be feat: pulling score from whitelist 8 years ago
Drew Bell 76db95e884 feat: Add custom extrator for Apartment Therapy 8 years ago
Drew Bell a708ad3b4f feat: Add custom parser for broadwayworld.com 8 years ago
Adam Pash 896021227d feat: added deadspin custom parser 8 years ago
Adam Pash 422deb4600 feat: generator generates potential selectors for all custom selectable fields 8 years ago
Adam Pash c314e3befa feat: dek returns null if it's basically the same as the excerpt
Squashed commit of the following:

commit 0ee7d51ce609ad23d2deca1af41e7b4e56681bd7
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Oct 10 15:44:28 2016 -0700

    feat: dek does not return if it's basically the same as the excerpt

commit 6ad27f994fff3652e04ffe7c81f1ae0b1647e941
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Oct 10 14:35:54 2016 -0700

    feat: added excerpt util
8 years ago
Adam Pash 63c06c8a00 fix: babel-polyfill mess (I think) 8 years ago
Adam Pash eb0aa0b1f6 feat: some small tweaks to toy's excellent parsers ☺️
Squashed commit of the following:

commit 9638220124a325322d6cda7d16c645185d5fe827
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Oct 10 11:02:29 2016 -0700

    fix: removed eslint plugin that was adding unneded async parens

commit ce2268c0f7c1b093c06f156730a0f1bc2aaba39c
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Oct 10 10:47:36 2016 -0700

    style: fix async in parens

commit 9591856915eddaf93170da1ce9225b8a378bdf55
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Oct 10 10:37:11 2016 -0700

    fix: remove parens around async

commit 6c56054717acc1f7e5499691780f8273f6d07bac
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Oct 10 10:35:50 2016 -0700

    fix msn fixture; adjusted yahoo test

commit 4fc117ad5fdc5528f29b0873d60a6a1709642f15
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Oct 10 10:14:38 2016 -0700

    removed dek and date_publised tests; neither exist in littlethings

commit 401094b4abc52901255fd2461f5839624f11d8a3
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Oct 10 10:08:44 2016 -0700

    feat: updated buzzfeed for content extraction

commit 19548a5485f70ff9b65e3e725d2364d07734ac9c
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Oct 10 09:54:30 2016 -0700

    fix: generator should make transforms an object, not array

commit b92113f9f7c97aca9e6d3ce9243abac967d26b63
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Oct 10 08:54:38 2016 -0700

    feat: updated politico

commit c026591040f7671cb2a6dd5177a995e21d015482
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Oct 10 08:48:52 2016 -0700

    fix: typos

commit 14aa8fa4ce38ff1c2a212cd0225437ae3042c2c3
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Oct 10 08:36:12 2016 -0700

    fix: incorrect command in readme

commit fe260e6122877e2cb0130a1ecde0e503017057a3
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Oct 10 08:31:11 2016 -0700

    fix: removed dek test because there is no dek on wikia
8 years ago
Toy Vano 3c99404566 Merge pull request #11 from postlight/feat-politico-extractor
feat: added politico extractor
8 years ago
Toy Vano e766494922 feat: added politico extractor 8 years ago
Toy Vano dd20e56bc5 Merge pull request #10 from postlight/feat-littlethings-extractor
feat: added littlethings extractor
8 years ago
Toy Vano fd1ac3f2b9 feat: added littlethings extractor 8 years ago
Toy Vano 6c18551ed0 Merge pull request #9 from postlight/feat-wikia-extractor
feat: added wikia extractor
8 years ago
Toy Vano 017b9dfcc2 Merge pull request #8 from postlight/feat-buzzfeed-extractor
feat: added incomplete buzzfeed extractor
8 years ago
Toy Vano bdf66314ea Merge pull request #7 from postlight/feat-yahoo-extractor
feat: added incomplete yahoo extractor
8 years ago
Toy Vano b0e1a873c0 Merge pull request #6 from postlight/feat-msn-extractor
feat: added incomplete msn extractor
8 years ago
Toy Vano 1519eed3e5 feat: added wikia extractor 8 years ago
Toy Vano 9416ec73a4 feat: added incomplete buzzfeed extractor 8 years ago
Toy Vano c6c35bd237 feat: added incomplete yahoo extractor 8 years ago
Toy Vano 320c740676 feat: added incomplete msn extractor 8 years ago
Adam Pash e3ee5e93bf chore: small doc fixes 8 years ago
Toy Vano 7ecc696248 feat: added wired custom extractor 8 years ago
Adam Pash 20b7c5a8b6 chore: fix a few typos/links 8 years ago
Adam Pash 173f885674 feat: custom parser + generator + detailed readme instructions
Squashed commit of the following:

commit 02563daa67712c3679258ebebac60dfa9568dffb
Author: Adam Pash <adam.pash@gmail.com>
Date:   Fri Sep 30 12:25:44 2016 -0400

    updated readme, added newyorker parser for readme guide

commit 0ac613ef823efbffbf4cc9a89e5cb2489d1c4f6f
Author: Adam Pash <adam.pash@gmail.com>
Date:   Fri Sep 30 11:16:52 2016 -0400

    feat: updated parser so the saved fixture absolutizes urls

commit 85c7a2660b21f95c2205ca4a4378a7570687fed0
Author: Adam Pash <adam.pash@gmail.com>
Date:   Fri Sep 30 10:15:26 2016 -0400

    refactor: attribute selectors must be an array for custom extractors

commit f60f93d5d3d9b2f2d9ec6f28d27ae9dcf16ef01e
Author: Adam Pash <adam.pash@gmail.com>
Date:   Thu Sep 29 10:13:14 2016 -0400

    fix: whitelisting srcset and alt attributes

commit e31cb1f4e8a9fc9c3d9b20ef9f40ca6c8d6ad51a
Author: Adam Pash <adam.pash@gmail.com>
Date:   Thu Sep 29 09:44:21 2016 -0400

    some housekeeping for coverage tests

commit 39eafe420c776a1fe7f9fea634fb529a3ed75a71
Author: Adam Pash <adam.pash@gmail.com>
Date:   Wed Sep 28 17:52:08 2016 -0400

    fix: word count for multi-page articles

commit b04e0066b52f190481b1b604c64e3d0b1226ff02
Author: Adam Pash <adam.pash@gmail.com>
Date:   Thu Sep 22 10:40:23 2016 -0400

    major improvements to output

commit 3f3a880b63b47fe21953485da670b6e291ac60e5
Author: Adam Pash <adam.pash@gmail.com>
Date:   Wed Sep 21 17:27:53 2016 -0400

    updated test command

commit 14503426557a870755453572221d95c92cff4bd2
Author: Adam Pash <adam.pash@gmail.com>
Date:   Wed Sep 21 16:00:30 2016 -0400

    shortened generator command

commit 5ebd8343cd4b87b3f5787dab665bff0de96846e1
Author: Adam Pash <adam.pash@gmail.com>
Date:   Wed Sep 21 15:59:14 2016 -0400

    feat: can disable fallback to generic parser (this will be useful for testing custom parsers)
8 years ago
Adam Pash 39a3c0690d chore: readme improvement 8 years ago
Adam Pash ef047107ea feat: content cleaner still runs, but can disable some cleaners 8 years ago
Adam Pash 75b1880f01 chore: cleaned up unused files, slight reorg 8 years ago
Adam Pash ad42055f8f feat: switched test framework to jest 8 years ago
Adam Pash 8f42e119e8 feat: generator for custom parsers and some documentation
Squashed commit of the following:

commit deaf9e60d031d9ee06e74b8c0895495b187032a5
Author: Adam Pash <adam.pash@gmail.com>
Date:   Tue Sep 20 10:31:09 2016 -0400

    chore: README for custom parsers

commit a8e8ad633e0d1576a52dbc90ce31b98fb2ec21ee
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Sep 19 23:36:09 2016 -0400

    draft of readme

commit 4f0f463f821465c282ce006378e5d55f8f41df5f
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Sep 19 17:56:34 2016 -0400

    custom extractor used to build basic parser for theatlantic

commit c5562a3cede41f56c4e723dcfa1181b49dcaae4d
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Sep 19 17:20:13 2016 -0400

    pre-commit to test custom parser generator

commit 7d50d5b7ab780b79fae38afcb87a7d1da5d139b2
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Sep 19 17:19:55 2016 -0400

    feat: added nytimes parser

commit 58b8d83a56927177984ddfdf70830bc4f328f200
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Sep 19 17:17:28 2016 -0400

    feat: can do fuzzy search or go straight to file

commit c99add753723a8e2ac64d51d7379ac8e23125526
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Sep 19 10:52:26 2016 -0400

    refactored export for custom extractors for easier renames

commit 22563413669651bb497f1bb2a92085b71f2ae324
Author: Adam Pash <adam.pash@gmail.com>
Date:   Fri Sep 16 17:36:13 2016 -0400

    feat: custom extractor generation in place

commit 2285a29908a7f82a5de3c81f6b2b902ddec9bdaa
Author: Adam Pash <adam.pash@gmail.com>
Date:   Fri Sep 16 16:42:20 2016 -0400

    good progress
8 years ago
Adam Pash 7ade83692a feat: improve wikipedia parser 8 years ago
Adam Pash 2ae2dba690 chore: renamed iris to mercury 8 years ago
Adam Pash 005ba47f6f fix: wikpedia transform only grabs one image from .infobox 8 years ago
Adam Pash 8dc6042dc9 build for comparisons 8 years ago
Adam Pash cbd0636dcf chore: cleaned up python and other unneeded comments 8 years ago
Adam Pash bf13b38a9b feat: some basic error handling for bad urls 8 years ago
Adam Pash ffaf7db0f1 fix: some improvements to date parsing. punting on localization issues 8 years ago
Adam Pash 396313aeae feat: added twitter custom extractor
Squashed commit of the following:

commit 8116f14364869b72a8afabfcb44b2ac154caed96
Author: Adam Pash <adam.pash@gmail.com>
Date:   Thu Sep 15 16:27:27 2016 -0400

    feat: added twitter custom extractor

commit e478eb1b0bcdcb65fdd5fa64e37be92b6defd702
Author: Adam Pash <adam.pash@gmail.com>
Date:   Thu Sep 15 16:22:54 2016 -0400

    fix: made custom extractors and cleaners adhere to underscore keys
8 years ago
Adam Pash d60d396c98 feat: added text direction to response 8 years ago
Adam Pash f0f216c7b9 feat: add option to allow custom extractors to skip default cleaners 8 years ago
Adam Pash 97a0728ecf test: added sanity test for get-extractor 8 years ago
Adam Pash 7c375aded7 chore: cleanup 8 years ago
Adam Pash 4cdc4165d6 fix: encodeURI before fetching 8 years ago
Adam Pash 1343469b6c fix: explicit/better decoding of gzipped content 8 years ago
Adam Pash c338098f21 refactor: renamed child to sibling for clarity 8 years ago
Adam Pash 6263e505d5 fix: handling case where node.get(0) returns null 8 years ago
Adam Pash 3b36a33e36 chore: change result keys to match python api 8 years ago
Adam Pash cc060b794d fix: wordcount calling excerpt 8 years ago
Adam Pash 7fc1f7f6bb checking in dist 8 years ago
Adam Pash daa9266182 feat: generic extractor for word count
Squashed commit of the following:

commit 0aba26ef9efba71a72c76fa351a9037e97fc1e9e
Author: Adam Pash <adam.pash@gmail.com>
Date:   Wed Sep 14 14:56:45 2016 -0400

    fix: normalizeSpaces regex fix broke a test

commit 07d60c1c8c6599d6c94d92e5a70649c28d03d6ea
Author: Adam Pash <adam.pash@gmail.com>
Date:   Wed Sep 14 14:52:41 2016 -0400

    feat: generic extractor for word count
8 years ago
Adam Pash 76df30e303 chore: cleanup 8 years ago
Adam Pash b3481a2c45 feat: generic excerpt extraction 8 years ago
Adam Pash 457075889d fix: selection should not be empty 8 years ago
Adam Pash 81ed4f00ed feat: improve nymag.com extractor to grab deks from features 8 years ago
Adam Pash 21f444367f feat: added page counts 8 years ago
Adam Pash f3a5d0ecca feat: added domain and url extractor (using same extractor)
commit 43ab423d575cd15cc55041fb3fe2f21ffdd7adff
Author: Adam Pash <adam.pash@gmail.com>
Date:   Wed Sep 14 11:57:25 2016 -0400
8 years ago
Adam Pash 67296691c2 refactor: page collection 8 years ago
Adam Pash b325a4acdd chore: clean up junk tests 8 years ago
Adam Pash 547ee2b4ca Merge pull request #1 from postlight/test-fix-fixture-locations
Fix Fixture Locations
8 years ago
Adam Pash 62ae330db2 fix: bug in scoring and converting to paragraphs 8 years ago
Jeremy Mack 7ca19d2e6f test: fix fixture locations 8 years ago
Adam Pash 7e2a34945f chore: refactored and linted 8 years ago
Adam Pash 9906bd36a4 chore: moved content scoring out of utils, removed no-longer-necessary utils 8 years ago
Adam Pash 7ec0ed0d31 feat: nextPageUrl handles multi-page articles
Squashed commit of the following:

commit b5070c0967a7f1a0c0c449ba7ea40aebe8fe4bb8
Author: Adam Pash <adam.pash@gmail.com>
Date:   Tue Sep 13 10:03:00 2016 -0400

    root extractor includes next page url

commit 79be83127d5342d89eef33665586fabea227d6b3
Author: Adam Pash <adam.pash@gmail.com>
Date:   Tue Sep 13 09:58:20 2016 -0400

    small score adjustment

commit 0f00507dbff43401145a892e849311518edec68a
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Sep 12 18:17:38 2016 -0400

    feat: nextPageUrl generic parser up and running

commit be91c589fc0c6d6f9b573080a76c9b1ac7af710c
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Sep 12 11:53:58 2016 -0400

    feat: pageNumFromUrl extracts the pagenum of the current url

commit ad879d7aabedadfd051c01b42d841703bf4763fa
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Sep 12 11:52:37 2016 -0400

    feat: isWordpress checks if a page is generated by wordpress
8 years ago
Adam Pash a89b9b785e feat: small improvement to author selectors 8 years ago
Adam Pash acaab70ee2 fix: scorePs parent scoring was overwriting child scoring 8 years ago
Adam Pash 8fe3bec6b6 fix: accepting cookies with request (required for sites like
nytimes.com)
8 years ago
Adam Pash 74694ba8e2 debugging: cheerio isn't always consistent in setting scores 8 years ago
Adam Pash 47ac7e9803 refactor: limiting calls to $ function
Squashed commit of the following:

commit c72da261cb5319d1eef207bff63b3c9cd49018df
Author: Adam Pash <adam.pash@gmail.com>
Date:   Fri Sep 9 15:28:43 2016 -0400

    refactor: limiting calls to $ function

commit eeae88247d844d5c6acbc529dbc3ce4d14e04191
Author: Adam Pash <adam.pash@gmail.com>
Date:   Fri Sep 9 15:14:33 2016 -0400

    refactor: convertNodeTo; requires a cheerio object
8 years ago
Adam Pash 81e9e7a317 feat: whitelisting attrs to keep 8 years ago
Adam Pash 7b97559778 chore: remove logic for fetching meta tags with custom attrs (resource
normalizes this now
8 years ago
Adam Pash c48e3485c0 chore: code reorganization
Squashed commit of the following:

commit 636296841d5cf5e685237fe70db7a15305d8e966
Author: Adam Pash <adam.pash@gmail.com>
Date:   Fri Sep 9 13:37:21 2016 -0400

    final cleanup

commit 51f712b3074d41a1f2da91519289d4dd09719ad0
Author: Adam Pash <adam.pash@gmail.com>
Date:   Fri Sep 9 13:25:28 2016 -0400

    Another big pass

commit 3860e6d872a9adb9290093fd9c8708dfcc773c28
Author: Adam Pash <adam.pash@gmail.com>
Date:   Fri Sep 9 12:49:52 2016 -0400

    chore: started reorganizing
8 years ago
Adam Pash f2729a5ee6 improved wiki extractor 8 years ago
Adam Pash 52e89a0229 fix: cleaning embed and object nodes 8 years ago
Adam Pash edfb54c532 feat: links are rewritten to absolute in cleaner
Squashed commit of the following:

commit 9057d411a5458f80c316604559c469a239ef3a40
Author: Adam Pash <adam.pash@gmail.com>
Date:   Fri Sep 9 11:42:19 2016 -0400

    feat: links are rewritten to absolute in cleaner
8 years ago
Adam Pash bdc2c0c1da feat: can now fetch attrs in RootExtractor's select method 8 years ago
Adam Pash 33c7e0d1c9 feat: Improved dateString parsing to handle more; first trying to parse without cleaning 8 years ago
Adam Pash 91881df523 refactor: cleaners now run on custom extractors
Squashed commit of the following:

commit e4c7d1d149d1846f0d589b3653655b81b477c682
Author: Adam Pash <adam.pash@gmail.com>
Date:   Thu Sep 8 19:29:26 2016 -0400

    refactor: cleaners now run on custom extractors

commit ca08d2482c54bf6a40f50758da9353f00987a4d7
Author: Adam Pash <adam.pash@gmail.com>
Date:   Thu Sep 8 14:42:19 2016 -0400

    moved cleaners, refactored as necessary

commit ec2c5d36410b255c6d8ee264deca990c46709c3c
Author: Adam Pash <adam.pash@gmail.com>
Date:   Thu Sep 8 14:07:01 2016 -0400

    moved datePublished cleaner

commit 5e55e397eecb3e88d64cd2aa2c6071c9cffed272
Author: Adam Pash <adam.pash@gmail.com>
Date:   Thu Sep 8 13:34:21 2016 -0400

    moved dek cleaner

commit 2dfb0c44d7882336992fdc864792df6eac094c21
Author: Adam Pash <adam.pash@gmail.com>
Date:   Thu Sep 8 13:29:37 2016 -0400

    moved lead-image-url

commit cef7a213b80ddd671249225622f1388f9e68896c
Author: Adam Pash <adam.pash@gmail.com>
Date:   Thu Sep 8 13:26:20 2016 -0400

    moved author
8 years ago
Adam Pash 603682239d feat: basic wikipedia custom extractor 8 years ago
Adam Pash 9665fe7209 feat: blogspot.com custom extractor 8 years ago
Adam Pash 6c6451b34b fix: duplicate key bug 8 years ago
Adam Pash 93ca688955 fix: dek and leadImg should not be html 8 years ago
Adam Pash 45ef18ba37 fix: brought .html fixtures into project dir 8 years ago
Adam Pash 7d88fee199 feat: RootExtractor performs extraction using custom and generic
extraction methods
8 years ago
Adam Pash 937138c7bb refactor: improve extractor args; passing as object 8 years ago
Adam Pash ecacc6ce12 Some good basic restructuring 8 years ago
Adam Pash b3f90c489e basic merging of extracting sources 8 years ago
Adam Pash 0f45b39ca2 refactor: preparing for extraction merging 8 years ago
Adam Pash a022252a14 feat: getExtractor returns generic extractor 8 years ago
Adam Pash c40b702b93 clean formatting 8 years ago
Adam Pash dfb5334f18 fix: encoding request response as null
This fixes an issue with gzipped content
8 years ago
Adam Pash ddc684c7d3 updated constants 8 years ago
Adam Pash 189361dc20 cleanup 8 years ago
Adam Pash ac62e0fba0 fix: pre-loading html in resource 8 years ago
Adam Pash 3128baeda1 cleanup 8 years ago
Adam Pash 86b2ee194c feat: can pass in raw html if already fetched 8 years ago