Adam Pash
8662474d8a
feat: changed user agent to latest chrome ( #121 )
...
* feat: changed user agent to latest chrome
* removed dead link
8 years ago
Janet
7709d69379
feat: npr parser ( #86 )
...
* feat: npr parser
Lead image appears in preview, but the test fails for some reason.
AssertionError: null ==
'https://media.npr.org/assets/img/2016/12/15/gettyimages-540681598_wide-
8b160732b96c083dc115134c3c019f3ac73586ba.jpg?s=1400'
Looks okay otherwise.
* feat: transformed figures/figcaptions, improved date_published and
addressed NPR's bad image metadata
8 years ago
Janet
8a82f2c0ab
feat: recode parser ( #85 )
...
* feat: recode parser
Thumbs up, as far as I can tell.
Note: No image appeared in the preview.
* feat: pulling in lead image
8 years ago
Janet
ad29acd7b7
feat: fortune parser ( #84 )
...
* feat: fortune parser
For some reason, the dek doesn’t appear in the local version of the
article I selected. I tried parsing the meta tag containing
og:description but it’s not working, and the description is slightly
longer than the dek in the original article.
I’m not sure why, but for the lead image, the meta tag for og:image is
not parsing the image url.
:(
* feat: fortune redesigned, so re-did extractor
* fix: added timezone
8 years ago
Janet
c133ddf614
feat: qz parser ( #81 )
...
* feat: qz parser
I couldn’t figure out how to parse the date, but otherwise should be
fine. I added a clean for the div.article-aside element based on what I
saw in how the chrome extension worked.
* feat: updated content to grab top image
test: date is null :/
8 years ago
Janet
84312b6ef1
feat: dmagazine parser ( #80 )
...
* feat: dmagazine parser
I’m sorry to have failed you. :-( These are the issues I encountered:
1) author - does not have a unique selector to distinguish it from the
date, couldn’t parse it
2) date - no meta data in the head
3) no meta og:image in the head (my go to), so I couldn’t get the image
test to pass, but it appears to be parsing. The caption below it is the
same size as the body copy in the preview. I couldn’t figure out how to
“transform” it to caption size.
* feat: update date, image, and author selectors and corresponding tests
* feat: generalized content selector
8 years ago
Janet
e035f36361
feat: reuters parser ( #78 )
...
* feat: reuters parser
Date parses correctly but fails test because of format discrepancy.
Author tags are nested within the content, which is why the author
names are appearing twice. I wasn’t sure how to address this.
Additionally, the location appears twice, so I cleaned the location
tags from the content.
* test: fix date format
* transform .article-subtitle to h4; cleaning author but leaving location
8 years ago
Janet
dec49ab073
feat: mashable parser ( #76 )
...
* feat: mashable parser
As usual the date is giving me issues because of formatting
discrepancies:
AssertionError: '2016-12-13T22:33:06.000Z' == '2016-12-14T03:33:06.000Z'
Not sure how we wanna deal with Twitter card embeds that don’t show up?
Also, image credits did not show up in preview.
* test: fixed date format
* transforming .image-credit to figcaption
8 years ago
Janet
cddc1afb69
feat: chicago tribune parser ( #75 )
...
* feat: chicago tribune parser
Date is parsing but failing the test because:
AssertionError: '2016-12-13T21:45:00.000Z' == '2016-12-13T13:45:00-0800'
I tried to insert a line of code for Time Zone but I’m a n00b so I
don’t think I did it right.
No image showing up in the preview.
* fix: remove timezone from date_published extractor
* test: update unit tests to assert the correct value for date_published
8 years ago
Janet
aff651c2d8
feat: hellogiggles parser ( #107 )
...
Looks good to me!
8 years ago
Janet
11ad7b9a92
feat: thought catalog parser ( #102 )
...
Looks good!
8 years ago
Janet
aa43a6091c
feat: cnbc parser ( #96 )
...
Should be good to go!
8 years ago
Janet
cd245f7980
feat: popsugar parser ( #93 )
...
I think this one is good to go!
8 years ago
Janet
a8ab7135e1
feat: observer parser ( #91 )
...
no problems
8 years ago
Janet
3bee7224cb
feat: nbc news parser ( #74 )
8 years ago
Janet
88242dd233
feat: nj.com parser ( #73 )
8 years ago
Janet
1ac5670a54
feat: inquisitor parser ( #72 )
8 years ago
Janet
9e5b91ed8b
feat: refinery29 parser ( #71 )
8 years ago
Janet
b78c58c43a
feat: miami herald parser ( #69 )
8 years ago
Janet
aedf83edc6
feat: eonline parser ( #68 )
8 years ago
Janet
a20da5eb31
uproxx extractor ( #66 )
8 years ago
Janet
87c42b6358
feat: 247sports.com extractor ( #64 )
8 years ago
Janet
22e6c884fb
feat: rolling stone extractor ( #65 )
8 years ago
Janet
6337231697
feat: usmagazine extractor ( #63 )
8 years ago
Janet
c06b19efe7
feat: people extractor ( #70 )
...
No major problems!
8 years ago
Janet
3cf2bb78c4
feat: vox custom parser ( #67 )
8 years ago
Janet
861c5f0dcb
feat: bustle extractor ( #60 )
8 years ago
Adam Pash
06397a4360
feat: browser-friendly selector for medium ( #61 )
8 years ago
Adam Pash
3297ab079d
feat: bloomberg extractor ( #59 )
...
Bloomberg has several templates. I'm supporting three different templates here, but I'm not sure that this is complete by any means.
It's also worth noting that SVGs don't make it through the parser terribly well for many reasons. One, for example, is that a lot of SVGs require custom CSS in order for them to make sense. I'm not sure this is something we can expect to address in the parser.
8 years ago
Janet
e55e9da534
feat: sbnation extractor ( #55 )
8 years ago
Adam Pash
8070e4790b
test: streamlined guardian tests w/new single-extraction ( #58 )
8 years ago
Adam Pash
bdb751fb53
feat: more cleaning for wired ( #56 )
8 years ago
Janet
e7e41bd242
feat: the guardian custom extractor ( #41 )
8 years ago
Adam Pash
81aa89f2c1
feat: youtube custom extractor ( #53 )
8 years ago
Adam Pash
2fb47640f2
Feat: detect platforms ( #52 )
...
Detectors for matching extractors for publishing platforms. Currently supporting Medium and Blogger.
8 years ago
Adam Pash
64c0fad2fd
fix: preserve whitespace ( #51 )
...
No longer normalizing whitespace in html
8 years ago
Adam Pash
15656cb3e1
Refactor: running tests more efficiently ( #49 )
...
Only running one parser per page we're testing rather than a parser per field we're testing.
8 years ago
Adam Pash
f9902cfa05
Fix: extension bugs ( #47 )
...
* feat: lead image on atlantic stories now included
* feat: supporting buzzfeed "longform" template
* feat: cleaning .parter-box from the atlantic
8 years ago
Adam Pash
16860f1d85
feat: improved nyt parser ( #46 )
...
NYT was one of the first, and its test was stale and it didn't have all
of its fields well defined.
8 years ago
Adam Pash
d0453efbf8
feat: improvements for nyer magazine articles ( #45 )
...
adds dek and date_published for magazine template
8 years ago
Adam Pash
00f8965c1f
fix: cleaning up deks ( #44 )
...
We've solidified what we consider a dek. This PR removes the dek selectors that do not fit that mold.
8 years ago
Janet
b415d1d37c
feat: aol custom extractor ( #42 )
...
* feat: aol custom parser
* removed work from other commits. merged with latest master
8 years ago
Matt
4cc3b68b5e
feat: remove footer links ( #40 )
...
the links at the bottom of the stories feel a little spammy because of how we treat links vs. the way they are displayed on the Times, would like to clean them
8 years ago
Adam Pash
ff1963bdca
feat: new cleaner for wapo ( #38 )
8 years ago
Adam Pash
0e6ccdf622
fix: browser cleanup ( #35 )
...
Cleaning up after the parser when it's done in the browser, before
returning result.
8 years ago
Silas Burton
c3d98a0d76
Feat cnn extractor ( #34 )
...
* wip: cnn custom extactor
* wip: cnn works except first paragraph
* final touches on cnn parser
* cleanup
8 years ago
Silas Burton
a0570f8e94
feat: extractor for the verge ( #33 )
...
* feat: extractor for the verge's standard article template
* feat: basic support for the verge feature template
* feat: allow multiple links to be previewed
* feat: content selector arrays
Content selector arrays allow custom parsers to select multiple elements
to match and include in the result.
* feat: updated verge parser to use multimatch selectors
* lint fix
* cleanup test builds
8 years ago
Adam Pash
233ca11a33
fix: added timezone to new republic date ( #32 )
8 years ago
Adam Pash
cfe7f34be4
fix: normalizing spaces for authors/dek/title ( #31 )
...
* fix: normalizing spaces for authors/dek/title
8 years ago
Adam Pash
9a23b24a89
feat: adjustment for huffpo. skipping overly aggressive default cleaners ( #30 )
8 years ago
Silas Burton
be2e4b5c80
Feat: huffington post extractor ( #28 )
...
* wip: huffpo custom extractor
* wip: some huffpo cleanup
8 years ago
Adam Pash
94198c0a65
feat: new republic custom extractor ( #25 )
...
* wip: new republic custom extractor
* feat: new republic article extractor
* feat: new republic minutes article extractor
8 years ago
Janet
c4d72fb735
feat: add money.cnn custom parser ( #26 )
...
* feat: add money.cnn custom parser
* added timezone to cnn custom parser
8 years ago
Adam Pash
6343946dd8
Feat: custom timezones ( #29 )
...
* using moment-timezone to allow custom timezones
* added tz to tmz, even though still so-so
8 years ago
Adam Pash
a8face796a
Fix extension bugs ( #23 )
...
* feat: cleaning supplemental elements in nytimes (visible in web only)
closes https://github.com/postlight/mercury-reader-chrome-extension/issues/102
* wip
* fix: more generous date published bits
* feat: added washington post extractor (including figure transforms)
closes https://github.com/postlight/mercury-reader-chrome-extension/issues/100
* feat: cleaning zoom lightbox from gizmodo/kinja
* lint fix
8 years ago
Adam Pash
3a2f32b0eb
feat: added tmz custom parser ( #22 )
8 years ago
Adam Pash
783a9cfb2f
fix: changed overly liberal regex for removing transparent images
8 years ago
Adam Pash
7411922c55
feat: encoding response body based on content-type charset ( #21 )
...
Also some small code organization
8 years ago
Adam Pash
c30fb2e4c0
chore: updated readme
8 years ago
Adam Pash
60a6861e18
Feat: browser support ( #19 )
...
Big undertaking to support Mercury in the browser. Builds are working and all tests are passing both for web and node builds. Most code is closely shared.
8 years ago
Adam Pash
eaea57461a
fix: servers returning bad headers was breaking request. temporarily ( #20 )
...
using fork with a fix for this until request merges the necessary pull request
8 years ago
Adam Pash
629eada1f7
feat: recording/playing back network requests with nock ( #18 )
...
* feat: recording/playing back network requests with nock
* lint fix
8 years ago
Adam Pash
e325d860fd
Feat: improving ci ( #16 )
...
This commit also swaps in yarn for npm and tweaks circle ci a bit.
* appveyor.yml first go
* changing node
* ps
* narrow it down
* trying this
* fix airbnb module
* trying with yarn
* logging
* hybrid?
* trying yarn w/circle
* bump workers?
* build off?
* updating script
* tweaking script for appveyor
* bumping maxworkers
* cleaning up
* build step?
* yarn it
* added appveyor badge
8 years ago
Adam Pash
048d654417
feat: parser auto-generates name; lint is more specific
8 years ago
Adam Pash
65c641a879
feat: enforcing line break rules in linter
8 years ago
Adam Pash
4d1d950807
updated generator templates for new style of import/export. also some
...
adjustments for usability
8 years ago
Adam Pash
7fa90f59b7
making all.js export a generic function to decrease possiblity of error
8 years ago
Adam Pash
de5b120b79
feat: allowing extractors to support multiple domains
8 years ago
Adam Pash
d038a36544
feat: custom medium extractor
8 years ago
Adam Pash
007ddec8ac
feat: allowing iframes from src domain
8 years ago
Adam Pash
b65b0c98b0
feat: supporting all GMG sites using DeadspinExtractor
8 years ago
Adam Pash
17317823de
fix: bug that stopped proper attr cleaning in certain cases
8 years ago
Adam Pash
40768fa188
feat: support lazy loading video on deadspin
8 years ago
Adam Pash
38c90d239e
fix: removeEmpty shouldn't remove elements with images or iframes inside
8 years ago
Adam Pash
c63f500433
fix: narrowed selector to fix blogspot title selector
8 years ago
Adam Pash
d3b11be473
feat: keeping youtube and vimeo iframe embeds ( #14 )
...
* feat: keeping youtube and vimeo iframe embeds
* fix: removing class from article correctly
8 years ago
Adam Pash
5c7f2cd28e
fix: better selector for nytimes authors
8 years ago
Adam Pash
3b87b557be
feat: pulling score from whitelist
8 years ago
Drew Bell
76db95e884
feat: Add custom extrator for Apartment Therapy
8 years ago
Drew Bell
a708ad3b4f
feat: Add custom parser for broadwayworld.com
8 years ago
Adam Pash
896021227d
feat: added deadspin custom parser
8 years ago
Adam Pash
422deb4600
feat: generator generates potential selectors for all custom selectable fields
8 years ago
Adam Pash
c314e3befa
feat: dek returns null if it's basically the same as the excerpt
...
Squashed commit of the following:
commit 0ee7d51ce609ad23d2deca1af41e7b4e56681bd7
Author: Adam Pash <adam.pash@gmail.com>
Date: Mon Oct 10 15:44:28 2016 -0700
feat: dek does not return if it's basically the same as the excerpt
commit 6ad27f994fff3652e04ffe7c81f1ae0b1647e941
Author: Adam Pash <adam.pash@gmail.com>
Date: Mon Oct 10 14:35:54 2016 -0700
feat: added excerpt util
8 years ago
Adam Pash
63c06c8a00
fix: babel-polyfill mess (I think)
8 years ago
Adam Pash
eb0aa0b1f6
feat: some small tweaks to toy's excellent parsers ☺️
...
Squashed commit of the following:
commit 9638220124a325322d6cda7d16c645185d5fe827
Author: Adam Pash <adam.pash@gmail.com>
Date: Mon Oct 10 11:02:29 2016 -0700
fix: removed eslint plugin that was adding unneded async parens
commit ce2268c0f7c1b093c06f156730a0f1bc2aaba39c
Author: Adam Pash <adam.pash@gmail.com>
Date: Mon Oct 10 10:47:36 2016 -0700
style: fix async in parens
commit 9591856915eddaf93170da1ce9225b8a378bdf55
Author: Adam Pash <adam.pash@gmail.com>
Date: Mon Oct 10 10:37:11 2016 -0700
fix: remove parens around async
commit 6c56054717acc1f7e5499691780f8273f6d07bac
Author: Adam Pash <adam.pash@gmail.com>
Date: Mon Oct 10 10:35:50 2016 -0700
fix msn fixture; adjusted yahoo test
commit 4fc117ad5fdc5528f29b0873d60a6a1709642f15
Author: Adam Pash <adam.pash@gmail.com>
Date: Mon Oct 10 10:14:38 2016 -0700
removed dek and date_publised tests; neither exist in littlethings
commit 401094b4abc52901255fd2461f5839624f11d8a3
Author: Adam Pash <adam.pash@gmail.com>
Date: Mon Oct 10 10:08:44 2016 -0700
feat: updated buzzfeed for content extraction
commit 19548a5485f70ff9b65e3e725d2364d07734ac9c
Author: Adam Pash <adam.pash@gmail.com>
Date: Mon Oct 10 09:54:30 2016 -0700
fix: generator should make transforms an object, not array
commit b92113f9f7c97aca9e6d3ce9243abac967d26b63
Author: Adam Pash <adam.pash@gmail.com>
Date: Mon Oct 10 08:54:38 2016 -0700
feat: updated politico
commit c026591040f7671cb2a6dd5177a995e21d015482
Author: Adam Pash <adam.pash@gmail.com>
Date: Mon Oct 10 08:48:52 2016 -0700
fix: typos
commit 14aa8fa4ce38ff1c2a212cd0225437ae3042c2c3
Author: Adam Pash <adam.pash@gmail.com>
Date: Mon Oct 10 08:36:12 2016 -0700
fix: incorrect command in readme
commit fe260e6122877e2cb0130a1ecde0e503017057a3
Author: Adam Pash <adam.pash@gmail.com>
Date: Mon Oct 10 08:31:11 2016 -0700
fix: removed dek test because there is no dek on wikia
8 years ago
Toy Vano
3c99404566
Merge pull request #11 from postlight/feat-politico-extractor
...
feat: added politico extractor
8 years ago
Toy Vano
e766494922
feat: added politico extractor
8 years ago
Toy Vano
dd20e56bc5
Merge pull request #10 from postlight/feat-littlethings-extractor
...
feat: added littlethings extractor
8 years ago
Toy Vano
fd1ac3f2b9
feat: added littlethings extractor
8 years ago
Toy Vano
6c18551ed0
Merge pull request #9 from postlight/feat-wikia-extractor
...
feat: added wikia extractor
8 years ago
Toy Vano
017b9dfcc2
Merge pull request #8 from postlight/feat-buzzfeed-extractor
...
feat: added incomplete buzzfeed extractor
8 years ago
Toy Vano
bdf66314ea
Merge pull request #7 from postlight/feat-yahoo-extractor
...
feat: added incomplete yahoo extractor
8 years ago
Toy Vano
b0e1a873c0
Merge pull request #6 from postlight/feat-msn-extractor
...
feat: added incomplete msn extractor
8 years ago
Toy Vano
1519eed3e5
feat: added wikia extractor
8 years ago
Toy Vano
9416ec73a4
feat: added incomplete buzzfeed extractor
8 years ago
Toy Vano
c6c35bd237
feat: added incomplete yahoo extractor
8 years ago
Toy Vano
320c740676
feat: added incomplete msn extractor
8 years ago
Adam Pash
e3ee5e93bf
chore: small doc fixes
8 years ago
Toy Vano
7ecc696248
feat: added wired custom extractor
8 years ago
Adam Pash
20b7c5a8b6
chore: fix a few typos/links
8 years ago
Adam Pash
173f885674
feat: custom parser + generator + detailed readme instructions
...
Squashed commit of the following:
commit 02563daa67712c3679258ebebac60dfa9568dffb
Author: Adam Pash <adam.pash@gmail.com>
Date: Fri Sep 30 12:25:44 2016 -0400
updated readme, added newyorker parser for readme guide
commit 0ac613ef823efbffbf4cc9a89e5cb2489d1c4f6f
Author: Adam Pash <adam.pash@gmail.com>
Date: Fri Sep 30 11:16:52 2016 -0400
feat: updated parser so the saved fixture absolutizes urls
commit 85c7a2660b21f95c2205ca4a4378a7570687fed0
Author: Adam Pash <adam.pash@gmail.com>
Date: Fri Sep 30 10:15:26 2016 -0400
refactor: attribute selectors must be an array for custom extractors
commit f60f93d5d3d9b2f2d9ec6f28d27ae9dcf16ef01e
Author: Adam Pash <adam.pash@gmail.com>
Date: Thu Sep 29 10:13:14 2016 -0400
fix: whitelisting srcset and alt attributes
commit e31cb1f4e8a9fc9c3d9b20ef9f40ca6c8d6ad51a
Author: Adam Pash <adam.pash@gmail.com>
Date: Thu Sep 29 09:44:21 2016 -0400
some housekeeping for coverage tests
commit 39eafe420c776a1fe7f9fea634fb529a3ed75a71
Author: Adam Pash <adam.pash@gmail.com>
Date: Wed Sep 28 17:52:08 2016 -0400
fix: word count for multi-page articles
commit b04e0066b52f190481b1b604c64e3d0b1226ff02
Author: Adam Pash <adam.pash@gmail.com>
Date: Thu Sep 22 10:40:23 2016 -0400
major improvements to output
commit 3f3a880b63b47fe21953485da670b6e291ac60e5
Author: Adam Pash <adam.pash@gmail.com>
Date: Wed Sep 21 17:27:53 2016 -0400
updated test command
commit 14503426557a870755453572221d95c92cff4bd2
Author: Adam Pash <adam.pash@gmail.com>
Date: Wed Sep 21 16:00:30 2016 -0400
shortened generator command
commit 5ebd8343cd4b87b3f5787dab665bff0de96846e1
Author: Adam Pash <adam.pash@gmail.com>
Date: Wed Sep 21 15:59:14 2016 -0400
feat: can disable fallback to generic parser (this will be useful for testing custom parsers)
8 years ago
Adam Pash
39a3c0690d
chore: readme improvement
8 years ago
Adam Pash
ef047107ea
feat: content cleaner still runs, but can disable some cleaners
8 years ago
Adam Pash
75b1880f01
chore: cleaned up unused files, slight reorg
8 years ago
Adam Pash
ad42055f8f
feat: switched test framework to jest
8 years ago
Adam Pash
8f42e119e8
feat: generator for custom parsers and some documentation
...
Squashed commit of the following:
commit deaf9e60d031d9ee06e74b8c0895495b187032a5
Author: Adam Pash <adam.pash@gmail.com>
Date: Tue Sep 20 10:31:09 2016 -0400
chore: README for custom parsers
commit a8e8ad633e0d1576a52dbc90ce31b98fb2ec21ee
Author: Adam Pash <adam.pash@gmail.com>
Date: Mon Sep 19 23:36:09 2016 -0400
draft of readme
commit 4f0f463f821465c282ce006378e5d55f8f41df5f
Author: Adam Pash <adam.pash@gmail.com>
Date: Mon Sep 19 17:56:34 2016 -0400
custom extractor used to build basic parser for theatlantic
commit c5562a3cede41f56c4e723dcfa1181b49dcaae4d
Author: Adam Pash <adam.pash@gmail.com>
Date: Mon Sep 19 17:20:13 2016 -0400
pre-commit to test custom parser generator
commit 7d50d5b7ab780b79fae38afcb87a7d1da5d139b2
Author: Adam Pash <adam.pash@gmail.com>
Date: Mon Sep 19 17:19:55 2016 -0400
feat: added nytimes parser
commit 58b8d83a56927177984ddfdf70830bc4f328f200
Author: Adam Pash <adam.pash@gmail.com>
Date: Mon Sep 19 17:17:28 2016 -0400
feat: can do fuzzy search or go straight to file
commit c99add753723a8e2ac64d51d7379ac8e23125526
Author: Adam Pash <adam.pash@gmail.com>
Date: Mon Sep 19 10:52:26 2016 -0400
refactored export for custom extractors for easier renames
commit 22563413669651bb497f1bb2a92085b71f2ae324
Author: Adam Pash <adam.pash@gmail.com>
Date: Fri Sep 16 17:36:13 2016 -0400
feat: custom extractor generation in place
commit 2285a29908a7f82a5de3c81f6b2b902ddec9bdaa
Author: Adam Pash <adam.pash@gmail.com>
Date: Fri Sep 16 16:42:20 2016 -0400
good progress
8 years ago
Adam Pash
7ade83692a
feat: improve wikipedia parser
8 years ago
Adam Pash
2ae2dba690
chore: renamed iris to mercury
8 years ago
Adam Pash
005ba47f6f
fix: wikpedia transform only grabs one image from .infobox
8 years ago
Adam Pash
8dc6042dc9
build for comparisons
8 years ago
Adam Pash
cbd0636dcf
chore: cleaned up python and other unneeded comments
8 years ago
Adam Pash
bf13b38a9b
feat: some basic error handling for bad urls
8 years ago
Adam Pash
ffaf7db0f1
fix: some improvements to date parsing. punting on localization issues
8 years ago
Adam Pash
396313aeae
feat: added twitter custom extractor
...
Squashed commit of the following:
commit 8116f14364869b72a8afabfcb44b2ac154caed96
Author: Adam Pash <adam.pash@gmail.com>
Date: Thu Sep 15 16:27:27 2016 -0400
feat: added twitter custom extractor
commit e478eb1b0bcdcb65fdd5fa64e37be92b6defd702
Author: Adam Pash <adam.pash@gmail.com>
Date: Thu Sep 15 16:22:54 2016 -0400
fix: made custom extractors and cleaners adhere to underscore keys
8 years ago
Adam Pash
d60d396c98
feat: added text direction to response
8 years ago
Adam Pash
f0f216c7b9
feat: add option to allow custom extractors to skip default cleaners
8 years ago
Adam Pash
97a0728ecf
test: added sanity test for get-extractor
8 years ago
Adam Pash
7c375aded7
chore: cleanup
8 years ago
Adam Pash
4cdc4165d6
fix: encodeURI before fetching
8 years ago
Adam Pash
1343469b6c
fix: explicit/better decoding of gzipped content
8 years ago
Adam Pash
c338098f21
refactor: renamed child to sibling for clarity
8 years ago
Adam Pash
6263e505d5
fix: handling case where node.get(0) returns null
8 years ago
Adam Pash
3b36a33e36
chore: change result keys to match python api
8 years ago
Adam Pash
cc060b794d
fix: wordcount calling excerpt
8 years ago
Adam Pash
7fc1f7f6bb
checking in dist
8 years ago
Adam Pash
daa9266182
feat: generic extractor for word count
...
Squashed commit of the following:
commit 0aba26ef9efba71a72c76fa351a9037e97fc1e9e
Author: Adam Pash <adam.pash@gmail.com>
Date: Wed Sep 14 14:56:45 2016 -0400
fix: normalizeSpaces regex fix broke a test
commit 07d60c1c8c6599d6c94d92e5a70649c28d03d6ea
Author: Adam Pash <adam.pash@gmail.com>
Date: Wed Sep 14 14:52:41 2016 -0400
feat: generic extractor for word count
8 years ago
Adam Pash
76df30e303
chore: cleanup
8 years ago
Adam Pash
b3481a2c45
feat: generic excerpt extraction
8 years ago
Adam Pash
457075889d
fix: selection should not be empty
8 years ago
Adam Pash
81ed4f00ed
feat: improve nymag.com extractor to grab deks from features
8 years ago
Adam Pash
21f444367f
feat: added page counts
8 years ago
Adam Pash
f3a5d0ecca
feat: added domain and url extractor (using same extractor)
...
commit 43ab423d575cd15cc55041fb3fe2f21ffdd7adff
Author: Adam Pash <adam.pash@gmail.com>
Date: Wed Sep 14 11:57:25 2016 -0400
8 years ago
Adam Pash
67296691c2
refactor: page collection
8 years ago
Adam Pash
b325a4acdd
chore: clean up junk tests
8 years ago
Adam Pash
547ee2b4ca
Merge pull request #1 from postlight/test-fix-fixture-locations
...
Fix Fixture Locations
8 years ago
Adam Pash
62ae330db2
fix: bug in scoring and converting to paragraphs
8 years ago
Jeremy Mack
7ca19d2e6f
test: fix fixture locations
8 years ago
Adam Pash
7e2a34945f
chore: refactored and linted
8 years ago
Adam Pash
9906bd36a4
chore: moved content scoring out of utils, removed no-longer-necessary utils
8 years ago
Adam Pash
7ec0ed0d31
feat: nextPageUrl handles multi-page articles
...
Squashed commit of the following:
commit b5070c0967a7f1a0c0c449ba7ea40aebe8fe4bb8
Author: Adam Pash <adam.pash@gmail.com>
Date: Tue Sep 13 10:03:00 2016 -0400
root extractor includes next page url
commit 79be83127d5342d89eef33665586fabea227d6b3
Author: Adam Pash <adam.pash@gmail.com>
Date: Tue Sep 13 09:58:20 2016 -0400
small score adjustment
commit 0f00507dbff43401145a892e849311518edec68a
Author: Adam Pash <adam.pash@gmail.com>
Date: Mon Sep 12 18:17:38 2016 -0400
feat: nextPageUrl generic parser up and running
commit be91c589fc0c6d6f9b573080a76c9b1ac7af710c
Author: Adam Pash <adam.pash@gmail.com>
Date: Mon Sep 12 11:53:58 2016 -0400
feat: pageNumFromUrl extracts the pagenum of the current url
commit ad879d7aabedadfd051c01b42d841703bf4763fa
Author: Adam Pash <adam.pash@gmail.com>
Date: Mon Sep 12 11:52:37 2016 -0400
feat: isWordpress checks if a page is generated by wordpress
8 years ago
Adam Pash
a89b9b785e
feat: small improvement to author selectors
8 years ago
Adam Pash
acaab70ee2
fix: scorePs parent scoring was overwriting child scoring
8 years ago
Adam Pash
8fe3bec6b6
fix: accepting cookies with request (required for sites like
...
nytimes.com)
8 years ago
Adam Pash
74694ba8e2
debugging: cheerio isn't always consistent in setting scores
8 years ago
Adam Pash
47ac7e9803
refactor: limiting calls to $ function
...
Squashed commit of the following:
commit c72da261cb5319d1eef207bff63b3c9cd49018df
Author: Adam Pash <adam.pash@gmail.com>
Date: Fri Sep 9 15:28:43 2016 -0400
refactor: limiting calls to $ function
commit eeae88247d844d5c6acbc529dbc3ce4d14e04191
Author: Adam Pash <adam.pash@gmail.com>
Date: Fri Sep 9 15:14:33 2016 -0400
refactor: convertNodeTo; requires a cheerio object
8 years ago
Adam Pash
81e9e7a317
feat: whitelisting attrs to keep
8 years ago
Adam Pash
7b97559778
chore: remove logic for fetching meta tags with custom attrs (resource
...
normalizes this now
8 years ago
Adam Pash
c48e3485c0
chore: code reorganization
...
Squashed commit of the following:
commit 636296841d5cf5e685237fe70db7a15305d8e966
Author: Adam Pash <adam.pash@gmail.com>
Date: Fri Sep 9 13:37:21 2016 -0400
final cleanup
commit 51f712b3074d41a1f2da91519289d4dd09719ad0
Author: Adam Pash <adam.pash@gmail.com>
Date: Fri Sep 9 13:25:28 2016 -0400
Another big pass
commit 3860e6d872a9adb9290093fd9c8708dfcc773c28
Author: Adam Pash <adam.pash@gmail.com>
Date: Fri Sep 9 12:49:52 2016 -0400
chore: started reorganizing
8 years ago
Adam Pash
f2729a5ee6
improved wiki extractor
8 years ago
Adam Pash
52e89a0229
fix: cleaning embed and object nodes
8 years ago
Adam Pash
edfb54c532
feat: links are rewritten to absolute in cleaner
...
Squashed commit of the following:
commit 9057d411a5458f80c316604559c469a239ef3a40
Author: Adam Pash <adam.pash@gmail.com>
Date: Fri Sep 9 11:42:19 2016 -0400
feat: links are rewritten to absolute in cleaner
8 years ago
Adam Pash
bdc2c0c1da
feat: can now fetch attrs in RootExtractor's select method
8 years ago
Adam Pash
33c7e0d1c9
feat: Improved dateString parsing to handle more; first trying to parse without cleaning
8 years ago
Adam Pash
91881df523
refactor: cleaners now run on custom extractors
...
Squashed commit of the following:
commit e4c7d1d149d1846f0d589b3653655b81b477c682
Author: Adam Pash <adam.pash@gmail.com>
Date: Thu Sep 8 19:29:26 2016 -0400
refactor: cleaners now run on custom extractors
commit ca08d2482c54bf6a40f50758da9353f00987a4d7
Author: Adam Pash <adam.pash@gmail.com>
Date: Thu Sep 8 14:42:19 2016 -0400
moved cleaners, refactored as necessary
commit ec2c5d36410b255c6d8ee264deca990c46709c3c
Author: Adam Pash <adam.pash@gmail.com>
Date: Thu Sep 8 14:07:01 2016 -0400
moved datePublished cleaner
commit 5e55e397eecb3e88d64cd2aa2c6071c9cffed272
Author: Adam Pash <adam.pash@gmail.com>
Date: Thu Sep 8 13:34:21 2016 -0400
moved dek cleaner
commit 2dfb0c44d7882336992fdc864792df6eac094c21
Author: Adam Pash <adam.pash@gmail.com>
Date: Thu Sep 8 13:29:37 2016 -0400
moved lead-image-url
commit cef7a213b80ddd671249225622f1388f9e68896c
Author: Adam Pash <adam.pash@gmail.com>
Date: Thu Sep 8 13:26:20 2016 -0400
moved author
8 years ago
Adam Pash
603682239d
feat: basic wikipedia custom extractor
8 years ago
Adam Pash
9665fe7209
feat: blogspot.com custom extractor
8 years ago
Adam Pash
6c6451b34b
fix: duplicate key bug
8 years ago
Adam Pash
93ca688955
fix: dek and leadImg should not be html
8 years ago
Adam Pash
45ef18ba37
fix: brought .html fixtures into project dir
8 years ago
Adam Pash
7d88fee199
feat: RootExtractor performs extraction using custom and generic
...
extraction methods
8 years ago
Adam Pash
937138c7bb
refactor: improve extractor args; passing as object
8 years ago
Adam Pash
ecacc6ce12
Some good basic restructuring
8 years ago
Adam Pash
b3f90c489e
basic merging of extracting sources
8 years ago
Adam Pash
0f45b39ca2
refactor: preparing for extraction merging
8 years ago
Adam Pash
a022252a14
feat: getExtractor returns generic extractor
8 years ago
Adam Pash
c40b702b93
clean formatting
8 years ago
Adam Pash
dfb5334f18
fix: encoding request response as null
...
This fixes an issue with gzipped content
8 years ago
Adam Pash
ddc684c7d3
updated constants
8 years ago
Adam Pash
189361dc20
cleanup
8 years ago
Adam Pash
ac62e0fba0
fix: pre-loading html in resource
8 years ago
Adam Pash
3128baeda1
cleanup
8 years ago
Adam Pash
86b2ee194c
feat: can pass in raw html if already fetched
8 years ago
Adam Pash
8da2425e59
feat: resource fetches content from a URL and prepares for parsing
...
Squashed commit of the following:
commit 7ba2d2b36d175f5ccbc02f918322ea0dd44bf2c1
Author: Adam Pash <adam.pash@gmail.com>
Date: Tue Sep 6 17:55:10 2016 -0400
feat: resource fetches content from a URL and prepares for parsing
commit 0abdfa49eed5b363169070dac6d65d0a5818c918
Author: Adam Pash <adam.pash@gmail.com>
Date: Tue Sep 6 17:54:07 2016 -0400
fix: this was messing up double Esses ('ss', as in class => cla)
commit 9dc65a99631e3a68267a68b2b4629c4be8f61546
Author: Adam Pash <adam.pash@gmail.com>
Date: Tue Sep 6 14:58:57 2016 -0400
fix: test suite working w/new dirs
commit 993dc33a5229bfa22ea998e3c4fe105be9d91c21
Author: Adam Pash <adam.pash@gmail.com>
Date: Tue Sep 6 14:49:39 2016 -0400
feat: convertLazyLoadedImages puts img urls in the src
commit e7fb105443dd16d036e460ad21fbcb47191f475b
Author: Adam Pash <adam.pash@gmail.com>
Date: Tue Sep 6 14:30:43 2016 -0400
feat: makeLinksAbsolute to fully qualify urls
commit dbd665078af854efe84bbbfe9b55acd02e1a652f
Author: Adam Pash <adam.pash@gmail.com>
Date: Tue Sep 6 13:38:33 2016 -0400
feat: fetchResource to fetch a url and validate the response
commit 42d3937c8f0f8df693996c2edee93625f13dced7
Author: Adam Pash <adam.pash@gmail.com>
Date: Tue Sep 6 10:25:34 2016 -0400
feat: normalizing meta tags
8 years ago
Adam Pash
bc97156718
fix: better scoring for iamge extensions
8 years ago
Adam Pash
11a2286659
notes, cleanup
8 years ago
Adam Pash
752331eaae
feat: bundling with rollup
...
Squashed commit of the following:
commit 52bcf0f2dd79bcb2ee21bc134522edd259a3d35e
Author: Adam Pash <adam.pash@gmail.com>
Date: Fri Sep 2 13:42:29 2016 -0400
fix: converting date to ISO string
commit 11e827e27129ac229a96f66ca03f0b18dc5d289d
Author: Adam Pash <adam.pash@gmail.com>
Date: Fri Sep 2 13:42:12 2016 -0400
feat: bundling with rollup
commit 1ff752a3e44e5836b955f7f15c799abbbdfc9207
Author: Adam Pash <adam.pash@gmail.com>
Date: Fri Sep 2 12:11:39 2016 -0400
clean
8 years ago
Adam Pash
0ff3082295
feat: GenericExtractLeadImageUrl
...
Squashed commit of the following:
commit 22d37ebf26dbbd0a3daebbfde3509a6ce04aaf72
Author: Adam Pash <adam.pash@gmail.com>
Date: Thu Sep 1 17:50:13 2016 -0400
feat: GenericExtractLeadImageUrl
commit 3327a0a7929dd0e9267dc9c26f4e2aa78c32586f
Author: Adam Pash <adam.pash@gmail.com>
Date: Thu Sep 1 15:33:42 2016 -0400
feat: can pass custom attributes to extractFromMeta
8 years ago
Adam Pash
467b600721
feat: extract dek stubbed (not currently functional)
8 years ago
Adam Pash
d3b791d516
fix: title wasn't cleaning html tags
8 years ago
Adam Pash
956fd678f7
feat: GenericDatePublishedExtractor
...
Squashed commit of the following:
commit 8eda4606e773147ae8dd67666d1a64d659f9fdad
Author: Adam Pash <adam.pash@gmail.com>
Date: Thu Sep 1 12:28:06 2016 -0400
feat: GenericDatePublishedExtractor
commit 935510fe9bc0a92f68fca7faf66019cb45330097
Author: Adam Pash <adam.pash@gmail.com>
Date: Thu Sep 1 09:28:42 2016 -0400
updated todo
8 years ago
Adam Pash
29db4a6ee0
feat: extract author
8 years ago
Adam Pash
7e28871a02
chore: plumbing
8 years ago
Adam Pash
746d07d4a2
feat: title extraction and scaffolding for more
...
Squashed commit of the following:
commit 31d8b63dcb3ec9bbd6c8e7a10852fbd060e91103
Author: Adam Pash <adam.pash@gmail.com>
Date: Wed Aug 31 15:52:27 2016 -0400
feat: title extraction
commit 7002c552a9f5bb54630455d983b699c041c629fc
Author: Adam Pash <adam.pash@gmail.com>
Date: Wed Aug 31 14:21:29 2016 -0400
feat: withinComment checks if a node is inside a comment
commit 57f06ef5b499c2f747edee0c9eb276e38984de9a
Author: Adam Pash <adam.pash@gmail.com>
Date: Wed Aug 31 13:40:36 2016 -0400
feat: extractFromMeta function
commit 0947f21aae94fa5ce462246ed5cb53144d563931
Author: Adam Pash <adam.pash@gmail.com>
Date: Wed Aug 31 13:32:30 2016 -0400
fix: returning original string if no tags in string
commit dd6b032e5f9877395b9600480dd96c6fdf60cecd
Author: Adam Pash <adam.pash@gmail.com>
Date: Wed Aug 31 12:03:58 2016 -0400
feat: clean title function removes junk from titles
commit f33b3eef29ad7692441bd0e5aa26b11dd4411dde
Author: Adam Pash <adam.pash@gmail.com>
Date: Wed Aug 31 12:03:35 2016 -0400
chore: renamed function to correct name
commit 076a986b12df68a939a8efa773e01d08780d79aa
Author: Adam Pash <adam.pash@gmail.com>
Date: Wed Aug 31 12:02:18 2016 -0400
feat: utility method to strip tags from text
commit f3e98cdf0a0d7601fab9e8824c0cde73ded51651
Author: Adam Pash <adam.pash@gmail.com>
Date: Wed Aug 31 11:31:33 2016 -0400
feat: resolveSplitTitle cleans raw title text
8 years ago
Adam Pash
07834c0e15
refactor: restructuring for metadata extraction
8 years ago
Adam Pash
95085d1a11
chore: cleanup
8 years ago
Adam Pash
e1ef25aab1
fix: added babel-polyfill for bug in Reflect
8 years ago
Adam Pash
93e844cdfe
feat: implemented extractBestNode functionality
...
Squashed commit of the following:
commit 9af554dd975ff1778ed70c71fa9bde667fc5f880
Author: Adam Pash <adam.pash@gmail.com>
Date: Tue Aug 30 15:19:32 2016 -0400
feat: add cleanHeaders
commit 0dfea98eedc4f97fcbd78866322595c705e20521
Author: Adam Pash <adam.pash@gmail.com>
Date: Tue Aug 30 14:30:49 2016 -0400
fix: scoring parent nodes recursively
commit b6e5897a694adeb81e25a905aba72c0f45a8cc94
Author: Adam Pash <adam.pash@gmail.com>
Date: Tue Aug 30 12:47:24 2016 -0400
feat: extract clean node up and running
commit fb652c5db13db6bce7271efd68ba4b20515e9549
Author: Adam Pash <adam.pash@gmail.com>
Date: Tue Aug 30 09:57:21 2016 -0400
chore: added test for p tags with nested tags (e.g., img, iframe)
commit 731d0a2e4d89121dfafad195e9d0911805c4f8e4
Author: Adam Pash <adam.pash@gmail.com>
Date: Mon Aug 29 17:50:33 2016 -0400
feat: extact clean node integrates most functions
commit 322bc6534d30feb7c1c08d3813132badc6286b40
Author: Adam Pash <adam.pash@gmail.com>
Date: Mon Aug 29 16:46:04 2016 -0400
feat: removing empty nodes as defined in constants
commit f1d38932ea12a865814d2326970031fcb8515baa
Author: Adam Pash <adam.pash@gmail.com>
Date: Mon Aug 29 16:33:31 2016 -0400
feat: cleaning attributes from nodes
commit 0aa73ada6854af0ecd504bfe3d926a9524787ab5
Author: Adam Pash <adam.pash@gmail.com>
Date: Mon Aug 29 16:09:56 2016 -0400
feat: cleaning h1s from text
commit 12d4a309246285c278ce7765e4fbaa8271bb5889
Author: Adam Pash <adam.pash@gmail.com>
Date: Mon Aug 29 15:52:03 2016 -0400
feat: removing spacer images
commit 4e74ff830cc67586560f6fc72e2cfa432a3a2647
Author: Adam Pash <adam.pash@gmail.com>
Date: Mon Aug 29 15:38:49 2016 -0400
feat: stripping unwanted html from doc
commit c774166e90169fd0c1aa89898d3f7a975e82bf0a
Author: Adam Pash <adam.pash@gmail.com>
Date: Mon Aug 29 15:17:32 2016 -0400
feat: removing small images, height attribute from images
commit 3a8642f42cda451669c832482c5e1611b1ff2ea9
Author: Adam Pash <adam.pash@gmail.com>
Date: Mon Aug 29 12:57:45 2016 -0400
feat: rewrite top level
commit a1c03e779234b0aea02206d92ec3dcc15758507e
Author: Adam Pash <adam.pash@gmail.com>
Date: Fri Aug 26 17:34:36 2016 -0400
in a weird place rn
8 years ago
Adam Pash
9da7a6f2a9
feat: find top candidate function
8 years ago
Adam Pash
e2600231ac
feat: added linkDensity function
8 years ago
Adam Pash
c470261d41
fix: changed parseInt to parseFloat
8 years ago
Adam Pash
44eae5e931
feat: added scoreContent function
8 years ago
Adam Pash
bd7ed77f23
Lots of progress on score-content
8 years ago
Adam Pash
cc734c7e7d
chore: cleaned up repetative testing for dom
8 years ago
Adam Pash
f3b1fefba6
chore: refactored tests
8 years ago
Adam Pash
d4a19e6a27
feat: ported scoring methods with unit tests
8 years ago
Adam Pash
97087bd626
chore: refactored to slightly cleaner file structure (more to do here)
8 years ago
Adam Pash
67e212ffac
feat: convertToParagraphs function working
8 years ago
Adam Pash
c237245e89
Converting multiple line breaks to p
8 years ago
Adam Pash
95d02dadd1
simple logic in place for brsToPs
8 years ago
Adam Pash
777e11c25c
Stripping unlikely candidates from DOM
8 years ago