Adam Pash
81ed4f00ed
feat: improve nymag.com extractor to grab deks from features
8 years ago
Adam Pash
21f444367f
feat: added page counts
8 years ago
Adam Pash
f3a5d0ecca
feat: added domain and url extractor (using same extractor)
...
commit 43ab423d575cd15cc55041fb3fe2f21ffdd7adff
Author: Adam Pash <adam.pash@gmail.com>
Date: Wed Sep 14 11:57:25 2016 -0400
8 years ago
Adam Pash
67296691c2
refactor: page collection
8 years ago
Adam Pash
b325a4acdd
chore: clean up junk tests
8 years ago
Adam Pash
547ee2b4ca
Merge pull request #1 from postlight/test-fix-fixture-locations
...
Fix Fixture Locations
8 years ago
Adam Pash
62ae330db2
fix: bug in scoring and converting to paragraphs
8 years ago
Adam Pash
3694c2d12c
chore: improve linter/babelrc
8 years ago
Jeremy Mack
7ca19d2e6f
test: fix fixture locations
8 years ago
Adam Pash
7e2a34945f
chore: refactored and linted
8 years ago
Adam Pash
9906bd36a4
chore: moved content scoring out of utils, removed no-longer-necessary utils
8 years ago
Adam Pash
7ec0ed0d31
feat: nextPageUrl handles multi-page articles
...
Squashed commit of the following:
commit b5070c0967a7f1a0c0c449ba7ea40aebe8fe4bb8
Author: Adam Pash <adam.pash@gmail.com>
Date: Tue Sep 13 10:03:00 2016 -0400
root extractor includes next page url
commit 79be83127d5342d89eef33665586fabea227d6b3
Author: Adam Pash <adam.pash@gmail.com>
Date: Tue Sep 13 09:58:20 2016 -0400
small score adjustment
commit 0f00507dbff43401145a892e849311518edec68a
Author: Adam Pash <adam.pash@gmail.com>
Date: Mon Sep 12 18:17:38 2016 -0400
feat: nextPageUrl generic parser up and running
commit be91c589fc0c6d6f9b573080a76c9b1ac7af710c
Author: Adam Pash <adam.pash@gmail.com>
Date: Mon Sep 12 11:53:58 2016 -0400
feat: pageNumFromUrl extracts the pagenum of the current url
commit ad879d7aabedadfd051c01b42d841703bf4763fa
Author: Adam Pash <adam.pash@gmail.com>
Date: Mon Sep 12 11:52:37 2016 -0400
feat: isWordpress checks if a page is generated by wordpress
8 years ago
Adam Pash
a89b9b785e
feat: small improvement to author selectors
8 years ago
Adam Pash
acaab70ee2
fix: scorePs parent scoring was overwriting child scoring
8 years ago
Adam Pash
8fe3bec6b6
fix: accepting cookies with request (required for sites like
...
nytimes.com)
8 years ago
Adam Pash
74694ba8e2
debugging: cheerio isn't always consistent in setting scores
8 years ago
Adam Pash
47ac7e9803
refactor: limiting calls to $ function
...
Squashed commit of the following:
commit c72da261cb5319d1eef207bff63b3c9cd49018df
Author: Adam Pash <adam.pash@gmail.com>
Date: Fri Sep 9 15:28:43 2016 -0400
refactor: limiting calls to $ function
commit eeae88247d844d5c6acbc529dbc3ce4d14e04191
Author: Adam Pash <adam.pash@gmail.com>
Date: Fri Sep 9 15:14:33 2016 -0400
refactor: convertNodeTo; requires a cheerio object
8 years ago
Adam Pash
81e9e7a317
feat: whitelisting attrs to keep
8 years ago
Adam Pash
7b97559778
chore: remove logic for fetching meta tags with custom attrs (resource
...
normalizes this now
8 years ago
Adam Pash
c48e3485c0
chore: code reorganization
...
Squashed commit of the following:
commit 636296841d5cf5e685237fe70db7a15305d8e966
Author: Adam Pash <adam.pash@gmail.com>
Date: Fri Sep 9 13:37:21 2016 -0400
final cleanup
commit 51f712b3074d41a1f2da91519289d4dd09719ad0
Author: Adam Pash <adam.pash@gmail.com>
Date: Fri Sep 9 13:25:28 2016 -0400
Another big pass
commit 3860e6d872a9adb9290093fd9c8708dfcc773c28
Author: Adam Pash <adam.pash@gmail.com>
Date: Fri Sep 9 12:49:52 2016 -0400
chore: started reorganizing
8 years ago
Adam Pash
f2729a5ee6
improved wiki extractor
8 years ago
Adam Pash
52e89a0229
fix: cleaning embed and object nodes
8 years ago
Adam Pash
edfb54c532
feat: links are rewritten to absolute in cleaner
...
Squashed commit of the following:
commit 9057d411a5458f80c316604559c469a239ef3a40
Author: Adam Pash <adam.pash@gmail.com>
Date: Fri Sep 9 11:42:19 2016 -0400
feat: links are rewritten to absolute in cleaner
8 years ago
Adam Pash
bdc2c0c1da
feat: can now fetch attrs in RootExtractor's select method
8 years ago
Adam Pash
33c7e0d1c9
feat: Improved dateString parsing to handle more; first trying to parse without cleaning
8 years ago
Adam Pash
91881df523
refactor: cleaners now run on custom extractors
...
Squashed commit of the following:
commit e4c7d1d149d1846f0d589b3653655b81b477c682
Author: Adam Pash <adam.pash@gmail.com>
Date: Thu Sep 8 19:29:26 2016 -0400
refactor: cleaners now run on custom extractors
commit ca08d2482c54bf6a40f50758da9353f00987a4d7
Author: Adam Pash <adam.pash@gmail.com>
Date: Thu Sep 8 14:42:19 2016 -0400
moved cleaners, refactored as necessary
commit ec2c5d36410b255c6d8ee264deca990c46709c3c
Author: Adam Pash <adam.pash@gmail.com>
Date: Thu Sep 8 14:07:01 2016 -0400
moved datePublished cleaner
commit 5e55e397eecb3e88d64cd2aa2c6071c9cffed272
Author: Adam Pash <adam.pash@gmail.com>
Date: Thu Sep 8 13:34:21 2016 -0400
moved dek cleaner
commit 2dfb0c44d7882336992fdc864792df6eac094c21
Author: Adam Pash <adam.pash@gmail.com>
Date: Thu Sep 8 13:29:37 2016 -0400
moved lead-image-url
commit cef7a213b80ddd671249225622f1388f9e68896c
Author: Adam Pash <adam.pash@gmail.com>
Date: Thu Sep 8 13:26:20 2016 -0400
moved author
8 years ago
Adam Pash
603682239d
feat: basic wikipedia custom extractor
8 years ago
Adam Pash
9665fe7209
feat: blogspot.com custom extractor
8 years ago
Adam Pash
6c6451b34b
fix: duplicate key bug
8 years ago
Adam Pash
93ca688955
fix: dek and leadImg should not be html
8 years ago
Adam Pash
45ef18ba37
fix: brought .html fixtures into project dir
8 years ago
Adam Pash
7d88fee199
feat: RootExtractor performs extraction using custom and generic
...
extraction methods
8 years ago
Adam Pash
937138c7bb
refactor: improve extractor args; passing as object
8 years ago
Adam Pash
ecacc6ce12
Some good basic restructuring
8 years ago
Adam Pash
b3f90c489e
basic merging of extracting sources
8 years ago
Adam Pash
0f45b39ca2
refactor: preparing for extraction merging
8 years ago
Adam Pash
a022252a14
feat: getExtractor returns generic extractor
8 years ago
Adam Pash
c40b702b93
clean formatting
8 years ago
Adam Pash
dfb5334f18
fix: encoding request response as null
...
This fixes an issue with gzipped content
8 years ago
Adam Pash
ddc684c7d3
updated constants
8 years ago
Adam Pash
189361dc20
cleanup
8 years ago
Adam Pash
ac62e0fba0
fix: pre-loading html in resource
8 years ago
Adam Pash
3128baeda1
cleanup
8 years ago
Adam Pash
86b2ee194c
feat: can pass in raw html if already fetched
8 years ago
Adam Pash
8da2425e59
feat: resource fetches content from a URL and prepares for parsing
...
Squashed commit of the following:
commit 7ba2d2b36d175f5ccbc02f918322ea0dd44bf2c1
Author: Adam Pash <adam.pash@gmail.com>
Date: Tue Sep 6 17:55:10 2016 -0400
feat: resource fetches content from a URL and prepares for parsing
commit 0abdfa49eed5b363169070dac6d65d0a5818c918
Author: Adam Pash <adam.pash@gmail.com>
Date: Tue Sep 6 17:54:07 2016 -0400
fix: this was messing up double Esses ('ss', as in class => cla)
commit 9dc65a99631e3a68267a68b2b4629c4be8f61546
Author: Adam Pash <adam.pash@gmail.com>
Date: Tue Sep 6 14:58:57 2016 -0400
fix: test suite working w/new dirs
commit 993dc33a5229bfa22ea998e3c4fe105be9d91c21
Author: Adam Pash <adam.pash@gmail.com>
Date: Tue Sep 6 14:49:39 2016 -0400
feat: convertLazyLoadedImages puts img urls in the src
commit e7fb105443dd16d036e460ad21fbcb47191f475b
Author: Adam Pash <adam.pash@gmail.com>
Date: Tue Sep 6 14:30:43 2016 -0400
feat: makeLinksAbsolute to fully qualify urls
commit dbd665078af854efe84bbbfe9b55acd02e1a652f
Author: Adam Pash <adam.pash@gmail.com>
Date: Tue Sep 6 13:38:33 2016 -0400
feat: fetchResource to fetch a url and validate the response
commit 42d3937c8f0f8df693996c2edee93625f13dced7
Author: Adam Pash <adam.pash@gmail.com>
Date: Tue Sep 6 10:25:34 2016 -0400
feat: normalizing meta tags
8 years ago
Adam Pash
bc97156718
fix: better scoring for iamge extensions
8 years ago
Adam Pash
11a2286659
notes, cleanup
8 years ago
Adam Pash
752331eaae
feat: bundling with rollup
...
Squashed commit of the following:
commit 52bcf0f2dd79bcb2ee21bc134522edd259a3d35e
Author: Adam Pash <adam.pash@gmail.com>
Date: Fri Sep 2 13:42:29 2016 -0400
fix: converting date to ISO string
commit 11e827e27129ac229a96f66ca03f0b18dc5d289d
Author: Adam Pash <adam.pash@gmail.com>
Date: Fri Sep 2 13:42:12 2016 -0400
feat: bundling with rollup
commit 1ff752a3e44e5836b955f7f15c799abbbdfc9207
Author: Adam Pash <adam.pash@gmail.com>
Date: Fri Sep 2 12:11:39 2016 -0400
clean
8 years ago
Adam Pash
0ff3082295
feat: GenericExtractLeadImageUrl
...
Squashed commit of the following:
commit 22d37ebf26dbbd0a3daebbfde3509a6ce04aaf72
Author: Adam Pash <adam.pash@gmail.com>
Date: Thu Sep 1 17:50:13 2016 -0400
feat: GenericExtractLeadImageUrl
commit 3327a0a7929dd0e9267dc9c26f4e2aa78c32586f
Author: Adam Pash <adam.pash@gmail.com>
Date: Thu Sep 1 15:33:42 2016 -0400
feat: can pass custom attributes to extractFromMeta
8 years ago
Adam Pash
467b600721
feat: extract dek stubbed (not currently functional)
8 years ago