Commit Graph

158 Commits

Author SHA1 Message Date
Adam Pash
33c7e0d1c9 feat: Improved dateString parsing to handle more; first trying to parse without cleaning 2016-09-09 09:59:56 -04:00
Adam Pash
91881df523 refactor: cleaners now run on custom extractors
Squashed commit of the following:

commit e4c7d1d149d1846f0d589b3653655b81b477c682
Author: Adam Pash <adam.pash@gmail.com>
Date:   Thu Sep 8 19:29:26 2016 -0400

    refactor: cleaners now run on custom extractors

commit ca08d2482c54bf6a40f50758da9353f00987a4d7
Author: Adam Pash <adam.pash@gmail.com>
Date:   Thu Sep 8 14:42:19 2016 -0400

    moved cleaners, refactored as necessary

commit ec2c5d36410b255c6d8ee264deca990c46709c3c
Author: Adam Pash <adam.pash@gmail.com>
Date:   Thu Sep 8 14:07:01 2016 -0400

    moved datePublished cleaner

commit 5e55e397eecb3e88d64cd2aa2c6071c9cffed272
Author: Adam Pash <adam.pash@gmail.com>
Date:   Thu Sep 8 13:34:21 2016 -0400

    moved dek cleaner

commit 2dfb0c44d7882336992fdc864792df6eac094c21
Author: Adam Pash <adam.pash@gmail.com>
Date:   Thu Sep 8 13:29:37 2016 -0400

    moved lead-image-url

commit cef7a213b80ddd671249225622f1388f9e68896c
Author: Adam Pash <adam.pash@gmail.com>
Date:   Thu Sep 8 13:26:20 2016 -0400

    moved author
2016-09-08 19:31:45 -04:00
Adam Pash
603682239d feat: basic wikipedia custom extractor 2016-09-08 13:19:06 -04:00
Adam Pash
9665fe7209 feat: blogspot.com custom extractor 2016-09-08 12:19:54 -04:00
Adam Pash
6c6451b34b fix: duplicate key bug 2016-09-08 12:15:45 -04:00
Adam Pash
93ca688955 fix: dek and leadImg should not be html 2016-09-08 11:24:19 -04:00
Adam Pash
45ef18ba37 fix: brought .html fixtures into project dir 2016-09-08 11:07:51 -04:00
Adam Pash
7d88fee199 feat: RootExtractor performs extraction using custom and generic
extraction methods
2016-09-08 11:00:29 -04:00
Adam Pash
937138c7bb refactor: improve extractor args; passing as object 2016-09-07 17:53:59 -04:00
Adam Pash
ecacc6ce12 Some good basic restructuring 2016-09-07 15:47:40 -04:00
Adam Pash
b3f90c489e basic merging of extracting sources 2016-09-07 15:36:05 -04:00
Adam Pash
0f45b39ca2 refactor: preparing for extraction merging 2016-09-07 14:40:22 -04:00
Adam Pash
a022252a14 feat: getExtractor returns generic extractor 2016-09-07 13:56:57 -04:00
Adam Pash
c40b702b93 clean formatting 2016-09-07 13:43:12 -04:00
Adam Pash
dfb5334f18 fix: encoding request response as null
This fixes an issue with gzipped content
2016-09-07 13:29:11 -04:00
Adam Pash
ddc684c7d3 updated constants 2016-09-07 12:46:03 -04:00
Adam Pash
189361dc20 cleanup 2016-09-07 11:26:31 -04:00
Adam Pash
ac62e0fba0 fix: pre-loading html in resource 2016-09-07 11:01:20 -04:00
Adam Pash
3128baeda1 cleanup 2016-09-07 11:01:02 -04:00
Adam Pash
86b2ee194c feat: can pass in raw html if already fetched 2016-09-07 10:08:55 -04:00
Adam Pash
8da2425e59 feat: resource fetches content from a URL and prepares for parsing
Squashed commit of the following:

commit 7ba2d2b36d175f5ccbc02f918322ea0dd44bf2c1
Author: Adam Pash <adam.pash@gmail.com>
Date:   Tue Sep 6 17:55:10 2016 -0400

    feat: resource fetches content from a URL and prepares for parsing

commit 0abdfa49eed5b363169070dac6d65d0a5818c918
Author: Adam Pash <adam.pash@gmail.com>
Date:   Tue Sep 6 17:54:07 2016 -0400

    fix: this was messing up double Esses ('ss', as in class => cla)

commit 9dc65a99631e3a68267a68b2b4629c4be8f61546
Author: Adam Pash <adam.pash@gmail.com>
Date:   Tue Sep 6 14:58:57 2016 -0400

    fix: test suite working w/new dirs

commit 993dc33a5229bfa22ea998e3c4fe105be9d91c21
Author: Adam Pash <adam.pash@gmail.com>
Date:   Tue Sep 6 14:49:39 2016 -0400

    feat: convertLazyLoadedImages puts img urls in the src

commit e7fb105443dd16d036e460ad21fbcb47191f475b
Author: Adam Pash <adam.pash@gmail.com>
Date:   Tue Sep 6 14:30:43 2016 -0400

    feat: makeLinksAbsolute to fully qualify urls

commit dbd665078af854efe84bbbfe9b55acd02e1a652f
Author: Adam Pash <adam.pash@gmail.com>
Date:   Tue Sep 6 13:38:33 2016 -0400

    feat: fetchResource to fetch a url and validate the response

commit 42d3937c8f0f8df693996c2edee93625f13dced7
Author: Adam Pash <adam.pash@gmail.com>
Date:   Tue Sep 6 10:25:34 2016 -0400

    feat: normalizing meta tags
2016-09-06 17:55:45 -04:00
Adam Pash
bc97156718 fix: better scoring for iamge extensions 2016-09-06 10:01:56 -04:00
Adam Pash
11a2286659 notes, cleanup 2016-09-06 09:55:36 -04:00
Adam Pash
752331eaae feat: bundling with rollup
Squashed commit of the following:

commit 52bcf0f2dd79bcb2ee21bc134522edd259a3d35e
Author: Adam Pash <adam.pash@gmail.com>
Date:   Fri Sep 2 13:42:29 2016 -0400

    fix: converting date to ISO string

commit 11e827e27129ac229a96f66ca03f0b18dc5d289d
Author: Adam Pash <adam.pash@gmail.com>
Date:   Fri Sep 2 13:42:12 2016 -0400

    feat: bundling with rollup

commit 1ff752a3e44e5836b955f7f15c799abbbdfc9207
Author: Adam Pash <adam.pash@gmail.com>
Date:   Fri Sep 2 12:11:39 2016 -0400

    clean
2016-09-02 13:43:03 -04:00
Adam Pash
0ff3082295 feat: GenericExtractLeadImageUrl
Squashed commit of the following:

commit 22d37ebf26dbbd0a3daebbfde3509a6ce04aaf72
Author: Adam Pash <adam.pash@gmail.com>
Date:   Thu Sep 1 17:50:13 2016 -0400

    feat: GenericExtractLeadImageUrl

commit 3327a0a7929dd0e9267dc9c26f4e2aa78c32586f
Author: Adam Pash <adam.pash@gmail.com>
Date:   Thu Sep 1 15:33:42 2016 -0400

    feat: can pass custom attributes to extractFromMeta
2016-09-01 17:50:42 -04:00
Adam Pash
467b600721 feat: extract dek stubbed (not currently functional) 2016-09-01 14:09:28 -04:00
Adam Pash
d3b791d516 fix: title wasn't cleaning html tags 2016-09-01 13:45:00 -04:00
Adam Pash
956fd678f7 feat: GenericDatePublishedExtractor
Squashed commit of the following:

commit 8eda4606e773147ae8dd67666d1a64d659f9fdad
Author: Adam Pash <adam.pash@gmail.com>
Date:   Thu Sep 1 12:28:06 2016 -0400

    feat: GenericDatePublishedExtractor

commit 935510fe9bc0a92f68fca7faf66019cb45330097
Author: Adam Pash <adam.pash@gmail.com>
Date:   Thu Sep 1 09:28:42 2016 -0400

    updated todo
2016-09-01 12:28:39 -04:00
Adam Pash
29db4a6ee0 feat: extract author 2016-08-31 18:04:19 -04:00
Adam Pash
7e28871a02 chore: plumbing 2016-08-31 16:08:52 -04:00
Adam Pash
746d07d4a2 feat: title extraction and scaffolding for more
Squashed commit of the following:

commit 31d8b63dcb3ec9bbd6c8e7a10852fbd060e91103
Author: Adam Pash <adam.pash@gmail.com>
Date:   Wed Aug 31 15:52:27 2016 -0400

    feat: title extraction

commit 7002c552a9f5bb54630455d983b699c041c629fc
Author: Adam Pash <adam.pash@gmail.com>
Date:   Wed Aug 31 14:21:29 2016 -0400

    feat: withinComment checks if a node is inside a comment

commit 57f06ef5b499c2f747edee0c9eb276e38984de9a
Author: Adam Pash <adam.pash@gmail.com>
Date:   Wed Aug 31 13:40:36 2016 -0400

    feat: extractFromMeta function

commit 0947f21aae94fa5ce462246ed5cb53144d563931
Author: Adam Pash <adam.pash@gmail.com>
Date:   Wed Aug 31 13:32:30 2016 -0400

    fix: returning original string if no tags in string

commit dd6b032e5f9877395b9600480dd96c6fdf60cecd
Author: Adam Pash <adam.pash@gmail.com>
Date:   Wed Aug 31 12:03:58 2016 -0400

    feat: clean title function removes junk from titles

commit f33b3eef29ad7692441bd0e5aa26b11dd4411dde
Author: Adam Pash <adam.pash@gmail.com>
Date:   Wed Aug 31 12:03:35 2016 -0400

    chore: renamed function to correct name

commit 076a986b12df68a939a8efa773e01d08780d79aa
Author: Adam Pash <adam.pash@gmail.com>
Date:   Wed Aug 31 12:02:18 2016 -0400

    feat: utility method to strip tags from text

commit f3e98cdf0a0d7601fab9e8824c0cde73ded51651
Author: Adam Pash <adam.pash@gmail.com>
Date:   Wed Aug 31 11:31:33 2016 -0400

    feat: resolveSplitTitle cleans raw title text
2016-08-31 15:52:48 -04:00
Adam Pash
07834c0e15 refactor: restructuring for metadata extraction 2016-08-31 09:42:04 -04:00
Adam Pash
ebea6254b5 ignore npm-debug.log 2016-08-31 09:30:43 -04:00
Adam Pash
95085d1a11 chore: cleanup 2016-08-30 17:08:55 -04:00
Adam Pash
e1ef25aab1 fix: added babel-polyfill for bug in Reflect 2016-08-30 16:07:09 -04:00
Adam Pash
93e844cdfe feat: implemented extractBestNode functionality
Squashed commit of the following:

commit 9af554dd975ff1778ed70c71fa9bde667fc5f880
Author: Adam Pash <adam.pash@gmail.com>
Date:   Tue Aug 30 15:19:32 2016 -0400

    feat: add cleanHeaders

commit 0dfea98eedc4f97fcbd78866322595c705e20521
Author: Adam Pash <adam.pash@gmail.com>
Date:   Tue Aug 30 14:30:49 2016 -0400

    fix: scoring parent nodes recursively

commit b6e5897a694adeb81e25a905aba72c0f45a8cc94
Author: Adam Pash <adam.pash@gmail.com>
Date:   Tue Aug 30 12:47:24 2016 -0400

    feat: extract clean node up and running

commit fb652c5db13db6bce7271efd68ba4b20515e9549
Author: Adam Pash <adam.pash@gmail.com>
Date:   Tue Aug 30 09:57:21 2016 -0400

    chore: added test for p tags with nested tags (e.g., img, iframe)

commit 731d0a2e4d89121dfafad195e9d0911805c4f8e4
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Aug 29 17:50:33 2016 -0400

    feat: extact clean node integrates most functions

commit 322bc6534d30feb7c1c08d3813132badc6286b40
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Aug 29 16:46:04 2016 -0400

    feat: removing empty nodes as defined in constants

commit f1d38932ea12a865814d2326970031fcb8515baa
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Aug 29 16:33:31 2016 -0400

    feat: cleaning attributes from nodes

commit 0aa73ada6854af0ecd504bfe3d926a9524787ab5
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Aug 29 16:09:56 2016 -0400

    feat: cleaning h1s from text

commit 12d4a309246285c278ce7765e4fbaa8271bb5889
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Aug 29 15:52:03 2016 -0400

    feat: removing spacer images

commit 4e74ff830cc67586560f6fc72e2cfa432a3a2647
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Aug 29 15:38:49 2016 -0400

    feat: stripping unwanted html from doc

commit c774166e90169fd0c1aa89898d3f7a975e82bf0a
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Aug 29 15:17:32 2016 -0400

    feat: removing small images, height attribute from images

commit 3a8642f42cda451669c832482c5e1611b1ff2ea9
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Aug 29 12:57:45 2016 -0400

    feat: rewrite top level

commit a1c03e779234b0aea02206d92ec3dcc15758507e
Author: Adam Pash <adam.pash@gmail.com>
Date:   Fri Aug 26 17:34:36 2016 -0400

    in a weird place rn
2016-08-30 15:25:25 -04:00
Adam Pash
9da7a6f2a9 feat: find top candidate function 2016-08-26 09:21:47 -04:00
Adam Pash
e2600231ac feat: added linkDensity function 2016-08-25 17:43:29 -04:00
Adam Pash
c470261d41 fix: changed parseInt to parseFloat 2016-08-25 15:50:59 -04:00
Adam Pash
44eae5e931 feat: added scoreContent function 2016-08-25 15:31:09 -04:00
Adam Pash
bd7ed77f23 Lots of progress on score-content 2016-08-24 18:23:51 -04:00
Adam Pash
cc734c7e7d chore: cleaned up repetative testing for dom 2016-08-24 15:50:51 -04:00
Adam Pash
f3b1fefba6 chore: refactored tests 2016-08-24 15:35:27 -04:00
Adam Pash
d4a19e6a27 feat: ported scoring methods with unit tests 2016-08-24 15:30:16 -04:00
Adam Pash
97087bd626 chore: refactored to slightly cleaner file structure (more to do here) 2016-08-24 11:20:13 -04:00
Adam Pash
67e212ffac feat: convertToParagraphs function working 2016-08-24 10:52:29 -04:00
Adam Pash
c237245e89 Converting multiple line breaks to p 2016-08-24 10:02:46 -04:00
Adam Pash
95d02dadd1 simple logic in place for brsToPs 2016-08-23 16:04:00 -04:00
Adam Pash
d70b9f6709 updated todo 2016-08-23 15:15:12 -04:00
Adam Pash
777e11c25c Stripping unlikely candidates from DOM 2016-08-23 15:03:03 -04:00