Commit Graph

34 Commits

Author SHA1 Message Date
Adam Pash
eaea57461a fix: servers returning bad headers was breaking request. temporarily (#20)
using fork with a fix for this until request merges the necessary pull request
2016-11-15 13:17:01 -08:00
Adam Pash
629eada1f7 feat: recording/playing back network requests with nock (#18)
* feat: recording/playing back network requests with nock

* lint fix
2016-10-28 14:54:12 -07:00
Adam Pash
e325d860fd Feat: improving ci (#16)
This commit also swaps in yarn for npm and tweaks circle ci a bit.

* appveyor.yml first go

* changing node

* ps

* narrow it down

* trying this

* fix airbnb module

* trying with yarn

* logging

* hybrid?

* trying yarn w/circle

* bump workers?

* build off?

* updating script

* tweaking script for appveyor

* bumping maxworkers

* cleaning up

* build step?

* yarn it

* added appveyor badge
2016-10-28 09:16:21 -07:00
Adam Pash
071218ab3c chore: added repo 2016-10-27 16:53:25 -07:00
Adam Pash
048d654417 feat: parser auto-generates name; lint is more specific 2016-10-27 14:54:38 -07:00
Adam Pash
7fa90f59b7 making all.js export a generic function to decrease possiblity of error 2016-10-27 10:19:21 -07:00
Adam Pash
a73246306d feat: quicker lint by being more specific 2016-10-26 16:05:00 -07:00
Adam Pash
4b5c029093 feat: added all-contributors 2016-10-26 15:42:55 -07:00
Adam Pash
eb0aa0b1f6 feat: some small tweaks to toy's excellent parsers ☺️
Squashed commit of the following:

commit 9638220124a325322d6cda7d16c645185d5fe827
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Oct 10 11:02:29 2016 -0700

    fix: removed eslint plugin that was adding unneded async parens

commit ce2268c0f7c1b093c06f156730a0f1bc2aaba39c
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Oct 10 10:47:36 2016 -0700

    style: fix async in parens

commit 9591856915eddaf93170da1ce9225b8a378bdf55
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Oct 10 10:37:11 2016 -0700

    fix: remove parens around async

commit 6c56054717acc1f7e5499691780f8273f6d07bac
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Oct 10 10:35:50 2016 -0700

    fix msn fixture; adjusted yahoo test

commit 4fc117ad5fdc5528f29b0873d60a6a1709642f15
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Oct 10 10:14:38 2016 -0700

    removed dek and date_publised tests; neither exist in littlethings

commit 401094b4abc52901255fd2461f5839624f11d8a3
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Oct 10 10:08:44 2016 -0700

    feat: updated buzzfeed for content extraction

commit 19548a5485f70ff9b65e3e725d2364d07734ac9c
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Oct 10 09:54:30 2016 -0700

    fix: generator should make transforms an object, not array

commit b92113f9f7c97aca9e6d3ce9243abac967d26b63
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Oct 10 08:54:38 2016 -0700

    feat: updated politico

commit c026591040f7671cb2a6dd5177a995e21d015482
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Oct 10 08:48:52 2016 -0700

    fix: typos

commit 14aa8fa4ce38ff1c2a212cd0225437ae3042c2c3
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Oct 10 08:36:12 2016 -0700

    fix: incorrect command in readme

commit fe260e6122877e2cb0130a1ecde0e503017057a3
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Oct 10 08:31:11 2016 -0700

    fix: removed dek test because there is no dek on wikia
2016-10-10 11:03:43 -07:00
Adam Pash
173f885674 feat: custom parser + generator + detailed readme instructions
Squashed commit of the following:

commit 02563daa67712c3679258ebebac60dfa9568dffb
Author: Adam Pash <adam.pash@gmail.com>
Date:   Fri Sep 30 12:25:44 2016 -0400

    updated readme, added newyorker parser for readme guide

commit 0ac613ef823efbffbf4cc9a89e5cb2489d1c4f6f
Author: Adam Pash <adam.pash@gmail.com>
Date:   Fri Sep 30 11:16:52 2016 -0400

    feat: updated parser so the saved fixture absolutizes urls

commit 85c7a2660b21f95c2205ca4a4378a7570687fed0
Author: Adam Pash <adam.pash@gmail.com>
Date:   Fri Sep 30 10:15:26 2016 -0400

    refactor: attribute selectors must be an array for custom extractors

commit f60f93d5d3d9b2f2d9ec6f28d27ae9dcf16ef01e
Author: Adam Pash <adam.pash@gmail.com>
Date:   Thu Sep 29 10:13:14 2016 -0400

    fix: whitelisting srcset and alt attributes

commit e31cb1f4e8a9fc9c3d9b20ef9f40ca6c8d6ad51a
Author: Adam Pash <adam.pash@gmail.com>
Date:   Thu Sep 29 09:44:21 2016 -0400

    some housekeeping for coverage tests

commit 39eafe420c776a1fe7f9fea634fb529a3ed75a71
Author: Adam Pash <adam.pash@gmail.com>
Date:   Wed Sep 28 17:52:08 2016 -0400

    fix: word count for multi-page articles

commit b04e0066b52f190481b1b604c64e3d0b1226ff02
Author: Adam Pash <adam.pash@gmail.com>
Date:   Thu Sep 22 10:40:23 2016 -0400

    major improvements to output

commit 3f3a880b63b47fe21953485da670b6e291ac60e5
Author: Adam Pash <adam.pash@gmail.com>
Date:   Wed Sep 21 17:27:53 2016 -0400

    updated test command

commit 14503426557a870755453572221d95c92cff4bd2
Author: Adam Pash <adam.pash@gmail.com>
Date:   Wed Sep 21 16:00:30 2016 -0400

    shortened generator command

commit 5ebd8343cd4b87b3f5787dab665bff0de96846e1
Author: Adam Pash <adam.pash@gmail.com>
Date:   Wed Sep 21 15:59:14 2016 -0400

    feat: can disable fallback to generic parser (this will be useful for testing custom parsers)
2016-09-30 12:26:25 -04:00
Adam Pash
ad42055f8f feat: switched test framework to jest 2016-09-20 10:52:16 -04:00
Adam Pash
8f42e119e8 feat: generator for custom parsers and some documentation
Squashed commit of the following:

commit deaf9e60d031d9ee06e74b8c0895495b187032a5
Author: Adam Pash <adam.pash@gmail.com>
Date:   Tue Sep 20 10:31:09 2016 -0400

    chore: README for custom parsers

commit a8e8ad633e0d1576a52dbc90ce31b98fb2ec21ee
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Sep 19 23:36:09 2016 -0400

    draft of readme

commit 4f0f463f821465c282ce006378e5d55f8f41df5f
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Sep 19 17:56:34 2016 -0400

    custom extractor used to build basic parser for theatlantic

commit c5562a3cede41f56c4e723dcfa1181b49dcaae4d
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Sep 19 17:20:13 2016 -0400

    pre-commit to test custom parser generator

commit 7d50d5b7ab780b79fae38afcb87a7d1da5d139b2
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Sep 19 17:19:55 2016 -0400

    feat: added nytimes parser

commit 58b8d83a56927177984ddfdf70830bc4f328f200
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Sep 19 17:17:28 2016 -0400

    feat: can do fuzzy search or go straight to file

commit c99add753723a8e2ac64d51d7379ac8e23125526
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Sep 19 10:52:26 2016 -0400

    refactored export for custom extractors for easier renames

commit 22563413669651bb497f1bb2a92085b71f2ae324
Author: Adam Pash <adam.pash@gmail.com>
Date:   Fri Sep 16 17:36:13 2016 -0400

    feat: custom extractor generation in place

commit 2285a29908a7f82a5de3c81f6b2b902ddec9bdaa
Author: Adam Pash <adam.pash@gmail.com>
Date:   Fri Sep 16 16:42:20 2016 -0400

    good progress
2016-09-20 10:37:03 -04:00
Adam Pash
f58ccec7aa fix: including babel-runtime as a bandaid for polyfill error 2016-09-19 11:24:43 -04:00
Adam Pash
59fb4c4974 fix: using transform-runtime to avoid babel-polyfill conflicts when used
in external code
2016-09-19 11:04:35 -04:00
Adam Pash
2ae2dba690 chore: renamed iris to mercury 2016-09-16 13:26:37 -04:00
Adam Pash
d60d396c98 feat: added text direction to response 2016-09-15 15:08:04 -04:00
Adam Pash
c76435ce62 updated name in package.json 2016-09-14 15:06:54 -04:00
Adam Pash
76df30e303 chore: cleanup 2016-09-14 14:28:45 -04:00
Adam Pash
67296691c2 refactor: page collection 2016-09-14 11:12:28 -04:00
Adam Pash
3694c2d12c chore: improve linter/babelrc 2016-09-14 10:14:19 -04:00
Adam Pash
7e2a34945f chore: refactored and linted 2016-09-13 15:22:27 -04:00
Adam Pash
7ec0ed0d31 feat: nextPageUrl handles multi-page articles
Squashed commit of the following:

commit b5070c0967a7f1a0c0c449ba7ea40aebe8fe4bb8
Author: Adam Pash <adam.pash@gmail.com>
Date:   Tue Sep 13 10:03:00 2016 -0400

    root extractor includes next page url

commit 79be83127d5342d89eef33665586fabea227d6b3
Author: Adam Pash <adam.pash@gmail.com>
Date:   Tue Sep 13 09:58:20 2016 -0400

    small score adjustment

commit 0f00507dbff43401145a892e849311518edec68a
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Sep 12 18:17:38 2016 -0400

    feat: nextPageUrl generic parser up and running

commit be91c589fc0c6d6f9b573080a76c9b1ac7af710c
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Sep 12 11:53:58 2016 -0400

    feat: pageNumFromUrl extracts the pagenum of the current url

commit ad879d7aabedadfd051c01b42d841703bf4763fa
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Sep 12 11:52:37 2016 -0400

    feat: isWordpress checks if a page is generated by wordpress
2016-09-13 10:08:49 -04:00
Adam Pash
c48e3485c0 chore: code reorganization
Squashed commit of the following:

commit 636296841d5cf5e685237fe70db7a15305d8e966
Author: Adam Pash <adam.pash@gmail.com>
Date:   Fri Sep 9 13:37:21 2016 -0400

    final cleanup

commit 51f712b3074d41a1f2da91519289d4dd09719ad0
Author: Adam Pash <adam.pash@gmail.com>
Date:   Fri Sep 9 13:25:28 2016 -0400

    Another big pass

commit 3860e6d872a9adb9290093fd9c8708dfcc773c28
Author: Adam Pash <adam.pash@gmail.com>
Date:   Fri Sep 9 12:49:52 2016 -0400

    chore: started reorganizing
2016-09-09 13:44:58 -04:00
Adam Pash
8da2425e59 feat: resource fetches content from a URL and prepares for parsing
Squashed commit of the following:

commit 7ba2d2b36d175f5ccbc02f918322ea0dd44bf2c1
Author: Adam Pash <adam.pash@gmail.com>
Date:   Tue Sep 6 17:55:10 2016 -0400

    feat: resource fetches content from a URL and prepares for parsing

commit 0abdfa49eed5b363169070dac6d65d0a5818c918
Author: Adam Pash <adam.pash@gmail.com>
Date:   Tue Sep 6 17:54:07 2016 -0400

    fix: this was messing up double Esses ('ss', as in class => cla)

commit 9dc65a99631e3a68267a68b2b4629c4be8f61546
Author: Adam Pash <adam.pash@gmail.com>
Date:   Tue Sep 6 14:58:57 2016 -0400

    fix: test suite working w/new dirs

commit 993dc33a5229bfa22ea998e3c4fe105be9d91c21
Author: Adam Pash <adam.pash@gmail.com>
Date:   Tue Sep 6 14:49:39 2016 -0400

    feat: convertLazyLoadedImages puts img urls in the src

commit e7fb105443dd16d036e460ad21fbcb47191f475b
Author: Adam Pash <adam.pash@gmail.com>
Date:   Tue Sep 6 14:30:43 2016 -0400

    feat: makeLinksAbsolute to fully qualify urls

commit dbd665078af854efe84bbbfe9b55acd02e1a652f
Author: Adam Pash <adam.pash@gmail.com>
Date:   Tue Sep 6 13:38:33 2016 -0400

    feat: fetchResource to fetch a url and validate the response

commit 42d3937c8f0f8df693996c2edee93625f13dced7
Author: Adam Pash <adam.pash@gmail.com>
Date:   Tue Sep 6 10:25:34 2016 -0400

    feat: normalizing meta tags
2016-09-06 17:55:45 -04:00
Adam Pash
752331eaae feat: bundling with rollup
Squashed commit of the following:

commit 52bcf0f2dd79bcb2ee21bc134522edd259a3d35e
Author: Adam Pash <adam.pash@gmail.com>
Date:   Fri Sep 2 13:42:29 2016 -0400

    fix: converting date to ISO string

commit 11e827e27129ac229a96f66ca03f0b18dc5d289d
Author: Adam Pash <adam.pash@gmail.com>
Date:   Fri Sep 2 13:42:12 2016 -0400

    feat: bundling with rollup

commit 1ff752a3e44e5836b955f7f15c799abbbdfc9207
Author: Adam Pash <adam.pash@gmail.com>
Date:   Fri Sep 2 12:11:39 2016 -0400

    clean
2016-09-02 13:43:03 -04:00
Adam Pash
0ff3082295 feat: GenericExtractLeadImageUrl
Squashed commit of the following:

commit 22d37ebf26dbbd0a3daebbfde3509a6ce04aaf72
Author: Adam Pash <adam.pash@gmail.com>
Date:   Thu Sep 1 17:50:13 2016 -0400

    feat: GenericExtractLeadImageUrl

commit 3327a0a7929dd0e9267dc9c26f4e2aa78c32586f
Author: Adam Pash <adam.pash@gmail.com>
Date:   Thu Sep 1 15:33:42 2016 -0400

    feat: can pass custom attributes to extractFromMeta
2016-09-01 17:50:42 -04:00
Adam Pash
956fd678f7 feat: GenericDatePublishedExtractor
Squashed commit of the following:

commit 8eda4606e773147ae8dd67666d1a64d659f9fdad
Author: Adam Pash <adam.pash@gmail.com>
Date:   Thu Sep 1 12:28:06 2016 -0400

    feat: GenericDatePublishedExtractor

commit 935510fe9bc0a92f68fca7faf66019cb45330097
Author: Adam Pash <adam.pash@gmail.com>
Date:   Thu Sep 1 09:28:42 2016 -0400

    updated todo
2016-09-01 12:28:39 -04:00
Adam Pash
746d07d4a2 feat: title extraction and scaffolding for more
Squashed commit of the following:

commit 31d8b63dcb3ec9bbd6c8e7a10852fbd060e91103
Author: Adam Pash <adam.pash@gmail.com>
Date:   Wed Aug 31 15:52:27 2016 -0400

    feat: title extraction

commit 7002c552a9f5bb54630455d983b699c041c629fc
Author: Adam Pash <adam.pash@gmail.com>
Date:   Wed Aug 31 14:21:29 2016 -0400

    feat: withinComment checks if a node is inside a comment

commit 57f06ef5b499c2f747edee0c9eb276e38984de9a
Author: Adam Pash <adam.pash@gmail.com>
Date:   Wed Aug 31 13:40:36 2016 -0400

    feat: extractFromMeta function

commit 0947f21aae94fa5ce462246ed5cb53144d563931
Author: Adam Pash <adam.pash@gmail.com>
Date:   Wed Aug 31 13:32:30 2016 -0400

    fix: returning original string if no tags in string

commit dd6b032e5f9877395b9600480dd96c6fdf60cecd
Author: Adam Pash <adam.pash@gmail.com>
Date:   Wed Aug 31 12:03:58 2016 -0400

    feat: clean title function removes junk from titles

commit f33b3eef29ad7692441bd0e5aa26b11dd4411dde
Author: Adam Pash <adam.pash@gmail.com>
Date:   Wed Aug 31 12:03:35 2016 -0400

    chore: renamed function to correct name

commit 076a986b12df68a939a8efa773e01d08780d79aa
Author: Adam Pash <adam.pash@gmail.com>
Date:   Wed Aug 31 12:02:18 2016 -0400

    feat: utility method to strip tags from text

commit f3e98cdf0a0d7601fab9e8824c0cde73ded51651
Author: Adam Pash <adam.pash@gmail.com>
Date:   Wed Aug 31 11:31:33 2016 -0400

    feat: resolveSplitTitle cleans raw title text
2016-08-31 15:52:48 -04:00
Adam Pash
e1ef25aab1 fix: added babel-polyfill for bug in Reflect 2016-08-30 16:07:09 -04:00
Adam Pash
93e844cdfe feat: implemented extractBestNode functionality
Squashed commit of the following:

commit 9af554dd975ff1778ed70c71fa9bde667fc5f880
Author: Adam Pash <adam.pash@gmail.com>
Date:   Tue Aug 30 15:19:32 2016 -0400

    feat: add cleanHeaders

commit 0dfea98eedc4f97fcbd78866322595c705e20521
Author: Adam Pash <adam.pash@gmail.com>
Date:   Tue Aug 30 14:30:49 2016 -0400

    fix: scoring parent nodes recursively

commit b6e5897a694adeb81e25a905aba72c0f45a8cc94
Author: Adam Pash <adam.pash@gmail.com>
Date:   Tue Aug 30 12:47:24 2016 -0400

    feat: extract clean node up and running

commit fb652c5db13db6bce7271efd68ba4b20515e9549
Author: Adam Pash <adam.pash@gmail.com>
Date:   Tue Aug 30 09:57:21 2016 -0400

    chore: added test for p tags with nested tags (e.g., img, iframe)

commit 731d0a2e4d89121dfafad195e9d0911805c4f8e4
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Aug 29 17:50:33 2016 -0400

    feat: extact clean node integrates most functions

commit 322bc6534d30feb7c1c08d3813132badc6286b40
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Aug 29 16:46:04 2016 -0400

    feat: removing empty nodes as defined in constants

commit f1d38932ea12a865814d2326970031fcb8515baa
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Aug 29 16:33:31 2016 -0400

    feat: cleaning attributes from nodes

commit 0aa73ada6854af0ecd504bfe3d926a9524787ab5
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Aug 29 16:09:56 2016 -0400

    feat: cleaning h1s from text

commit 12d4a309246285c278ce7765e4fbaa8271bb5889
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Aug 29 15:52:03 2016 -0400

    feat: removing spacer images

commit 4e74ff830cc67586560f6fc72e2cfa432a3a2647
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Aug 29 15:38:49 2016 -0400

    feat: stripping unwanted html from doc

commit c774166e90169fd0c1aa89898d3f7a975e82bf0a
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Aug 29 15:17:32 2016 -0400

    feat: removing small images, height attribute from images

commit 3a8642f42cda451669c832482c5e1611b1ff2ea9
Author: Adam Pash <adam.pash@gmail.com>
Date:   Mon Aug 29 12:57:45 2016 -0400

    feat: rewrite top level

commit a1c03e779234b0aea02206d92ec3dcc15758507e
Author: Adam Pash <adam.pash@gmail.com>
Date:   Fri Aug 26 17:34:36 2016 -0400

    in a weird place rn
2016-08-30 15:25:25 -04:00
Adam Pash
89a2cfbb82 getWeight with tests 2016-08-23 13:06:43 -04:00
Adam Pash
f3aebb2a16 Basic testing in place 2016-08-23 11:03:31 -04:00
Adam Pash
8efcc70eef bringing in cheerio 2016-08-23 10:30:40 -04:00
Adam Pash
b349a1eac5 using rollup 2016-08-22 14:53:42 -04:00