Commit Graph

155 Commits

Author SHA1 Message Date
Michael Ashley
e12c916499
feat: ability to add custom extractors via api (#484)
* feat: ability to add custom extractors via api

* docs: updating readme

* fix: example.com was being used in another test

* fix: timezone was messing up date_published test

* fix: using a unique site for testing

* fix: updated custom extractor api

* docs: updating readme

* fix: removing unused fixture

* fix: updating test description

* feat: ability to add custom extractors via cli
2019-09-04 07:32:28 -07:00
Sven Wiegand
f95947fe88 Implemented custom extractor epaper.zeit.de (#488) 2019-08-28 07:15:14 -07:00
Michael Ashley
2422e4717d
fix: incorrect parsing on medium.com (#477)
* fix: medium extractor now pulls content

* fix: remove youtube caption if no preview available

* fix: remove youtube node if no image

* fix: removing dek from medium.com extractor
2019-08-28 07:04:27 -07:00
Michael Ashley
0686ee7956
fix: incorrect parsing on theatlantic.com (#475)
* fix: incorrect parsing on theatlantic.com

* chore: updating theatlantic.com tests & fixtures

* chore: removing script data from minified fixture
2019-08-20 09:58:24 -07:00
Michael Ashley
5e33263d25
chore: minifying biorxiv.com fixture (#478) 2019-08-20 09:46:15 -07:00
david0leong
911b0f87c8 Add custom extractor for biorxiv.org (#467)
* Add custom extractor for biorxiv.org

* Fix content selector

* Improve content selector
2019-08-19 13:46:03 -07:00
Ben Ubois
0942c37876 feat: custom parser for phoronix.com. (#431) 2019-06-26 09:55:13 -07:00
Michael P. Geraci
571a913745 feat: pitchfork extractor (#439)
* generate the custom extractor and get the first test to pass

* add the basic extractors (title, author, date, etc)

* select the score as well as the review text, and break the content test

* prepend the score to the content

* get the date from the datetime attribute

* mangle this test a little, but just a little (it does work properly)

* move from prepending the score to the review text to adding it as a custom field in the extractor
2019-06-26 09:02:17 -07:00
david0leong
694ea820aa Custom Extractor for clinicaltrials.gov (#305)
* Add prototype of custom extractor for clinicaltrials.gov

* Add .DS_Store to gitignore

* Make tests for title, author and date_published selectors pass

* Make content selector test pass

* Fix date_published test

* Rebuild

* Remove .DS-Store from gitignore

* Improve extractor and text/fixture of clinicaltrials.gov
2019-05-27 09:25:51 +03:00
Wajeeh Zantout
7c8de71c52 fix: new yorker extractor (#414)
* fix: new yorker extractor

* fix: date_published selector

* fix: remove footer from content

* feat: add additional selector for title

* feat: support article with multiple authors
2019-05-15 11:00:50 +03:00
Wajeeh Zantout
e66ad8b81c feat: add le monde extractor (#415) 2019-05-14 14:53:49 +03:00
kik0220
f81dc63617 feat: add rbbtoday.com custom parser (#411)
* feat: add rbbtoday.com custom parser

* fix: content test

* fix: dek and content
2019-05-08 14:04:03 +03:00
kik0220
5e1113b3a9 feat: add japan.zdnet.com custom parser (#410)
* feat: add japan.zdnet.com custom parser

* fix: author and date_published selector
2019-05-08 13:51:03 +03:00
kik0220
77e3bc00e2 feat: add wired.jp custom parser (#409)
* feat: add wired.jp custom parser

* fix: author test

* fix: date_published selector

* test: fix dek and contest

* test: fix content (without clean dek)
2019-05-08 13:32:04 +03:00
kik0220
0b36c96de0 feat: add techlog.iij.ad.jp custom parser (#405)
* feat: add techlog.iij.ad.jp custom parser

* fix: date_published and content selector
2019-05-08 13:20:47 +03:00
kik0220
406bf1b1a9 feat: add weekly.ascii.jp custom parser (#401)
* feat: add weekly.ascii.jp custom parser

* fix: title and date_published selector
2019-05-08 13:10:42 +03:00
kik0220
216bfade00 feat: add www.ipa.go.jp custom parser (#408) 2019-05-03 13:40:42 +03:00
kik0220
3ae8f3bde3 feat: add www.oreilly.co.jp custom parser (#407) 2019-05-03 13:30:48 +03:00
kik0220
7396e81b72 feat: add sect.iij.ad.jp custom parser (#404) 2019-05-03 13:19:06 +03:00
kik0220
3f1d9030ee feat: add www.lifehacker.jp custom parser (#403) 2019-05-03 13:14:53 +03:00
kik0220
b077000c4a feat: add getnews.jp custom parser (#402) 2019-05-03 13:10:55 +03:00
kik0220
b5425c3e8a feat: add www.gizmodo.jp custom parser (#400) 2019-05-03 13:06:51 +03:00
kik0220
a38c727a0a feat: add deadline.com custom parser (#383)
* feat: add deadline.com custom parser

* fix: timezone

* fix: date_published selectors

* fix: title and author selector

* test: transform .embed-twitter

* fix: regenerate the fixture and fix content selector
2019-04-24 15:29:02 +03:00
kik0220
74a3c49a3c feat: add japan.cnet.com custom parser (#382)
* feat: add japan.cnet.com custom parser

* fix: remove transform
2019-04-24 14:39:54 +03:00
kik0220
7b07f88448 feat: add www.yomiuri.co.jp custom parser (#381) 2019-04-24 11:00:56 +03:00
kik0220
8ca2894751 feat: add bookwalker.jp custom parser (#374) 2019-04-15 11:06:10 +03:00
kik0220
a5f06ce27a feat: add takagi-hiromitsu.jp custom parser (#364) 2019-04-12 18:11:05 +03:00
kik0220
b9c57dbc2f feat: add www.publickey1.jp custom parser (#365)
* feat: add www.publickey1.jp custom parser

* fix: date_published selector
2019-04-12 18:00:51 +03:00
kik0220
d7dbea8a95 feat: add www.itmedia.co.jp custom parser (#366)
* feat: add www.itmedia.co.jp custom parser

* feat: add nlab.itmedia.co.jp support

* fix: title selectors
2019-04-12 17:51:16 +03:00
kik0220
9218f80da6 feat: add www.moongift.jp custom parser (#367)
* feat: add www.moongift.jp custom parser

* fix: date_published selectors

* fix: pass test

* fix: add timezone
2019-04-12 17:40:55 +03:00
kik0220
4eb73dffb0 feat: add www.infoq.com custom parser (#368)
* feat: add www.infoq.com custom parser

* fix: date_published selector
2019-04-12 17:30:46 +03:00
kik0220
ce5cd2dd0d feat: add phpspot.org custom parser (#369)
* feat: add phpspot.org custom parser

* fix: date_published selector
2019-04-12 17:18:47 +03:00
kik0220
73be0c5a10 feat: add www.jnsa.org custom parser (#346)
* feat: add www.jnsa.org custom parser
2019-04-09 16:51:25 +03:00
Adam Pash
eacd1ee97f feat: custom genius parser. (#284)
also adds ability to transform value returned by an attribute selector
2019-04-09 12:49:24 +03:00
kik0220
c389c966d7 feat: add jvndb.jvn.jp custom parser (#345) 2019-04-09 12:05:03 +03:00
kik0220
8493d05cb5 feat: add scan.netsecurity.ne.jp custom parser (#347) 2019-04-09 11:59:27 +03:00
kik0220
2a76c6c212 feat: add www.elecom.co.jp custom parser (#348) 2019-04-09 11:54:57 +03:00
kik0220
a9e010b718 feat: add www.sanwa.co.jp custom parser (#349) 2019-04-09 11:50:48 +03:00
kik0220
1639eae324 feat: add www.asahi.com custom parser (#350) 2019-04-09 11:42:14 +03:00
kik0220
21f7de70c1 feat: add buzzap.jp custom parser (#351) 2019-04-09 11:35:40 +03:00
kik0220
f3a7e393a3 feat: add www.ossnews.jp custom parser (#352) 2019-04-09 11:30:56 +03:00
kik0220
c309bdb373 feat: add otrs.com custom parser (#353) 2019-04-09 11:17:58 +03:00
Toufic Mouallem
3ed778b53e fix: Adapt CNBC extractor to article redesign (#336) 2019-03-25 15:43:40 -07:00
Ben Ubois
a7e4c67d1d Extract content from GitHub repos. (#306)
* Extract content from GitHub repos.

* Add published and dek.

* Timezone fix.
2019-03-14 08:48:33 -07:00
Toufic Mouallem
7844129fda feat: Add custom parser for Reddit (#307) 2019-03-08 14:37:24 -08:00
Jordan Hotmann
83d1c2401b feat: add custom extractor for blisterreview.com (#299) 2019-03-01 16:48:26 -08:00
kik0220
d9a1e7b22b feat: add news.mynavi.jp custom parser (#287) 2019-03-01 16:45:32 -08:00
Adam Pash
9698d9a0c4
dx: comment on custom parser pr fix (#278)
* dx: comment on custom parser pr fix

* fix path

* write json

* chore: rename comment script
2019-02-28 11:11:03 -08:00
Ben Ubois
ed14203e97 fix: return early if creating the resource failed. (#285) 2019-02-20 16:48:51 -08:00
Ben Ubois
0e27448866 feat: Various Character Encoding Improvements (#270)
* Support HTML5 charset tag

In HTML5 `<meta charset="">` is shorthand for `<meta http-equiv="content-type" content="">`
https://developer.mozilla.org/en-US/docs/Web/HTML/Element/meta

* Handle more character encoding declaration methods.
2019-02-12 15:15:19 -08:00