Commit Graph

114 Commits

Author SHA1 Message Date
kik0220
c309bdb373 feat: add otrs.com custom parser (#353) 2019-04-09 11:17:58 +03:00
Toufic Mouallem
3ed778b53e fix: Adapt CNBC extractor to article redesign (#336) 2019-03-25 15:43:40 -07:00
Ben Ubois
a7e4c67d1d Extract content from GitHub repos. (#306)
* Extract content from GitHub repos.

* Add published and dek.

* Timezone fix.
2019-03-14 08:48:33 -07:00
Toufic Mouallem
7844129fda feat: Add custom parser for Reddit (#307) 2019-03-08 14:37:24 -08:00
Jordan Hotmann
83d1c2401b feat: add custom extractor for blisterreview.com (#299) 2019-03-01 16:48:26 -08:00
kik0220
d9a1e7b22b feat: add news.mynavi.jp custom parser (#287) 2019-03-01 16:45:32 -08:00
Adam Pash
9698d9a0c4
dx: comment on custom parser pr fix (#278)
* dx: comment on custom parser pr fix

* fix path

* write json

* chore: rename comment script
2019-02-28 11:11:03 -08:00
Ben Ubois
ed14203e97 fix: return early if creating the resource failed. (#285) 2019-02-20 16:48:51 -08:00
Ben Ubois
0e27448866 feat: Various Character Encoding Improvements (#270)
* Support HTML5 charset tag

In HTML5 `<meta charset="">` is shorthand for `<meta http-equiv="content-type" content="">`
https://developer.mozilla.org/en-US/docs/Web/HTML/Element/meta

* Handle more character encoding declaration methods.
2019-02-12 15:15:19 -08:00
Wajeeh Zantout
1ccd14e1e9 feat: add fortinet custom parser (#188)
* feat: add fortinet custom parser

* fix: eslint error

* fix: transform noscript images

* feat: add fortinet custom parser

* fix: eslint error

* fix: transform noscript images

* fix: transform method

* test: transform method

* fix: fs import
2019-01-30 09:33:36 +02:00
Wajeeh Zantout
9b36003b62 feat: add fastcompany custom parser (#191)
* feat: add fastcompany custom parser

* fix: eslint error

* fix: test for date_published

* feat: add fastcompany custom parser

* fix: eslint error

* fix: test for date_published

* fix: fs import
2019-01-30 09:30:24 +02:00
Ralph Jbeily
f3f6e21fd8 fix: author and date published selectors (#189) 2019-01-25 11:28:43 -08:00
Adam Pash
96640e3564
fix: failing fetchResource test (#187)
I think was a fixture problem
2018-12-20 10:06:16 -08:00
Adam Pash
5663660f76
fix: nytimes custom parser title selector (#181)
* fix: nytimes custom parser title selector

* upgrade node version

* circle ci tweak
2018-10-12 13:39:41 -07:00
Adam Pash
b8aa87c777 feat: improve wh parser (#168) 2017-03-24 14:41:40 -07:00
Adam Pash
61f0f4e1af fix: kept elements being removed (#166)
Elements marked to keep were removeable under specific circumstances.
This PR fixes these edge cases.
2017-03-23 13:16:21 -07:00
Adam Pash
453419de72 feat: improve wh.gov parser (#163)
* feat: support youtube-nocookie domain

* feat: updated wh.gov parser to support speeches
2017-03-22 13:16:54 -07:00
Janet
f13bb721f6 feat: prospect magazine parser (#147)
* feat: prospect magazine parser

Couldn’t find a way to parse the date but I think it’s good otherwise.

* fix: pulls date

* fix: add timezone

* fix: generalize
2017-03-14 18:34:40 -04:00
Kevin Ngao
1b28713cf5 feat: fool.com parser (#158)
* feat: add fool.com custom parser
2017-03-14 18:04:19 -04:00
Janet
c18959779d feat: forward.com parser (#144)
* feat: forward.com parser

LGTM although image didn’t show up in preview

* feat: also pull imge into content

* fix: generalize selectors

* fix: generalize selector
2017-03-14 17:53:23 -04:00
Janet
50e548bac2 feat: qdaily parser (#146)
* feat: qdaily parser

Firstly — I accidentally tried to generate the parser on the master
branch, and I’m not sure where it is, maybe floating in the nether
world.

On to the parser — this one was a bit tricky because things were in
Chinese! The content appears to be parsing (as seen in preview) but
it’s not passing the test. I noticed the second “ ‘ “ mark isn’t
appearing on the parser side.

Additionally, some of the lazy loading images aren’t appearing in the
preview (I cleaned the wrong lazy load images that appeared), so
someone will probably have to work on that (I don’t know how to do
transforms yet).

* fix tests

* fix: selector generalization
2017-03-14 17:37:53 -04:00
Silas Burton
11382ce651 Feat: Slate extractor (#153)
* feat: slate extractor

* fix: generalize selectors

* fix: add Slate timezone
2017-03-13 17:44:04 -04:00
Silas Burton
5acaa6ab56 feat: ici.radio-canada.ca extractor (#156)
* feat: ici.radio-canada.ca extractor

* fix: add timezone
2017-03-13 17:23:20 -04:00
Silas Burton
9b371e51ac Feat: gothamist extractor (#151)
* feat: gothamist extractor

* feat: add other gothamist network sites

* fix: try getting date another way

* fix: add gothamist timezone

* fix: generalize selectors

* fix: h1 is inside entry-header, needs to be specific because of another h1 on the page

* fix: general and specific selector
2017-03-09 13:13:46 -05:00
Kevin Ngao
afbef9bc39 Fix Encoding on Body (#143)
* fix: check encoding on body
2017-03-06 11:36:56 -05:00
Janet
93d2baf5cf feat: news.natgeo parser (#88)
* feat: natgeo parser

For some reason, the local copy of the article didn’t grab the author
name in it, so I couldn’t figure out how to parse it. The generic
parser took a name of an author of a paper mentioned in the article,
and thought that was the author name, which was funny.

I cleaned a large block quote that didn’t make sense as it was shown in
the preview, although I noticed that the Mercury chrome extension
didn’t even display it.

* fix: add date_published transform

* fix: date_published assertion

* disable: author assertion, generlize author selector

* rm: author assertion

* fix: image lead

* fix: guard agaist missing img url

* fix: generalize dek and title selectors
2017-02-08 15:27:35 -07:00
Janet
2279c2d486 feat: natgeo parser (#89)
* feat: natgeo parser

Same as the news.nationalgeographic.com parser - for some reason the
author name doesn’t appear to be getting pulled into the local copy of
the file.

* fix: content assertion

* fix: generalize author byline

* disable: author assertion

* rm: author assertion

* fix: image lead, handles image-group

* fix: guard agaist missing img url

* fix: generalize dek and title selectors
2017-02-08 15:01:55 -07:00
Janet
11f466ccb3 feat: latimes parser (#92)
* feat: latimes parser
2017-02-08 11:29:03 -05:00
Kevin Ngao
26a8e4f75a feat: macrumors parser (#120)
* feat: add macrumors
2017-02-07 19:15:29 -05:00
Kevin Ngao
b4fec6af98 feat: androidcentral parser (#119)
* feat: androidcentral parser
2017-02-07 18:20:04 -05:00
Janet
beb0b89a4f feat: pagesix parser (#97)
* feat: pagesix parser
2017-02-07 17:38:09 -05:00
Janet
f2160eb5b6 feat: si parser (#118)
* feat: si parser
2017-02-07 16:52:11 -05:00
Janet
2af0f6179a feat: rawstory parser (#109)
* feat: rawstory parser

Finished, with a little help from Frankie (thanks Frankie!)

* fix: date_published timezone
2017-02-07 12:53:05 -07:00
Janet
765032452d feat: thefederalistpapers parser (#101)
* feat: thefederalistpapers parser
2017-02-07 14:30:52 -05:00
Janet
fb5eb2e104 feat: cnet parser (#104)
* feat: cnet parser

Date test fail - please take a look!

Also, image didn’t load in preview.

* fix: timezone

* fix: image lead
2017-02-07 11:55:04 -07:00
Janet
3c5fa28f10 feat: cbs sports parser (#98)
* feat: cbs sports parser
2017-02-07 10:45:48 -05:00
Janet
3cf2d0d3ef feat: msnbc parser (#100)
* feat: msnbc parser
2017-02-06 18:08:49 -05:00
Janet
f9ab9eb885 feat: howtogeek extractor (#108)
* feat: howtogeek extractor

This one is a bit tricky - the author and date info appear in a comment
section at the bottom. Was able to parse the author but not the date
info. Halp

* howtogeek update

Thanks to @fdsimms I was able to parse the date, but not sure what to
test it against, so I left it blank.

* fix: date_published assertion, it was comparing against empty string

* fix: timezone

* amend: generalize author selector
2017-02-06 15:23:15 -07:00
Janet
258acdfd02 feat: opposing views parser (#103)
* feat: opposing views parser
2017-02-06 12:22:42 -05:00
Janet
b63dd33579 feat: today parser (#106)
* feat: today parser

This looks fine — there are a couple of lines of “Related” but they are
within the body (and don’t have their own classes) so I couldn’t clean
them out.

* fix: fix content assertion
2017-02-06 09:20:12 -07:00
Janet
c94eee7f92 feat: cinema blend parser (#105)
* feat: cinema blend parser

all systems go

* fix: timezone
2017-02-06 09:02:11 -07:00
Janet
64e3c205e8 feat: the political insider parser (#99)
* feat: the political insider parser with timezone
2017-02-03 16:25:16 -05:00
Janet
7b52d3d1fc feat: al.com parser (#110)
* feat: al.com parser

I think this is good but could you pls double check time zone on the
date? Thanks

* fix: date_published timezone
2017-02-03 11:45:45 -07:00
Janet
15df58496f feat: westernjournalism parser (#113)
* feat: westernjournalism parser

Adjacent sibling selector FTW!

Image not displaying in preview.

* feat: fix assertion, body does not include _Advertisement_ subtext
2017-02-03 11:15:50 -07:00
Janet
ae12a1d701 feat: mental floss parser (#94)
* feat: mental floss parser
2017-02-03 11:40:01 -05:00
Janet
bf29291395 feat: thepennyhoarder parser (#112)
* feat: thepennyhoarder parser

Looks good, although no image in preview!

* fix: adds selector for article lead image
2017-02-03 08:56:15 -07:00
Janet
fadd198d04 feat: abcnewsgo parser (#90)
* feat: abcnewsgo parser
2017-02-02 17:43:35 -05:00
Janet
1054d854dd feat: america now parser (#114)
* feat: america now parser

Looks good but lead image did not display in preview.

* feat: adds selector for lead image
2017-02-02 13:46:20 -07:00
Janet
4c48acba59 feat: fusion parser
Looks okay — image did not load in preview.
2017-02-02 10:54:49 -07:00
Janet
d292d8ef3a feat: ny daily news parser (#87)
* feat: ny daily news parser
2017-02-02 12:30:16 -05:00