Commit Graph

81 Commits (c309bdb37359b67b247208a0e06abe599a4504db)

Author SHA1 Message Date
kik0220 c309bdb373 feat: add otrs.com custom parser (#353) 5 years ago
Ben Ubois a7e4c67d1d Extract content from GitHub repos. (#306)
* Extract content from GitHub repos.

* Add published and dek.

* Timezone fix.
5 years ago
Toufic Mouallem 7844129fda feat: Add custom parser for Reddit (#307) 5 years ago
Jordan Hotmann 83d1c2401b feat: add custom extractor for blisterreview.com (#299) 5 years ago
kik0220 d9a1e7b22b feat: add news.mynavi.jp custom parser (#287) 5 years ago
Wajeeh Zantout 1ccd14e1e9 feat: add fortinet custom parser (#188)
* feat: add fortinet custom parser

* fix: eslint error

* fix: transform noscript images

* feat: add fortinet custom parser

* fix: eslint error

* fix: transform noscript images

* fix: transform method

* test: transform method

* fix: fs import
5 years ago
Wajeeh Zantout 9b36003b62 feat: add fastcompany custom parser (#191)
* feat: add fastcompany custom parser

* fix: eslint error

* fix: test for date_published

* feat: add fastcompany custom parser

* fix: eslint error

* fix: test for date_published

* fix: fs import
5 years ago
Janet f13bb721f6 feat: prospect magazine parser (#147)
* feat: prospect magazine parser

Couldn’t find a way to parse the date but I think it’s good otherwise.

* fix: pulls date

* fix: add timezone

* fix: generalize
7 years ago
Kevin Ngao 1b28713cf5 feat: fool.com parser (#158)
* feat: add fool.com custom parser
7 years ago
Janet c18959779d feat: forward.com parser (#144)
* feat: forward.com parser

LGTM although image didn’t show up in preview

* feat: also pull imge into content

* fix: generalize selectors

* fix: generalize selector
7 years ago
Janet 50e548bac2 feat: qdaily parser (#146)
* feat: qdaily parser

Firstly — I accidentally tried to generate the parser on the master
branch, and I’m not sure where it is, maybe floating in the nether
world.

On to the parser — this one was a bit tricky because things were in
Chinese! The content appears to be parsing (as seen in preview) but
it’s not passing the test. I noticed the second “ ‘ “ mark isn’t
appearing on the parser side.

Additionally, some of the lazy loading images aren’t appearing in the
preview (I cleaned the wrong lazy load images that appeared), so
someone will probably have to work on that (I don’t know how to do
transforms yet).

* fix tests

* fix: selector generalization
7 years ago
Silas Burton 11382ce651 Feat: Slate extractor (#153)
* feat: slate extractor

* fix: generalize selectors

* fix: add Slate timezone
7 years ago
Silas Burton 5acaa6ab56 feat: ici.radio-canada.ca extractor (#156)
* feat: ici.radio-canada.ca extractor

* fix: add timezone
7 years ago
Silas Burton 9b371e51ac Feat: gothamist extractor (#151)
* feat: gothamist extractor

* feat: add other gothamist network sites

* fix: try getting date another way

* fix: add gothamist timezone

* fix: generalize selectors

* fix: h1 is inside entry-header, needs to be specific because of another h1 on the page

* fix: general and specific selector
7 years ago
Janet 93d2baf5cf feat: news.natgeo parser (#88)
* feat: natgeo parser

For some reason, the local copy of the article didn’t grab the author
name in it, so I couldn’t figure out how to parse it. The generic
parser took a name of an author of a paper mentioned in the article,
and thought that was the author name, which was funny.

I cleaned a large block quote that didn’t make sense as it was shown in
the preview, although I noticed that the Mercury chrome extension
didn’t even display it.

* fix: add date_published transform

* fix: date_published assertion

* disable: author assertion, generlize author selector

* rm: author assertion

* fix: image lead

* fix: guard agaist missing img url

* fix: generalize dek and title selectors
7 years ago
Janet 2279c2d486 feat: natgeo parser (#89)
* feat: natgeo parser

Same as the news.nationalgeographic.com parser - for some reason the
author name doesn’t appear to be getting pulled into the local copy of
the file.

* fix: content assertion

* fix: generalize author byline

* disable: author assertion

* rm: author assertion

* fix: image lead, handles image-group

* fix: guard agaist missing img url

* fix: generalize dek and title selectors
7 years ago
Janet 11f466ccb3 feat: latimes parser (#92)
* feat: latimes parser
7 years ago
Kevin Ngao 26a8e4f75a feat: macrumors parser (#120)
* feat: add macrumors
7 years ago
Kevin Ngao b4fec6af98 feat: androidcentral parser (#119)
* feat: androidcentral parser
7 years ago
Janet beb0b89a4f feat: pagesix parser (#97)
* feat: pagesix parser
7 years ago
Janet f2160eb5b6 feat: si parser (#118)
* feat: si parser
7 years ago
Janet 2af0f6179a feat: rawstory parser (#109)
* feat: rawstory parser

Finished, with a little help from Frankie (thanks Frankie!)

* fix: date_published timezone
7 years ago
Janet 765032452d feat: thefederalistpapers parser (#101)
* feat: thefederalistpapers parser
7 years ago
Janet fb5eb2e104 feat: cnet parser (#104)
* feat: cnet parser

Date test fail - please take a look!

Also, image didn’t load in preview.

* fix: timezone

* fix: image lead
7 years ago
Janet 3c5fa28f10 feat: cbs sports parser (#98)
* feat: cbs sports parser
7 years ago
Janet 3cf2d0d3ef feat: msnbc parser (#100)
* feat: msnbc parser
7 years ago
Janet f9ab9eb885 feat: howtogeek extractor (#108)
* feat: howtogeek extractor

This one is a bit tricky - the author and date info appear in a comment
section at the bottom. Was able to parse the author but not the date
info. Halp

* howtogeek update

Thanks to @fdsimms I was able to parse the date, but not sure what to
test it against, so I left it blank.

* fix: date_published assertion, it was comparing against empty string

* fix: timezone

* amend: generalize author selector
7 years ago
Janet 258acdfd02 feat: opposing views parser (#103)
* feat: opposing views parser
8 years ago
Janet b63dd33579 feat: today parser (#106)
* feat: today parser

This looks fine — there are a couple of lines of “Related” but they are
within the body (and don’t have their own classes) so I couldn’t clean
them out.

* fix: fix content assertion
8 years ago
Janet c94eee7f92 feat: cinema blend parser (#105)
* feat: cinema blend parser

all systems go

* fix: timezone
8 years ago
Janet 64e3c205e8 feat: the political insider parser (#99)
* feat: the political insider parser with timezone
8 years ago
Janet 7b52d3d1fc feat: al.com parser (#110)
* feat: al.com parser

I think this is good but could you pls double check time zone on the
date? Thanks

* fix: date_published timezone
8 years ago
Janet 15df58496f feat: westernjournalism parser (#113)
* feat: westernjournalism parser

Adjacent sibling selector FTW!

Image not displaying in preview.

* feat: fix assertion, body does not include _Advertisement_ subtext
8 years ago
Janet ae12a1d701 feat: mental floss parser (#94)
* feat: mental floss parser
8 years ago
Janet bf29291395 feat: thepennyhoarder parser (#112)
* feat: thepennyhoarder parser

Looks good, although no image in preview!

* fix: adds selector for article lead image
8 years ago
Janet fadd198d04 feat: abcnewsgo parser (#90)
* feat: abcnewsgo parser
8 years ago
Janet 1054d854dd feat: america now parser (#114)
* feat: america now parser

Looks good but lead image did not display in preview.

* feat: adds selector for lead image
8 years ago
Janet 4c48acba59 feat: fusion parser
Looks okay — image did not load in preview.
8 years ago
Janet d292d8ef3a feat: ny daily news parser (#87)
* feat: ny daily news parser
8 years ago
Janet 385b9d76a3 feat: sciencefly extractor (#116)
* feat: sciencefly extractor, use loading image rather than 404'ing meta
8 years ago
Adam Pash 6bd6278a07 feat: custom parser for wh blog (#130) 8 years ago
Adam Pash 31eb4f9222 Feat: LinkedIn parser (#123)
* feat: rebuild custom parser

* feat: linkedin custom parser
8 years ago
Janet 7709d69379 feat: npr parser (#86)
* feat: npr parser

Lead image appears in preview, but the test fails for some reason.

AssertionError: null ==
'https://media.npr.org/assets/img/2016/12/15/gettyimages-540681598_wide-
8b160732b96c083dc115134c3c019f3ac73586ba.jpg?s=1400'

Looks okay otherwise.

* feat: transformed figures/figcaptions, improved date_published and
addressed NPR's bad image metadata
8 years ago
Janet 8a82f2c0ab feat: recode parser (#85)
* feat: recode parser

Thumbs up, as far as I can tell.

Note: No image appeared in the preview.

* feat: pulling in lead image
8 years ago
Janet ad29acd7b7 feat: fortune parser (#84)
* feat: fortune parser

For some reason, the dek doesn’t appear in the local version of the
article I selected. I tried parsing the meta tag containing
og:description but it’s not working, and the description is slightly
longer than the dek in the original article.

I’m not sure why, but for the lead image, the meta tag for og:image is
not parsing the image url.

:(

* feat: fortune redesigned, so re-did extractor

* fix: added timezone
8 years ago
Janet c133ddf614 feat: qz parser (#81)
* feat: qz parser

I couldn’t figure out how to parse the date, but otherwise should be
fine. I added a clean for the div.article-aside element based on what I
saw in how the chrome extension worked.

* feat: updated content to grab top image

test: date is null :/
8 years ago
Janet 84312b6ef1 feat: dmagazine parser (#80)
* feat: dmagazine parser

I’m sorry to have failed you. :-( These are the issues I encountered:

1) author - does not have a unique selector to distinguish it from the
date, couldn’t parse it
2) date - no meta data in the head
3) no meta og:image in the head (my go to), so I couldn’t get the image
test to pass, but it appears to be parsing. The caption below it is the
same size as the body copy in the preview. I couldn’t figure out how to
“transform” it to caption size.

* feat: update date, image, and author selectors and corresponding tests

* feat: generalized content selector
8 years ago
Janet e035f36361 feat: reuters parser (#78)
* feat: reuters parser

Date parses correctly but fails test because of format discrepancy.

Author tags are nested within the content, which is why the author
names are appearing twice. I wasn’t sure how to address this.

Additionally, the location appears twice, so I cleaned the location
tags from the content.

* test: fix date format

* transform .article-subtitle to h4; cleaning author but leaving location
8 years ago
Janet dec49ab073 feat: mashable parser (#76)
* feat: mashable parser

As usual the date is giving me issues because of formatting
discrepancies:
AssertionError: '2016-12-13T22:33:06.000Z' == '2016-12-14T03:33:06.000Z'

Not sure how we wanna deal with Twitter card embeds that don’t show up?

Also, image credits did not show up in preview.

* test: fixed date format

* transforming .image-credit to figcaption
8 years ago
Janet cddc1afb69 feat: chicago tribune parser (#75)
* feat: chicago tribune parser

Date is parsing but failing the test because:
AssertionError: '2016-12-13T21:45:00.000Z' == '2016-12-13T13:45:00-0800'

I tried to insert a line of code for Time Zone but I’m a n00b so I
don’t think I did it right.

No image showing up in the preview.

* fix: remove timezone from date_published extractor

* test: update unit tests to assert the correct value for date_published
8 years ago