* feat: prospect magazine parser
Couldn’t find a way to parse the date but I think it’s good otherwise.
* fix: pulls date
* fix: add timezone
* fix: generalize
* feat: forward.com parser
LGTM although image didn’t show up in preview
* feat: also pull imge into content
* fix: generalize selectors
* fix: generalize selector
* feat: qdaily parser
Firstly — I accidentally tried to generate the parser on the master
branch, and I’m not sure where it is, maybe floating in the nether
world.
On to the parser — this one was a bit tricky because things were in
Chinese! The content appears to be parsing (as seen in preview) but
it’s not passing the test. I noticed the second “ ‘ “ mark isn’t
appearing on the parser side.
Additionally, some of the lazy loading images aren’t appearing in the
preview (I cleaned the wrong lazy load images that appeared), so
someone will probably have to work on that (I don’t know how to do
transforms yet).
* fix tests
* fix: selector generalization
* feat: gothamist extractor
* feat: add other gothamist network sites
* fix: try getting date another way
* fix: add gothamist timezone
* fix: generalize selectors
* fix: h1 is inside entry-header, needs to be specific because of another h1 on the page
* fix: general and specific selector
* feat: natgeo parser
For some reason, the local copy of the article didn’t grab the author
name in it, so I couldn’t figure out how to parse it. The generic
parser took a name of an author of a paper mentioned in the article,
and thought that was the author name, which was funny.
I cleaned a large block quote that didn’t make sense as it was shown in
the preview, although I noticed that the Mercury chrome extension
didn’t even display it.
* fix: add date_published transform
* fix: date_published assertion
* disable: author assertion, generlize author selector
* rm: author assertion
* fix: image lead
* fix: guard agaist missing img url
* fix: generalize dek and title selectors
* feat: natgeo parser
Same as the news.nationalgeographic.com parser - for some reason the
author name doesn’t appear to be getting pulled into the local copy of
the file.
* fix: content assertion
* fix: generalize author byline
* disable: author assertion
* rm: author assertion
* fix: image lead, handles image-group
* fix: guard agaist missing img url
* fix: generalize dek and title selectors
* feat: howtogeek extractor
This one is a bit tricky - the author and date info appear in a comment
section at the bottom. Was able to parse the author but not the date
info. Halp
* howtogeek update
Thanks to @fdsimms I was able to parse the date, but not sure what to
test it against, so I left it blank.
* fix: date_published assertion, it was comparing against empty string
* fix: timezone
* amend: generalize author selector
* feat: today parser
This looks fine — there are a couple of lines of “Related” but they are
within the body (and don’t have their own classes) so I couldn’t clean
them out.
* fix: fix content assertion
* feat: westernjournalism parser
Adjacent sibling selector FTW!
Image not displaying in preview.
* feat: fix assertion, body does not include _Advertisement_ subtext
* feat: npr parser
Lead image appears in preview, but the test fails for some reason.
AssertionError: null ==
'https://media.npr.org/assets/img/2016/12/15/gettyimages-540681598_wide-
8b160732b96c083dc115134c3c019f3ac73586ba.jpg?s=1400'
Looks okay otherwise.
* feat: transformed figures/figcaptions, improved date_published and
addressed NPR's bad image metadata
* feat: fortune parser
For some reason, the dek doesn’t appear in the local version of the
article I selected. I tried parsing the meta tag containing
og:description but it’s not working, and the description is slightly
longer than the dek in the original article.
I’m not sure why, but for the lead image, the meta tag for og:image is
not parsing the image url.
:(
* feat: fortune redesigned, so re-did extractor
* fix: added timezone
* feat: qz parser
I couldn’t figure out how to parse the date, but otherwise should be
fine. I added a clean for the div.article-aside element based on what I
saw in how the chrome extension worked.
* feat: updated content to grab top image
test: date is null :/
* feat: dmagazine parser
I’m sorry to have failed you. :-( These are the issues I encountered:
1) author - does not have a unique selector to distinguish it from the
date, couldn’t parse it
2) date - no meta data in the head
3) no meta og:image in the head (my go to), so I couldn’t get the image
test to pass, but it appears to be parsing. The caption below it is the
same size as the body copy in the preview. I couldn’t figure out how to
“transform” it to caption size.
* feat: update date, image, and author selectors and corresponding tests
* feat: generalized content selector
* feat: reuters parser
Date parses correctly but fails test because of format discrepancy.
Author tags are nested within the content, which is why the author
names are appearing twice. I wasn’t sure how to address this.
Additionally, the location appears twice, so I cleaned the location
tags from the content.
* test: fix date format
* transform .article-subtitle to h4; cleaning author but leaving location
* feat: mashable parser
As usual the date is giving me issues because of formatting
discrepancies:
AssertionError: '2016-12-13T22:33:06.000Z' == '2016-12-14T03:33:06.000Z'
Not sure how we wanna deal with Twitter card embeds that don’t show up?
Also, image credits did not show up in preview.
* test: fixed date format
* transforming .image-credit to figcaption