mirror of
https://github.com/postlight/mercury-parser
synced 2024-11-05 12:00:13 +00:00
8da2425e59
Squashed commit of the following: commit 7ba2d2b36d175f5ccbc02f918322ea0dd44bf2c1 Author: Adam Pash <adam.pash@gmail.com> Date: Tue Sep 6 17:55:10 2016 -0400 feat: resource fetches content from a URL and prepares for parsing commit 0abdfa49eed5b363169070dac6d65d0a5818c918 Author: Adam Pash <adam.pash@gmail.com> Date: Tue Sep 6 17:54:07 2016 -0400 fix: this was messing up double Esses ('ss', as in class => cla) commit 9dc65a99631e3a68267a68b2b4629c4be8f61546 Author: Adam Pash <adam.pash@gmail.com> Date: Tue Sep 6 14:58:57 2016 -0400 fix: test suite working w/new dirs commit 993dc33a5229bfa22ea998e3c4fe105be9d91c21 Author: Adam Pash <adam.pash@gmail.com> Date: Tue Sep 6 14:49:39 2016 -0400 feat: convertLazyLoadedImages puts img urls in the src commit e7fb105443dd16d036e460ad21fbcb47191f475b Author: Adam Pash <adam.pash@gmail.com> Date: Tue Sep 6 14:30:43 2016 -0400 feat: makeLinksAbsolute to fully qualify urls commit dbd665078af854efe84bbbfe9b55acd02e1a652f Author: Adam Pash <adam.pash@gmail.com> Date: Tue Sep 6 13:38:33 2016 -0400 feat: fetchResource to fetch a url and validate the response commit 42d3937c8f0f8df693996c2edee93625f13dced7 Author: Adam Pash <adam.pash@gmail.com> Date: Tue Sep 6 10:25:34 2016 -0400 feat: normalizing meta tags
1.4 KiB
1.4 KiB
TODO:
- run makeLinksAbsolute on extracted content before returning
- remove logic for fetching meta attrs with custom props
- Resource (fetches page, validates it, cleans it, normalizes meta tags (!), converts lazy-loaded images, makes links absolute, etc)
- extractNextPageUrl
- Rename all cleaners from cleanThing to clean
- Make sure weightNodes flag is being passed properly
- Get better sense of when cheerio returns a raw node and when a cheerio object
- Remove $ from function calls to getScore
- Remove $ whenever possible
- Test if .is method is faster than regex methods
- Separate constants into activity-specific folders (dom, scoring)
DONE:
x Check that lead-image-url extractor isn't looking for end-of-string file extension matches (i.e., it could be ...foo.jpg?otherstuff
x extractLeadImageUrl
x extractDek
x extractDatePublished
x Title metadata
x Test re-initializing $ if/when it needs to loop again
x cleanHeaders
Remove any headers that are before any p tags, matching title, etc
x extract
(this kicks it all off)
x node_is_sufficient
x _extract_best_node
x get_weight
x _strip_unlikely_candidates
x _convert_to_paragraphs
x _brs_to_paragraphs
x _paragraphize
Scoring
x _get_score
x _set_score
x _add_score
x _score_content
x _score_node
x _score_paragraph
Top Candidate
x _find_top_candidate
x extract_clean_node
x _clean_conditionally