mirror of
https://github.com/postlight/mercury-parser
synced 2024-11-05 12:00:13 +00:00
0ff3082295
Squashed commit of the following: commit 22d37ebf26dbbd0a3daebbfde3509a6ce04aaf72 Author: Adam Pash <adam.pash@gmail.com> Date: Thu Sep 1 17:50:13 2016 -0400 feat: GenericExtractLeadImageUrl commit 3327a0a7929dd0e9267dc9c26f4e2aa78c32586f Author: Adam Pash <adam.pash@gmail.com> Date: Thu Sep 1 15:33:42 2016 -0400 feat: can pass custom attributes to extractFromMeta
1.0 KiB
1.0 KiB
TODO: Tmrw: - extractNextPageUrl
- Try Closure webpack compiler
- Rename all cleaners from cleanThing to clean
- Make sure weightNodes flag is being passed properly
- Get better sense of when cheerio returns a raw node and when a cheerio object
- Remove $ from function calls to getScore
- Remove $ whenever possible
- Test if .is method is faster than regex methods
- Separate constants into activity-specific folders (dom, scoring)
DONE:
x extractLeadImageUrl
x extractDek
x extractDatePublished
x Title metadata
x Test re-initializing $ if/when it needs to loop again
x cleanHeaders
Remove any headers that are before any p tags, matching title, etc
x extract
(this kicks it all off)
x node_is_sufficient
x _extract_best_node
x get_weight
x _strip_unlikely_candidates
x _convert_to_paragraphs
x _brs_to_paragraphs
x _paragraphize
Scoring
x _get_score
x _set_score
x _add_score
x _score_content
x _score_node
x _score_paragraph
Top Candidate
x _find_top_candidate
x extract_clean_node
x _clean_conditionally