2016-08-30 19:25:25 +00:00
|
|
|
TODO:
|
2016-09-01 16:28:39 +00:00
|
|
|
Tmrw:
|
|
|
|
- extractDek
|
|
|
|
- extractNextPageUrl
|
|
|
|
- extractLeadImageUrl
|
2016-09-01 18:09:28 +00:00
|
|
|
- Try Closure webpack compiler
|
2016-08-30 19:25:25 +00:00
|
|
|
- Make sure weightNodes flag is being passed properly
|
|
|
|
- Get better sense of when cheerio returns a raw node and when a cheerio object
|
|
|
|
- Remove $ from function calls to getScore
|
|
|
|
- Remove $ whenever possible
|
|
|
|
- Test if .is method is faster than regex methods
|
|
|
|
- Separate constants into activity-specific folders (dom, scoring)
|
2016-08-23 17:06:29 +00:00
|
|
|
|
2016-08-30 19:25:25 +00:00
|
|
|
|
|
|
|
DONE:
|
2016-09-01 18:09:28 +00:00
|
|
|
x extractDatePublished
|
2016-08-31 19:52:48 +00:00
|
|
|
x Title metadata
|
|
|
|
x Test re-initializing $ if/when it needs to loop again
|
2016-08-30 19:25:25 +00:00
|
|
|
x `cleanHeaders` Remove any headers that are before any p tags, matching title, etc
|
|
|
|
x `extract` (this kicks it all off)
|
2016-08-23 17:06:29 +00:00
|
|
|
x `node_is_sufficient`
|
2016-08-30 19:25:25 +00:00
|
|
|
x `_extract_best_node`
|
2016-08-23 17:06:29 +00:00
|
|
|
x `get_weight`
|
2016-08-23 19:15:12 +00:00
|
|
|
x `_strip_unlikely_candidates`
|
2016-08-24 14:51:20 +00:00
|
|
|
x `_convert_to_paragraphs`
|
2016-08-24 14:00:15 +00:00
|
|
|
x `_brs_to_paragraphs`
|
|
|
|
x `_paragraphize`
|
2016-08-23 17:06:29 +00:00
|
|
|
|
|
|
|
## Scoring
|
|
|
|
|
2016-08-24 19:30:16 +00:00
|
|
|
x `_get_score`
|
|
|
|
x `_set_score`
|
|
|
|
x `_add_score`
|
2016-08-25 19:31:09 +00:00
|
|
|
x `_score_content`
|
2016-08-24 19:30:16 +00:00
|
|
|
x `_score_node`
|
|
|
|
x `_score_paragraph`
|
2016-08-23 17:06:29 +00:00
|
|
|
|
|
|
|
## Top Candidate
|
|
|
|
|
2016-08-25 23:15:04 +00:00
|
|
|
x `_find_top_candidate`
|
2016-08-30 19:25:25 +00:00
|
|
|
x `extract_clean_node`
|
|
|
|
x `_clean_conditionally`
|
2016-08-25 19:31:09 +00:00
|
|
|
|