mercury-parser/TODO.md
Adam Pash 0ff3082295 feat: GenericExtractLeadImageUrl
Squashed commit of the following:

commit 22d37ebf26dbbd0a3daebbfde3509a6ce04aaf72
Author: Adam Pash <adam.pash@gmail.com>
Date:   Thu Sep 1 17:50:13 2016 -0400

    feat: GenericExtractLeadImageUrl

commit 3327a0a7929dd0e9267dc9c26f4e2aa78c32586f
Author: Adam Pash <adam.pash@gmail.com>
Date:   Thu Sep 1 15:33:42 2016 -0400

    feat: can pass custom attributes to extractFromMeta
2016-09-01 17:50:42 -04:00

1.0 KiB

TODO: Tmrw: - extractNextPageUrl

  • Try Closure webpack compiler
  • Rename all cleaners from cleanThing to clean
  • Make sure weightNodes flag is being passed properly
  • Get better sense of when cheerio returns a raw node and when a cheerio object
    • Remove $ from function calls to getScore
    • Remove $ whenever possible
  • Test if .is method is faster than regex methods
  • Separate constants into activity-specific folders (dom, scoring)

DONE: x extractLeadImageUrl x extractDek x extractDatePublished x Title metadata x Test re-initializing $ if/when it needs to loop again x cleanHeaders Remove any headers that are before any p tags, matching title, etc x extract (this kicks it all off) x node_is_sufficient x _extract_best_node x get_weight x _strip_unlikely_candidates x _convert_to_paragraphs x _brs_to_paragraphs x _paragraphize

Scoring

x _get_score x _set_score x _add_score x _score_content x _score_node x _score_paragraph

Top Candidate

x _find_top_candidate x extract_clean_node x _clean_conditionally