mercury-parser/TODO.md
Adam Pash 8da2425e59 feat: resource fetches content from a URL and prepares for parsing
Squashed commit of the following:

commit 7ba2d2b36d175f5ccbc02f918322ea0dd44bf2c1
Author: Adam Pash <adam.pash@gmail.com>
Date:   Tue Sep 6 17:55:10 2016 -0400

    feat: resource fetches content from a URL and prepares for parsing

commit 0abdfa49eed5b363169070dac6d65d0a5818c918
Author: Adam Pash <adam.pash@gmail.com>
Date:   Tue Sep 6 17:54:07 2016 -0400

    fix: this was messing up double Esses ('ss', as in class => cla)

commit 9dc65a99631e3a68267a68b2b4629c4be8f61546
Author: Adam Pash <adam.pash@gmail.com>
Date:   Tue Sep 6 14:58:57 2016 -0400

    fix: test suite working w/new dirs

commit 993dc33a5229bfa22ea998e3c4fe105be9d91c21
Author: Adam Pash <adam.pash@gmail.com>
Date:   Tue Sep 6 14:49:39 2016 -0400

    feat: convertLazyLoadedImages puts img urls in the src

commit e7fb105443dd16d036e460ad21fbcb47191f475b
Author: Adam Pash <adam.pash@gmail.com>
Date:   Tue Sep 6 14:30:43 2016 -0400

    feat: makeLinksAbsolute to fully qualify urls

commit dbd665078af854efe84bbbfe9b55acd02e1a652f
Author: Adam Pash <adam.pash@gmail.com>
Date:   Tue Sep 6 13:38:33 2016 -0400

    feat: fetchResource to fetch a url and validate the response

commit 42d3937c8f0f8df693996c2edee93625f13dced7
Author: Adam Pash <adam.pash@gmail.com>
Date:   Tue Sep 6 10:25:34 2016 -0400

    feat: normalizing meta tags
2016-09-06 17:55:45 -04:00

1.4 KiB

TODO:

  • run makeLinksAbsolute on extracted content before returning
  • remove logic for fetching meta attrs with custom props
  • Resource (fetches page, validates it, cleans it, normalizes meta tags (!), converts lazy-loaded images, makes links absolute, etc)
  • extractNextPageUrl
  • Rename all cleaners from cleanThing to clean
  • Make sure weightNodes flag is being passed properly
  • Get better sense of when cheerio returns a raw node and when a cheerio object
    • Remove $ from function calls to getScore
    • Remove $ whenever possible
  • Test if .is method is faster than regex methods
  • Separate constants into activity-specific folders (dom, scoring)

DONE: x Check that lead-image-url extractor isn't looking for end-of-string file extension matches (i.e., it could be ...foo.jpg?otherstuff x extractLeadImageUrl x extractDek x extractDatePublished x Title metadata x Test re-initializing $ if/when it needs to loop again x cleanHeaders Remove any headers that are before any p tags, matching title, etc x extract (this kicks it all off) x node_is_sufficient x _extract_best_node x get_weight x _strip_unlikely_candidates x _convert_to_paragraphs x _brs_to_paragraphs x _paragraphize

Scoring

x _get_score x _set_score x _add_score x _score_content x _score_node x _score_paragraph

Top Candidate

x _find_top_candidate x extract_clean_node x _clean_conditionally