mercury-parser/TODO.md at b3481a2c45ca5ca5e243c46bdbd4a308d8dd47af

Archives/mercury-parser

Fork 0

mirror of https://github.com/postlight/mercury-parser synced 2024-11-12 19:10:45 +00:00

Adam Pash 81ed4f00ed feat: improve nymag.com extractor to grab deks from features

2016-09-14 13:12:40 -04:00

1.7 KiB

Raw Blame History

TODO:

Complete response:
- add excerpt
- add word count
Test if .is method is faster than regex methods

DONE: x add total pages x add rendered pages x add canonicalUrl x add domain x Separate constants into activity-specific folders (dom, scoring) x extractNextPageUrl x Make sure weightNodes flag is being passed properly x Rename all cleaners from cleanThing to clean x Remove $ from function calls to getScore x remove all but attributes whitelist. research what attributes are important beyond SRC and href x remove logic for fetching meta attrs with custom props x cleaning embed and object nodes x run makeLinksAbsolute on extracted content before returning x add option to fetch attrs in RootExtractor's select method x get custom datePublished selector to convert to date object (prob through cleaner) x extract and generalize cleaners x move arguments to cleaners to object x Check that lead-image-url extractor isn't looking for end-of-string file extension matches (i.e., it could be ...foo.jpg?otherstuff x extractLeadImageUrl x Resource (fetches page, validates it, cleans it, normalizes meta tags (!), converts lazy-loaded images, makes links absolute, etc) x extractDek x extractDatePublished x Title metadata x Test re-initializing $ if/when it needs to loop again x cleanHeaders Remove any headers that are before any p tags, matching title, etc x extract (this kicks it all off) x node_is_sufficient x _extract_best_node x get_weight x _strip_unlikely_candidates x _convert_to_paragraphs x _brs_to_paragraphs x _paragraphize

Scoring

x _get_score x _set_score x _add_score x _score_content x _score_node x _score_paragraph

Top Candidate

x _find_top_candidate x extract_clean_node x _clean_conditionally

1.7 KiB Raw Blame History

Scoring

Top Candidate

1.7 KiB

Raw Blame History