1.7 KiB
TODO:
- Complete response:
- add excerpt
- add word count
- Test if .is method is faster than regex methods
DONE:
x add total pages
x add rendered pages
x add canonicalUrl
x add domain
x Separate constants into activity-specific folders (dom, scoring)
x extractNextPageUrl
x Make sure weightNodes flag is being passed properly
x Rename all cleaners from cleanThing to clean
x Remove $ from function calls to getScore
x remove all but attributes whitelist. research what attributes are important beyond SRC and href
x remove logic for fetching meta attrs with custom props
x cleaning embed and object nodes
x run makeLinksAbsolute on extracted content before returning
x add option to fetch attrs in RootExtractor's select method
x get custom datePublished selector to convert to date object (prob through cleaner)
x extract and generalize cleaners
x move arguments to cleaners to object
x Check that lead-image-url extractor isn't looking for end-of-string file extension matches (i.e., it could be ...foo.jpg?otherstuff
x extractLeadImageUrl
x Resource (fetches page, validates it, cleans it, normalizes meta tags (!), converts lazy-loaded images, makes links absolute, etc)
x extractDek
x extractDatePublished
x Title metadata
x Test re-initializing $ if/when it needs to loop again
x cleanHeaders
Remove any headers that are before any p tags, matching title, etc
x extract
(this kicks it all off)
x node_is_sufficient
x _extract_best_node
x get_weight
x _strip_unlikely_candidates
x _convert_to_paragraphs
x _brs_to_paragraphs
x _paragraphize
Scoring
x _get_score
x _set_score
x _add_score
x _score_content
x _score_node
x _score_paragraph
Top Candidate
x _find_top_candidate
x extract_clean_node
x _clean_conditionally