You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
mercury-parser/src/cleaners/content.js

95 lines
3.1 KiB
JavaScript

feat: implemented extractBestNode functionality Squashed commit of the following: commit 9af554dd975ff1778ed70c71fa9bde667fc5f880 Author: Adam Pash <adam.pash@gmail.com> Date: Tue Aug 30 15:19:32 2016 -0400 feat: add cleanHeaders commit 0dfea98eedc4f97fcbd78866322595c705e20521 Author: Adam Pash <adam.pash@gmail.com> Date: Tue Aug 30 14:30:49 2016 -0400 fix: scoring parent nodes recursively commit b6e5897a694adeb81e25a905aba72c0f45a8cc94 Author: Adam Pash <adam.pash@gmail.com> Date: Tue Aug 30 12:47:24 2016 -0400 feat: extract clean node up and running commit fb652c5db13db6bce7271efd68ba4b20515e9549 Author: Adam Pash <adam.pash@gmail.com> Date: Tue Aug 30 09:57:21 2016 -0400 chore: added test for p tags with nested tags (e.g., img, iframe) commit 731d0a2e4d89121dfafad195e9d0911805c4f8e4 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 17:50:33 2016 -0400 feat: extact clean node integrates most functions commit 322bc6534d30feb7c1c08d3813132badc6286b40 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 16:46:04 2016 -0400 feat: removing empty nodes as defined in constants commit f1d38932ea12a865814d2326970031fcb8515baa Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 16:33:31 2016 -0400 feat: cleaning attributes from nodes commit 0aa73ada6854af0ecd504bfe3d926a9524787ab5 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 16:09:56 2016 -0400 feat: cleaning h1s from text commit 12d4a309246285c278ce7765e4fbaa8271bb5889 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 15:52:03 2016 -0400 feat: removing spacer images commit 4e74ff830cc67586560f6fc72e2cfa432a3a2647 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 15:38:49 2016 -0400 feat: stripping unwanted html from doc commit c774166e90169fd0c1aa89898d3f7a975e82bf0a Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 15:17:32 2016 -0400 feat: removing small images, height attribute from images commit 3a8642f42cda451669c832482c5e1611b1ff2ea9 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 12:57:45 2016 -0400 feat: rewrite top level commit a1c03e779234b0aea02206d92ec3dcc15758507e Author: Adam Pash <adam.pash@gmail.com> Date: Fri Aug 26 17:34:36 2016 -0400 in a weird place rn
8 years ago
import {
cleanAttributes,
feat: implemented extractBestNode functionality Squashed commit of the following: commit 9af554dd975ff1778ed70c71fa9bde667fc5f880 Author: Adam Pash <adam.pash@gmail.com> Date: Tue Aug 30 15:19:32 2016 -0400 feat: add cleanHeaders commit 0dfea98eedc4f97fcbd78866322595c705e20521 Author: Adam Pash <adam.pash@gmail.com> Date: Tue Aug 30 14:30:49 2016 -0400 fix: scoring parent nodes recursively commit b6e5897a694adeb81e25a905aba72c0f45a8cc94 Author: Adam Pash <adam.pash@gmail.com> Date: Tue Aug 30 12:47:24 2016 -0400 feat: extract clean node up and running commit fb652c5db13db6bce7271efd68ba4b20515e9549 Author: Adam Pash <adam.pash@gmail.com> Date: Tue Aug 30 09:57:21 2016 -0400 chore: added test for p tags with nested tags (e.g., img, iframe) commit 731d0a2e4d89121dfafad195e9d0911805c4f8e4 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 17:50:33 2016 -0400 feat: extact clean node integrates most functions commit 322bc6534d30feb7c1c08d3813132badc6286b40 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 16:46:04 2016 -0400 feat: removing empty nodes as defined in constants commit f1d38932ea12a865814d2326970031fcb8515baa Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 16:33:31 2016 -0400 feat: cleaning attributes from nodes commit 0aa73ada6854af0ecd504bfe3d926a9524787ab5 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 16:09:56 2016 -0400 feat: cleaning h1s from text commit 12d4a309246285c278ce7765e4fbaa8271bb5889 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 15:52:03 2016 -0400 feat: removing spacer images commit 4e74ff830cc67586560f6fc72e2cfa432a3a2647 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 15:38:49 2016 -0400 feat: stripping unwanted html from doc commit c774166e90169fd0c1aa89898d3f7a975e82bf0a Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 15:17:32 2016 -0400 feat: removing small images, height attribute from images commit 3a8642f42cda451669c832482c5e1611b1ff2ea9 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 12:57:45 2016 -0400 feat: rewrite top level commit a1c03e779234b0aea02206d92ec3dcc15758507e Author: Adam Pash <adam.pash@gmail.com> Date: Fri Aug 26 17:34:36 2016 -0400 in a weird place rn
8 years ago
cleanHeaders,
cleanHOnes,
cleanImages,
feat: implemented extractBestNode functionality Squashed commit of the following: commit 9af554dd975ff1778ed70c71fa9bde667fc5f880 Author: Adam Pash <adam.pash@gmail.com> Date: Tue Aug 30 15:19:32 2016 -0400 feat: add cleanHeaders commit 0dfea98eedc4f97fcbd78866322595c705e20521 Author: Adam Pash <adam.pash@gmail.com> Date: Tue Aug 30 14:30:49 2016 -0400 fix: scoring parent nodes recursively commit b6e5897a694adeb81e25a905aba72c0f45a8cc94 Author: Adam Pash <adam.pash@gmail.com> Date: Tue Aug 30 12:47:24 2016 -0400 feat: extract clean node up and running commit fb652c5db13db6bce7271efd68ba4b20515e9549 Author: Adam Pash <adam.pash@gmail.com> Date: Tue Aug 30 09:57:21 2016 -0400 chore: added test for p tags with nested tags (e.g., img, iframe) commit 731d0a2e4d89121dfafad195e9d0911805c4f8e4 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 17:50:33 2016 -0400 feat: extact clean node integrates most functions commit 322bc6534d30feb7c1c08d3813132badc6286b40 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 16:46:04 2016 -0400 feat: removing empty nodes as defined in constants commit f1d38932ea12a865814d2326970031fcb8515baa Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 16:33:31 2016 -0400 feat: cleaning attributes from nodes commit 0aa73ada6854af0ecd504bfe3d926a9524787ab5 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 16:09:56 2016 -0400 feat: cleaning h1s from text commit 12d4a309246285c278ce7765e4fbaa8271bb5889 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 15:52:03 2016 -0400 feat: removing spacer images commit 4e74ff830cc67586560f6fc72e2cfa432a3a2647 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 15:38:49 2016 -0400 feat: stripping unwanted html from doc commit c774166e90169fd0c1aa89898d3f7a975e82bf0a Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 15:17:32 2016 -0400 feat: removing small images, height attribute from images commit 3a8642f42cda451669c832482c5e1611b1ff2ea9 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 12:57:45 2016 -0400 feat: rewrite top level commit a1c03e779234b0aea02206d92ec3dcc15758507e Author: Adam Pash <adam.pash@gmail.com> Date: Fri Aug 26 17:34:36 2016 -0400 in a weird place rn
8 years ago
cleanTags,
removeEmpty,
rewriteTopLevel,
stripJunkTags,
makeLinksAbsolute,
} from 'utils/dom';
feat: implemented extractBestNode functionality Squashed commit of the following: commit 9af554dd975ff1778ed70c71fa9bde667fc5f880 Author: Adam Pash <adam.pash@gmail.com> Date: Tue Aug 30 15:19:32 2016 -0400 feat: add cleanHeaders commit 0dfea98eedc4f97fcbd78866322595c705e20521 Author: Adam Pash <adam.pash@gmail.com> Date: Tue Aug 30 14:30:49 2016 -0400 fix: scoring parent nodes recursively commit b6e5897a694adeb81e25a905aba72c0f45a8cc94 Author: Adam Pash <adam.pash@gmail.com> Date: Tue Aug 30 12:47:24 2016 -0400 feat: extract clean node up and running commit fb652c5db13db6bce7271efd68ba4b20515e9549 Author: Adam Pash <adam.pash@gmail.com> Date: Tue Aug 30 09:57:21 2016 -0400 chore: added test for p tags with nested tags (e.g., img, iframe) commit 731d0a2e4d89121dfafad195e9d0911805c4f8e4 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 17:50:33 2016 -0400 feat: extact clean node integrates most functions commit 322bc6534d30feb7c1c08d3813132badc6286b40 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 16:46:04 2016 -0400 feat: removing empty nodes as defined in constants commit f1d38932ea12a865814d2326970031fcb8515baa Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 16:33:31 2016 -0400 feat: cleaning attributes from nodes commit 0aa73ada6854af0ecd504bfe3d926a9524787ab5 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 16:09:56 2016 -0400 feat: cleaning h1s from text commit 12d4a309246285c278ce7765e4fbaa8271bb5889 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 15:52:03 2016 -0400 feat: removing spacer images commit 4e74ff830cc67586560f6fc72e2cfa432a3a2647 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 15:38:49 2016 -0400 feat: stripping unwanted html from doc commit c774166e90169fd0c1aa89898d3f7a975e82bf0a Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 15:17:32 2016 -0400 feat: removing small images, height attribute from images commit 3a8642f42cda451669c832482c5e1611b1ff2ea9 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 12:57:45 2016 -0400 feat: rewrite top level commit a1c03e779234b0aea02206d92ec3dcc15758507e Author: Adam Pash <adam.pash@gmail.com> Date: Fri Aug 26 17:34:36 2016 -0400 in a weird place rn
8 years ago
// Clean our article content, returning a new, cleaned node.
export default function extractCleanNode(
article,
{
$,
cleanConditionally = true,
title = '',
url = '',
}
) {
feat: implemented extractBestNode functionality Squashed commit of the following: commit 9af554dd975ff1778ed70c71fa9bde667fc5f880 Author: Adam Pash <adam.pash@gmail.com> Date: Tue Aug 30 15:19:32 2016 -0400 feat: add cleanHeaders commit 0dfea98eedc4f97fcbd78866322595c705e20521 Author: Adam Pash <adam.pash@gmail.com> Date: Tue Aug 30 14:30:49 2016 -0400 fix: scoring parent nodes recursively commit b6e5897a694adeb81e25a905aba72c0f45a8cc94 Author: Adam Pash <adam.pash@gmail.com> Date: Tue Aug 30 12:47:24 2016 -0400 feat: extract clean node up and running commit fb652c5db13db6bce7271efd68ba4b20515e9549 Author: Adam Pash <adam.pash@gmail.com> Date: Tue Aug 30 09:57:21 2016 -0400 chore: added test for p tags with nested tags (e.g., img, iframe) commit 731d0a2e4d89121dfafad195e9d0911805c4f8e4 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 17:50:33 2016 -0400 feat: extact clean node integrates most functions commit 322bc6534d30feb7c1c08d3813132badc6286b40 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 16:46:04 2016 -0400 feat: removing empty nodes as defined in constants commit f1d38932ea12a865814d2326970031fcb8515baa Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 16:33:31 2016 -0400 feat: cleaning attributes from nodes commit 0aa73ada6854af0ecd504bfe3d926a9524787ab5 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 16:09:56 2016 -0400 feat: cleaning h1s from text commit 12d4a309246285c278ce7765e4fbaa8271bb5889 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 15:52:03 2016 -0400 feat: removing spacer images commit 4e74ff830cc67586560f6fc72e2cfa432a3a2647 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 15:38:49 2016 -0400 feat: stripping unwanted html from doc commit c774166e90169fd0c1aa89898d3f7a975e82bf0a Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 15:17:32 2016 -0400 feat: removing small images, height attribute from images commit 3a8642f42cda451669c832482c5e1611b1ff2ea9 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 12:57:45 2016 -0400 feat: rewrite top level commit a1c03e779234b0aea02206d92ec3dcc15758507e Author: Adam Pash <adam.pash@gmail.com> Date: Fri Aug 26 17:34:36 2016 -0400 in a weird place rn
8 years ago
// Rewrite the tag name to div if it's a top level node like body or
// html to avoid later complications with multiple body tags.
rewriteTopLevel(article, $);
feat: implemented extractBestNode functionality Squashed commit of the following: commit 9af554dd975ff1778ed70c71fa9bde667fc5f880 Author: Adam Pash <adam.pash@gmail.com> Date: Tue Aug 30 15:19:32 2016 -0400 feat: add cleanHeaders commit 0dfea98eedc4f97fcbd78866322595c705e20521 Author: Adam Pash <adam.pash@gmail.com> Date: Tue Aug 30 14:30:49 2016 -0400 fix: scoring parent nodes recursively commit b6e5897a694adeb81e25a905aba72c0f45a8cc94 Author: Adam Pash <adam.pash@gmail.com> Date: Tue Aug 30 12:47:24 2016 -0400 feat: extract clean node up and running commit fb652c5db13db6bce7271efd68ba4b20515e9549 Author: Adam Pash <adam.pash@gmail.com> Date: Tue Aug 30 09:57:21 2016 -0400 chore: added test for p tags with nested tags (e.g., img, iframe) commit 731d0a2e4d89121dfafad195e9d0911805c4f8e4 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 17:50:33 2016 -0400 feat: extact clean node integrates most functions commit 322bc6534d30feb7c1c08d3813132badc6286b40 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 16:46:04 2016 -0400 feat: removing empty nodes as defined in constants commit f1d38932ea12a865814d2326970031fcb8515baa Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 16:33:31 2016 -0400 feat: cleaning attributes from nodes commit 0aa73ada6854af0ecd504bfe3d926a9524787ab5 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 16:09:56 2016 -0400 feat: cleaning h1s from text commit 12d4a309246285c278ce7765e4fbaa8271bb5889 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 15:52:03 2016 -0400 feat: removing spacer images commit 4e74ff830cc67586560f6fc72e2cfa432a3a2647 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 15:38:49 2016 -0400 feat: stripping unwanted html from doc commit c774166e90169fd0c1aa89898d3f7a975e82bf0a Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 15:17:32 2016 -0400 feat: removing small images, height attribute from images commit 3a8642f42cda451669c832482c5e1611b1ff2ea9 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 12:57:45 2016 -0400 feat: rewrite top level commit a1c03e779234b0aea02206d92ec3dcc15758507e Author: Adam Pash <adam.pash@gmail.com> Date: Fri Aug 26 17:34:36 2016 -0400 in a weird place rn
8 years ago
// Drop small images and spacer images
cleanImages(article, $);
feat: implemented extractBestNode functionality Squashed commit of the following: commit 9af554dd975ff1778ed70c71fa9bde667fc5f880 Author: Adam Pash <adam.pash@gmail.com> Date: Tue Aug 30 15:19:32 2016 -0400 feat: add cleanHeaders commit 0dfea98eedc4f97fcbd78866322595c705e20521 Author: Adam Pash <adam.pash@gmail.com> Date: Tue Aug 30 14:30:49 2016 -0400 fix: scoring parent nodes recursively commit b6e5897a694adeb81e25a905aba72c0f45a8cc94 Author: Adam Pash <adam.pash@gmail.com> Date: Tue Aug 30 12:47:24 2016 -0400 feat: extract clean node up and running commit fb652c5db13db6bce7271efd68ba4b20515e9549 Author: Adam Pash <adam.pash@gmail.com> Date: Tue Aug 30 09:57:21 2016 -0400 chore: added test for p tags with nested tags (e.g., img, iframe) commit 731d0a2e4d89121dfafad195e9d0911805c4f8e4 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 17:50:33 2016 -0400 feat: extact clean node integrates most functions commit 322bc6534d30feb7c1c08d3813132badc6286b40 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 16:46:04 2016 -0400 feat: removing empty nodes as defined in constants commit f1d38932ea12a865814d2326970031fcb8515baa Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 16:33:31 2016 -0400 feat: cleaning attributes from nodes commit 0aa73ada6854af0ecd504bfe3d926a9524787ab5 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 16:09:56 2016 -0400 feat: cleaning h1s from text commit 12d4a309246285c278ce7765e4fbaa8271bb5889 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 15:52:03 2016 -0400 feat: removing spacer images commit 4e74ff830cc67586560f6fc72e2cfa432a3a2647 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 15:38:49 2016 -0400 feat: stripping unwanted html from doc commit c774166e90169fd0c1aa89898d3f7a975e82bf0a Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 15:17:32 2016 -0400 feat: removing small images, height attribute from images commit 3a8642f42cda451669c832482c5e1611b1ff2ea9 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 12:57:45 2016 -0400 feat: rewrite top level commit a1c03e779234b0aea02206d92ec3dcc15758507e Author: Adam Pash <adam.pash@gmail.com> Date: Fri Aug 26 17:34:36 2016 -0400 in a weird place rn
8 years ago
// Drop certain tags like <title>, etc
// This is -mostly- for cleanliness, not security.
stripJunkTags(article, $);
feat: implemented extractBestNode functionality Squashed commit of the following: commit 9af554dd975ff1778ed70c71fa9bde667fc5f880 Author: Adam Pash <adam.pash@gmail.com> Date: Tue Aug 30 15:19:32 2016 -0400 feat: add cleanHeaders commit 0dfea98eedc4f97fcbd78866322595c705e20521 Author: Adam Pash <adam.pash@gmail.com> Date: Tue Aug 30 14:30:49 2016 -0400 fix: scoring parent nodes recursively commit b6e5897a694adeb81e25a905aba72c0f45a8cc94 Author: Adam Pash <adam.pash@gmail.com> Date: Tue Aug 30 12:47:24 2016 -0400 feat: extract clean node up and running commit fb652c5db13db6bce7271efd68ba4b20515e9549 Author: Adam Pash <adam.pash@gmail.com> Date: Tue Aug 30 09:57:21 2016 -0400 chore: added test for p tags with nested tags (e.g., img, iframe) commit 731d0a2e4d89121dfafad195e9d0911805c4f8e4 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 17:50:33 2016 -0400 feat: extact clean node integrates most functions commit 322bc6534d30feb7c1c08d3813132badc6286b40 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 16:46:04 2016 -0400 feat: removing empty nodes as defined in constants commit f1d38932ea12a865814d2326970031fcb8515baa Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 16:33:31 2016 -0400 feat: cleaning attributes from nodes commit 0aa73ada6854af0ecd504bfe3d926a9524787ab5 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 16:09:56 2016 -0400 feat: cleaning h1s from text commit 12d4a309246285c278ce7765e4fbaa8271bb5889 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 15:52:03 2016 -0400 feat: removing spacer images commit 4e74ff830cc67586560f6fc72e2cfa432a3a2647 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 15:38:49 2016 -0400 feat: stripping unwanted html from doc commit c774166e90169fd0c1aa89898d3f7a975e82bf0a Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 15:17:32 2016 -0400 feat: removing small images, height attribute from images commit 3a8642f42cda451669c832482c5e1611b1ff2ea9 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 12:57:45 2016 -0400 feat: rewrite top level commit a1c03e779234b0aea02206d92ec3dcc15758507e Author: Adam Pash <adam.pash@gmail.com> Date: Fri Aug 26 17:34:36 2016 -0400 in a weird place rn
8 years ago
// H1 tags are typically the article title, which should be extracted
// by the title extractor instead. If there's less than 3 of them (<3),
// strip them. Otherwise, turn 'em into H2s.
cleanHOnes(article, $);
feat: implemented extractBestNode functionality Squashed commit of the following: commit 9af554dd975ff1778ed70c71fa9bde667fc5f880 Author: Adam Pash <adam.pash@gmail.com> Date: Tue Aug 30 15:19:32 2016 -0400 feat: add cleanHeaders commit 0dfea98eedc4f97fcbd78866322595c705e20521 Author: Adam Pash <adam.pash@gmail.com> Date: Tue Aug 30 14:30:49 2016 -0400 fix: scoring parent nodes recursively commit b6e5897a694adeb81e25a905aba72c0f45a8cc94 Author: Adam Pash <adam.pash@gmail.com> Date: Tue Aug 30 12:47:24 2016 -0400 feat: extract clean node up and running commit fb652c5db13db6bce7271efd68ba4b20515e9549 Author: Adam Pash <adam.pash@gmail.com> Date: Tue Aug 30 09:57:21 2016 -0400 chore: added test for p tags with nested tags (e.g., img, iframe) commit 731d0a2e4d89121dfafad195e9d0911805c4f8e4 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 17:50:33 2016 -0400 feat: extact clean node integrates most functions commit 322bc6534d30feb7c1c08d3813132badc6286b40 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 16:46:04 2016 -0400 feat: removing empty nodes as defined in constants commit f1d38932ea12a865814d2326970031fcb8515baa Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 16:33:31 2016 -0400 feat: cleaning attributes from nodes commit 0aa73ada6854af0ecd504bfe3d926a9524787ab5 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 16:09:56 2016 -0400 feat: cleaning h1s from text commit 12d4a309246285c278ce7765e4fbaa8271bb5889 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 15:52:03 2016 -0400 feat: removing spacer images commit 4e74ff830cc67586560f6fc72e2cfa432a3a2647 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 15:38:49 2016 -0400 feat: stripping unwanted html from doc commit c774166e90169fd0c1aa89898d3f7a975e82bf0a Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 15:17:32 2016 -0400 feat: removing small images, height attribute from images commit 3a8642f42cda451669c832482c5e1611b1ff2ea9 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 12:57:45 2016 -0400 feat: rewrite top level commit a1c03e779234b0aea02206d92ec3dcc15758507e Author: Adam Pash <adam.pash@gmail.com> Date: Fri Aug 26 17:34:36 2016 -0400 in a weird place rn
8 years ago
// Clean headers
cleanHeaders(article, $, title);
feat: implemented extractBestNode functionality Squashed commit of the following: commit 9af554dd975ff1778ed70c71fa9bde667fc5f880 Author: Adam Pash <adam.pash@gmail.com> Date: Tue Aug 30 15:19:32 2016 -0400 feat: add cleanHeaders commit 0dfea98eedc4f97fcbd78866322595c705e20521 Author: Adam Pash <adam.pash@gmail.com> Date: Tue Aug 30 14:30:49 2016 -0400 fix: scoring parent nodes recursively commit b6e5897a694adeb81e25a905aba72c0f45a8cc94 Author: Adam Pash <adam.pash@gmail.com> Date: Tue Aug 30 12:47:24 2016 -0400 feat: extract clean node up and running commit fb652c5db13db6bce7271efd68ba4b20515e9549 Author: Adam Pash <adam.pash@gmail.com> Date: Tue Aug 30 09:57:21 2016 -0400 chore: added test for p tags with nested tags (e.g., img, iframe) commit 731d0a2e4d89121dfafad195e9d0911805c4f8e4 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 17:50:33 2016 -0400 feat: extact clean node integrates most functions commit 322bc6534d30feb7c1c08d3813132badc6286b40 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 16:46:04 2016 -0400 feat: removing empty nodes as defined in constants commit f1d38932ea12a865814d2326970031fcb8515baa Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 16:33:31 2016 -0400 feat: cleaning attributes from nodes commit 0aa73ada6854af0ecd504bfe3d926a9524787ab5 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 16:09:56 2016 -0400 feat: cleaning h1s from text commit 12d4a309246285c278ce7765e4fbaa8271bb5889 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 15:52:03 2016 -0400 feat: removing spacer images commit 4e74ff830cc67586560f6fc72e2cfa432a3a2647 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 15:38:49 2016 -0400 feat: stripping unwanted html from doc commit c774166e90169fd0c1aa89898d3f7a975e82bf0a Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 15:17:32 2016 -0400 feat: removing small images, height attribute from images commit 3a8642f42cda451669c832482c5e1611b1ff2ea9 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 12:57:45 2016 -0400 feat: rewrite top level commit a1c03e779234b0aea02206d92ec3dcc15758507e Author: Adam Pash <adam.pash@gmail.com> Date: Fri Aug 26 17:34:36 2016 -0400 in a weird place rn
8 years ago
// Make links absolute
makeLinksAbsolute(article, $, url);
feat: implemented extractBestNode functionality Squashed commit of the following: commit 9af554dd975ff1778ed70c71fa9bde667fc5f880 Author: Adam Pash <adam.pash@gmail.com> Date: Tue Aug 30 15:19:32 2016 -0400 feat: add cleanHeaders commit 0dfea98eedc4f97fcbd78866322595c705e20521 Author: Adam Pash <adam.pash@gmail.com> Date: Tue Aug 30 14:30:49 2016 -0400 fix: scoring parent nodes recursively commit b6e5897a694adeb81e25a905aba72c0f45a8cc94 Author: Adam Pash <adam.pash@gmail.com> Date: Tue Aug 30 12:47:24 2016 -0400 feat: extract clean node up and running commit fb652c5db13db6bce7271efd68ba4b20515e9549 Author: Adam Pash <adam.pash@gmail.com> Date: Tue Aug 30 09:57:21 2016 -0400 chore: added test for p tags with nested tags (e.g., img, iframe) commit 731d0a2e4d89121dfafad195e9d0911805c4f8e4 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 17:50:33 2016 -0400 feat: extact clean node integrates most functions commit 322bc6534d30feb7c1c08d3813132badc6286b40 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 16:46:04 2016 -0400 feat: removing empty nodes as defined in constants commit f1d38932ea12a865814d2326970031fcb8515baa Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 16:33:31 2016 -0400 feat: cleaning attributes from nodes commit 0aa73ada6854af0ecd504bfe3d926a9524787ab5 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 16:09:56 2016 -0400 feat: cleaning h1s from text commit 12d4a309246285c278ce7765e4fbaa8271bb5889 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 15:52:03 2016 -0400 feat: removing spacer images commit 4e74ff830cc67586560f6fc72e2cfa432a3a2647 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 15:38:49 2016 -0400 feat: stripping unwanted html from doc commit c774166e90169fd0c1aa89898d3f7a975e82bf0a Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 15:17:32 2016 -0400 feat: removing small images, height attribute from images commit 3a8642f42cda451669c832482c5e1611b1ff2ea9 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 12:57:45 2016 -0400 feat: rewrite top level commit a1c03e779234b0aea02206d92ec3dcc15758507e Author: Adam Pash <adam.pash@gmail.com> Date: Fri Aug 26 17:34:36 2016 -0400 in a weird place rn
8 years ago
// Remove style or align attributes
cleanAttributes(article);
feat: implemented extractBestNode functionality Squashed commit of the following: commit 9af554dd975ff1778ed70c71fa9bde667fc5f880 Author: Adam Pash <adam.pash@gmail.com> Date: Tue Aug 30 15:19:32 2016 -0400 feat: add cleanHeaders commit 0dfea98eedc4f97fcbd78866322595c705e20521 Author: Adam Pash <adam.pash@gmail.com> Date: Tue Aug 30 14:30:49 2016 -0400 fix: scoring parent nodes recursively commit b6e5897a694adeb81e25a905aba72c0f45a8cc94 Author: Adam Pash <adam.pash@gmail.com> Date: Tue Aug 30 12:47:24 2016 -0400 feat: extract clean node up and running commit fb652c5db13db6bce7271efd68ba4b20515e9549 Author: Adam Pash <adam.pash@gmail.com> Date: Tue Aug 30 09:57:21 2016 -0400 chore: added test for p tags with nested tags (e.g., img, iframe) commit 731d0a2e4d89121dfafad195e9d0911805c4f8e4 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 17:50:33 2016 -0400 feat: extact clean node integrates most functions commit 322bc6534d30feb7c1c08d3813132badc6286b40 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 16:46:04 2016 -0400 feat: removing empty nodes as defined in constants commit f1d38932ea12a865814d2326970031fcb8515baa Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 16:33:31 2016 -0400 feat: cleaning attributes from nodes commit 0aa73ada6854af0ecd504bfe3d926a9524787ab5 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 16:09:56 2016 -0400 feat: cleaning h1s from text commit 12d4a309246285c278ce7765e4fbaa8271bb5889 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 15:52:03 2016 -0400 feat: removing spacer images commit 4e74ff830cc67586560f6fc72e2cfa432a3a2647 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 15:38:49 2016 -0400 feat: stripping unwanted html from doc commit c774166e90169fd0c1aa89898d3f7a975e82bf0a Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 15:17:32 2016 -0400 feat: removing small images, height attribute from images commit 3a8642f42cda451669c832482c5e1611b1ff2ea9 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 12:57:45 2016 -0400 feat: rewrite top level commit a1c03e779234b0aea02206d92ec3dcc15758507e Author: Adam Pash <adam.pash@gmail.com> Date: Fri Aug 26 17:34:36 2016 -0400 in a weird place rn
8 years ago
// We used to clean UL's and OL's here, but it was leading to
// too many in-article lists being removed. Consider a better
// way to detect menus particularly and remove them.
cleanTags(article, $, cleanConditionally);
feat: implemented extractBestNode functionality Squashed commit of the following: commit 9af554dd975ff1778ed70c71fa9bde667fc5f880 Author: Adam Pash <adam.pash@gmail.com> Date: Tue Aug 30 15:19:32 2016 -0400 feat: add cleanHeaders commit 0dfea98eedc4f97fcbd78866322595c705e20521 Author: Adam Pash <adam.pash@gmail.com> Date: Tue Aug 30 14:30:49 2016 -0400 fix: scoring parent nodes recursively commit b6e5897a694adeb81e25a905aba72c0f45a8cc94 Author: Adam Pash <adam.pash@gmail.com> Date: Tue Aug 30 12:47:24 2016 -0400 feat: extract clean node up and running commit fb652c5db13db6bce7271efd68ba4b20515e9549 Author: Adam Pash <adam.pash@gmail.com> Date: Tue Aug 30 09:57:21 2016 -0400 chore: added test for p tags with nested tags (e.g., img, iframe) commit 731d0a2e4d89121dfafad195e9d0911805c4f8e4 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 17:50:33 2016 -0400 feat: extact clean node integrates most functions commit 322bc6534d30feb7c1c08d3813132badc6286b40 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 16:46:04 2016 -0400 feat: removing empty nodes as defined in constants commit f1d38932ea12a865814d2326970031fcb8515baa Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 16:33:31 2016 -0400 feat: cleaning attributes from nodes commit 0aa73ada6854af0ecd504bfe3d926a9524787ab5 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 16:09:56 2016 -0400 feat: cleaning h1s from text commit 12d4a309246285c278ce7765e4fbaa8271bb5889 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 15:52:03 2016 -0400 feat: removing spacer images commit 4e74ff830cc67586560f6fc72e2cfa432a3a2647 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 15:38:49 2016 -0400 feat: stripping unwanted html from doc commit c774166e90169fd0c1aa89898d3f7a975e82bf0a Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 15:17:32 2016 -0400 feat: removing small images, height attribute from images commit 3a8642f42cda451669c832482c5e1611b1ff2ea9 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 12:57:45 2016 -0400 feat: rewrite top level commit a1c03e779234b0aea02206d92ec3dcc15758507e Author: Adam Pash <adam.pash@gmail.com> Date: Fri Aug 26 17:34:36 2016 -0400 in a weird place rn
8 years ago
// Remove empty paragraph nodes
removeEmpty(article, $);
feat: implemented extractBestNode functionality Squashed commit of the following: commit 9af554dd975ff1778ed70c71fa9bde667fc5f880 Author: Adam Pash <adam.pash@gmail.com> Date: Tue Aug 30 15:19:32 2016 -0400 feat: add cleanHeaders commit 0dfea98eedc4f97fcbd78866322595c705e20521 Author: Adam Pash <adam.pash@gmail.com> Date: Tue Aug 30 14:30:49 2016 -0400 fix: scoring parent nodes recursively commit b6e5897a694adeb81e25a905aba72c0f45a8cc94 Author: Adam Pash <adam.pash@gmail.com> Date: Tue Aug 30 12:47:24 2016 -0400 feat: extract clean node up and running commit fb652c5db13db6bce7271efd68ba4b20515e9549 Author: Adam Pash <adam.pash@gmail.com> Date: Tue Aug 30 09:57:21 2016 -0400 chore: added test for p tags with nested tags (e.g., img, iframe) commit 731d0a2e4d89121dfafad195e9d0911805c4f8e4 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 17:50:33 2016 -0400 feat: extact clean node integrates most functions commit 322bc6534d30feb7c1c08d3813132badc6286b40 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 16:46:04 2016 -0400 feat: removing empty nodes as defined in constants commit f1d38932ea12a865814d2326970031fcb8515baa Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 16:33:31 2016 -0400 feat: cleaning attributes from nodes commit 0aa73ada6854af0ecd504bfe3d926a9524787ab5 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 16:09:56 2016 -0400 feat: cleaning h1s from text commit 12d4a309246285c278ce7765e4fbaa8271bb5889 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 15:52:03 2016 -0400 feat: removing spacer images commit 4e74ff830cc67586560f6fc72e2cfa432a3a2647 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 15:38:49 2016 -0400 feat: stripping unwanted html from doc commit c774166e90169fd0c1aa89898d3f7a975e82bf0a Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 15:17:32 2016 -0400 feat: removing small images, height attribute from images commit 3a8642f42cda451669c832482c5e1611b1ff2ea9 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 12:57:45 2016 -0400 feat: rewrite top level commit a1c03e779234b0aea02206d92ec3dcc15758507e Author: Adam Pash <adam.pash@gmail.com> Date: Fri Aug 26 17:34:36 2016 -0400 in a weird place rn
8 years ago
return article;
feat: implemented extractBestNode functionality Squashed commit of the following: commit 9af554dd975ff1778ed70c71fa9bde667fc5f880 Author: Adam Pash <adam.pash@gmail.com> Date: Tue Aug 30 15:19:32 2016 -0400 feat: add cleanHeaders commit 0dfea98eedc4f97fcbd78866322595c705e20521 Author: Adam Pash <adam.pash@gmail.com> Date: Tue Aug 30 14:30:49 2016 -0400 fix: scoring parent nodes recursively commit b6e5897a694adeb81e25a905aba72c0f45a8cc94 Author: Adam Pash <adam.pash@gmail.com> Date: Tue Aug 30 12:47:24 2016 -0400 feat: extract clean node up and running commit fb652c5db13db6bce7271efd68ba4b20515e9549 Author: Adam Pash <adam.pash@gmail.com> Date: Tue Aug 30 09:57:21 2016 -0400 chore: added test for p tags with nested tags (e.g., img, iframe) commit 731d0a2e4d89121dfafad195e9d0911805c4f8e4 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 17:50:33 2016 -0400 feat: extact clean node integrates most functions commit 322bc6534d30feb7c1c08d3813132badc6286b40 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 16:46:04 2016 -0400 feat: removing empty nodes as defined in constants commit f1d38932ea12a865814d2326970031fcb8515baa Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 16:33:31 2016 -0400 feat: cleaning attributes from nodes commit 0aa73ada6854af0ecd504bfe3d926a9524787ab5 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 16:09:56 2016 -0400 feat: cleaning h1s from text commit 12d4a309246285c278ce7765e4fbaa8271bb5889 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 15:52:03 2016 -0400 feat: removing spacer images commit 4e74ff830cc67586560f6fc72e2cfa432a3a2647 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 15:38:49 2016 -0400 feat: stripping unwanted html from doc commit c774166e90169fd0c1aa89898d3f7a975e82bf0a Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 15:17:32 2016 -0400 feat: removing small images, height attribute from images commit 3a8642f42cda451669c832482c5e1611b1ff2ea9 Author: Adam Pash <adam.pash@gmail.com> Date: Mon Aug 29 12:57:45 2016 -0400 feat: rewrite top level commit a1c03e779234b0aea02206d92ec3dcc15758507e Author: Adam Pash <adam.pash@gmail.com> Date: Fri Aug 26 17:34:36 2016 -0400 in a weird place rn
8 years ago
}
// headers = doc.xpath('.//h2 | .//h3 | .//h4 | .//h5 | .//h6')
// for header in headers:
// drop_header = False
//
// # Remove any headers that are before any p tags in the
// # document. This probably means that it was part of the title, a
// # subtitle or something else extraneous like a datestamp or byline,
// # all of which should be handled by other metadata handling.
// no_previous_ps = int(header.xpath("count(preceding::p[1])")) == 0
// if no_previous_ps:
// similar_header_count = int(doc.xpath('count(.//%s)' % header.tag))
// if similar_header_count < 3:
// drop_header = True
//
// # Remove any headers that match the title exactly.
// if inner_text(header) == self.title:
// drop_header = True
//
// # If this header has a negative weight, it's probably junk.
// # Get rid of it.
// if self.get_weight(header) < 0:
// drop_header = True
//
// if drop_header:
// try:
// header.drop_tree()
// except AssertionError:
// # No parent exists for this node, so just blank it out.
// header.text = ''
//
// if clean_conditionally:
// # We used to clean UL's and OL's here, but it was leading to
// # too many in-article lists being removed. Consider a better
// # way to detect menus particularly and remove them.
// self._clean_conditionally(doc, ['ul', 'ol', 'table', 'div'])
//
// return doc