Fix for #52: <input type="hidden"> are not counted any more for "form removal" heuristic.

This commit is contained in:
Yuri Baburov 2014-09-22 15:31:31 +07:00
parent 2fab5ffa6b
commit 638f73f6a2

View File

@ -452,6 +452,7 @@ class Document:
for kind in ['p', 'img', 'li', 'a', 'embed', 'input']:
counts[kind] = len(el.findall('.//%s' % kind))
counts["li"] -= 100
counts["input"] -= len(el.findall('.//input[@type="hidden"]'))
# Count the text length excluding any surrounding whitespace
content_length = text_length(el)