Merge branch 'master' of https://github.com/bookieio/breadability into upstream-sync

Conflicts:
	CHANGELOG.rst
	README.rst
	breadability/document.py
	breadability/scoring.py
	breadability/scripts/client.py
	setup.py
	tests/test_articles/test_sweetshark/article.html
	tests/test_articles/test_sweetshark/test.py
pull/21/head
Mišo Belica 10 years ago
commit 687d2ecfdf

@ -1,19 +1,41 @@
.. :changelog:
Changelog for readability
Changelog for breadability
==========================
- Sibling node is appended only when sibling doesn't already exist.
- Treat images a little differently so they get more inclusion.
- Added User-Agent string into HTTP requests.
- Added property ``Article.main_text`` for getting text annotated with
0.1.17 (Jan 22nd 2014)
----------------------
- More log quieting down to INFO vs WARN
0.1.16 (Jan 22nd 2014)
----------------------
- Clean up logging output at warning when it's not a true warning
0.1.15 (Nov 29th 2013)
-----------------------
- Merge changes from 0.1.14 of breadability with the fork https://github.com/miso-belica/readability.py and tweaking to return to the name breadability.
- Fork: Added property ``Article.main_text`` for getting text annotated with
semantic HTML tags (<em>, <strong>, ...).
- Join node with 1 child of the same type. From
- Fork: Join node with 1 child of the same type. From
``<div><div>...</div></div>`` we get ``<div>...</div>``.
- Don't change <div> to <p> if it contains <p> elements.
- Renamed test generation helper 'readability_newtest' -> 'readability_test'.
- Renamed package to readability.
- Added support for Python >= 3.2.
- Py3k compatible package 'charade' is used instead of 'chardet'.
- Fork: Don't change <div> to <p> if it contains <p> elements.
- Fork: Renamed test generation helper 'readability_newtest' -> 'readability_test'.
- Fork: Renamed package to readability. (Renamed back)
- Fork: Added support for Python >= 3.2.
- Fork: Py3k compatible package 'charade' is used instead of 'chardet'.
0.1.14 (Nov 7th 2013)
----------------------
- Update sibling append to only happen when sibling doesn't already exist.
0.1.13 (Aug 31st 2013)
-----------------------
- Give images in content boy a better chance of survival
- Add tests
0.1.12 (July 28th 2013)
-----------------------
- Add a user agent to requests.
0.1.11 (Dec 12th 2012)
-----------------------

@ -0,0 +1,4 @@
Rick Harding
nhnifong
Craig Maloney
Mišo Belica

@ -0,0 +1,72 @@
# Makefile to help automate tasks
WD := $(shell pwd)
PY := bin/python
PIP := bin/pip
PEP8 := bin/pep8
NOSE := bin/nosetests
# ###########
# Tests rule!
# ###########
.PHONY: test
test: venv develop $(NOSE)
$(NOSE) -s tests
$(NOSE):
$(PIP) install nose nose-selecttests pep8 pylint coverage
# #######
# INSTALL
# #######
.PHONY: all
all: venv deps develop
venv: bin/python
bin/python:
virtualenv .
.PHONY: deps
deps: venv
pip install -r requirements.txt
.PHONY: clean_venv
clean_venv:
rm -rf bin include lib local man share
.PHONY: develop
develop: lib/python*/site-packages/breadability.egg-link
lib/python*/site-packages/breadability.egg-link:
$(PY) setup.py develop
# ###########
# Development
# ###########
.PHONY: clean_all
clean_all: clean_venv
if [ -d dist ]; then \
rm -r dist; \
fi
bin/flake8: venv
bin/pip install flake8
lint: bin/flake8
flake8 breadability
# ###########
# Deploy
# ###########
.PHONY: dist
dist:
$(PY) setup.py sdist
.PHONY: upload
upload:
$(PY) setup.py sdist upload
.PHONY: version_update
version_update:
$(EDITOR) setup.py CHANGELOG.rst

@ -1,14 +1,14 @@
Readability.py - another readability Python port
==============================================
.. image:: https://api.travis-ci.org/miso-belica/readability.py.png?branch=master
:target: https://travis-ci.org/miso-belica/readability.py
breadability - another readability Python (v2.6-v3.3) port
===========================================================
.. image:: https://api.travis-ci.org/bookieio/breadability.png?branch=master
:target: https://travis-ci.org/bookieio/breadability.py
I've tried to work with the various forks of some ancient codebase that ported
`readability`_ to Python. The lack of tests, unused regex's, and commented out
sections of code in other Python ports just drove me nuts.
I put forth an effort to bring in several of the better forks into one
codebase, but they've diverged so much that I just can't work with it.
code base, but they've diverged so much that I just can't work with it.
So what's any sane person to do? Re-port it with my own repo, add some tests,
infrastructure, and try to make this port better. OSS FTW (and yea, NIH FML,
@ -47,7 +47,7 @@ things from pip so that it can compile.
.. code-block:: bash
$ [sudo] apt-get install libxml2-dev libxslt-dev
$ [sudo] pip install git+git://github.com/miso-belica/readability.py.git
$ [sudo] pip install git+git://github.com/bookieio/breadability.git
Tests
-----
@ -63,7 +63,7 @@ Command line
.. code-block:: bash
$ readability http://wiki.python.org/moin/BeginnersGuide
$ breadability http://wiki.python.org/moin/BeginnersGuide
Options
```````
@ -85,7 +85,7 @@ Python API
from __future__ import print_function
from readability.readable import Article
from breadability.readable import Article
if __name__ == "__main__":
@ -103,7 +103,7 @@ hopefully things are setup in a way that those can/will be added.
Fortunately, I need this library for my tools:
- https://bmark.us
- http://readable.bmark.us
- http://r.bmark.us
so I really need this to be an active and improving project.

@ -0,0 +1,11 @@
# -*- coding: utf8 -*-
from __future__ import (
absolute_import,
division,
print_function,
unicode_literals
)
import pkg_resources
__version__ = pkg_resources.get_distribution("breadability").version

@ -19,9 +19,13 @@ string_types = (bytes, unicode,)
try:
# Assert to hush pyflakes about the unused import. This is a _compat
# module and we expect this to aid in other code importing urllib.
import urllib2 as urllib
assert urllib
except ImportError:
import urllib.request as urllib
assert urllib
def unicode_compatible(cls):

@ -8,17 +8,31 @@ import re
import logging
import charade
from lxml.etree import tounicode, XMLSyntaxError
from lxml.html import document_fromstring, HTMLParser
from ._compat import unicode, to_bytes, to_unicode, unicode_compatible
from lxml.etree import (
tounicode,
XMLSyntaxError,
)
from lxml.html import (
document_fromstring,
HTMLParser,
)
from ._compat import (
to_bytes,
to_unicode,
unicode,
unicode_compatible,
)
from .utils import cached_property
logger = logging.getLogger("readability")
logger = logging.getLogger("breadability")
TAG_MARK_PATTERN = re.compile(to_bytes(r"</?[^>]*>\s*"))
UTF8_PARSER = HTMLParser(encoding="utf8")
def determine_encoding(page):
encoding = "utf8"
text = TAG_MARK_PATTERN.sub(to_bytes(" "), page)
@ -43,7 +57,12 @@ def determine_encoding(page):
return encoding
BREAK_TAGS_PATTERN = re.compile(to_unicode(r"(?:<\s*[bh]r[^>]*>\s*)+"), re.IGNORECASE)
BREAK_TAGS_PATTERN = re.compile(
to_unicode(r"(?:<\s*[bh]r[^>]*>\s*)+"),
re.IGNORECASE
)
def convert_breaks_to_paragraphs(html):
"""
Converts <hr> tag and multiple <br> tags into paragraph.
@ -64,7 +83,6 @@ def _replace_break_tags(match):
return tags
UTF8_PARSER = HTMLParser(encoding="utf8")
def build_document(html_content, base_href=None):
"""Requires that the `html_content` not be None"""
assert html_content is not None

@ -13,12 +13,17 @@ from lxml.html import fragment_fromstring, fromstring
from .document import OriginalDocument
from .annotated_text import AnnotatedTextHandler
from .scoring import (score_candidates, get_link_density, get_class_weight,
is_unlikely_node)
from .scoring import (
get_class_weight,
get_link_density,
is_unlikely_node,
score_candidates,
)
from .utils import cached_property, shrink_text
html_cleaner = Cleaner(scripts=True, javascript=True, comments=True,
html_cleaner = Cleaner(
scripts=True, javascript=True, comments=True,
style=True, links=True, meta=False, add_nofollow=False,
page_structure=False, processing_instructions=True,
embedded=False, frames=False, forms=False,
@ -44,7 +49,7 @@ NULL_DOCUMENT = """
</html>
"""
logger = logging.getLogger("readability")
logger = logging.getLogger("breadability")
def ok_embedded_video(node):
@ -129,7 +134,8 @@ def check_siblings(candidate_node, candidate_list):
content_bonus += candidate_node.content_score * 0.2
if sibling in candidate_list:
adjusted_score = candidate_list[sibling].content_score + content_bonus
adjusted_score = \
candidate_list[sibling].content_score + content_bonus
if adjusted_score >= sibling_target_score:
append = True
@ -146,7 +152,8 @@ def check_siblings(candidate_node, candidate_list):
append = True
if append:
logger.debug("Sibling appended: %s %r", sibling.tag, sibling.attrib)
logger.debug(
"Sibling appended: %s %r", sibling.tag, sibling.attrib)
if sibling.tag not in ("div", "p"):
# We have a node that isn't a common block level element, like
# a form or td tag. Turn it into a div so it doesn't get
@ -191,7 +198,8 @@ def clean_document(node):
if n.tag in ("div", "p"):
text_content = shrink_text(n.text_content())
if len(text_content) < 5 and not n.getchildren():
logger.debug("Dropping %s %r without content.", n.tag, n.attrib)
logger.debug(
"Dropping %s %r without content.", n.tag, n.attrib)
to_drop.append(n)
# finally try out the conditional cleaning of the target node
@ -206,7 +214,8 @@ def clean_document(node):
def drop_nodes_with_parents(nodes):
for node in nodes:
if node.getparent() is not None:
logger.debug("Droping node with parent %s %r", node.tag, node.attrib)
logger.debug(
"Droping node with parent %s %r", node.tag, node.attrib)
node.drop_tree()
@ -231,7 +240,8 @@ def clean_conditionally(node):
commas_count = node.text_content().count(',')
if commas_count < 10:
logger.debug("There are %d commas so we're processing more.", commas_count)
logger.debug(
"There are %d commas so we're processing more.", commas_count)
# If there are not very many commas, and the number of
# non-paragraph elements is more than paragraphs or other ominous
@ -267,7 +277,8 @@ def clean_conditionally(node):
logger.debug('Conditional drop: weight big but link heavy')
remove_node = True
elif (embed == 1 and content_length < 75) or embed > 1:
logger.debug('Conditional drop: embed w/o much content or many embed')
logger.debug(
'Conditional drop: embed w/o much content or many embed')
remove_node = True
if remove_node:
@ -305,10 +316,12 @@ def find_candidates(document):
for node in document.iter():
if is_unlikely_node(node):
logger.debug("We should drop unlikely: %s %r", node.tag, node.attrib)
logger.debug(
"We should drop unlikely: %s %r", node.tag, node.attrib)
should_remove.add(node)
elif is_bad_link(node):
logger.debug("We should drop bad link: %s %r", node.tag, node.attrib)
logger.debug(
"We should drop bad link: %s %r", node.tag, node.attrib)
should_remove.add(node)
elif node.tag in SCORABLE_TAGS:
nodes_to_score.add(node)
@ -399,11 +412,12 @@ class Article(object):
def _readable(self):
"""The readable parsed article"""
if not self.candidates:
logger.warning("No candidates found in document.")
logger.info("No candidates found in document.")
return self._handle_no_candidates()
# right now we return the highest scoring candidate content
best_candidates = sorted((c for c in self.candidates.values()),
best_candidates = sorted(
(c for c in self.candidates.values()),
key=attrgetter("content_score"), reverse=True)
printer = PrettyPrinter(indent=2)
@ -415,9 +429,11 @@ class Article(object):
updated_winner = check_siblings(winner, self.candidates)
updated_winner.node = prep_article(updated_winner.node)
if updated_winner.node is not None:
dom = build_base_document(updated_winner.node, self._return_fragment)
dom = build_base_document(
updated_winner.node, self._return_fragment)
else:
logger.warning('Had candidates but failed to find a cleaned winning DOM.')
logger.info(
'Had candidates but failed to find a cleaned winning DOM.')
dom = self._handle_no_candidates()
return self._remove_orphans(dom.get_element_by_id("readabilityBody"))
@ -437,9 +453,10 @@ class Article(object):
if self.dom is not None and len(self.dom):
dom = prep_article(self.dom)
dom = build_base_document(dom, self._return_fragment)
return self._remove_orphans(dom.get_element_by_id("readabilityBody"))
return self._remove_orphans(
dom.get_element_by_id("readabilityBody"))
else:
logger.warning("No document to use.")
logger.info("No document to use.")
return build_error_document(self._return_fragment)
@ -454,7 +471,8 @@ def leaf_div_elements_into_paragraphs(document):
for element in document.iter(tag="div"):
child_tags = tuple(n.tag for n in element.getchildren())
if "div" not in child_tags and "p" not in child_tags:
logger.debug("Changing leaf block element <%s> into <p>", element.tag)
logger.debug(
"Changing leaf block element <%s> into <p>", element.tag)
element.tag = "p"
return document

@ -17,9 +17,9 @@ from .utils import normalize_whitespace
# A series of sets of attributes we check to help in determining if a node is
# a potential candidate or not.
CLS_UNLIKELY = re.compile(
"combx|comment|community|disqus|extra|foot|header|menu|remark|rss|shoutbox|"
"sidebar|sponsor|ad-break|agegate|pagination|pager|perma|popup|tweet|"
"twitter|social|breadcrumb",
"combx|comment|community|disqus|extra|foot|header|menu|remark|rss|"
"shoutbox|sidebar|sponsor|ad-break|agegate|pagination|pager|perma|popup|"
"tweet|twitter|social|breadcrumb",
re.IGNORECASE
)
CLS_MAYBE = re.compile(
@ -32,12 +32,12 @@ CLS_WEIGHT_POSITIVE = re.compile(
)
CLS_WEIGHT_NEGATIVE = re.compile(
"combx|comment|com-|contact|foot|footer|footnote|head|masthead|media|meta|"
"outbrain|promo|related|scroll|shoutbox|sidebar|sponsor|shopping|tags|tool|"
"widget",
"outbrain|promo|related|scroll|shoutbox|sidebar|sponsor|shopping|tags|"
"tool|widget",
re.IGNORECASE
)
logger = logging.getLogger("readability")
logger = logging.getLogger("breadability")
def check_node_attributes(pattern, node, *attributes):
@ -146,8 +146,8 @@ def score_candidates(nodes):
for node in nodes:
logger.debug("* Scoring candidate %s %r", node.tag, node.attrib)
# if the node has no parent it knows of
# then it ends up creating a body & html tag to parent the html fragment
# if the node has no parent it knows of then it ends up creating a
# body & html tag to parent the html fragment
parent = node.getparent()
if parent is None:
logger.debug("Skipping candidate - parent node is 'None'.")
@ -161,7 +161,9 @@ def score_candidates(nodes):
# if paragraph is < `MIN_HIT_LENTH` characters don't even count it
inner_text = node.text_content().strip()
if len(inner_text) < MIN_HIT_LENTH:
logger.debug("Skipping candidate - inner text < %d characters.", MIN_HIT_LENTH)
logger.debug(
"Skipping candidate - inner text < %d characters.",
MIN_HIT_LENTH)
continue
# initialize readability data for the parent
@ -184,7 +186,8 @@ def score_candidates(nodes):
# subtract 0.5 points for each double quote within this paragraph
double_quotes_count = inner_text.count('"')
content_score += double_quotes_count * -0.5
logger.debug("Penalty points for %d double-quotes.", double_quotes_count)
logger.debug(
"Penalty points for %d double-quotes.", double_quotes_count)
# for every 100 characters in this paragraph, add another point
# up to 3 points
@ -193,12 +196,14 @@ def score_candidates(nodes):
logger.debug("Bonus points for length of text: %f", length_points)
# add the score to the parent
logger.debug("Bonus points for parent %s %r with score %f: %f",
logger.debug(
"Bonus points for parent %s %r with score %f: %f",
parent.tag, parent.attrib, candidates[parent].content_score,
content_score)
candidates[parent].content_score += content_score
# the grand node gets half
logger.debug("Bonus points for grand %s %r with score %f: %f",
logger.debug(
"Bonus points for grand %s %r with score %f: %f",
grand.tag, grand.attrib, candidates[grand].content_score,
content_score / 2.0)
candidates[grand].content_score += content_score / 2.0
@ -210,7 +215,8 @@ def score_candidates(nodes):
for candidate in candidates.values():
adjustment = 1.0 - get_link_density(candidate.node)
candidate.content_score *= adjustment
logger.debug("Link density adjustment for %s %r: %f",
logger.debug(
"Link density adjustment for %s %r: %f",
candidate.node.tag, candidate.node.attrib, adjustment)
return candidates

@ -4,9 +4,9 @@
A fast python port of arc90's readability tool
Usage:
readability [options] <resource>
readability --version
readability --help
breadability [options] <resource>
breadability --version
breadability --help
Arguments:
<resource> URL or file path to process in readable form.
@ -37,7 +37,10 @@ from ..readable import Article
HEADERS = {
"User-Agent": "Readability (Readable content parser; https://github.com/miso-belica/readability.py) Version/%s" % __version__,
"User-Agent": 'breadability/{version} ({url})'.format(
url="https://github.com/bookieio/breadability",
version=__version__
)
}
@ -47,7 +50,7 @@ def parse_args():
def main():
args = parse_args()
logger = logging.getLogger("readability")
logger = logging.getLogger("breadability")
if args["--verbose"]:
logger.setLevel(logging.DEBUG)

@ -1,12 +1,12 @@
# -*- coding: utf8 -*-
"""
Helper to generate a new set of article test files for readability.
Helper to generate a new set of article test files for breadability.
Usage:
readability_test --name <name> <url>
readability_test --version
readability_test --help
breadability_test --name <name> <url>
breadability_test --version
breadability_test --help
Arguments:
<url> The url of content to fetch for the article.html
@ -39,7 +39,7 @@ from __future__ import absolute_import
from __future__ import division, print_function, unicode_literals
from os.path import join, dirname
from readability.readable import Article
from breadability.readable import Article
from ...compat import unittest

@ -6,6 +6,9 @@ from __future__ import division, print_function, unicode_literals
import re
MULTIPLE_WHITESPACE_PATTERN = re.compile(r"\s+", re.UNICODE)
def is_blank(text):
"""
Returns ``True`` if string contains only whitespace characters
@ -18,7 +21,6 @@ def shrink_text(text):
return normalize_whitespace(text.strip())
MULTIPLE_WHITESPACE_PATTERN = re.compile(r"\s+", re.UNICODE)
def normalize_whitespace(text):
"""
Translates multiple whitespace into single space character.

@ -1,7 +0,0 @@
# -*- coding: utf8 -*-
from __future__ import absolute_import
from __future__ import division, print_function, unicode_literals
__version__ = "0.1.11"

@ -1,5 +1,8 @@
docopt>=0.6.1,<0.7
charade
lxml
coverage
docopt>=0.6.1,<0.7
lxml
nose
nose-selecttests
pep8
pylint

@ -1,4 +1,4 @@
[nosetests]
with-coverage=1
cover-package=readability
cover-package=breadability
cover-erase=1

@ -2,8 +2,8 @@ import sys
from os.path import abspath, dirname, join
from setuptools import setup, find_packages
from readability import __version__
VERSION = "0.1.17"
VERSION_SUFFIX = "%d.%d" % sys.version_info[:2]
CURRENT_DIRECTORY = abspath(dirname(__file__))
@ -28,24 +28,34 @@ tests_require = [
if sys.version_info < (2, 7):
install_requires.append("unittest2")
console_script_targets = [
"breadability = breadability.scripts.client:main",
"breadability-{0} = breadability.scripts.client:main",
"breadability_test = breadability.scripts.test_helper:main",
"breadability_test-{0} = breadability.scripts.test_helper:main",
]
console_script_targets = [
target.format(VERSION_SUFFIX) for target in console_script_targets
]
setup(
name="readability",
version=__version__,
name="breadability",
version=VERSION,
description="Port of Readability HTML parser in Python",
long_description=long_description,
keywords=[
"bookie",
"breadability",
"content",
"HTML",
"parsing",
"readability",
"readable",
"parsing",
"HTML",
"content",
],
author="Rick Harding",
author_email="rharding@mitechie.com",
maintainer="Michal Belica",
maintainer_email="miso.belica@gmail.com",
url="https://github.com/miso-belica/readability.py",
url="https://github.com/bookieio/breadability",
license="BSD",
classifiers=[
"Development Status :: 5 - Production/Stable",
@ -64,7 +74,6 @@ setup(
"Topic :: Software Development :: Pre-processors",
"Topic :: Text Processing :: Filters",
"Topic :: Text Processing :: Markup :: HTML",
],
packages=find_packages(),
include_package_data=True,
@ -73,11 +82,6 @@ setup(
tests_require=tests_require,
test_suite="tests.run_tests.run",
entry_points={
"console_scripts": [
"readability = readability.scripts.client:main",
"readability-%s = readability.scripts.client:main" % VERSION_SUFFIX,
"readability_test = readability.scripts.test_helper:main",
"readability_test-%s = readability.scripts.test_helper:main" % VERSION_SUFFIX,
]
"console_scripts": console_script_targets,
}
)

@ -12,7 +12,7 @@ from os.path import dirname, abspath
DEFAULT_PARAMS = [
"nosetests",
"--with-coverage",
"--cover-package=readability",
"--cover-package=breadability",
"--cover-erase",
]

@ -1,11 +1,15 @@
# -*- coding: utf8 -*-
from __future__ import absolute_import
from __future__ import division, print_function, unicode_literals
from __future__ import (
absolute_import,
division,
print_function,
unicode_literals
)
from lxml.html import fragment_fromstring, document_fromstring
from readability.readable import Article
from readability.annotated_text import AnnotatedTextHandler
from breadability.readable import Article
from breadability.annotated_text import AnnotatedTextHandler
from .compat import unittest
from .utils import load_snippet, load_article

@ -5,7 +5,7 @@ from __future__ import division, print_function, unicode_literals
import os
from readability.readable import Article
from breadability.readable import Article
from ...compat import unittest

File diff suppressed because it is too large Load Diff

@ -0,0 +1,33 @@
import os
try:
# Python < 2.7
import unittest2 as unittest
except ImportError:
import unittest
from breadability.readable import Article
class TestBusinessInsiderArticle(unittest.TestCase):
"""Test the scoring and parsing of the Blog Post"""
def setUp(self):
"""Load up the article for us"""
article_path = os.path.join(os.path.dirname(__file__), 'article.html')
self.article = open(article_path).read()
def tearDown(self):
"""Drop the article"""
self.article = None
def test_parses(self):
"""Verify we can parse the document."""
doc = Article(self.article)
self.assertTrue('id="readabilityBody"' in doc.readable)
def test_images_preserved(self):
"""The div with the comments should be removed."""
doc = Article(self.article)
self.assertTrue('bharath-kumar-a-co-founder-at-pugmarksme-suggests-working-on-a-sunday-late-night.jpg' in doc.readable)
self.assertTrue('bryan-guido-hassin-a-university-professor-and-startup-junkie-uses-airplane-days.jpg' in doc.readable)

@ -4,7 +4,7 @@ from __future__ import absolute_import
from __future__ import division, print_function, unicode_literals
from os.path import join, dirname
from readability.readable import Article
from breadability.readable import Article
from ...compat import unittest

@ -4,8 +4,8 @@ from __future__ import absolute_import
from __future__ import division, print_function, unicode_literals
from os.path import join, dirname
from readability.readable import Article
from readability._compat import unicode
from breadability.readable import Article
from breadability._compat import unicode
from ...compat import unittest

@ -1,14 +1,18 @@
# -*- coding: utf8 -*-
from __future__ import absolute_import
from __future__ import division, print_function, unicode_literals
from __future__ import (
absolute_import,
division,
print_function,
unicode_literals
)
import os
from operator import attrgetter
from readability.readable import Article
from readability.readable import check_siblings
from readability.readable import prep_article
from breadability.readable import Article
from breadability.readable import check_siblings
from breadability.readable import prep_article
from ...compat import unittest
@ -57,7 +61,8 @@ class TestArticle(unittest.TestCase):
for node in doc._should_drop:
self.assertFalse(node == found.node)
by_score = sorted([c for c in doc.candidates.values()],
by_score = sorted(
[c for c in doc.candidates.values()],
key=attrgetter('content_score'), reverse=True)
self.assertTrue(by_score[0].node == found.node)

@ -4,11 +4,11 @@ from __future__ import absolute_import
from __future__ import division, print_function, unicode_literals
from os.path import join, dirname
from readability.readable import Article
from breadability.readable import Article
from ...compat import unittest
class TestArticle(unittest.TestCase):
class TestSweetsharkBlog(unittest.TestCase):
"""
Test the scoring and parsing of the article from URL below:
http://sweetshark.livejournal.com/11564.html

@ -4,9 +4,12 @@ from __future__ import absolute_import
from __future__ import division, print_function, unicode_literals
from collections import defaultdict
from readability._compat import to_unicode, to_bytes
from readability.document import (OriginalDocument, determine_encoding,
convert_breaks_to_paragraphs)
from breadability._compat import to_unicode, to_bytes
from breadability.document import (
convert_breaks_to_paragraphs,
determine_encoding,
OriginalDocument,
)
from .compat import unittest
from .utils import load_snippet
@ -18,14 +21,16 @@ class TestOriginalDocument(unittest.TestCase):
returned = convert_breaks_to_paragraphs(
"<div>HI<br><br>How are you?<br><br> \t \n <br>Fine\n I guess</div>")
self.assertEqual(returned,
self.assertEqual(
returned,
"<div>HI</p><p>How are you?</p><p>Fine\n I guess</div>")
def test_convert_hr_tags_to_paragraphs(self):
returned = convert_breaks_to_paragraphs(
"<div>HI<br><br>How are you?<hr/> \t \n <br>Fine\n I guess</div>")
self.assertEqual(returned,
self.assertEqual(
returned,
"<div>HI</p><p>How are you?</p><p>Fine\n I guess</div>")
def test_readin_min_document(self):
@ -79,7 +84,7 @@ class TestOriginalDocument(unittest.TestCase):
def test_encoding(self):
text = "ľščťžýáíéäúňôůě".encode("iso-8859-2")
encoding = determine_encoding(text)
determine_encoding(text)
def test_encoding_short(self):
text = "ľščťžýáíé".encode("iso-8859-2")

@ -6,14 +6,16 @@ from __future__ import division, print_function, unicode_literals
from lxml.etree import tounicode
from lxml.html import document_fromstring
from lxml.html import fragment_fromstring
from readability._compat import to_unicode
from readability.readable import Article
from readability.readable import get_class_weight
from readability.readable import get_link_density
from readability.readable import is_bad_link
from readability.readable import score_candidates
from readability.readable import leaf_div_elements_into_paragraphs
from readability.scoring import ScoredNode
from breadability._compat import to_unicode
from breadability.readable import (
Article,
get_class_weight,
get_link_density,
is_bad_link,
leaf_div_elements_into_paragraphs,
score_candidates,
)
from breadability.scoring import ScoredNode
from .compat import unittest
from .utils import load_snippet, load_article
@ -27,6 +29,14 @@ class TestReadableDocument(unittest.TestCase):
# We get back the document as a div tag currently by default.
self.assertEqual(doc.readable_dom.tag, 'div')
def test_title_loads(self):
"""Verify we can fetch the title of the parsed article"""
doc = Article(load_snippet('document_min.html'))
self.assertEqual(
doc._original_document.title,
'Min Document Title'
)
def test_doc_no_scripts_styles(self):
"""Step #1 remove all scripts from the document"""
doc = Article(load_snippet('document_scripts.html'))
@ -80,10 +90,11 @@ class TestCleaning(unittest.TestCase):
"""Verify we wipe out things from our unlikely list."""
doc = Article(load_snippet('test_readable_unlikely.html'))
readable = doc.readable_dom
must_not_appear = ['comment', 'community', 'disqus', 'extra', 'foot',
'header', 'menu', 'remark', 'rss', 'shoutbox', 'sidebar',
'sponsor', 'ad-break', 'agegate', 'pagination' '', 'pager',
'popup', 'tweet', 'twitter', 'imgBlogpostPermalink']
must_not_appear = [
'comment', 'community', 'disqus', 'extra', 'foot',
'header', 'menu', 'remark', 'rss', 'shoutbox', 'sidebar',
'sponsor', 'ad-break', 'agegate', 'pagination' '', 'pager',
'popup', 'tweet', 'twitter', 'imgBlogpostPermalink']
want_to_appear = ['and', 'article', 'body', 'column', 'main', 'shadow']
@ -128,17 +139,24 @@ class TestCleaning(unittest.TestCase):
self.assertEqual(
tounicode(
leaf_div_elements_into_paragraphs(test_doc2)),
to_unicode('<html><body><p>simple<a href="">link</a></p></body></html>')
to_unicode(
'<html><body><p>simple<a href="">link</a></p></body></html>')
)
def test_dont_transform_div_with_div(self):
"""Verify that only child <div> element is replaced by <p>."""
dom = document_fromstring(
"<html><body><div>text<div>child</div>aftertext</div></body></html>")
"<html><body><div>text<div>child</div>"
"aftertext</div></body></html>"
)
self.assertEqual(
tounicode(leaf_div_elements_into_paragraphs(dom)),
to_unicode("<html><body><div>text<p>child</p>aftertext</div></body></html>")
tounicode(
leaf_div_elements_into_paragraphs(dom)),
to_unicode(
"<html><body><div>text<p>child</p>"
"aftertext</div></body></html>"
)
)
def test_bad_links(self):

@ -8,14 +8,18 @@ import re
from operator import attrgetter
from lxml.html import document_fromstring
from lxml.html import fragment_fromstring
from readability.readable import Article
from readability.scoring import check_node_attributes
from readability.scoring import get_class_weight
from readability.scoring import ScoredNode
from readability.scoring import score_candidates
from readability.scoring import generate_hash_id
from readability.readable import get_link_density
from readability.readable import is_unlikely_node
from breadability.readable import Article
from breadability.scoring import (
check_node_attributes,
generate_hash_id,
get_class_weight,
score_candidates,
ScoredNode,
)
from breadability.readable import (
get_link_density,
is_unlikely_node,
)
from .compat import unittest
from .utils import load_snippet
@ -60,7 +64,8 @@ class TestCheckNodeAttr(unittest.TestCase):
test_node = fragment_fromstring('<div/>')
test_node.set('class', 'test2 comment')
self.assertTrue(check_node_attributes(test_pattern, test_node, 'class'))
self.assertTrue(
check_node_attributes(test_pattern, test_node, 'class'))
def test_has_id(self):
"""Verify that a node has an id in our set."""
@ -75,7 +80,8 @@ class TestCheckNodeAttr(unittest.TestCase):
test_pattern = re.compile('test1|test2', re.I)
test_node = fragment_fromstring('<div/>')
test_node.set('class', 'test4 comment')
self.assertFalse(check_node_attributes(test_pattern, test_node, 'class'))
self.assertFalse(
check_node_attributes(test_pattern, test_node, 'class'))
def test_lacks_id(self):
"""Verify that a node does not have an id in our set."""
@ -266,7 +272,8 @@ class TestScoreCandidates(unittest.TestCase):
div_nodes = dom.findall(".//div")
candidates = score_candidates(div_nodes)
ordered = sorted((c for c in candidates.values()), reverse=True,
ordered = sorted(
(c for c in candidates.values()), reverse=True,
key=attrgetter("content_score"))
self.assertEqual(ordered[0].node.tag, "div")

Loading…
Cancel
Save