Sync up with the fork

pull/23/head
Richard Harding 11 years ago
commit d6317cd2ce

1
.gitignore vendored

@ -1,5 +1,6 @@
*.pyc
*.prof
.coverage
.installed.cfg
bin

@ -1,9 +1,11 @@
language: python
python:
- "2.7"
- "2.6"
- "2.7"
- "3.2"
- "3.3"
before_install: sudo apt-get install libxml2-dev libxslt-dev
# command to install dependencies
install: pip install -r requirements.txt --use-mirrors
# command to run tests
script: python setup.py install && nosetests src/breadability/tests
script: python setup.py install && nosetests tests

@ -0,0 +1,3 @@
Rick Harding (original author)
Michal Belica (current maintainer)
nhnifong

@ -0,0 +1,71 @@
.. :changelog:
Changelog for readability
==========================
- Added property ``Article.main_text`` for getting text annotated with
semantic HTML tags (<em>, <strong>, ...).
- Join node with 1 child of the same type. From
``<div><div>...</div></div>`` we get ``<div>...</div>``.
- Don't change <div> to <p> if it contains <p> elements.
- Renamed test generation helper 'readability_newtest' -> 'readability_test'.
- Renamed package to readability.
- Added support for Python >= 3.2.
- Py3k compatible package 'charade' is used instead of 'chardet'.
0.1.11 (Dec 12th 2012)
-----------------------
- Add argparse to the install requires for python < 2.7
0.1.10 (Sept 13th 2012)
-----------------------
- Updated scoring bonus and penalty with , and " characters.
0.1.9 (Aug 27nd 2012)
----------------------
- In case of an issue dealing with candidates we need to act like we didn't
find any candidates for the article content. #10
0.1.8 (Aug 27nd 2012)
----------------------
- Add code/tests for an empty document.
- Fixes #9 to handle xml parsing issues.
0.1.7 (July 21nd 2012)
----------------------
- Change the encode 'replace' kwarg into a normal arg for older python
version.
0.1.6 (June 17th 2012)
----------------------
- Fix the link removal, add tests and a place to process other bad links.
0.1.5 (June 16th 2012)
----------------------
- Start to look at removing bad links from content in the conditional cleaning
state. This was really used for the scripting.com site's garbage.
0.1.4 (June 16th 2012)
----------------------
- Add a test generation helper readability_newtest script.
- Add tests and fixes for the scripting news parse failure.
0.1.3 (June 15th 2012)
----------------------
- Add actual testing of full articles for regression tests.
- Update parser to properly clean after winner doc node is chosen.
0.1.2 (May 28th 2012)
----------------------
- Bugfix: #4 issue with logic of the 100char bonus points in scoring
- Garden with PyLint/PEP8
- Add a bunch of tests to readable/scoring code.
0.1.1 (May 11th 2012)
---------------------
- Fix bugs in scoring to help in getting right content
- Add concept of -d which shows scoring/decisions on nodes
- Update command line client to be able to pipe output to other tools
0.1.0 (May 6th 2012)
--------------------
- Initial release and upload to PyPi

@ -0,0 +1,10 @@
Copyright (c) 2013 Rick Harding, Michal Belica and contributors
All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
- Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
- Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

@ -1,2 +1,3 @@
include README.rst
include NEWS.txt
include CHANGELOG.rst
include LICENSE.rst

@ -1,61 +0,0 @@
# Makefile to help automate tasks
WD := $(shell pwd)
PY := bin/python
PIP := bin/pip
PEP8 := bin/pep8
NOSE := bin/nosetests
# ###########
# Tests rule!
# ###########
.PHONY: test
test: venv develop $(NOSE)
$(NOSE) --with-id -s src/breadability/tests
$(NOSE):
$(PIP) install nose pep8 pylint coverage
# #######
# INSTALL
# #######
.PHONY: all
all: venv develop
venv: bin/python
bin/python:
virtualenv .
.PHONY: clean_venv
clean_venv:
rm -rf bin include lib local man share
.PHONY: develop
develop: lib/python*/site-packages/readability_lxml.egg-link
lib/python*/site-packages/readability_lxml.egg-link:
$(PY) setup.py develop
# ###########
# Development
# ###########
.PHONY: clean_all
clean_all: clean_venv
if [ -d dist ]; then \
rm -r dist; \
fi
# ###########
# Deploy
# ###########
.PHONY: dist
dist:
$(PY) setup.py sdist
.PHONY: upload
upload:
$(PY) setup.py sdist upload
.PHONY: version_update
version_update:
$(EDITOR) setup.py src/breadability/__init__.py NEWS.txt

@ -1,5 +1,8 @@
breadability - another readability Python port
===============================================
Readability.py - another readability Python port
==============================================
.. image:: https://api.travis-ci.org/miso-belica/readability.py.png?branch=master
:target: https://travis-ci.org/miso-belica/readability.py
I've tried to work with the various forks of some ancient codebase that ported
`readability`_ to Python. The lack of tests, unused regex's, and commented out
sections of code in other Python ports just drove me nuts.
@ -14,51 +17,79 @@ but oh well I did try)
This is a pretty straight port of the JS here:
- http://code.google.com/p/arc90labs-readability/source/browse/trunk/js/readability.js#82
- http://www.minvolai.com/blog/decruft-arc90s-readability-in-python/
Some other ports:
- https://github.com/aidanf/BTE
- http://www.unixuser.org/~euske/python/webstemmer/#extract
- https://github.com/al3xandru/readability.py
- https://github.com/rcarmo/soup-strainer
- https://github.com/bcampbell/decruft
- https://github.com/gfxmonk/python-readability
- https://github.com/srid/readability
- https://github.com/dcramer/decruft
- https://github.com/reorx/readability
- https://github.com/mote/python-readability
- https://github.com/predatell/python-readability-lxml
- https://github.com/Harshavardhana/boilerpipy
- https://github.com/raptium/hitomi
- https://github.com/kingwkb/readability
Installation
-------------
------------
This does depend on lxml so you'll need some C headers in order to install
things from pip so that it can compile.
::
.. code-block:: bash
sudo apt-get install libxml2-dev libxslt-dev
pip install breadability
$ [sudo] apt-get install libxml2-dev libxslt-dev
$ [sudo] pip install git+git://github.com/miso-belica/readability.py.git
Tests
-----
.. code-block:: bash
Usage
------
$ nosetests --with-coverage --cover-package=readability --cover-erase tests
$ nosetests-3.3 --with-coverage --cover-package=readability --cover-erase tests
cmd line
~~~~~~~~~
::
Usage
-----
Command line
~~~~~~~~~~~~
.. code-block:: bash
$ breadability http://wiki.python.org/moin/BeginnersGuide
$ readability http://wiki.python.org/moin/BeginnersGuide
Options
``````````
```````
- **b** will write out the parsed content to a temp file and open it in a
browser for viewing.
- **d** will write out debug scoring statements to help track why a node was
chosen as the document and why some nodes were removed from the final
product.
- **f** will override the default behaviour of getting an html fragment (<div>)
and give you back a full <html> document.
- **v** will output in verbose debug mode and help let you know why it parsed
how it did.
- b will write out the parsed content to a temp file and open it in a
browser for viewing.
- d will write out debug scoring statements to help track why a node was
chosen as the document and why some nodes were removed from the final
product.
- f will override the default behaviour of getting an html fragment (<div>)
and give you back a full <html> document.
- v will output in verbose debug mode and help let you know why it parsed
how it did.
Python API
~~~~~~~~~~
.. code-block:: python
Using from Python
~~~~~~~~~~~~~~~~~~
from __future__ import print_function
::
from readability.readable import Article
from breadability.readable import Article
doc = Article(html_text, url=url_came_from)
print doc.readable
if __name__ == "__main__":
document = Article(html_as_text, url=source_url)
print(document.readable)
Work to be done
@ -76,39 +107,26 @@ Fortunately, I need this library for my tools:
so I really need this to be an active and improving project.
Off the top of my heads todo list:
- Support metadata from parsed article [url, confidence scores, all
candidates we thought about?]
- More tests, more thorough tests
- More sample articles we need to test against in the test_articles
- Tests that run through and check for regressions of the test_articles
- Tidy'ing the HTML that comes out, might help with regression tests ^^
- Multiple page articles
- Performance tuning, we do a lot of looping and re-drop some nodes that
should be skipped. We should have a set of regression tests for this so
that if we implement a change that blows up performance we know it right
away.
- More docs for things, but sphinx docs and in code comments to help
understand wtf we're doing and why. That's the biggest hurdle to some of
this stuff.
Helping out
------------
If you want to help, shoot me a pull request, an issue report with broken
urls, etc.
You can ping me on irc, I'm always in the `#bookie` channel in freenode.
Important Links
----------------
Off the top of my heads TODO list:
- `Builds`_ are done on `TravisCI`_
- Support metadata from parsed article [url, confidence scores, all
candidates we thought about?]
- More tests, more thorough tests
- More sample articles we need to test against in the test_articles
- Tests that run through and check for regressions of the test_articles
- Tidy'ing the HTML that comes out, might help with regression tests ^^
- Multiple page articles
- Performance tuning, we do a lot of looping and re-drop some nodes that
should be skipped. We should have a set of regression tests for this so
that if we implement a change that blows up performance we know it right
away.
- More docs for things, but sphinx docs and in code comments to help
understand wtf we're doing and why. That's the biggest hurdle to some of
this stuff.
Inspiration
~~~~~~~~~~~~
~~~~~~~~~~~
- `python-readability`_
- `decruft`_
@ -117,7 +135,6 @@ Inspiration
.. _readability: http://code.google.com/p/arc90labs-readability/
.. _Builds: http://travis-ci.org/#!/mitechie/breadability
.. _TravisCI: http://travis-ci.org/
.. _decruft: https://github.com/dcramer/decruft
.. _python-readability: https://github.com/buriy/python-readability

@ -0,0 +1,7 @@
# -*- coding: utf8 -*-
from __future__ import absolute_import
from __future__ import division, print_function, unicode_literals
__version__ = "0.1.14"

@ -0,0 +1,101 @@
# -*- coding: utf8 -*-
from __future__ import absolute_import
from __future__ import division, print_function, unicode_literals
from sys import version_info
PY3 = version_info[0] == 3
if PY3:
bytes = bytes
unicode = str
else:
bytes = str
unicode = unicode
string_types = (bytes, unicode,)
try:
import urllib2 as urllib
except ImportError:
import urllib.request as urllib
def unicode_compatible(cls):
"""
Decorator for unicode compatible classes. Method ``__unicode__``
has to be implemented to work decorator as expected.
"""
if PY3:
cls.__str__ = cls.__unicode__
cls.__bytes__ = lambda self: self.__str__().encode("utf8")
else:
cls.__str__ = lambda self: self.__unicode__().encode("utf8")
return cls
def to_string(object):
return to_unicode(object) if PY3 else to_bytes(object)
def to_bytes(object):
try:
if isinstance(object, bytes):
return object
elif isinstance(object, unicode):
return object.encode("utf8")
else:
# try encode instance to bytes
return instance_to_bytes(object)
except UnicodeError:
# recover from codec error and use 'repr' function
return to_bytes(repr(object))
def to_unicode(object):
try:
if isinstance(object, unicode):
return object
elif isinstance(object, bytes):
return object.decode("utf8")
else:
# try decode instance to unicode
return instance_to_unicode(object)
except UnicodeError:
# recover from codec error and use 'repr' function
return to_unicode(repr(object))
def instance_to_bytes(instance):
if PY3:
if hasattr(instance, "__bytes__"):
return bytes(instance)
elif hasattr(instance, "__str__"):
return unicode(instance).encode("utf8")
else:
if hasattr(instance, "__str__"):
return bytes(instance)
elif hasattr(instance, "__unicode__"):
return unicode(instance).encode("utf8")
return to_bytes(repr(instance))
def instance_to_unicode(instance):
if PY3:
if hasattr(instance, "__str__"):
return unicode(instance)
elif hasattr(instance, "__bytes__"):
return bytes(instance).decode("utf8")
else:
if hasattr(instance, "__unicode__"):
return unicode(instance)
elif hasattr(instance, "__str__"):
return bytes(instance).decode("utf8")
return to_unicode(repr(instance))

@ -0,0 +1,89 @@
# -*- coding: utf8 -*-
from __future__ import absolute_import
from __future__ import division, print_function, unicode_literals
from itertools import groupby
from lxml.sax import saxify, ContentHandler
from .utils import is_blank, shrink_text
from ._compat import to_unicode
_SEMANTIC_TAGS = frozenset((
"a", "abbr", "acronym", "b", "big", "blink", "blockquote", "cite", "code",
"dd", "del", "dfn", "dir", "dl", "dt", "em", "h", "h1", "h2", "h3", "h4",
"h5", "h6", "i", "ins", "kbd", "li", "marquee", "menu", "ol", "pre", "q",
"s", "samp", "strike", "strong", "sub", "sup", "tt", "u", "ul", "var",
))
class AnnotatedTextHandler(ContentHandler):
"""A class for converting a HTML DOM into annotated text."""
@classmethod
def parse(cls, dom):
"""Converts DOM into paragraphs."""
handler = cls()
saxify(dom, handler)
return handler.content
def __init__(self):
self._content = []
self._paragraph = []
self._dom_path = []
@property
def content(self):
return self._content
def startElementNS(self, name, qname, attrs):
namespace, name = name
if name in _SEMANTIC_TAGS:
self._dom_path.append(to_unicode(name))
def endElementNS(self, name, qname):
namespace, name = name
if name == "p" and self._paragraph:
self._append_paragraph(self._paragraph)
elif name in ("ol", "ul", "pre") and self._paragraph:
self._append_paragraph(self._paragraph)
self._dom_path.pop()
elif name in _SEMANTIC_TAGS:
self._dom_path.pop()
def endDocument(self):
if self._paragraph:
self._append_paragraph(self._paragraph)
def _append_paragraph(self, paragraph):
paragraph = self._process_paragraph(paragraph)
self._content.append(paragraph)
self._paragraph = []
def _process_paragraph(self, paragraph):
current_paragraph = []
for annotation, items in groupby(paragraph, key=lambda i: i[1]):
if annotation and "li" in annotation:
for text, _ in items:
text = shrink_text(text)
current_paragraph.append((text, annotation))
else:
text = "".join(i[0] for i in items)
text = shrink_text(text)
current_paragraph.append((text, annotation))
return tuple(current_paragraph)
def characters(self, content):
if is_blank(content):
return
if self._dom_path:
pair = (content, tuple(sorted(frozenset(self._dom_path))))
else:
pair = (content, None)
self._paragraph.append(pair)

@ -0,0 +1,130 @@
# -*- coding: utf8 -*-
"""Generate a clean nice starting html document to process for an article."""
from __future__ import absolute_import
import re
import logging
import charade
from lxml.etree import tostring, tounicode, XMLSyntaxError
from lxml.html import document_fromstring, HTMLParser
from ._compat import unicode, to_bytes, to_unicode, unicode_compatible
from .utils import cached_property
logger = logging.getLogger("readability")
TAG_MARK_PATTERN = re.compile(to_bytes(r"</?[^>]*>\s*"))
def determine_encoding(page):
encoding = "utf8"
text = TAG_MARK_PATTERN.sub(to_bytes(" "), page)
# don't venture to guess
if not text.strip() or len(text) < 10:
return encoding
# try enforce UTF-8
diff = text.decode(encoding, "ignore").encode(encoding)
sizes = len(diff), len(text)
# 99% of UTF-8
if abs(len(text) - len(diff)) < max(sizes) * 0.01:
return encoding
# try detect encoding
encoding_detector = charade.detect(text)
if encoding_detector["encoding"]:
encoding = encoding_detector["encoding"]
return encoding
BREAK_TAGS_PATTERN = re.compile(to_unicode(r"(?:<\s*[bh]r[^>]*>\s*)+"), re.IGNORECASE)
def convert_breaks_to_paragraphs(html):
"""
Converts <hr> tag and multiple <br> tags into paragraph.
"""
logger.debug("Converting multiple <br> & <hr> tags into <p>.")
return BREAK_TAGS_PATTERN.sub(_replace_break_tags, html)
def _replace_break_tags(match):
tags = match.group()
if to_unicode("<hr") in tags:
return to_unicode("</p><p>")
elif tags.count(to_unicode("<br")) > 1:
return to_unicode("</p><p>")
else:
return tags
UTF8_PARSER = HTMLParser(encoding="utf8")
def build_document(html_content, base_href=None):
"""Requires that the `html_content` not be None"""
assert html_content is not None
if isinstance(html_content, unicode):
html_content = html_content.encode("utf8", "replace")
try:
document = document_fromstring(html_content, parser=UTF8_PARSER)
except XMLSyntaxError:
raise ValueError("Failed to parse document contents.")
if base_href:
document.make_links_absolute(base_href, resolve_base_href=True)
else:
document.resolve_base_href()
return document
@unicode_compatible
class OriginalDocument(object):
"""The original document to process."""
def __init__(self, html, url=None):
self._html = html
self._url = url
@property
def url(self):
"""Source URL of HTML document."""
return self._url
def __unicode__(self):
"""Renders the document as a string."""
return tounicode(self.dom)
@cached_property
def dom(self):
"""Parsed HTML document from the input."""
html = self._html
if not isinstance(html, unicode):
encoding = determine_encoding(html)
html = html.decode(encoding)
html = convert_breaks_to_paragraphs(html)
document = build_document(html, self._url)
return document
@cached_property
def links(self):
"""Links within the document."""
return self.dom.findall(".//a")
@cached_property
def title(self):
"""Title attribute of the parsed document."""
title_element = self.dom.find(".//title")
if title_element is None or title_element.text is None:
return ""
else:
return title_element.text.strip()

@ -0,0 +1,460 @@
# -*- coding: utf8 -*-
from __future__ import absolute_import
import re
import logging
from copy import deepcopy
from operator import attrgetter
from pprint import PrettyPrinter
from lxml.html.clean import Cleaner
from lxml.etree import tounicode, tostring
from lxml.html import fragment_fromstring, fromstring
from .document import OriginalDocument
from .annotated_text import AnnotatedTextHandler
from .scoring import (score_candidates, get_link_density, get_class_weight,
is_unlikely_node)
from .utils import cached_property, shrink_text
html_cleaner = Cleaner(scripts=True, javascript=True, comments=True,
style=True, links=True, meta=False, add_nofollow=False,
page_structure=False, processing_instructions=True,
embedded=False, frames=False, forms=False,
annoying_tags=False, remove_tags=None, kill_tags=("noscript", "iframe"),
remove_unknown_tags=False, safe_attrs_only=False)
SCORABLE_TAGS = ("div", "p", "td", "pre", "article")
ANNOTATION_TAGS = (
"a", "abbr", "acronym", "b", "big", "blink", "blockquote", "br", "cite",
"code", "dd", "del", "dir", "dl", "dt", "em", "font", "h", "h1", "h2",
"h3", "h4", "h5", "h6", "hr", "i", "ins", "kbd", "li", "marquee", "menu",
"ol", "p", "pre", "q", "s", "samp", "span", "strike", "strong", "sub",
"sup", "tt", "u", "ul", "var",
)
NULL_DOCUMENT = """
<html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
</head>
<body>
</body>
</html>
"""
logger = logging.getLogger("readability")
def ok_embedded_video(node):
"""Check if this embed/video is an ok one to count."""
good_keywords = ('youtube', 'blip.tv', 'vimeo')
node_str = tounicode(node)
for key in good_keywords:
if key in node_str:
return True
return False
def build_base_document(dom, return_fragment=True):
"""
Builds a base document with the body as root.
:param dom: Parsed lxml tree (Document Object Model).
:param bool return_fragment: If True only <div> fragment is returned.
Otherwise full HTML document is returned.
"""
body_element = dom.find(".//body")
if body_element is None:
fragment = fragment_fromstring('<div id="readabilityBody"/>')
fragment.append(dom)
else:
body_element.tag = "div"
body_element.set("id", "readabilityBody")
fragment = body_element
return document_from_fragment(fragment, return_fragment)
def build_error_document(dom, return_fragment=True):
"""
Builds an empty erorr document with the body as root.
:param bool return_fragment: If True only <div> fragment is returned.
Otherwise full HTML document is returned.
"""
fragment = fragment_fromstring(
'<div id="readabilityBody" class="parsing-error"/>')
return document_from_fragment(fragment, return_fragment)
def document_from_fragment(fragment, return_fragment):
if return_fragment:
document = fragment
else:
document = fromstring(NULL_DOCUMENT)
body_element = document.find(".//body")
body_element.append(fragment)
document.doctype = "<!DOCTYPE html>"
return document
def check_siblings(candidate_node, candidate_list):
"""
Looks through siblings for content that might also be related.
Things like preambles, content split by ads that we removed, etc.
"""
candidate_css = candidate_node.node.get("class")
potential_target = candidate_node.content_score * 0.2
sibling_target_score = potential_target if potential_target > 10 else 10
parent = candidate_node.node.getparent()
siblings = parent.getchildren() if parent is not None else []
for sibling in siblings:
append = False
content_bonus = 0
if sibling is candidate_node.node:
append = True
# Give a bonus if sibling nodes and top candidates have the example
# same class name
if candidate_css and sibling.get("class") == candidate_css:
content_bonus += candidate_node.content_score * 0.2
if sibling in candidate_list:
adjusted_score = candidate_list[sibling].content_score + content_bonus
if adjusted_score >= sibling_target_score:
append = True
if sibling.tag == "p":
link_density = get_link_density(sibling)
content = sibling.text_content()
content_length = len(content)
if content_length > 80 and link_density < 0.25:
append = True
elif content_length < 80 and link_density == 0:
if ". " in content:
append = True
if append:
logger.debug("Sibling appended: %s %r", sibling.tag, sibling.attrib)
if sibling.tag not in ("div", "p"):
# We have a node that isn't a common block level element, like
# a form or td tag. Turn it into a div so it doesn't get
# filtered out later by accident.
sibling.tag = "div"
candidate_node.node.append(sibling)
return candidate_node
def clean_document(node):
"""Cleans up the final document we return as the readable article."""
if node is None or len(node) == 0:
return None
logger.debug("Cleaning document.")
to_drop = []
for n in node.iter():
logger.debug("Cleaning node: %s %r", n.tag, n.attrib)
# clean out any in-line style properties
if "style" in n.attrib:
n.set("style", "")
# remove embended objects unless it's wanted video
if n.tag in ("object", "embed") and not ok_embedded_video(n):
logger.debug("Dropping node %s %r", n.tag, n.attrib)
to_drop.append(n)
# clean headings with bad css or high link density
if n.tag in ("h1", "h2", "h3", "h4") and get_class_weight(n) < 0:
logger.debug("Dropping <%s>, it's insignificant", n.tag)
to_drop.append(n)
if n.tag in ("h3", "h4") and get_link_density(n) > 0.33:
logger.debug("Dropping <%s>, it's insignificant", n.tag)
to_drop.append(n)
# drop block element without content and children
if n.tag in ("div", "p"):
text_content = shrink_text(n.text_content())
if len(text_content) < 5 and not n.getchildren():
logger.debug("Dropping %s %r without content.", n.tag, n.attrib)
to_drop.append(n)
# finally try out the conditional cleaning of the target node
if clean_conditionally(n):
to_drop.append(n)
drop_nodes_with_parents(to_drop)
return node
def drop_nodes_with_parents(nodes):
for node in nodes:
if node.getparent() is not None:
logger.debug("Droping node with parent %s %r", node.tag, node.attrib)
node.drop_tree()
def clean_conditionally(node):
"""Remove the clean_el if it looks like bad content based on rules."""
logger.debug('Cleaning conditionally node: %s %r', node.tag, node.attrib)
if node.tag not in ('form', 'table', 'ul', 'div', 'p'):
# this is not the tag you're looking for
logger.debug('Node cleared: %s %r', node.tag, node.attrib)
return
weight = get_class_weight(node)
# content_score = LOOK up the content score for this node we found
# before else default to 0
content_score = 0
if weight + content_score < 0:
logger.debug('Dropping conditional node')
logger.debug('Weight + score < 0')
return True
commas_count = node.text_content().count(',')
if commas_count < 10:
logger.debug("There are %d commas so we're processing more.", commas_count)
# If there are not very many commas, and the number of
# non-paragraph elements is more than paragraphs or other ominous
# signs, remove the element.
p = len(node.findall('.//p'))
img = len(node.findall('.//img'))
li = len(node.findall('.//li')) - 100
inputs = len(node.findall('.//input'))
embed = 0
embeds = node.findall('.//embed')
for e in embeds:
if ok_embedded_video(e):
embed += 1
link_density = get_link_density(node)
content_length = len(node.text_content())
remove_node = False
if li > p and node.tag != 'ul' and node.tag != 'ol':
logger.debug('Conditional drop: li > p and not ul/ol')
remove_node = True
elif inputs > p / 3.0:
logger.debug('Conditional drop: inputs > p/3.0')
remove_node = True
elif content_length < 25 and (img == 0 or img > 2):
logger.debug('Conditional drop: len < 25 and 0/>2 images')
remove_node = True
elif weight < 25 and link_density > 0.2:
logger.debug('Conditional drop: weight small and link is dense')
remove_node = True
elif weight >= 25 and link_density > 0.5:
logger.debug('Conditional drop: weight big but link heavy')
remove_node = True
elif (embed == 1 and content_length < 75) or embed > 1:
logger.debug('Conditional drop: embed w/o much content or many embed')
remove_node = True
if remove_node:
logger.debug('Node will be removed')
else:
logger.debug('Node cleared: %s %r', node.tag, node.attrib)
return remove_node
# nope, don't remove anything
logger.debug('Node Cleared final.')
return False
def prep_article(doc):
"""Once we've found our target article we want to clean it up.
Clean out:
- inline styles
- forms
- strip empty <p>
- extra tags
"""
return clean_document(doc)
def find_candidates(document):
"""
Finds cadidate nodes for the readable version of the article.
Here's we're going to remove unlikely nodes, find scores on the rest,
clean up and return the final best match.
"""
nodes_to_score = set()
should_remove = set()
for node in document.iter():
if is_unlikely_node(node):
logger.debug("We should drop unlikely: %s %r", node.tag, node.attrib)
should_remove.add(node)
elif is_bad_link(node):
logger.debug("We should drop bad link: %s %r", node.tag, node.attrib)
should_remove.add(node)
elif node.tag in SCORABLE_TAGS:
nodes_to_score.add(node)
return score_candidates(nodes_to_score), should_remove
def is_bad_link(node):
"""
Helper to determine if the node is link that is useless.
We've hit articles with many multiple links that should be cleaned out
because they're just there to pollute the space. See tests for examples.
"""
if node.tag != "a":
return False
name = node.get("name")
href = node.get("href")
if name and not href:
return True
if href:
href_parts = href.split("#")
if len(href_parts) == 2 and len(href_parts[1]) > 25:
return True
return False
class Article(object):
"""Parsed readable object"""
def __init__(self, html, url=None, return_fragment=True):
"""
Create the Article we're going to use.
:param html: The string of HTML we're going to parse.
:param url: The url so we can adjust the links to still work.
:param return_fragment: Should we return a <div> fragment or
a full <html> document.
"""
self._original_document = OriginalDocument(html, url=url)
self._return_fragment = return_fragment
def __str__(self):
return tostring(self._readable())
def __unicode__(self):
return tounicode(self._readable())
@cached_property
def dom(self):
"""Parsed lxml tree (Document Object Model) of the given html."""
try:
dom = self._original_document.dom
# cleaning doesn't return, just wipes in place
html_cleaner(dom)
return leaf_div_elements_into_paragraphs(dom)
except ValueError:
return None
@cached_property
def candidates(self):
"""Generates list of candidates from the DOM."""
dom = self.dom
if dom is None or len(dom) == 0:
return None
candidates, unlikely_candidates = find_candidates(dom)
drop_nodes_with_parents(unlikely_candidates)
return candidates
@cached_property
def main_text(self):
dom = deepcopy(self.readable_dom).get_element_by_id("readabilityBody")
return AnnotatedTextHandler.parse(dom)
@cached_property
def readable(self):
return tounicode(self.readable_dom)
@cached_property
def readable_dom(self):
return self._readable()
def _readable(self):
"""The readable parsed article"""
if not self.candidates:
logger.warning("No candidates found in document.")
return self._handle_no_candidates()
# right now we return the highest scoring candidate content
best_candidates = sorted((c for c in self.candidates.values()),
key=attrgetter("content_score"), reverse=True)
printer = PrettyPrinter(indent=2)
logger.debug(printer.pformat(best_candidates))
# since we have several candidates, check the winner's siblings
# for extra content
winner = best_candidates[0]
updated_winner = check_siblings(winner, self.candidates)
updated_winner.node = prep_article(updated_winner.node)
if updated_winner.node is not None:
dom = build_base_document(updated_winner.node, self._return_fragment)
else:
logger.warning('Had candidates but failed to find a cleaned winning DOM.')
dom = self._handle_no_candidates()
return self._remove_orphans(dom.get_element_by_id("readabilityBody"))
def _remove_orphans(self, dom):
for node in dom.iterdescendants():
if len(node) == 1 and tuple(node)[0].tag == node.tag:
node.drop_tag()
return dom
def _handle_no_candidates(self):
"""
If we fail to find a good candidate we need to find something else.
"""
# since we've not found a good candidate we're should help this
if self.dom is not None and len(self.dom):
dom = prep_article(self.dom)
dom = build_base_document(dom, self._return_fragment)
return self._remove_orphans(dom.get_element_by_id("readabilityBody"))
else:
logger.warning("No document to use.")
return build_error_document(self._return_fragment)
def leaf_div_elements_into_paragraphs(document):
"""
Turn some block elements that don't have children block level
elements into <p> elements.
Since we can't change the tree as we iterate over it, we must do this
before we process our document.
"""
for element in document.iter(tag="div"):
child_tags = tuple(n.tag for n in element.getchildren())
if "div" not in child_tags and "p" not in child_tags:
logger.debug("Changing leaf block element <%s> into <p>", element.tag)
element.tag = "p"
return document

@ -0,0 +1,251 @@
# -*- coding: utf8 -*-
"""Handle dealing with scoring nodes and content for our parsing."""
from __future__ import absolute_import
from __future__ import division, print_function
import re
import logging
from hashlib import md5
from lxml.etree import tostring
from ._compat import to_bytes
from .utils import normalize_whitespace
# A series of sets of attributes we check to help in determining if a node is
# a potential candidate or not.
CLS_UNLIKELY = re.compile(
"combx|comment|community|disqus|extra|foot|header|menu|remark|rss|shoutbox|"
"sidebar|sponsor|ad-break|agegate|pagination|pager|perma|popup|tweet|"
"twitter|social|breadcrumb",
re.IGNORECASE
)
CLS_MAYBE = re.compile(
"and|article|body|column|main|shadow|entry",
re.IGNORECASE
)
CLS_WEIGHT_POSITIVE = re.compile(
"article|body|content|entry|main|page|pagination|post|text|blog|story",
re.IGNORECASE
)
CLS_WEIGHT_NEGATIVE = re.compile(
"combx|comment|com-|contact|foot|footer|footnote|head|masthead|media|meta|"
"outbrain|promo|related|scroll|shoutbox|sidebar|sponsor|shopping|tags|tool|"
"widget",
re.IGNORECASE
)
logger = logging.getLogger("readability")
def check_node_attributes(pattern, node, *attributes):
"""
Searches match in attributes against given pattern and if
finds the match against any of them returns True.
"""
for attribute_name in attributes:
attribute = node.get(attribute_name)
if attribute is not None and pattern.search(attribute):
return True
return False
def generate_hash_id(node):
"""
Generates a hash_id for the node in question.
:param node: lxml etree node
"""
try:
content = tostring(node)
except Exception as e:
logger.exception("Generating of hash failed")
content = to_bytes(repr(node))
hash_id = md5(content).hexdigest()
return hash_id[:8]
def get_link_density(node, node_text=None):
"""
Computes the ratio for text in given node and text in links
contained in the node. It is computed from number of
characters in the texts.
:parameter Element node:
HTML element in which links density is computed.
:parameter string node_text:
Text content of given node if it was obtained before.
:returns float:
Returns value of computed 0 <= density <= 1, where 0 means
no links and 1 means that node contains only links.
"""
if node_text is None:
node_text = node.text_content()
node_text = normalize_whitespace(node_text.strip())
text_length = len(node_text)
if text_length == 0:
return 0.0
links_length = sum(map(_get_normalized_text_length, node.findall(".//a")))
return links_length / text_length
def _get_normalized_text_length(node):
return len(normalize_whitespace(node.text_content().strip()))
def get_class_weight(node):
"""
Computes weight of element according to its class/id.
We're using sets to help efficiently check for existence of matches.
"""
weight = 0
if check_node_attributes(CLS_WEIGHT_NEGATIVE, node, "class"):
weight -= 25
if check_node_attributes(CLS_WEIGHT_POSITIVE, node, "class"):
weight += 25
if check_node_attributes(CLS_WEIGHT_NEGATIVE, node, "id"):
weight -= 25
if check_node_attributes(CLS_WEIGHT_POSITIVE, node, "id"):
weight += 25
return weight
def is_unlikely_node(node):
"""
Short helper for checking unlikely status.
If the class or id are in the unlikely list, and there's not also a
class/id in the likely list then it might need to be removed.
"""
unlikely = check_node_attributes(CLS_UNLIKELY, node, "class", "id")
maybe = check_node_attributes(CLS_MAYBE, node, "class", "id")
return bool(unlikely and not maybe and node.tag != "body")
def score_candidates(nodes):
"""Given a list of potential nodes, find some initial scores to start"""
MIN_HIT_LENTH = 25
candidates = {}
for node in nodes:
logger.debug("* Scoring candidate %s %r", node.tag, node.attrib)
# if the node has no parent it knows of
# then it ends up creating a body & html tag to parent the html fragment
parent = node.getparent()
if parent is None:
logger.debug("Skipping candidate - parent node is 'None'.")
continue
grand = parent.getparent()
if grand is None:
logger.debug("Skipping candidate - grand parent node is 'None'.")
continue
# if paragraph is < `MIN_HIT_LENTH` characters don't even count it
inner_text = node.text_content().strip()
if len(inner_text) < MIN_HIT_LENTH:
logger.debug("Skipping candidate - inner text < %d characters.", MIN_HIT_LENTH)
continue
# initialize readability data for the parent
# add parent node if it isn't in the candidate list
if parent not in candidates:
candidates[parent] = ScoredNode(parent)
if grand not in candidates:
candidates[grand] = ScoredNode(grand)
# add a point for the paragraph itself as a base
content_score = 1
if inner_text:
# add 0.25 points for any commas within this paragraph
commas_count = inner_text.count(",")
content_score += commas_count * 0.25
logger.debug("Bonus points for %d commas.", commas_count)
# subtract 0.5 points for each double quote within this paragraph
double_quotes_count = inner_text.count('"')
content_score += double_quotes_count * -0.5
logger.debug("Penalty points for %d double-quotes.", double_quotes_count)
# for every 100 characters in this paragraph, add another point
# up to 3 points
length_points = len(inner_text) / 100
content_score += min(length_points, 3.0)
logger.debug("Bonus points for length of text: %f", length_points)
# add the score to the parent
logger.debug("Bonus points for parent %s %r with score %f: %f",
parent.tag, parent.attrib, candidates[parent].content_score,
content_score)
candidates[parent].content_score += content_score
# the grand node gets half
logger.debug("Bonus points for grand %s %r with score %f: %f",
grand.tag, grand.attrib, candidates[grand].content_score,
content_score / 2.0)
candidates[grand].content_score += content_score / 2.0
if node not in candidates:
candidates[node] = ScoredNode(node)
candidates[node].content_score += content_score
for candidate in candidates.values():
adjustment = 1.0 - get_link_density(candidate.node)
candidate.content_score *= adjustment
logger.debug("Link density adjustment for %s %r: %f",
candidate.node.tag, candidate.node.attrib, adjustment)
return candidates
class ScoredNode(object):
"""
We need Scored nodes we use to track possible article matches
We might have a bunch of these so we use __slots__ to keep memory usage
down.
"""
__slots__ = ('node', 'content_score')
def __init__(self, node):
"""Given node, set an initial score and weigh based on css and id"""
self.node = node
self.content_score = 0
if node.tag in ('div', 'article'):
self.content_score = 5
if node.tag in ('pre', 'td', 'blockquote'):
self.content_score = 3
if node.tag in ('address', 'ol', 'ul', 'dl', 'dd', 'dt', 'li', 'form'):
self.content_score = -3
if node.tag in ('h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'th'):
self.content_score = -5
self.content_score += get_class_weight(node)
@property
def hash_id(self):
return generate_hash_id(self.node)
def __repr__(self):
if self.node is None:
return "<NullScoredNode with score {2:0.1F}>" % self.content_score
return "<ScoredNode {0} {1}: {2:0.1F}>".format(
self.node.tag,
self.node.attrib,
self.content_score
)

@ -0,0 +1,87 @@
# -*- coding: utf8 -*-
"""
A fast python port of arc90's readability tool
Usage:
readability [options] <resource>
readability --version
readability --help
Arguments:
<resource> URL or file path to process in readable form.
Options:
-f, --fragment Output html fragment by default.
-b, --browser Open the parsed content in your web browser.
-d, --debug Output the detailed scoring information for debugging
parsing.
-v, --verbose Increase logging verbosity to DEBUG.
--version Display program's version number and exit.
-h, --help Display this help message and exit.
"""
from __future__ import absolute_import
from __future__ import division, print_function, unicode_literals
import logging
import locale
import webbrowser
from tempfile import NamedTemporaryFile
from docopt import docopt
from .. import __version__
from .._compat import urllib
from ..readable import Article
HEADERS = {
"User-Agent": "Readability (Readable content parser) Version/%s" % __version__,
}
def parse_args():
return docopt(__doc__, version=__version__)
def main():
args = parse_args()
logger = logging.getLogger("readability")
if args["--verbose"]:
logger.setLevel(logging.DEBUG)
resource = args["<resource>"]
if resource.startswith("www"):
resource = "http://" + resource
url = None
if resource.startswith("http://") or resource.startswith("https://"):
url = resource
request = urllib.Request(url, headers=HEADERS)
response = urllib.urlopen(request)
content = response.read()
response.close()
else:
with open(resource, "r") as file:
content = file.read()
document = Article(content, url=url, return_fragment=args["--fragment"])
if args["--browser"]:
html_file = NamedTemporaryFile(mode="wb", suffix=".html", delete=False)
content = document.readable.encode("utf8")
html_file.write(content)
html_file.close()
webbrowser.open(html_file.name)
else:
encoding = locale.getpreferredencoding()
content = document.readable.encode(encoding)
print(content)
if __name__ == '__main__':
main()

@ -0,0 +1,127 @@
# -*- coding: utf8 -*-
"""
Helper to generate a new set of article test files for readability.
Usage:
readability_test --name <name> <url>
readability_test --version
readability_test --help
Arguments:
<url> The url of content to fetch for the article.html
Options:
-n <name>, --name=<name> Name of the test directory.
--version Show program's version number and exit.
-h, --help Show this help message and exit.
"""
from __future__ import absolute_import
from __future__ import division, print_function, unicode_literals
from os import mkdir
from os.path import join, dirname, pardir, exists as path_exists
from docopt import docopt
from .. import __version__
from .._compat import to_unicode, urllib
TEST_PATH = join(
dirname(__file__),
pardir, pardir,
"tests/test_articles"
)
TEST_TEMPLATE = '''# -*- coding: utf8 -*-
from __future__ import absolute_import
from __future__ import division, print_function, unicode_literals
from os.path import join, dirname
from readability.readable import Article
from ...compat import unittest
class TestArticle(unittest.TestCase):
"""
Test the scoring and parsing of the article from URL below:
%(source_url)s
"""
def setUp(self):
"""Load up the article for us"""
article_path = join(dirname(__file__), "article.html")
with open(article_path, "rb") as file:
self.document = Article(file.read(), "%(source_url)s")
def tearDown(self):
"""Drop the article"""
self.document = None
def test_parses(self):
"""Verify we can parse the document."""
self.assertIn('id="readabilityBody"', self.document.readable)
def test_content_exists(self):
"""Verify that some content exists."""
self.assertIn("#&@#&@#&@", self.document.readable)
def test_content_does_not_exist(self):
"""Verify we cleaned out some content that shouldn't exist."""
self.assertNotIn("", self.document.readable)
'''
def parse_args():
return docopt(__doc__, version=__version__)
def make_test_directory(name):
"""Generates a new directory for tests."""
directory_name = "test_" + name.replace(" ", "_")
directory_path = join(TEST_PATH, directory_name)
if not path_exists(directory_path):
mkdir(directory_path)
return directory_path
def make_test_files(directory_path, url):
init_file = join(directory_path, "__init__.py")
open(init_file, "a").close()
data = TEST_TEMPLATE % {
"source_url": to_unicode(url)
}
test_file = join(directory_path, "test.py")
with open(test_file, "w") as file:
file.write(data)
def fetch_article(directory_path, url):
"""Get the content of the url and make it the article.html"""
opener = urllib.build_opener()
opener.addheaders = [("Accept-Charset", "utf-8")]
response = opener.open(url)
html_data = response.read()
response.close()
path = join(directory_path, "article.html")
with open(path, "wb") as file:
file.write(html_data)
def main():
"""Run the script."""
args = parse_args()
directory = make_test_directory(args["--name"])
make_test_files(directory, args["<url>"])
fetch_article(directory, args["<url>"])
if __name__ == "__main__":
main()

@ -0,0 +1,58 @@
# -*- coding: utf8 -*-
from __future__ import absolute_import
from __future__ import division, print_function, unicode_literals
import re
def is_blank(text):
"""
Returns ``True`` if string contains only whitespace characters
or is empty. Otherwise ``False`` is returned.
"""
return not text or text.isspace()
def shrink_text(text):
return normalize_whitespace(text.strip())
MULTIPLE_WHITESPACE_PATTERN = re.compile(r"\s+", re.UNICODE)
def normalize_whitespace(text):
"""
Translates multiple whitespace into single space character.
If there is at least one new line character chunk is replaced
by single LF (Unix new line) character.
"""
return MULTIPLE_WHITESPACE_PATTERN.sub(_replace_whitespace, text)
def _replace_whitespace(match):
text = match.group()
if "\n" in text or "\r" in text:
return "\n"
else:
return " "
def cached_property(getter):
"""
Decorator that converts a method into memoized property.
The decorator works as expected only for classes with
attribute '__dict__' and immutable properties.
"""
def decorator(self):
key = "_cached_property_" + getter.__name__
if not hasattr(self, key):
setattr(self, key, getter(self))
return getattr(self, key)
decorator.__name__ = getter.__name__
decorator.__module__ = getter.__module__
decorator.__doc__ = getter.__doc__
return property(decorator)

@ -0,0 +1,508 @@
import re
from lxml.etree import tounicode
from lxml.etree import tostring
from lxml.html.clean import Cleaner
from lxml.html import fragment_fromstring
from lxml.html import fromstring
from operator import attrgetter
from pprint import PrettyPrinter
from breadability.document import OriginalDocument
from breadability.logconfig import LOG
from breadability.logconfig import LNODE
from breadability.scoring import score_candidates
from breadability.scoring import get_link_density
from breadability.scoring import get_class_weight
from breadability.scoring import is_unlikely_node
from breadability.utils import cached_property
html_cleaner = Cleaner(scripts=True, javascript=True, comments=True,
style=True, links=True, meta=False, add_nofollow=False,
page_structure=False, processing_instructions=True,
embedded=False, frames=False, forms=False,
annoying_tags=False, remove_tags=None,
remove_unknown_tags=False, safe_attrs_only=False)
BASE_DOC = """
<html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
</head>
<body>
</body>
</html>
"""
SCORABLE_TAGS = ['div', 'p', 'td', 'pre', 'article']
def drop_tag(doc, *tags):
"""Helper to just remove any nodes that match this html tag passed in
:param *tags: one or more html tag strings to remove e.g. style, script
"""
for tag in tags:
found = doc.iterfind(".//" + tag)
for n in found:
LNODE.log(n, 1, "Dropping tag")
n.drop_tree()
return doc
def is_bad_link(a_node):
"""Helper to determine if the link is something to clean out
We've hit articles with many multiple links that should be cleaned out
because they're just there to pollute the space. See tests for examples.
"""
if a_node.tag == 'a':
name = a_node.get('name')
href = a_node.get('href')
if name and not href:
return True
if href:
url_bits = href.split('#')
if len(url_bits) == 2:
if len(url_bits[1]) > 25:
return True
return False
def ok_embedded_video(node):
"""Check if this embed/video is an ok one to count."""
keep_keywords = ['youtube', 'blip.tv', 'vimeo']
node_str = tounicode(node)
for key in keep_keywords:
if key in node_str:
return True
return False
def build_base_document(html, fragment=True):
"""Return a base document with the body as root.
:param html: Parsed Element object
:param fragment: Should we return a <div> doc fragment or a full <html>
doc.
"""
if html.tag == 'body':
html.tag = 'div'
found_body = html
else:
found_body = html.find('.//body')
if found_body is None:
frag = fragment_fromstring('<div/>')
frag.set('id', 'readabilityBody')
frag.append(html)
if not fragment:
output = fromstring(BASE_DOC)
insert_point = output.find('.//body')
insert_point.append(frag)
else:
output = frag
else:
found_body.tag = 'div'
found_body.set('id', 'readabilityBody')
if not fragment:
output = fromstring(BASE_DOC)
insert_point = output.find('.//body')
insert_point.append(found_body)
else:
output = found_body
output.doctype = "<!DOCTYPE html>"
return output
def build_error_document(html, fragment=True):
"""Return an empty erorr document with the body as root.
:param fragment: Should we return a <div> doc fragment or a full <html>
doc.
"""
frag = fragment_fromstring('<div/>')
frag.set('id', 'readabilityBody')
frag.set('class', 'parsing-error')
if not fragment:
output = fromstring(BASE_DOC)
insert_point = output.find('.//body')
insert_point.append(frag)
else:
output = frag
output.doctype = "<!DOCTYPE html>"
return output
def transform_misused_divs_into_paragraphs(doc):
"""Turn all divs that don't have children block level elements into p's
Since we can't change the tree as we iterate over it, we must do this
before we process our document.
The idea is that we process all divs and if the div does not contain
another list of divs, then we replace it with a p tag instead appending
it's contents/children to it.
"""
for elem in doc.iter(tag='div'):
child_tags = [n.tag for n in elem.getchildren()]
if 'div' not in child_tags:
# if there is no div inside of this div...then it's a leaf
# node in a sense.
# We need to create a <p> and put all it's contents in there
# We'll just stringify it, then regex replace the first/last
# div bits to turn them into <p> vs <div>.
LNODE.log(elem, 1, 'Turning leaf <div> into <p>')
orig = tounicode(elem).strip()
started = re.sub(r'^<\s*div', '<p', orig)
ended = re.sub(r'div>$', 'p>', started)
elem.getparent().replace(elem, fromstring(ended))
return doc
def check_siblings(candidate_node, candidate_list):
"""Look through siblings for content that might also be related.
Things like preambles, content split by ads that we removed, etc.
"""
candidate_css = candidate_node.node.get('class')
potential_target = candidate_node.content_score * 0.2
sibling_target_score = potential_target if potential_target > 10 else 10
parent = candidate_node.node.getparent()
siblings = parent.getchildren() if parent is not None else []
for sibling in siblings:
append = False
content_bonus = 0
if sibling is candidate_node.node:
LNODE.log(sibling, 1, 'Sibling is the node so append')
append = True
# Give a bonus if sibling nodes and top candidates have the example
# same class name
if candidate_css and sibling.get('class') == candidate_css:
content_bonus += candidate_node.content_score * 0.2
if sibling in candidate_list:
adjusted_score = candidate_list[sibling].content_score + \
content_bonus
if adjusted_score >= sibling_target_score:
append = True
if sibling.tag == 'p':
link_density = get_link_density(sibling)
content = sibling.text_content()
content_length = len(content)
if content_length > 80 and link_density < 0.25:
append = True
elif content_length < 80 and link_density == 0:
if ". " in content:
append = True
if append:
LNODE.log(sibling, 1, 'Sibling being appended')
if sibling.tag not in ['div', 'p']:
# We have a node that isn't a common block level element, like
# a form or td tag. Turn it into a div so it doesn't get
# filtered out later by accident.
sibling.tag = 'div'
if candidate_node.node != sibling:
candidate_node.node.append(sibling)
return candidate_node
def clean_document(node):
"""Clean up the final document we return as the readable article"""
if node is None or len(node) == 0:
return
LNODE.log(node, 2, "Processing doc")
clean_list = ['object', 'h1']
to_drop = []
# If there is only one h2, they are probably using it as a header and
# not a subheader, so remove it since we already have a header.
if len(node.findall('.//h2')) == 1:
LOG.debug('Adding H2 to list of nodes to clean.')
clean_list.append('h2')
for n in node.iter():
LNODE.log(n, 2, "Cleaning iter node")
# clean out any in-line style properties
if 'style' in n.attrib:
n.set('style', '')
# remove all of the following tags
# Clean a node of all elements of type "tag".
# (Unless it's a youtube/vimeo video. People love movies.)
is_embed = True if n.tag in ['object', 'embed'] else False
if n.tag in clean_list:
allow = False
# Allow youtube and vimeo videos through as people usually
# want to see those.
if is_embed:
if ok_embedded_video(n):
allow = True
if not allow:
LNODE.log(n, 2, "Dropping Node")
to_drop.append(n)
if n.tag in ['h1', 'h2', 'h3', 'h4']:
# clean headings
# if the heading has no css weight or a high link density,
# remove it
if get_class_weight(n) < 0 or get_link_density(n) > .33:
LNODE.log(n, 2, "Dropping <hX>, it's insignificant")
to_drop.append(n)
# clean out extra <p>
if n.tag == 'p':
# if the p has no children and has no content...well then down
# with it.
if not n.getchildren() and len(n.text_content()) < 5:
LNODE.log(n, 2, 'Dropping extra <p>')
to_drop.append(n)
# finally try out the conditional cleaning of the target node
if clean_conditionally(n):
to_drop.append(n)
[n.drop_tree() for n in to_drop if n.getparent() is not None]
return node
def clean_conditionally(node):
"""Remove the clean_el if it looks like bad content based on rules."""
target_tags = ['form', 'table', 'ul', 'div', 'p']
LNODE.log(node, 2, 'Cleaning conditionally node.')
if node.tag not in target_tags:
# this is not the tag you're looking for
LNODE.log(node, 2, 'Node cleared.')
return
weight = get_class_weight(node)
# content_score = LOOK up the content score for this node we found
# before else default to 0
content_score = 0
if (weight + content_score < 0):
LNODE.log(node, 2, 'Dropping conditional node')
LNODE.log(node, 2, 'Weight + score < 0')
return True
if node.text_content().count(',') < 10:
LOG.debug("There aren't 10 ,s so we're processing more")
# If there are not very many commas, and the number of
# non-paragraph elements is more than paragraphs or other ominous
# signs, remove the element.
p = len(node.findall('.//p'))
img = len(node.findall('.//img'))
li = len(node.findall('.//li')) - 100
inputs = len(node.findall('.//input'))
embed = 0
embeds = node.findall('.//embed')
for e in embeds:
if ok_embedded_video(e):
embed += 1
link_density = get_link_density(node)
content_length = len(node.text_content())
remove_node = False
if li > p and node.tag != 'ul' and node.tag != 'ol':
LNODE.log(node, 2, 'Conditional drop: li > p and not ul/ol')
remove_node = True
elif inputs > p / 3.0:
LNODE.log(node, 2, 'Conditional drop: inputs > p/3.0')
remove_node = True
elif content_length < 25 and (img == 0 or img > 2):
LNODE.log(node, 2,
'Conditional drop: len < 25 and 0/>2 images')
remove_node = True
elif weight < 25 and link_density > 0.2:
LNODE.log(node, 2,
'Conditional drop: weight small and link is dense')
remove_node = True
elif weight >= 25 and link_density > 0.5:
LNODE.log(node, 2,
'Conditional drop: weight big but link heavy')
remove_node = True
elif (embed == 1 and content_length < 75) or embed > 1:
LNODE.log(node, 2,
'Conditional drop: embed w/o much content or many embed')
remove_node = True
if remove_node:
LNODE.log(node, 2, 'Node will be removed')
else:
LNODE.log(node, 2, 'Node cleared')
return remove_node
# nope, don't remove anything
LNODE.log(node, 2, 'Node Cleared final.')
return False
def prep_article(doc):
"""Once we've found our target article we want to clean it up.
Clean out:
- inline styles
- forms
- strip empty <p>
- extra tags
"""
doc = clean_document(doc)
return doc
def find_candidates(doc):
"""Find cadidate nodes for the readable version of the article.
Here's we're going to remove unlikely nodes, find scores on the rest, and
clean up and return the final best match.
"""
scorable_node_tags = SCORABLE_TAGS
nodes_to_score = []
should_remove = []
for node in doc.iter():
if is_unlikely_node(node):
LOG.debug('We should drop unlikely: ' + str(node))
should_remove.append(node)
continue
if node.tag == 'a' and is_bad_link(node):
LOG.debug('We should drop bad link: ' + str(node))
should_remove.append(node)
continue
if node.tag in scorable_node_tags and node not in nodes_to_score:
nodes_to_score.append(node)
return score_candidates(nodes_to_score), should_remove
class Article(object):
"""Parsed readable object"""
_should_drop = []
def __init__(self, html, url=None, fragment=True):
"""Create the Article we're going to use.
:param html: The string of html we're going to parse.
:param url: The url so we can adjust the links to still work.
:param fragment: Should we return a <div> fragment or a full <html>
doc.
"""
LOG.debug('Url: ' + str(url))
self.orig = OriginalDocument(html, url=url)
self.fragment = fragment
def __str__(self):
return tostring(self._readable)
def __unicode__(self):
return tounicode(self._readable)
@cached_property(ttl=600)
def doc(self):
"""The doc is the parsed xml tree of the given html."""
try:
doc = self.orig.html
# cleaning doesn't return, just wipes in place
html_cleaner(doc)
doc = drop_tag(doc, 'noscript', 'iframe')
doc = transform_misused_divs_into_paragraphs(doc)
return doc
except ValueError:
return None
@cached_property(ttl=600)
def candidates(self):
"""Generate the list of candidates from the doc."""
doc = self.doc
if doc is not None and len(doc):
candidates, should_drop = find_candidates(doc)
self._should_drop = should_drop
return candidates
else:
return None
@cached_property(ttl=600)
def readable(self):
return tounicode(self._readable)
@cached_property(ttl=600)
def _readable(self):
"""The readable parsed article"""
if self.candidates:
LOG.debug('Candidates found:')
pp = PrettyPrinter(indent=2)
# cleanup by removing the should_drop we spotted.
[n.drop_tree() for n in self._should_drop
if n.getparent() is not None]
# right now we return the highest scoring candidate content
by_score = sorted([c for c in self.candidates.values()],
key=attrgetter('content_score'), reverse=True)
LOG.debug(pp.pformat(by_score))
# since we have several candidates, check the winner's siblings
# for extra content
winner = by_score[0]
LOG.debug('Selected winning node: ' + str(winner))
updated_winner = check_siblings(winner, self.candidates)
LOG.debug('Begin final prep of article')
updated_winner.node = prep_article(updated_winner.node)
if updated_winner.node is not None:
doc = build_base_document(updated_winner.node, self.fragment)
else:
LOG.warning('Had candidates but failed to find a cleaned winning doc.')
doc = self._handle_no_candidates()
else:
LOG.warning('No candidates found: using document.')
LOG.debug('Begin final prep of article')
doc = self._handle_no_candidates()
return doc
def _handle_no_candidates(self):
"""If we fail to find a good candidate we need to find something else."""
# since we've not found a good candidate we're should help this
if self.doc is not None and len(self.doc):
# cleanup by removing the should_drop we spotted.
[n.drop_tree() for n in self._should_drop
if n.getparent() is not None]
doc = prep_article(self.doc)
doc = build_base_document(doc, self.fragment)
else:
LOG.warning('No document to use.')
doc = build_error_document(self.fragment)
return doc

@ -1,6 +1,5 @@
chardet
docopt>=0.6.1,<0.7
charade
lxml
coverage
nose
pep8
pylint

@ -0,0 +1,237 @@
"""Handle dealing with scoring nodes and content for our parsing."""
import re
from hashlib import md5
from lxml.etree import tounicode
from breadability.logconfig import LNODE
from breadability.logconfig import LOG
# A series of sets of attributes we check to help in determining if a node is
# a potential candidate or not.
CLS_UNLIKELY = re.compile(('combx|comment|community|disqus|extra|foot|header|'
'menu|remark|rss|shoutbox|sidebar|sponsor|ad-break|agegate|pagination|'
'pager|perma|popup|tweet|twitter'), re.I)
CLS_MAYBE = re.compile('and|article|body|column|main|shadow', re.I)
CLS_WEIGHT_POSITIVE = re.compile(('article|body|content|entry|hentry|main|'
'page|pagination|post|text|blog|story'), re.I)
CLS_WEIGHT_NEGATIVE = re.compile(('combx|comment|com-|contact|foot|footer|'
'footnote|head|masthead|media|meta|outbrain|promo|related|scroll|shoutbox|'
'sidebar|sponsor|shopping|tags|tool|widget'), re.I)
def check_node_attr(node, attr, checkset):
value = node.get(attr) or ""
check = checkset.search(value)
if check:
return True
else:
return False
def generate_hash_id(node):
"""Generate a hash_id for the node in question.
:param node: lxml etree node
"""
content = tounicode(node)
hashed = md5()
try:
hashed.update(content.encode('utf-8', "replace"))
except Exception, e:
LOG.error("BOOM! " + str(e))
return hashed.hexdigest()[0:8]
def get_link_density(node, node_text=None):
"""Generate a value for the number of links in the node.
:param node: pared elementree node
:param node_text: if we already have the text_content() make this easier
on us.
:returns float:
"""
link_length = sum([len(a.text_content()) or 0
for a in node.findall(".//a")])
# For each img, give 50 bonus chars worth of length.
# Tweaking this 50 down a notch should help if we hit false positives.
link_length = max(link_length -
sum([50 for img in node.findall(".//img")]), 0)
if node_text:
text_length = len(node_text)
else:
text_length = len(node.text_content())
return float(link_length) / max(text_length, 1)
def get_class_weight(node):
"""Get an elements class/id weight.
We're using sets to help efficiently check for existence of matches.
"""
weight = 0
if check_node_attr(node, 'class', CLS_WEIGHT_NEGATIVE):
weight = weight - 25
if check_node_attr(node, 'class', CLS_WEIGHT_POSITIVE):
weight = weight + 25
if check_node_attr(node, 'id', CLS_WEIGHT_NEGATIVE):
weight = weight - 25
if check_node_attr(node, 'id', CLS_WEIGHT_POSITIVE):
weight = weight + 25
return weight
def is_unlikely_node(node):
"""Short helper for checking unlikely status.
If the class or id are in the unlikely list, and there's not also a
class/id in the likely list then it might need to be removed.
"""
unlikely = check_node_attr(node, 'class', CLS_UNLIKELY) or \
check_node_attr(node, 'id', CLS_UNLIKELY)
maybe = check_node_attr(node, 'class', CLS_MAYBE) or \
check_node_attr(node, 'id', CLS_MAYBE)
if unlikely and not maybe and node.tag != 'body':
return True
else:
return False
def score_candidates(nodes):
"""Given a list of potential nodes, find some initial scores to start"""
MIN_HIT_LENTH = 25
candidates = {}
for node in nodes:
LNODE.log(node, 1, "Scoring Node")
content_score = 0
# if the node has no parent it knows of, then it ends up creating a
# body and html tag to parent the html fragment.
parent = node.getparent()
grand = parent.getparent() if parent is not None else None
innertext = node.text_content()
if parent is None or grand is None:
LNODE.log(
node, 1,
"Skipping candidate because parent/grand are none")
continue
# If this paragraph is less than 25 characters, don't even count it.
if innertext and len(innertext) < MIN_HIT_LENTH:
LNODE.log(
node, 1,
"Skipping candidate because not enough content.")
continue
# Initialize readability data for the parent.
# if the parent node isn't in the candidate list, add it
if parent not in candidates:
candidates[parent] = ScoredNode(parent)
if grand not in candidates:
candidates[grand] = ScoredNode(grand)
# Add a point for the paragraph itself as a base.
content_score += 1
if innertext:
# Add 0.25 points for any commas within this paragraph
content_score += innertext.count(',') * 0.25
LNODE.log(node, 1,
"Bonus points for ,: " + str(innertext.count(',')))
# Subtract 0.5 points for each double quote within this paragraph
content_score += innertext.count('"') * (-0.5)
LNODE.log(node, 1,
'Penalty points for ": ' + str(innertext.count('"')))
# For every 100 characters in this paragraph, add another point.
# Up to 3 points.
length_points = len(innertext) / 100
if length_points > 3:
content_score += 3
else:
content_score += length_points
LNODE.log(
node, 1,
"Length/content points: {0} : {1}".format(length_points,
content_score))
# Add the score to the parent.
LNODE.log(node, 1, "From this current node.")
candidates[parent].content_score += content_score
LNODE.log(
candidates[parent].node,
1,
"Giving parent bonus points: " + str(
candidates[parent].content_score))
# The grandparent gets half.
LNODE.log(candidates[grand].node, 1, "Giving grand bonus points")
candidates[grand].content_score += (content_score / 2.0)
LNODE.log(
candidates[parent].node,
1,
"Giving grand bonus points: " + str(
candidates[grand].content_score))
for candidate in candidates.values():
adjustment = 1 - get_link_density(candidate.node)
LNODE.log(
candidate.node,
1,
"Getting link density adjustment: {0} * {1} ".format(
candidate.content_score, adjustment))
candidate.content_score = candidate.content_score * (adjustment)
return candidates
class ScoredNode(object):
"""We need Scored nodes we use to track possible article matches
We might have a bunch of these so we use __slots__ to keep memory usage
down.
"""
__slots__ = ['node', 'content_score']
def __repr__(self):
"""Helpful representation of our Scored Node"""
return "{0}: {1:0.1F}\t{2}".format(
self.hash_id,
self.content_score,
self.node)
def __init__(self, node):
"""Given node, set an initial score and weigh based on css and id"""
self.node = node
content_score = 0
if node.tag in ['div', 'article']:
content_score = 5
if node.tag in ['pre', 'td', 'blockquote']:
content_score = 3
if node.tag in ['address', 'ol', 'ul', 'dl', 'dd', 'dt', 'li',
'form']:
content_score = -3
if node.tag in ['h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'th']:
content_score = -5
content_score += get_class_weight(node)
self.content_score = content_score
@property
def hash_id(self):
return generate_hash_id(self.node)

@ -1,59 +1,81 @@
from setuptools import setup, find_packages
import sys
import os
here = os.path.abspath(os.path.dirname(__file__))
README = open(os.path.join(here, 'README.rst')).read()
NEWS = open(os.path.join(here, 'NEWS.txt')).read()
from os.path import abspath, dirname, join
from setuptools import setup, find_packages
from readability import __version__
VERSION_SUFFIX = "%d.%d" % sys.version_info[:2]
CURRENT_DIRECTORY = abspath(dirname(__file__))
with open(join(CURRENT_DIRECTORY, "README.rst")) as readme:
with open(join(CURRENT_DIRECTORY, "CHANGELOG.rst")) as changelog:
long_description = "%s\n\n%s" % (readme.read(), changelog.read())
version = '0.1.14'
install_requires = [
# List your project dependencies here.
# For more details, see:
# http://packages.python.org/distribute/setuptools.html#declaring-dependencies
'chardet',
'lxml',
"docopt>=0.6.1,<0.7",
"charade",
"lxml>=2.0",
]
tests_require = [
'coverage',
'nose',
'pep8',
'pylint',
"coverage",
"nose",
]
if sys.version_info < (2, 7):
# Require argparse since it's not in the stdlib yet.
install_requires.append('argparse')
install_requires.append('unittest2')
install_requires.append("unittest2")
setup(
name='breadability',
version=version,
description="Redone port of Readability API in Python",
long_description=README + '\n\n' + NEWS,
name="readability",
version=__version__,
description="Port of Readability HTML parser in Python",
long_description=long_description,
keywords=[
"readability",
"readable",
"parsing",
"HTML",
"content",
],
author="Michal Belica",
author_email="miso.belica@gmail.com",
url="https://github.com/miso-belica/readability.py",
license="BSD",
classifiers=[
# Get strings from
# http://pypi.python.org/pypi?%3Aaction=list_classifiers
"Development Status :: 5 - Production/Stable",
"Intended Audience :: Developers",
"License :: OSI Approved :: BSD License",
"Operating System :: OS Independent",
"Programming Language :: Python",
"Programming Language :: Python :: 2",
"Programming Language :: Python :: 2.6",
"Programming Language :: Python :: 2.7",
"Programming Language :: Python :: 3",
"Programming Language :: Python :: 3.2",
"Programming Language :: Python :: 3.3",
"Programming Language :: Python :: Implementation :: CPython",
"Topic :: Internet :: WWW/HTTP",
"Topic :: Software Development :: Pre-processors",
"Topic :: Text Processing :: Filters",
"Topic :: Text Processing :: Markup :: HTML",
],
keywords='readable parsing html content bookie',
author='Rick Harding',
author_email='rharding@mitechie.com',
url='http://docs.bmark.us',
license='BSD',
packages=find_packages('src'),
package_dir={'': 'src'},
packages=find_packages(),
include_package_data=True,
zip_safe=False,
install_requires=install_requires,
tests_require=tests_require,
extras_require={
'test': tests_require
},
test_suite="tests.run_tests.run",
entry_points={
'console_scripts': [
'breadability=breadability:client.main',
'breadability_newtest=breadability:newtest.main',
"console_scripts": [
"readability = readability.scripts.client:main",
"readability-%s = readability.scripts.client:main" % VERSION_SUFFIX,
"readability_test = readability.scripts.test_helper:main",
"readability_test-%s = readability.scripts.test_helper:main" % VERSION_SUFFIX,
]
}
)

@ -1,115 +0,0 @@
"""Generate a clean nice starting html document to process for an article."""
import chardet
import re
from lxml.etree import tostring
from lxml.etree import tounicode
from lxml.etree import XMLSyntaxError
from lxml.html import document_fromstring
from lxml.html import HTMLParser
from breadability.logconfig import LOG
from breadability.utils import cached_property
utf8_parser = HTMLParser(encoding='utf-8')
def get_encoding(page):
text = re.sub('</?[^>]*>\s*', ' ', page)
enc = 'utf-8'
if not text.strip() or len(text) < 10:
return enc # can't guess
try:
diff = text.decode(enc, 'ignore').encode(enc)
sizes = len(diff), len(text)
# 99% of utf-8
if abs(len(text) - len(diff)) < max(sizes) * 0.01:
return enc
except UnicodeDecodeError:
pass
res = chardet.detect(text)
enc = res['encoding']
# print '->', enc, "%.2f" % res['confidence']
if enc == 'MacCyrillic':
enc = 'cp1251'
if not enc:
enc = 'utf-8'
return enc
def replace_multi_br_to_paragraphs(html):
"""Convert multiple <br>s into paragraphs"""
LOG.debug('Replacing multiple <br/> to <p>')
rep = re.compile("(<br[^>]*>[ \n\r\t]*){2,}", re.I)
return rep.sub('</p><p>', html)
def build_doc(page):
"""Requires that the `page` not be None"""
if page is None:
LOG.error("Page content is None, can't build_doc")
return ''
if isinstance(page, unicode):
page_unicode = page
else:
enc = get_encoding(page)
page_unicode = page.decode(enc, 'replace')
try:
doc = document_fromstring(
page_unicode.encode('utf-8', 'replace'),
parser=utf8_parser)
return doc
except XMLSyntaxError, exc:
LOG.error('Failed to parse: ' + str(exc))
raise ValueError('Failed to parse document contents.')
class OriginalDocument(object):
"""The original document to process"""
_base_href = None
def __init__(self, html, url=None):
self.orig_html = html
self.url = url
def __str__(self):
"""Render out our document as a string"""
return tostring(self.html)
def __unicode__(self):
"""Render out our document as a string"""
return tounicode(self.html)
def _parse(self, html):
"""Generate an lxml document from our html."""
html = replace_multi_br_to_paragraphs(html)
doc = build_doc(html)
# doc = html_cleaner.clean_html(doc)
base_href = self.url
if base_href:
LOG.debug('Making links absolute')
doc.make_links_absolute(base_href, resolve_base_href=True)
else:
doc.resolve_base_href()
return doc
@cached_property(ttl=600)
def html(self):
"""The parsed html document from the input"""
return self._parse(self.orig_html)
@cached_property(ttl=600)
def links(self):
"""Links within the document"""
return self.html.findall(".//a")
@cached_property(ttl=600)
def title(self):
"""Pull the title attribute out of the parsed document"""
titleElem = self.html.find('.//title')
if titleElem is None or titleElem.text is None:
return ''
else:
return titleElem.text

@ -1,190 +0,0 @@
"""Setup a logging helper for our module.
Helpers:
LOG - out active logger instance
set_logging_level(level) - adjust the current logging level
"""
import logging
import sys
import time
from collections import namedtuple
from hashlib import md5
from lxml.etree import tounicode
# For pretty log messages, if available
try:
import curses
except ImportError:
curses = None
LOGLEVEL = "WARNING"
# Logging bits stolen and adapted from:
# http://www.tornadoweb.org/documentation/_modules/tornado/options.html
LogOptions = namedtuple('LogOptions', [
'loglevel',
'log_file_prefix',
'log_file_max_size',
'log_file_num_backups',
'log_to_stderr',
])
options = LogOptions(
loglevel=LOGLEVEL,
log_file_prefix="",
log_file_max_size=100 * 1000 * 1000,
log_file_num_backups=5,
log_to_stderr=True,
)
def set_logging_level(level):
"""Adjust the current logging level.
Expect a string of DEBUG, WARNING, INFO, etc.
"""
logging.getLogger('breadable').setLevel(getattr(logging, level))
def enable_pretty_logging():
"""Turns on formatted logging output as configured.
This is called automatically by `parse_command_line`.
"""
root_logger = logging.getLogger()
if options.log_file_prefix:
channel = logging.handlers.RotatingFileHandler(
filename=options.log_file_prefix,
maxBytes=options.log_file_max_size,
backupCount=options.log_file_num_backups)
channel.setFormatter(_LogFormatter(color=False))
root_logger.addHandler(channel)
if (options.log_to_stderr or
(options.log_to_stderr is None and not root_logger.handlers)):
# Set up color if we are in a tty and curses is installed
color = False
if curses and sys.stderr.isatty():
try:
curses.setupterm()
if curses.tigetnum("colors") > 0:
color = True
except Exception:
pass
channel = logging.StreamHandler()
channel.setFormatter(_LogFormatter(color=color))
root_logger.addHandler(channel)
class LogHelper(object):
"""Helper to allow us to log as we want for debugging"""
scoring = 1
removing = 2
_active = False
_actions = None
def __init__(self, log, actions=None, content=False):
if actions is None:
self._actions = tuple()
else:
self._actions = actions
self._log = log
self.content = content
@property
def actions(self):
"""Return a tuple of the actions we want to log"""
return self._actions
def activate(self):
"""Turn on this logger."""
self._active = True
def deactivate(self):
"""Turn off the logger"""
self._active = False
def log(self, node, action, description):
"""Write out our log info based on the node and event specified.
We only log this information if we're are DEBUG loglevel
"""
if self._active:
content = tounicode(node)
hashed = md5()
try:
hashed.update(content.encode('utf-8', errors="replace"))
except Exception, exc:
LOG.error("Cannot hash the current node." + str(exc))
hash_id = hashed.hexdigest()[0:8]
# if hash_id in ['9c880b27', '8393b7d7', '69bfebdd']:
print(u"{0} :: {1}\n{2}".format(
hash_id,
description,
content.replace(u"\n", u"")[0:202],
))
class _LogFormatter(logging.Formatter):
def __init__(self, color, *args, **kwargs):
logging.Formatter.__init__(self, *args, **kwargs)
self._color = color
if color:
# The curses module has some str/bytes confusion in python3.
# Most methods return bytes, but only accept strings.
# The explict calls to unicode() below are harmless in python2,
# but will do the right conversion in python3.
fg_color = unicode(curses.tigetstr("setaf") or
curses.tigetstr("setf") or "", "ascii")
self._colors = {
logging.DEBUG: unicode(
curses.tparm(fg_color, curses.COLOR_CYAN),
"ascii"),
logging.INFO: unicode(
curses.tparm(fg_color, curses.COLOR_GREEN),
"ascii"),
logging.WARNING: unicode(
curses.tparm(fg_color, curses.COLOR_YELLOW), # Yellow
"ascii"),
logging.ERROR: unicode(
curses.tparm(fg_color, curses.COLOR_RED), # Red
"ascii"),
}
self._normal = unicode(curses.tigetstr("sgr0"), "ascii")
def format(self, record):
try:
record.message = record.getMessage()
except Exception, e:
record.message = "Bad message (%r): %r" % (e, record.__dict__)
record.asctime = time.strftime(
"%y%m%d %H:%M:%S", self.converter(record.created))
prefix = '[%(levelname)1.1s %(asctime)s %(module)s:%(lineno)d]' % \
record.__dict__
if self._color:
prefix = (self._colors.get(record.levelno, self._normal) +
prefix + self._normal)
formatted = prefix + " " + record.message
if record.exc_info:
if not record.exc_text:
record.exc_text = self.formatException(record.exc_info)
if record.exc_text:
formatted = formatted.rstrip() + "\n" + record.exc_text
return formatted.replace("\n", "\n ")
# Set up log level and pretty console logging by default
logging.getLogger('breadable').setLevel(getattr(logging, LOGLEVEL))
enable_pretty_logging()
LOG = logging.getLogger('breadable')
LNODE = LogHelper(LOG,
actions=(LogHelper.scoring, LogHelper.removing),
content=True
)

@ -1,109 +0,0 @@
import argparse
import codecs
import urllib2
from os import mkdir
from os import path
from breadability import VERSION
TESTPATH = path.join(
path.dirname(path.dirname(__file__)),
'tests', 'test_articles')
TESTTPL = """
import os
try:
# Python < 2.7
import unittest2 as unittest
except ImportError:
import unittest
from breadability.readable import Article
class TestArticle(unittest.TestCase):
\"\"\"Test the scoring and parsing of the Article\"\"\"
def setUp(self):
\"\"\"Load up the article for us\"\"\"
article_path = os.path.join(os.path.dirname(__file__), 'article.html')
self.article = open(article_path).read()
def tearDown(self):
\"\"\"Drop the article\"\"\"
self.article = None
def test_parses(self):
\"\"\"Verify we can parse the document.\"\"\"
doc = Article(self.article)
self.assertTrue('id="readabilityBody"' in doc.readable)
def test_content_exists(self):
\"\"\"Verify that some content exists.\"\"\"
pass
def test_content_does_not_exist(self):
\"\"\"Verify we cleaned out some content that shouldn't exist.\"\"\"
pass
"""
def parse_args():
desc = "breadability helper to generate a new set of article test files."
parser = argparse.ArgumentParser(description=desc)
parser.add_argument('--version',
action='version', version=VERSION)
parser.add_argument('-n', '--name',
action='store',
required=True,
help='Name of the test directory')
parser.add_argument('url', metavar='URL', type=str, nargs=1,
help='The url of content to fetch for the article.html')
args = parser.parse_args()
return args
def make_dir(name):
"""Generate a new directory for tests.
"""
dir_name = 'test_' + name.replace(' ', '_')
updated_name = path.join(TESTPATH, dir_name)
mkdir(updated_name)
return updated_name
def make_files(dirname):
init_file = path.join(dirname, '__init__.py')
test_file = path.join(dirname, 'test.py')
open(init_file, "a").close()
with open(test_file, 'w') as f:
f.write(TESTTPL)
def fetch_article(dirname, url):
"""Get the content of the url and make it the article.html"""
opener = urllib2.build_opener()
opener.addheaders = [('Accept-Charset', 'utf-8')]
url_response = opener.open(url)
dl_html = url_response.read().decode('utf-8')
fh = codecs.open(path.join(dirname, 'article.html'), "w", "utf-8")
fh.write(dl_html)
fh.close()
def main():
"""Run the script."""
args = parse_args()
new_dir = make_dir(args.name)
make_files(new_dir)
fetch_article(new_dir, args.url[0])
if __name__ == '__main__':
main()

@ -1,14 +0,0 @@
from os import path
TEST_DIR = path.dirname(__file__)
def load_snippet(filename):
"""Helper to fetch in the content of a test snippet"""
return open(path.join(TEST_DIR, 'test_snippets', filename)).read()
def load_article(filename):
"""Helper to fetch in the content of a test article"""
return open(path.join(TEST_DIR, 'test_articles', filename)).read()

@ -1,49 +0,0 @@
from collections import defaultdict
try:
# Python < 2.7
import unittest2 as unittest
except ImportError:
import unittest
from breadability.document import OriginalDocument
from breadability.tests import load_snippet
class TestOriginalDocument(unittest.TestCase):
"""Verify we can process html into a document to work off of."""
def test_readin_min_document(self):
"""Verify we can read in a min html document"""
doc = OriginalDocument(load_snippet('document_min.html'))
self.assertTrue(str(doc).startswith(u'<html>'))
self.assertEqual(doc.title, 'Min Document Title')
def test_readin_with_base_url(self):
"""Passing a url should update links to be absolute links"""
doc = OriginalDocument(
load_snippet('document_absolute_url.html'),
url="http://blog.mitechie.com/test.html")
self.assertTrue(str(doc).startswith(u'<html>'))
# find the links on the page and make sure each one starts with out
# base url we told it to use.
links = doc.links
self.assertEqual(len(links), 3)
# we should have two links that start with our blog url
# and one link that starts with amazon
link_counts = defaultdict(int)
for link in links:
if link.get('href').startswith('http://blog.mitechie.com'):
link_counts['blog'] += 1
else:
link_counts['other'] += 1
self.assertEqual(link_counts['blog'], 2)
self.assertEqual(link_counts['other'], 1)
def test_no_br_allowed(self):
"""We convert all <br/> tags to <p> tags"""
doc = OriginalDocument(load_snippet('document_min.html'))
self.assertIsNone(doc.html.find('.//br'))

@ -1,61 +0,0 @@
import time
#
# ? 2011 Christopher Arndt, MIT License
#
class cached_property(object):
'''Decorator for read-only properties evaluated only once within TTL
period.
It can be used to created a cached property like this::
import random
# the class containing the property must be a new-style class
class MyClass(object):
# create property whose value is cached for ten minutes
@cached_property(ttl=600) def randint(self):
# will only be evaluated every 10 min. at maximum.
return random.randint(0, 100)
The value is cached in the '_cache' attribute of the object instance that
has the property getter method wrapped by this decorator. The '_cache'
attribute value is a dictionary which has a key for every property of the
object which is wrapped by this decorator. Each entry in the cache is
created only when the property is accessed for the first time and is a
two-element tuple with the last computed property value and the last time
it was updated in seconds since the epoch.
The default time-to-live (TTL) is 300 seconds (5 minutes). Set the TTL to
zero for the cached value to never expire.
To expire a cached property value manually just do::
del instance._cache[<property name>]
'''
def __init__(self, ttl=300):
self.ttl = ttl
def __call__(self, fget, doc=None):
self.fget = fget
self.__doc__ = doc or fget.__doc__
self.__name__ = fget.__name__
self.__module__ = fget.__module__
return self
def __get__(self, inst, owner):
now = time.time()
try:
value, last_update = inst._cache[self.__name__]
if self.ttl > 0 and now - last_update > self.ttl:
raise AttributeError
except (KeyError, AttributeError):
value = self.fget(inst)
try:
cache = inst._cache
except AttributeError:
cache = inst._cache = {}
cache[self.__name__] = (value, now)
return value

@ -0,0 +1,9 @@
# -*- coding: utf8 -*-
from __future__ import absolute_import
from __future__ import division, print_function, unicode_literals
try:
import unittest2 as unittest
except ImportError:
import unittest

@ -0,0 +1,310 @@
<!DOCTYPE html>
<html lang="cs-CZ" prefix="og: http://ogp.me/ns#">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta http-equiv="X-UA-Compatible" content="IE=edge"/>
<title>Automatické zabezpečení | Zdroják</title>
<link rel="apple-touch-icon" href="http://www.zdrojak.cz/wp-content/themes/zdrojak/img/touch-icon-iphone.png"/>
<link rel="apple-touch-icon" sizes="72x72" href="http://www.zdrojak.cz/wp-content/themes/zdrojak/img/touch-icon-ipad.png"/>
<link rel="apple-touch-icon" sizes="114x114"
href="http://www.zdrojak.cz/wp-content/themes/zdrojak/img/touch-icon-iphone-retina.png"/>
<link rel="apple-touch-icon" sizes="144x144"
href="http://www.zdrojak.cz/wp-content/themes/zdrojak/img/touch-icon-ipad-retina.png"/>
<link rel="apple-touch-icon" sizes="512x512"
href="http://www.zdrojak.cz/wp-content/themes/zdrojak/img/apple-touch-icon-itunes.png"/>
<meta name="apple-mobile-web-app-title" content="Zdrojak.cz"/>
<link rel="profile" href="http://gmpg.org/xfn/11"/>
<link rel="pingback" href="http://www.zdrojak.cz/xmlrpc.php"/>
<!-- This site is optimized with the Yoast WordPress SEO plugin v1.4.4 - http://yoast.com/wordpress/seo/ -->
<meta name="description" content="Nespoléhejte se na to, že do kódu nezapomenete na všechna místa připsat ošetření dat. Snažte se aplikaci navrhnout tak, aby se na nic zapomenout nedalo. Za cenu o něco složitějšího jádra bude veškerý kód, který ho používá, obvykle taky mnohem jednodušší."/>
<link rel="canonical" href="http://www.zdrojak.cz/clanky/automaticke-zabezpeceni/" />
<meta property='og:locale' content='cs_CZ'/>
<meta property='og:title' content='Automatické zabezpečení - Zdroják'/>
<meta property='og:description' content='Nespoléhejte se na to, že do kódu nezapomenete na všechna místa připsat ošetření dat. Snažte se aplikaci navrhnout tak, aby se na nic zapomenout nedalo. Za cenu o něco složitějšího jádra bude veškerý kód, který ho používá, obvykle taky mnohem jednodušší.'/>
<meta property='og:url' content='http://www.zdrojak.cz/clanky/automaticke-zabezpeceni/'/>
<meta property='og:site_name' content='Zdroják'/>
<meta property='og:type' content='article'/>
<meta property='og:image' content='http://www.zdrojak.cz/wp-content/uploads/2008/11/security-122617975844093.png'/>
<!-- / Yoast WordPress SEO plugin. -->
<link rel="alternate" type="application/rss+xml" title="Zdroják &raquo; RSS zdroj" href="http://www.zdrojak.cz/feed/" />
<link rel="alternate" type="application/rss+xml" title="Zdroják &raquo; RSS komentářů" href="http://www.zdrojak.cz/comments/feed/" />
<link rel="alternate" type="application/rss+xml" title="Zdroják &raquo; RSS komentářů pro Automatické zabezpečení" href="http://www.zdrojak.cz/clanky/automaticke-zabezpeceni/feed/" />
<link rel='stylesheet' id='admin-bar-css' href='http://www.zdrojak.cz/wp-includes/css/admin-bar.min.css?ver=3.5.1' type='text/css' media='all' />
<link rel='stylesheet' id='kindle-css-css' href='http://www.zdrojak.cz/wp-content/plugins/omSocialButtons/kindle/kindle.css?ver=3.5.1' type='text/css' media='all' />
<link rel='stylesheet' id='twentytwelve-style-css' href='http://www.zdrojak.cz/wp-content/themes/zdrojak/style.css?ver=5f54643' type='text/css' media='all' />
<script type='text/javascript' src='http://ajax.googleapis.com/ajax/libs/jquery/1.7.2/jquery.min.js?ver=3.5.1'></script>
<meta name="generator" content="WordPress 3.5.1" />
<!-- Google Plus -->
<script type="text/javascript">
window.___gcfg = {lang: 'cs'}; (function () {
var po = document.createElement('script');
po.type = 'text/javascript';
po.async = true;
po.src = 'https://apis.google.com/js/plusone.js';
var s = document.getElementsByTagName('script')[0];
s.parentNode.insertBefore(po, s);
})();
</script>
<!-- Google Plus -->
<style type="text/css">.recentcomments a{display:inline !important;padding:0 !important;margin:0 !important;}</style>
<style type="text/css" media="print">#wpadminbar { display:none; }</style>
<style type="text/css" id="custom-background-css">
body.custom-background { background-color: #e6e6e6; }
</style>
<link rel="shortcut icon" href="/favicon.ico"/>
<link rel="author" href="/humans.txt"/>
</head>
<body class="single single-post postid-3773 single-format-standard admin-bar no-customize-support custom-background">
<div id="page" class="hfeed site">
<div class="site-top">
<div class="wrapper">
</div> <div id="main" class="wrapper">
<div id="primary" class="site-content">
<div id="content" role="main">
<article id="post-3773" class="post-3773 post type-post status-publish format-standard hentry category-ruzne tag-bezpecnost tag-ruzne full">
<header class="entry-header">
<h1 class="entry-title">Automatické zabezpečení</h1>
</header>
<div class="entry-content">
<p>Úroveň zabezpečení aplikace bych rozdělil do tří úrovní:</p>
<ol>
<li>Aplikace zabezpečená není, neošetřuje uživatelské vstupy ani své výstupy.</li>
<li>Aplikace se o zabezpečení snaží, ale takovým způsobem, že na ně lze zapomenout.</li>
<li>Aplikace se o zabezpečení stará sama, prakticky se nedá udělat chyba.</li>
</ol>
<p>Jak se tyto úrovně projevují v jednotlivých oblastech?</p>
<h2><a href="http://php.vrana.cz/cross-site-scripting.php">XSS</a></h2>
<p>Druhou úroveň představuje ruční ošetřování pomocí <a href="http://www.php.net/manual/en/function.htmlspecialchars.php"><kbd>htmlspecialchars</kbd></a>. Třetí úroveň zdánlivě reprezentuje automatické ošetřování v šablonách, např. v <a href="http://doc.nette.org/cs/templating"><strong>Nette Latte</strong></a>. Proč píšu zdánlivě? Problém je v tom, že ošetření se dá obvykle snadno zakázat, např. v Latte pomocí <code>{!$var}</code>. Viděl jsem šablony plné vykřičníků i na místech, kde být neměly. Autor to vysvětlil tak, že psaní <code>{$var}</code> někde způsobovalo problémy, které po přidání vykřičníku zmizely, tak je začal psát všude.</p>
<pre class="brush: php">&lt;?php
$safeHtml = $texy-&gt;process($content_texy);
$content = Html::el()-&gt;setHtml($safeHtml);
// v šabloně pak můžeme použít {$content}
?&gt;</pre>
<p>Ideální by bylo, když by už samotná metoda <code>process()</code> vracela instanci <code>Html</code>.</p>
<div class="social-buttons"><div class="wrapper"> <!-- Facebook -->
<div id="fb-root"></div>
<script>(function(d, s, id) {
var js, fjs = d.getElementsByTagName(s)[0];
if (d.getElementById(id)) return;
js = d.createElement(s); js.id = id;
js.src = "//connect.facebook.net/en_US/all.js#xfbml=1&appId=556615607691233";
fjs.parentNode.insertBefore(js, fjs);
}(document, 'script', 'facebook-jssdk'));</script>
<!-- Facebook -->
<div class="facebook">
<div class="fb-like"
data-href="http://www.zdrojak.cz/clanky/automaticke-zabezpeceni/" data-width="450" data-layout="button_count"></div>
</div><div class="twitter">
<a href="https://twitter.com/share" class="twitter-share-button" data-via="zdrojak" data-url="http://www.zdrojak.cz/clanky/automaticke-zabezpeceni/" data-lang="en" data-text="Automatické zabezpečení"></a>
<script>!function (d, s, id) {
var js, fjs = d.getElementsByTagName(s)[0];
if (!d.getElementById(id)) {
js = d.createElement(s);
js.id = id;
js.src = "//platform.twitter.com/widgets.js";
fjs.parentNode.insertBefore(js, fjs);
}
}(document, "script", "twitter-wjs");</script>
</div><div class="google">
<div class="g-plusone"
data-href="http://www.zdrojak.cz/clanky/automaticke-zabezpeceni/" data-width="300" data-size="medium"></div>
</div><div class="kindle">
<div class="kindleWidget kindleLight">
<img src="https://d1xnn692s7u6t6.cloudfront.net/black-15.png"/><span>Kindle</span>
</div>
</div>
</div></div> <a name="bottom"></a>
</div>
</article>
</div>
<tr id="tr-comment-23965" class="comment byuser comment-author-okbob even thread-odd thread-alt depth-1 recent new">
<td class="author"><img alt='' src='/wp-content/uploads/avatars/2005/01/okbob.png' class='avatar avatar-16 photo' height='16' width='16' />okbob</td>
<th class="comment-title"><a href="http://www.zdrojak.cz/clanky/automaticke-zabezpeceni/?show=comments#comment-23965">trochu jiný přístup</a></th>
<td class="comments-date">
<time datetime="2013-01-30T09:29:35+00:00">30.1.2013 v 07:29</time>
</td>
</tr><tr id="tr-comment-23966" class="comment odd alt depth-2 recent new">
<td class="author"><img alt='' src='http://0.gravatar.com/avatar/ad516503a11cd5ca435acc9bb6523536?s=16' class='avatar avatar-16 photo avatar-default' height='16' width='16' />Aleš Roubíček</td>
<th class="comment-title"><span class="line"></span><a href="http://www.zdrojak.cz/clanky/automaticke-zabezpeceni/?show=comments#comment-23966">Re: trochu jiný přístup</a></th>
<td class="comments-date">
<time datetime="2013-01-30T09:37:35+00:00">30.1.2013 v 07:37</time>
</td>
</tr><tr id="tr-comment-23967" class="comment even depth-3 recent new">
<td class="author"><img alt='' src='http://0.gravatar.com/avatar/ad516503a11cd5ca435acc9bb6523536?s=16' class='avatar avatar-16 photo avatar-default' height='16' width='16' />Futrál</td>
<th class="comment-title"><span class="line"></span><span class="line"></span><a href="http://www.zdrojak.cz/clanky/automaticke-zabezpeceni/?show=comments#comment-23967">Re: trochu jiný přístup</a></th>
<td class="comments-date">
<time datetime="2013-01-30T11:23:28+00:00">30.1.2013 v 09:23</time>
</td>
</tr></div>
</div>
<tr id="tr-comment-23968" class="comment odd alt depth-2 recent new">
<td class="author"><img alt='' src='http://0.gravatar.com/avatar/ad516503a11cd5ca435acc9bb6523536?s=16' class='avatar avatar-16 photo avatar-default' height='16' width='16' />Futrál</td>
<th class="comment-title"><span class="line"></span><a href="http://www.zdrojak.cz/clanky/automaticke-zabezpeceni/?show=comments#comment-23968">Re: trochu jiný přístup</a></th>
<td class="comments-date">
<time datetime="2013-01-30T11:24:59+00:00">30.1.2013 v 09:24</time>
</td>
</tr></div>
</div>
<tr id="tr-comment-23969" class="comment even thread-even depth-1 recent new">
<td class="author"><img alt='' src='http://0.gravatar.com/avatar/ad516503a11cd5ca435acc9bb6523536?s=16' class='avatar avatar-16 photo avatar-default' height='16' width='16' />Monty</td>
<th class="comment-title"><a href="http://www.zdrojak.cz/clanky/automaticke-zabezpeceni/?show=comments#comment-23969">Jaké ošetření sloupce?</a></th>
<td class="comments-date">
<time datetime="2013-01-30T14:30:19+00:00">30.1.2013 ve 12:30</time>
</td>
</tr><tr id="tr-comment-23971" class="comment byuser comment-author-jakub-vrana bypostauthor odd alt depth-2 recent new">
<td class="author"><img alt='' src='/wp-content/uploads/avatars/2004/12/jakubvrana.jpg' class='avatar avatar-16 photo' height='16' width='16' />Jakub Vrána</td>
<th class="comment-title"><span class="line"></span><a href="http://www.zdrojak.cz/clanky/automaticke-zabezpeceni/?show=comments#comment-23971">Re: Jaké ošetření sloupce?</a></th>
<td class="comments-date">
<time datetime="2013-01-30T20:14:25+00:00">30.1.2013 ve 18:14</time>
</td>
</tr></div>
</div>
<tr id="tr-comment-23973" class="comment even thread-odd thread-alt depth-1 recent new">
<td class="author"><img alt='' src='http://0.gravatar.com/avatar/ad516503a11cd5ca435acc9bb6523536?s=16' class='avatar avatar-16 photo avatar-default' height='16' width='16' />bene</td>
<th class="comment-title"><a href="http://www.zdrojak.cz/clanky/automaticke-zabezpeceni/?show=comments#comment-23973">Re: Automatické zabezpečení</a></th>
<td class="comments-date">
<time datetime="2013-01-31T08:58:47+00:00">31.1.2013 v 06:58</time>
</td>
</tr></div>
<tr id="tr-comment-23974" class="comment odd alt thread-even depth-1 recent new">
<td class="author"><img alt='' src='http://0.gravatar.com/avatar/ad516503a11cd5ca435acc9bb6523536?s=16' class='avatar avatar-16 photo avatar-default' height='16' width='16' />5o</td>
<th class="comment-title"><a href="http://www.zdrojak.cz/clanky/automaticke-zabezpeceni/?show=comments#comment-23974">ACL assertion</a></th>
<td class="comments-date">
<time datetime="2013-01-31T12:04:27+00:00">31.1.2013 v 10:04</time>
</td>
</tr></div>
</table>
</div>
<div class="comments-actions">
<a class="btn add-comment" href="?show=comments#respond">Přidat komentář</a>
<a class="btn btn-primary show-comments" href="?show=comments#comments">Zobrazit komentáře</a>
</div>
</div> <nav class="nav-single no-print">
<h3 class="assistive-text">Navigace pro příspěvky</h3>
<span class="nav-previous">
<a href="http://www.zdrojak.cz/clanky/tvorba-moderniho-eshopu-kategorie-a-parametricke-hledani/" rel="next"><i class="icon-left"></i>Tvorba moderního eshopu: kategorie a parametrické hledání</a> </span>
<span class="nav-next">
<a href="http://www.zdrojak.cz/clanky/bezpecny-sandboxovany-iframe/" rel="prev">Bezpečný sandboxovaný iframe<i class="icon-right"></i></a> </span>
</nav>
<div class="visible-print">Zdroj: http://www.zdrojak.cz/?p=3773</div>
</div>
</div>
<div id="secondary" class="widget-area" role="complementary">
<aside id="recent_category_posts-0" class="with-thumb widget widget_recent_entries"> <h3 class="widget-title">Mobilní vývoj</h3> <div class="thumbnail">
<img width="180" height="180" src="http://www.zdrojak.cz/wp-content/uploads/2013/02/develcz.jpg" class="attachment- wp-post-image" alt="Mobilní vývoj" title="Mobilní vývoj" /> </div>
<ul>
<li><a href="http://www.zdrojak.cz/clanky/reportaz-z-devel-cz-konference-2013/" title="Reportáž z Devel.cz konference 2013">Reportáž z Devel.cz konference 2013</a></li>
<li><a href="http://www.zdrojak.cz/clanky/webapi-firefox-os/" title="Prohlédněte si možnosti WebAPI ve Firefox OS">Prohlédněte si možnosti WebAPI ve Firefox OS</a></li>
<li><a href="http://www.zdrojak.cz/clanky/nova-vyvojarska-konzole-v-google-play/" title="Nová Vývojářská konzole v Google Play">Nová Vývojářská konzole v Google Play</a></li>
</ul>
</aside> <aside id="recent-posts-0" class="widget widget_recent_entries"> <h3 class="widget-title">Nejnovější příspěvky</h3> <ul>
<li>
<a href="http://www.zdrojak.cz/clanky/mloc-js-staticke-typovani/" title="Typické! O maďarské konferenci mloc.js a statickém typování při vývoji webových aplikací">Typické! O maďarské konferenci mloc.js a statickém typování při vývoji webových aplikací</a>
</li>
<li>
<a href="http://www.zdrojak.cz/clanky/poslete-zdrojak-do-kindlu-novinky/" title="Pošlete Zdroják do Kindlu a další jarní novinky">Pošlete Zdroják do Kindlu a další jarní novinky</a>
</li>
<li>
<a href="http://www.zdrojak.cz/clanky/formulare-v-html5-a-nove-atributy/" title="Formuláře v HTML5 a nové atributy">Formuláře v HTML5 a nové atributy</a>
</li>
<li>
<a href="http://www.zdrojak.cz/clanky/modularni-webove-aplikace-s-minimem-rucni-prace/" title="Modulární webové aplikace s minimem ruční práce">Modulární webové aplikace s minimem ruční práce</a>
</li>
<li>
<a href="http://www.zdrojak.cz/clanky/zapier-propojovani-api/" title="Zapier &#8211; dejte propojování API ten správný šmrnc">Zapier &#8211; dejte propojování API ten správný šmrnc</a>
</li>
</ul>
</aside><aside id="related_posts-0" class="widget widget_related_posts"><h3 class="widget-title">Související články</h3><ul><li><a href="http://www.zdrojak.cz/clanky/do-hlubin-implementaci-javascriptu-1-dil-uvod/" rel="bookmark">Do hlubin implementací JavaScriptu: 1. díl &#8211; úvod</a></li><li><a href="http://www.zdrojak.cz/clanky/uvodni-analyza-pro-moderni-e-shop/" rel="bookmark">Úvodní analýza pro moderní e-shop</a></li><li><a href="http://www.zdrojak.cz/clanky/django-databazovy-model/" rel="bookmark">Django: Databázový model</a></li><li><a href="http://www.zdrojak.cz/clanky/dojo-toolkit-pokrocile-techniky/" rel="bookmark">Dojo Toolkit: pokročilé techniky</a></li><li><a href="http://www.zdrojak.cz/clanky/uvod-do-architektury-mvc/" rel="bookmark">Úvod do architektury MVC</a></li></ul></aside><aside id="recent-comments-0" class="widget widget_recent_comments"><h3 class="widget-title">Nejnovější komentáře</h3><ul id="recentcomments"><li class="recentcomments">Martin Kozák u <a href="http://www.zdrojak.cz/clanky/mloc-js-staticke-typovani/?show=comments#comment-24239">Typické! O maďarské konferenci mloc.js a statickém typování při vývoji webových aplikací</a></li><li class="recentcomments">maras u <a href="http://www.zdrojak.cz/clanky/mloc-js-staticke-typovani/?show=comments#comment-24238">Typické! O maďarské konferenci mloc.js a statickém typování při vývoji webových aplikací</a></li><li class="recentcomments">Ladislav Thon u <a href="http://www.zdrojak.cz/clanky/mloc-js-staticke-typovani/?show=comments#comment-24237">Typické! O maďarské konferenci mloc.js a statickém typování při vývoji webových aplikací</a></li><li class="recentcomments">satai u <a href="http://www.zdrojak.cz/clanky/mloc-js-staticke-typovani/?show=comments#comment-24236">Typické! O maďarské konferenci mloc.js a statickém typování při vývoji webových aplikací</a></li><li class="recentcomments">Ladislav Thon u <a href="http://www.zdrojak.cz/clanky/mloc-js-staticke-typovani/?show=comments#comment-24235">Typické! O maďarské konferenci mloc.js a statickém typování při vývoji webových aplikací</a></li></ul></aside> </div><!-- #secondary -->
</div>
<div id="bottom" class="site-bottom no-print">
<div class="wrapper">
<aside class="follow">
<h3>Sledujte</h3>
<div class="content">
<ul class="menu">
<li>
<a href="/feed" class="rss">
<i class="icon-rss"></i>
<span class="assistive-text">RSS</span>
</a>
</li>
<li>
<a href="/mail" class="mail" title="Zpravodaj">
<i class="icon-mail"></i>
<span class="assistive-text">Zpravodaj</span>
</a>
</li>
<li>
<a target="_blank" href="https://twitter.com/zdrojak" title="@zdrojak" class="twitter">
<i class="icon-twitter"></i>
<span class="assistive-text">Twitter @zdrojak</span>
</a>
</li>
<li>
<a target="_blank" href="https://plus.google.com/101725826130888424314/posts" class="googleplus">
<i class="icon-gplus"></i>
<span class="assistive-text">Google + Zdroják</span>
</a>
</li>
</ul>
</div>
</aside> </div>
</div>
</div>
<script type='text/javascript' src='http://www.zdrojak.cz/wp-includes/js/admin-bar.min.js?ver=3.5.1'></script>
<script type='text/javascript'>
/* <![CDATA[ */
(function k(){window.$SendToKindle&&window.$SendToKindle.Widget?$SendToKindle.Widget.init({"title":".entry-title","published":".entry-date","content":".post","exclude":".no-kindle"}):setTimeout(k,500);})();
/* ]]> */
</script>
<script type='text/javascript' src='http://d1xnn692s7u6t6.cloudfront.net/widget.js'></script>
<script type='text/javascript' src='http://www.zdrojak.cz/wp-content/themes/zdrojak/js/zdrojak.min.js?ver=5f54643'></script>
<div id="webstats">
<script src="http://c1.navrcholu.cz/code?site=72;t=t1x1" type="text/javascript"></script>
<noscript>
<div>
<img src="http://c1.navrcholu.cz/hit?site=72;t=t1x1;ref=;jss=0" width="1" height="1" alt=""/>
</div>
</noscript>
<script type="text/javascript">
var _gaq = _gaq || [];
_gaq.push(['_setAccount', 'UA-30960355-1']);
_gaq.push(['_trackPageview']);
(function () {
var ga = document.createElement('script');
ga.type = 'text/javascript';
ga.async = true;
ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
var s = document.getElementsByTagName('script')[0];
s.parentNode.insertBefore(ga, s);
})();
</script>
</div>
<!-- d0e88e1c2777b950889c170b9b92adf8 -->
</body>
</html>

@ -0,0 +1,21 @@
<html>
<head>
<meta http-equiv="charset" content="utf-8"/>
<title>This is title of document</title>
</head>
<body>
<div>Inline text is not so good, but it's here.</div>
<div class="article">
<div class="wrapper">
<p>
Paragraph is more <em>better</em>.
This text is very <strong>pretty</strong> 'cause she's girl.
</p>
<p>
This is not <big>crap</big> so <dfn title="Make me readable">readability</dfn> me :)
</p>
</div>
</div>
<div>And some next not so <b>good</b> text.</div>
</body>
</html>

@ -0,0 +1,18 @@
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
<title>Paragraphs</title>
</head>
<body>
<div>
<h1>Nadpis H1, ktorý chce byť prvý s textom ale predbehol ho "title"</h1>
<p>
Toto je prvý odstavec a to je fajn.
</p>
<p>
Tento text je tu aby vyplnil prázdne miesto v srdci súboru.
Aj súbory majú predsa city.
</p>
</div>
</body>
</html>

@ -0,0 +1,35 @@
# -*- coding: utf8 -*-
from __future__ import print_function
import sys
import atexit
import nose
from os.path import dirname, abspath
DEFAULT_PARAMS = [
"nosetests",
"--with-coverage",
"--cover-package=readability",
"--cover-erase",
]
@atexit.register
def exit_function(msg="Shutting down"):
print(msg, file=sys.stderr)
def run(argv=[]):
sys.exitfunc = exit_function
nose.run(
argv=DEFAULT_PARAMS + argv,
defaultTest=abspath(dirname(__file__)),
)
if __name__ == "__main__":
run(sys.argv[1:])

@ -0,0 +1,169 @@
# -*- coding: utf8 -*-
from __future__ import absolute_import
from __future__ import division, print_function, unicode_literals
from lxml.html import fragment_fromstring, document_fromstring
from readability.readable import Article
from readability.annotated_text import AnnotatedTextHandler
from .compat import unittest
from .utils import load_snippet, load_article
class TestAnnotatedText(unittest.TestCase):
def test_simple_document(self):
dom = fragment_fromstring("<p>This is\n\tsimple\ttext.</p>")
annotated_text = AnnotatedTextHandler.parse(dom)
expected = [
(
("This is\nsimple text.", None),
),
]
self.assertEqual(annotated_text, expected)
def test_empty_paragraph(self):
dom = fragment_fromstring("<div><p>Paragraph <p>\t \n</div>")
annotated_text = AnnotatedTextHandler.parse(dom)
expected = [
(
("Paragraph", None),
),
]
self.assertEqual(annotated_text, expected)
def test_multiple_paragraphs(self):
dom = fragment_fromstring("<div><p> 1 first<p> 2\tsecond <p>3\rthird </div>")
annotated_text = AnnotatedTextHandler.parse(dom)
expected = [
(
("1 first", None),
),
(
("2 second", None),
),
(
("3\nthird", None),
),
]
self.assertEqual(annotated_text, expected)
def test_single_annotation(self):
dom = fragment_fromstring("<div><p> text <em>emphasis</em> <p> last</div>")
annotated_text = AnnotatedTextHandler.parse(dom)
expected = [
(
("text", None),
("emphasis", ("em",)),
),
(
("last", None),
),
]
self.assertEqual(annotated_text, expected)
def test_recursive_annotation(self):
dom = fragment_fromstring("<div><p> text <em><i><em>emphasis</em></i></em> <p> last</div>")
annotated_text = AnnotatedTextHandler.parse(dom)
expected = [
(
("text", None),
("emphasis", ("em", "i")),
),
(
("last", None),
),
]
self.assertEqual(annotated_text, expected)
def test_annotations_without_explicit_paragraph(self):
dom = fragment_fromstring("<div>text <strong>emphasis</strong>\t<b>hmm</b> </div>")
annotated_text = AnnotatedTextHandler.parse(dom)
expected = [
(
("text", None),
("emphasis", ("strong",)),
("hmm", ("b",)),
),
]
self.assertEqual(annotated_text, expected)
def test_process_paragraph_with_chunked_text(self):
handler = AnnotatedTextHandler()
paragraph = handler._process_paragraph([
(" 1", ("b", "del")),
(" 2", ("b", "del")),
(" 3", None),
(" 4", None),
(" 5", None),
(" 6", ("em",)),
])
expected = (
("1 2", ("b", "del")),
("3 4 5", None),
("6", ("em",)),
)
self.assertEqual(paragraph, expected)
def test_include_heading(self):
dom = document_fromstring(load_snippet("h1_and_2_paragraphs.html"))
annotated_text = AnnotatedTextHandler.parse(dom.find("body"))
expected = [
(
('Nadpis H1, ktorý chce byť prvý s textom ale predbehol ho "title"', ("h1",)),
("Toto je prvý odstavec a to je fajn.", None),
),
(
("Tento text je tu aby vyplnil prázdne miesto v srdci súboru.\nAj súbory majú predsa city.", None),
),
]
self.assertSequenceEqual(annotated_text, expected)
def test_real_article(self):
article = Article(load_article("zdrojak_automaticke_zabezpeceni.html"))
annotated_text = article.main_text
expected = [
(
("Automatické zabezpečení", ("h1",)),
("Úroveň zabezpečení aplikace bych rozdělil do tří úrovní:", None),
),
(
("Aplikace zabezpečená není, neošetřuje uživatelské vstupy ani své výstupy.", ("li", "ol")),
("Aplikace se o zabezpečení snaží, ale takovým způsobem, že na ně lze zapomenout.", ("li", "ol")),
("Aplikace se o zabezpečení stará sama, prakticky se nedá udělat chyba.", ("li", "ol")),
),
(
("Jak se tyto úrovně projevují v jednotlivých oblastech?", None),
),
(
("XSS", ("a", "h2")),
("Druhou úroveň představuje ruční ošetřování pomocí", None),
("htmlspecialchars", ("a", "kbd")),
(". Třetí úroveň zdánlivě reprezentuje automatické ošetřování v šablonách, např. v", None),
("Nette Latte", ("a", "strong")),
(". Proč píšu zdánlivě? Problém je v tom, že ošetření se dá obvykle snadno zakázat, např. v Latte pomocí", None),
("{!$var}", ("code",)),
(". Viděl jsem šablony plné vykřičníků i na místech, kde být neměly. Autor to vysvětlil tak, že psaní", None),
("{$var}", ("code",)),
("někde způsobovalo problémy, které po přidání vykřičníku zmizely, tak je začal psát všude.", None),
),
(
("<?php\n$safeHtml = $texy->process($content_texy);\n$content = Html::el()->setHtml($safeHtml);\n// v šabloně pak můžeme použít {$content}\n?>", ("pre", )),
),
(
("Ideální by bylo, když by už samotná metoda", None),
("process()", ("code",)),
("vracela instanci", None),
("Html", ("code",)),
(".", None),
),
]
self.assertSequenceEqual(annotated_text, expected)

@ -1,11 +1,12 @@
# -*- coding: utf8 -*-
from __future__ import absolute_import
from __future__ import division, print_function, unicode_literals
import os
try:
# Python < 2.7
import unittest2 as unittest
except ImportError:
import unittest
from breadability.readable import Article
from readability.readable import Article
from ...compat import unittest
class TestAntipopeBlog(unittest.TestCase):

@ -0,0 +1,658 @@
<!DOCTYPE html>
<!--[if IE 7 | IE 8]>
<html class="ie" lang="cs-CZ" prefix="og: http://ogp.me/ns#"><![endif]-->
<!--[if !(IE 7) | !(IE 8) ]><!--><html lang="cs-CZ" prefix="og: http://ogp.me/ns#"><!--<![endif]-->
<head>
<meta charset="UTF-8"/>
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
<link rel="apple-touch-icon" href="http://www.zdrojak.cz/wp-content/themes/zdrojak/images/touch-icon-iphone.png" />
<link rel="apple-touch-icon" sizes="72x72" href="http://www.zdrojak.cz/wp-content/themes/zdrojak/images/touch-icon-ipad.png" />
<link rel="apple-touch-icon" sizes="114x114" href="http://www.zdrojak.cz/wp-content/themes/zdrojak/images/touch-icon-iphone-retina.png" />
<link rel="apple-touch-icon" sizes="144x144" href="http://www.zdrojak.cz/wp-content/themes/zdrojak/images/touch-icon-ipad-retina.png" />
<link rel="apple-touch-icon" sizes="512x512" href="http://www.zdrojak.cz/wp-content/themes/zdrojak/images/apple-touch-icon-itunes.png" />
<meta name="apple-mobile-web-app-title" content="Zdrojak.cz" />
<title>Ještě k testování | Zdroják</title>
<link rel="profile" href="http://gmpg.org/xfn/11"/>
<link rel="pingback" href="http://www.zdrojak.cz/xmlrpc.php"/>
<!--[if lt IE 9]>
<script src="http://www.zdrojak.cz/wp-content/themes/twentytwelve/js/html5.js" type="text/javascript"></script>
<![endif]-->
<!-- This site is optimized with the Yoast WordPress SEO plugin v1.4.1 - http://yoast.com/wordpress/seo/ -->
<meta name="description" content="SEO, MVC, návrhové vzory, knihovny a AJAX už všichni umí, nebo jsou o tom alespoň přesvědčeni. O použitelnosti má ponětí stále víc vývojářů. Kdekdo se zaklíná „čistým kódem“… Jen jedna věc vzbuzuje zatím stále silný odpor testování! Racionálně vzato to nedává smysl, takže příčina bude někde jinde…"/>
<link rel="canonical" href="http://www.zdrojak.cz/clanky/jeste-k-testovani/" />
<meta property='og:locale' content='cs_CZ'/>
<meta property='og:title' content='Ještě k testování - Zdroják'/>
<meta property='og:description' content='SEO, MVC, návrhové vzory, knihovny a AJAX už všichni umí, nebo jsou o tom alespoň přesvědčeni. O použitelnosti má ponětí stále víc vývojářů. Kdekdo se zaklíná „čistým kódem“… Jen jedna věc vzbuzuje zatím stále silný odpor testování! Racionálně vzato to nedává smysl, takže příčina bude někde jinde…'/>
<meta property='og:url' content='http://www.zdrojak.cz/clanky/jeste-k-testovani/'/>
<meta property='og:site_name' content='Zdroják'/>
<meta property='og:type' content='article'/>
<meta property='og:image' content='http://www.zdrojak.cz/wp-content/uploads/2011/01/know-how-1.png'/>
<!-- / Yoast WordPress SEO plugin. -->
<link rel="alternate" type="application/rss+xml" title="Zdroják &raquo; RSS zdroj" href="http://www.zdrojak.cz/feed/" />
<link rel="alternate" type="application/rss+xml" title="Zdroják &raquo; RSS komentářů" href="http://www.zdrojak.cz/comments/feed/" />
<link rel="alternate" type="application/rss+xml" title="Zdroják &raquo; RSS komentářů pro Ještě k testování" href="http://www.zdrojak.cz/clanky/jeste-k-testovani/feed/" />
<link rel='stylesheet' id='admin-bar-css' href='http://www.zdrojak.cz/wp-includes/css/admin-bar.min.css?ver=3.5.1' type='text/css' media='all' />
<link rel='stylesheet' id='boxes-css' href='http://www.zdrojak.cz/wp-content/plugins/wordpress-seo/css/adminbar.css?ver=3.5.1' type='text/css' media='all' />
<link rel='stylesheet' id='twentytwelve-style-css' href='http://www.zdrojak.cz/wp-content/themes/zdrojak/style.css?ver=cdf3a74' type='text/css' media='all' />
<!--[if lt IE 9]>
<link rel='stylesheet' id='twentytwelve-ie-css' href='http://www.zdrojak.cz/wp-content/themes/twentytwelve/css/ie.css?ver=cdf3a74' type='text/css' media='all' />
<![endif]-->
<link rel='stylesheet' id='wp-slimbox-css' href='http://www.zdrojak.cz/wp-content/plugins/omSlimBox/slimbox/css/slimbox2.css?ver=2' type='text/css' media='all' />
<script type='text/javascript' src='http://ajax.googleapis.com/ajax/libs/jquery/1.7.2/jquery.min.js?ver=3.5.1'></script>
<script type='text/javascript' src='http://www.zdrojak.cz/wp-content/themes/zdrojak/js/zdrojak.min.js?ver=cdf3a74'></script>
<script type='text/javascript' src='http://www.zdrojak.cz/wp-includes/js/comment-reply.min.js?ver=3.5.1'></script>
<meta name="generator" content="WordPress 3.5.1" />
<style type="text/css">.recentcomments a{display:inline !important;padding:0 !important;margin:0 !important;}</style>
<style type="text/css" media="print">#wpadminbar { display:none; }</style>
<style type="text/css" media="screen">
html { margin-top: 28px !important; }
* html body { margin-top: 28px !important; }
</style>
<style type="text/css" id="custom-background-css">
body.custom-background { background-color: #e6e6e6; }
</style>
<link rel="shortcut icon" href="/favicon.ico"/>
<link rel="author" href="/humans.txt"/>
</head>
<body class="single single-post postid-3450 single-format-standard admin-bar no-customize-support custom-background">
<div id="page" class="hfeed site">
<header id="masthead" class="site-header no-print" role="banner">
<div class="wrapper">
<h1 class="site-title visib">
<a href="http://www.zdrojak.cz/"
title="Zdroják"
rel="home">Zdroják</a>
</h1>
<h2 class="site-description hidden">o tvorbě webových stránek a aplikací</h2>
<nav id="site-navigation" class="main-navigation" role="navigation">
<div class="skip-link assistive-text hidden">
<a href="#content" title="Přejít k obsahu webu">
Přejít k obsahu webu </a>
</div>
<form method="get" id="searchform" action="http://www.zdrojak.cz/">
<fieldset>
<legend>Hledat</legend>
<label for="s">Hledat</label>
<input type="search" class="field" name="s" id="s" autocomplete="off"/>
<input type="submit" class="submit" name="submit" id="searchsubmit"
value="Hledat"/>
</fieldset>
</form>
<h3 class="menu-toggle">Menu</h3>
<div class="menu-top-menu-container"><ul id="menu-top-menu" class="nav-menu"><li id="menu-item-7124" class="menu-item menu-item-type-custom menu-item-object-custom menu-item-7124"><a href="/clanky/">Rubriky</a>
<ul class="sub-menu">
<li id="menu-item-7125" class="menu-item menu-item-type-taxonomy menu-item-object-category menu-item-7125"><a href="http://www.zdrojak.cz/databaze/">Databáze</a></li>
<li id="menu-item-7126" class="menu-item menu-item-type-taxonomy menu-item-object-category menu-item-7126"><a href="http://www.zdrojak.cz/javascript/">JavaScript</a></li>
<li id="menu-item-7127" class="menu-item menu-item-type-taxonomy menu-item-object-category menu-item-7127"><a href="http://www.zdrojak.cz/mobilni-vyvoj/">Mobilní vývoj</a></li>
<li id="menu-item-7128" class="menu-item menu-item-type-taxonomy menu-item-object-category menu-item-7128"><a href="http://www.zdrojak.cz/php/">PHP</a></li>
<li id="menu-item-7129" class="menu-item menu-item-type-taxonomy menu-item-object-category menu-item-7129"><a href="http://www.zdrojak.cz/rozhovory/">Rozhovory</a></li>
<li id="menu-item-7130" class="menu-item menu-item-type-taxonomy menu-item-object-category current-post-ancestor current-menu-parent current-post-parent menu-item-7130"><a href="http://www.zdrojak.cz/ruzne/">Různé</a></li>
<li id="menu-item-7131" class="menu-item menu-item-type-taxonomy menu-item-object-category menu-item-7131"><a href="http://www.zdrojak.cz/webdesign/">Webdesign</a></li>
</ul>
</li>
<li id="menu-item-7132" class="menu-item menu-item-type-post_type menu-item-object-page menu-item-7132"><a href="http://www.zdrojak.cz/n/">Nálepky</a></li>
<li id="menu-item-7133" class="menu-item menu-item-type-custom menu-item-object-custom menu-item-7133"><a href="/serialy/">Seriály</a></li>
<li id="menu-item-7134" class="menu-item menu-item-type-custom menu-item-object-custom menu-item-7134"><a target="_blank" href="http://www.root.cz/skoleni/">Školení</a></li>
</ul></div> </nav>
</div>
</header>
<div class="site-top">
<div class="wrapper">
<p id="breadcrumbs"><span xmlns:v="http://rdf.data-vocabulary.org/#"><span typeof="v:Breadcrumb"><a href="http://www.zdrojak.cz" rel="v:url" property="v:title">Zdroják</a></span> &raquo; <span typeof="v:Breadcrumb"><a href="http://www.zdrojak.cz/clanky/" rel="v:url" property="v:title">Články</a></span> &raquo; <span typeof="v:Breadcrumb"><a href="http://www.zdrojak.cz/ruzne/" rel="v:url" property="v:title">Různé</a></span> &raquo; <span typeof="v:Breadcrumb"><strong class="breadcrumb_last" property="v:title">Ještě k testování</strong></span></span></p> </div>
</div> <div id="main" class="wrapper">
<div id="primary" class="site-content">
<div id="content" role="main">
<article id="post-3450" class="post-3450 post type-post status-publish format-standard hentry category-ruzne tag-programovaci-jazyky tag-ruzne tag-subjektivne tag-testovani full">
<header class="entry-header">
<h1 class="entry-title">Ještě k testování</h1>
<div class="entry-meta"><a class="post" href="http://www.zdrojak.cz/clanky/">Články</a> - <span class="author vcard"><a class="url fn n" href="http://www.zdrojak.cz/autori/martin-maly/" title="Zobrazit všechny příspěvky, jejichž autorem je Martin Malý" rel="author">Martin Malý</a></span> - <a href="http://www.zdrojak.cz/ruzne/" title="Zobrazit všechny příspěvky z rubriky Různé" >Různé</a> - <time class="entry-date" datetime="2011-03-14T00:00:00+00:00">14.3.2011</time> </div>
</header>
<div class="entry-summary with-thumbnail">
<div class='thumbnail'><img width="80" height="80" src="http://www.zdrojak.cz/wp-content/uploads/2011/01/know-how-1.png" class="attachment-post-thumbnail wp-post-image" alt="Know how" /></div>
<div class="excerpt">
<p>SEO, MVC, návrhové vzory, knihovny a AJAX už všichni umí, nebo jsou o tom alespoň přesvědčeni. O použitelnosti má ponětí stále víc vývojářů. Kdekdo se zaklíná „čistým kódem“… Jen jedna věc vzbuzuje zatím stále silný odpor  testování! Racionálně vzato to nedává smysl, takže příčina bude někde jinde…</p>
</div>
</div>
<div class="entry-content">
<div class="entry-terms tags">
<div class="entry-tags">
<h4>Nálepky:</h4><ul><li><a href="http://www.zdrojak.cz/n/programovaci-jazyky/" rel="tag">Programovací jazyky</a></li><li><a href="http://www.zdrojak.cz/n/ruzne/" rel="tag">Různé</a></li><li><a href="http://www.zdrojak.cz/n/subjektivne/" rel="tag">Subjektivně</a></li><li><a href="http://www.zdrojak.cz/n/testovani/" rel="tag">Testování</a></li></ul> </div>
</div>
<p>S automatizovaným testováním kódu (a ve zbytku článku budu mít na mysli právě to) jsem se setkal v několika firmách. Nikde ho nedělali. Když jsem se ptal proč, dozvěděl jsem se vždy nějakou variaci na <em>starý příběh o testování</em>.</p>
<div class="rs-box">
<h2>Starý příběh o testování</h2>
<p>U nás ve firmě jsme vždycky psali kód tak, jak jsme uměli nejlíp. Postupně jsme se naučili pracovat s CVS a s knihovnami kódu a když byl čas, a že skoro nikdy nebyl, tak jsme zkoušeli i novinky. Mladý zapálený programátor nám jednou říkal, co se dozvěděl o Agile, jako že tam dělají ty scrumy a iterace a že programujou dva najednou, no to jsme se zasmáli, to jsou nesmysly, ale něco z toho jsme si vzali  zavedli jsme podle toho scrumu každodenní ranní porady.</p>
<p>No a tenhle vendelín jednou taky přišel s tím testováním. Já programuju patnáct let, takže nějaké zkušenosti mám. Od začátku mi bylo jasný, že to je spousta práce navíc, kterou nám nikdo nezaplatí. Kluci budou hodinu psát třídu a dvě hodiny test  jako k čemu to je? No, ale všichni to chválej, tak na tom asi něco bude, tak jsme to v létě, když bylo volnějc, zkusili. U jednoho takovýho projektu, co jsme dělali, jsme začali psát ke každý třídě testovací skripty.</p>
<p>Byl to šílenej vopich, kluci nadávali, že mají dvakrát tolik práce, že by za tu dobu byli už hotoví s celým projektem, a že je to jen zdržuje. Pár chyb to našlo, to sice jo, ale žádná sláva, na tu spoustu práce, co jsme s tím měli… Navíc to třeba vůbec nenašlo jasný chyby, co jsi v tom kódu viděl jen kdyžs ho přečetl! A nejhorší bylo, že u malých tříd to bylo OK, ale když jsme to dali dohromady a pustili proti databázi, tak se třeba ukázalo, že to vůbec nefunguje  a přitom ty unit testy byly všechny OK, OK, OK… Takovýhle testování je naprd. Navíc pak přišly nějaký změny a ty testy bysme museli stejně přepisovat, a to by se nám už vůbec nevyplatilo, udržovat dvojí kód, takže u nás jednoznačně #fail.</p>
<p>Vono teoreticky to zní hezky a pro takový ty malý třídy, kde se něco počítá, je to možná dobrý, ale v praxi to je k ničemu… Jediný testování, který má smysl, je to, že si každý zkusí, jestli to, co napsal, taky funguje. Takhle to dělají programátoři odjakživa a šlape to.</p>
<div class="cb"></div>
</div>
<p>Inu, v praxi je k ničemu každý pracovní postup, který aplikujete mechanicky, bez pochopení jeho podstaty (taková <em>kargokultická metodika</em>). Vzít si z agile jen „ranní porady“ je nejjistější způsob, jak zjistit, že „to nefunguje“.</p>
<p>Ruku na srdce  kolikrát se vám stalo, že jste o něčem prohlásili, že to je „naprosto na houby“, až vám jednoho dne někdo ukázal, jak to používat, a vy jste museli uznat, že nástroj je výborný a „na houby“ bylo hlavně to, že jste s ním neuměli nebo nechápali, k čemu je? Mně mockrát.</p>
<p>V pozadí mnohých sporů a odmítání je leckdy nepochopení. Dovolte mi, abych byl tedy chvíli „advokátem pro testování“; mým cílem není přesvědčit vás v článku o tom, že byste měli testovat a že se vám to vyplatí, ale zkusit vyviklat některé protiargumenty, v jejichž základu je právě nepochopení. Čímž neříkám, že můj pohled na testování je jediný správný (to ale nedělají ani advokáti; místo toho to nazývají „právní názor“).</p>
<h2>Zvolený obor</h2>
<p>Testování je velmi široká oblast a mnoha lidem splývá, proto než se pustím do obhajoby, musím nejprve vymezit oblast, které se bude obhajoba týkat. Rád bych se věnoval jednotkovým (unit) testům. Jsou pravděpodobně nejznámější, nejčastěji vyjmenovávané, ale na druhou stranu hodně specifické.</p>
<p>Jednotkové testy jsou automatizované postupy pro otestování jednotky kódu (třída, knihovna, unit, skript, … – tedy <em>něco, co lze samostatně testovat</em>). Jejich cílem je strojově otestovat, zda daná jednotka dělá to, co dělat má, a zda nedělá něco, co dělat nemá. Je jasné, že automaticky můžeme otestovat pouze to, co se automaticky otestovat dá, resp. co lze automaticky otestovat snadno.</p>
<div class="rs-tip-major">
<p>Automatizované testování nenahrazuje ruční; doplňuje ho.</p>
<div class="cb"></div>
</div>
<p>U jednotek testujeme, zda:</p>
<ol>
<li>vrací správné hodnoty na správné vstupní údaje</li>
<li>vrací správné hodnoty na mezní vstupní údaje</li>
<li>legitimně zhavaruje tehdy, když zhavarovat má</li>
</ol>
<p>Co to znamená? U jednoduchých funkcí zadáváme vstupní hodnoty a kontrolujeme výstupní, resp. chování funkce. U složitějších testujeme to, co testovat lze. Kupříkladu u třídy, která bude generovat CAPTCHA obrázek, nebudeme psát OCR, který bude vyhodnocovat, zda výsledek opravdu obsahuje požadované znaky, to je extremistický nesmysl. Otestujeme, zda při zadání dobrých vstupních údajů vygeneruje třída obrázek, jestli ten obrázek má patřičné rozměry a patřičný formát. To je snadné otestovat. To, jestli obrázek obsahuje opravdu daný text daným fontem, už nebudeme řešit unit testem; ověříme to metodou „kouknu a vidím“.</p>
<div class="rs-tip-major">
<p>Ne každé testování je automatizované; ne každé automatizované testování je unit test.</p>
<div class="cb"></div>
</div>
<p>Jednotkové testy by měly v ideálním případě otestovat každou metodu třídy, každou funkci v knihovně, každý řádek kódu, navíc takovým způsobem, který je nezávislý na zbytku systému či na vnitřních stavech. Každý „testovací případ“ by měl pracovat s čistou kopií jednotky.</p>
<p>Pokud jednotka používá nějaké komplexní funkce „zvenčí“, pak pro testování podstrčíme „mock object“, který se bude navenek tvářit tak, že opravdu funguje, ve skutečnosti ale jen vrátí testovací data. Řekněme, že budeme testovat HTML generátor, který generuje stránky ze záznamů v databázi. Namísto objektu, který přebírá data z databáze, podstrčíme „mock“  jednoduchou třídu, která má stejné rozhraní, ale na <code>getTextById()</code> vrátí testovací „Lorem ipsum“. Jednotkové testy tak proběhnou nezávisle na okolí.</p>
<div class="rs-tip-major">
<p>Unit testy nezjistí, jestli celý dům bude stát. Testujeme jednotlivé cihly, maltu, tvárnice, tedy základní stavební prvky, a ověřujeme, jestli fungují tak, jak od nich očekáváme.</p>
<div class="cb"></div>
</div>
<p><strong>Jednotkovými testy netestujeme, zda jednotka funguje spolu s ostatními; od toho jsou integrační testy. Netestujeme jimi ani to, jestli celá aplikace funguje.</strong> Očekávat, že jednotka pro generování HTML funguje, a tím pádem musí fungovat celý web, je bláhové. V dalším textu se nebudeme zabývat ani integračními testy, ani testováním aplikace, <strong>zůstaneme jen u automatizovaných jednotkových tes­tů.</strong></p>
<h2>ISO9001</h2>
<p>K čemu nám tedy takové testování je? Nezjistíme tím, jestli to spolupracuje se zbytkem aplikace, nezjistíme, jestli aplikace funguje… <strong>Automatizované unit testy mají jinou hodnotu: jsou automatické (můžou tedy běžet bez zásahu člověka, např. na serveru jako hook u verzovacího nástroje), opakovatelné a jejich výsledky lze dobře zpracovat.</strong></p>
<p>Trochu to připomíná známou (a mnohými proklínanou) normu ISO9001. Tato norma nezajišťuje, jak si mnozí lidé myslí, jakost výrobků. Tato norma je zaměřena na to, aby veškeré procesy byly jasně popsané, specifikované a opakovatelné. ISO9001 vám nezaručí, že při výrobě neuděláte chybu. Postup podle této normy pouze zaručí, že chybu uděláte vždy stejně (pokud je procesní), nebo že zjistíte, kde vzniká, protože jednotlivé kroky jsou přesně popsané. Ano, je to opruz, popisovat přesně všechny procesy, sepisovat lejstra o tom, co se dělá a jak se to přesně dělá. Ale když je někde chyba, můžete se postupů popsaných v lejstrech při hledání držet. Buď zjistíte, že někdo postup nedodržel, nebo že je v procesu chyba  a pak ji můžete opravit a popsat proces znovu.</p>
<p>S testováním je to podobné. Test není vaše ověření, že vše funguje; na to by byl leckdy opravdu drahý. Test je nástroj pro dlouhodobou udržitelnost kódu a pro rozumnou práci s ním. Dobře napsané testy dokáží odhalit problémy při zásahu do kódu. Většinu situací „tady přepíšu pár řádků, bude to ale fungovat stejně“, které vedou k prapodivným chybám, můžete s jednotkovými testy zachytit dřív, než si zavlečete do kódu skryté chyby.</p>
<h2>Test je tak dobrý, jako jeho autor</h2>
<p><strong>Testování je jako španělská hospoda  najdete tam jen to, co si s sebou přinesete.</strong> Žádný test neobjeví v kódu nic, co autor nedokáže popsat. Myslet si, že unit test objeví chybu tam, kde nikoho nenapadlo, že by mohla být, je naivní.</p>
<p>Napsat dobrý test je trošku umění, především proto, že mnozí lidé postupují při ověřování chybně. Lidský mozek má tendenci hledat případy, které naši teorii potvrzují, namísto toho, aby hledal případy, které by jí vyvracely, kdyby fungovaly. Jinými slovy: <strong>musíme testovat nejen správnou funkčnost, ale i správnou nefunkčnost. </strong></p>
<p>Najít ale všelijaké kombinace, které by měly zhavarovat, vyžaduje opravdu zkušeného programátora s dobrou fantazií. Taky nikdo netvrdí, že napsat dobré testy je hračka!</p>
<h2>Jednotkové testování není všespásné</h2>
<p>Myslet si, že napíšu jednotkový test a knihovna bude automaticky dobrá a použitelná je bláhové. Myslet si, že jednotkový test zaručí kvalitní kód, je taky nesmysl. Přesto mnozí očekávají od jednotkových testů něco, co jim jednotkové testy nemohou nabídnout, a jsou pak rozčarováni z toho, že jejich očekávání nebylo naplněno. Často pak z neúspěchu viní testování jako takové.</p>
<p>Zopakujme si ještě jednou: <strong>Jednotkové testy slouží k automatizovanému, opakovatelnému a strojově zpracovatelnému testování izolovaných funkcí.</strong> Není to nástroj pro zajištění kvality nebo vhodnosti pro daný účel; nenahradí to dobrou analýzu ani dobrý návrh. Použijete je hlavně při dlouhodobé údržbě vlastního kódu. Naprosto neocenitelné jsou jejich služby ve chvíli, kdy napíšete „verzi 2“, která „<em>by měla být kompatibilní s verzí 1</em>“. Máte-li „verzi 1“ pokrytou dobrými testy, uvidíte na první pohled, jak to s tou kompatibilitou ve skutečnosti je.</p>
<p>Pokud píšete kód vždy důsledně jen na jedno použití, nasadíte ho do aplikace a pak už se k němu nikdy nevrátíte, tak pravděpodobně tuhle výhodu neoceníte. <em>Po pravdě řečeno v takovém případě máte hlavně úplně jiný problém než to, že netestujete…</em></p>
<h2>K čemu tedy?</h2>
<p>Pokud se držíte metodiky <abbr title="Test-Driven Development">TDD</abbr>, tedy že nejprve píšete testy a až po nich kód, tak můžete brát <strong>psaní testů jako první použití vašeho nového kódu</strong>. Berte to jako příležitost zjistit, jak se s ním pracuje, a to ještě dřív, než ho opravdu napíšete. Uvidíte svůj kód očima jeho uživatele, což je zkušenost k nezaplacení. Třeba zjistíte, že budete muset něco v API změnit či upravit…</p>
<p>TDD bývá někdy některými hodnocena jako příliš ortodoxní. Asi není třeba být vždy a za všech okolností doslovný a pokud napíšete nejdřív jednoduchou knihovnu a až po ní testy, nebude to jistě žádné velké neštěstí. Jen <em>pozor na to, že při obráceném postupu má člověk stále ještě v hlavě vlastní kód a mnohdy píše testy „na míru svému kódu a svým chybám“</em>.</p>
<div class="rs-box">
<p>Kupříkladu píšeme funkci, u níž je parametr i, jehož hodnota smí být max. 10 (včetně). Při psaní se překoukneme a  do funkce napíšeme test, který vyhodí výjimku, když <code>i&lt;10</code>. Pokud jsme nejprve napsali kód, tak máme mnohdy tendenci ověřovat, že pro <code>i=9</code> projde a pro <code>i=10</code> zhavaruje. Ve skutečnosti tedy testujeme to, že napsaný kód dělá to, co je v něm napsáno, nikoli že dělá to, co dělat má. Pokud začneme nejprve testem, pravděpodobně jej napíšeme správně.</p>
<div class="cb"></div>
</div>
<h2>Testy patří k bontonu!</h2>
<p>Testy jsou v podobné roli jako dokumentace: programátoři mají odpor k vytváření, protože to je „neproduktivní práce“. Když člověk programuje, v hlavě mu letí myšlenky a na nějaké psaní dokumentace není čas… Maximálně tak nějaký ten komentář do kódu.</p>
<p>Propagátoři nových jazyků a čistého kódu hovoří o dokumentačních komentářích jako o samozřejmosti; měly by patřit do kódu stejně samozřejmě jako odsazování. Stejný pohled se začíná prosazovat i v oblasti testování. <strong>Pustit open source knihovnu do světa bez sady testů</strong> (a bez dokumentace) <strong>je v jistých kruzích už programátorské faux pas</strong>: <em>k čemu mi je kód, který si můžu upravit, když nemůžu rychle zjistit, jestli mi úprava něco nerozbila?</em></p>
<p><strong>Napsat dobrý test je nutnost, pokud chceme svým kódem přispět do většího projektu.</strong> I ve firmách, které nedělají open source, je často používáno automatické testování, ať už kvůli <em>Continuous Integration</em>, tak třeba i pro měření kvality práce programátorů  pokud někdo soustavně <em>commituje</em> změny, které neprojdou testem, lze to snadno dohledat a zjistit příčiny.</p>
<p><strong>Testy, podobně jako dokumentace, nejsou v podstatě nikdy hotové a kompletní. </strong>To, že se v kódu objeví chyba, kterou test nezachytil, není důkaz toho, že jednotkové testování nemá smysl, ale toho, že byl test neúplný. Můžete se rozčílit na všechny propagátory testů a napsat jim to do diskusí, nebo můžete problém popsat testem; to druhé bývá rychlejší a smysluplnější. Stejně tak když vám kolega řekne, že mu vaše třída nefunguje za takových a takových podmínek: to je ideální příležitost ty podmínky nasimulovat v testu!</p>
<p>A nezapomeňte: <em>dobrý test vám kryje záda, když jde do tuhého a hledá se viník!</em></p>
<h2>Stejně ale…</h2>
<p class="rs-question">Pro nás je to drahé a zdržuje to.</p>
<p class="rs-answer">Zkusili jste si to, testy jste psali tak, jak se psát mají, všechno jste udělali správně, ale zdržovalo vás to. Knihovny totiž nikdy nepřepisujete a ty testy byste stejně spustili jen jednou. Pak asi ano, pokud jste si jisti, že jste všechno udělali správně, a přesto jste si spočítali, že se vám to nevyplatí, tak OK. </p>
<p class="rs-question">Nám chyby v kódu nevadí.</p>
<p class="rs-answer">Komu by vadily, že? Místo psaní testů vymyslíme, jak opravy kódu prodat zákazníkovi jako vícepráce, a vyděláme na tom!</p>
<p class="rs-question">Můj kód je vždy perfektní, protože jsem špičkový programátor.</p>
<p class="rs-answer">Pardon, testem jsme vás nechtěli urazit. Víme, že jste špičkový stroj na kód, který není nikdy unavený, nikdy nedělá chyby, nikdy se nepřepíše, vždy je stoprocentně koncentrovaný  a že tomu věříte. Máte pro to ale i nějaký jiný důkaz než svoje tvrzení?</p>
<p class="rs-question">
Všechny tyhlety takzvaný „metodiky“ jsou jen tlamocviky mladých frikulínů, které mají zakrýt, že vlastně vůbec neuměj&#8216; programovat…</p>
<p class="rs-answer">
Ale jistě… „Opravdový programátor“ napíše cyklus DO přes tisíc řádků, a nesplete se! Přidejme ještě „pravidla jsou pro slabochy“ a „čára není zeď“, ať to máme komplet. Ale upřímně  pokud si myslíte, že programování je umění, měli byste programy vystavovat na výstavách, a ne je cpát lidem do počítačů, aby s nima pracovali…</p>
<h2>Shrnutí</h2>
<ul>
<li>Testy nejsou kouzlo; je to metoda. Když ji neumíte a děláte ji špatně, nebude vám fungovat, tak prosté to je.</li>
<li>Jednotkové testy testují to, co říkají: funkčnost jednotek kódu.</li>
<li>Automatizovaný test otestuje jen to, co do něj napíšete.</li>
<li>Jednotkový test nenahrazuje jiné metody testování; doplňuje je. Pokud chcete testovat, jak to bude fungovat dohromady, slouží k tomu integrační testy.</li>
<li>Jednotkové testy děláme proto, že jsou opakovatelné, automatizovatelné a jejich výstup lze strojově vyhodnotit.</li>
<li>Automatizovaný test nemá, na rozdíl od člověka, „své dny“ a vždy testuje vše tak jak má. Neznamená to ale, že můžou člověka plně nahradit  jen mu ulehčují mechanickou práci.</li>
<li>Testování není ladění.</li>
<li>Test je jen tak dobrý jako jeho autor; je-li autor lemrouch, je i test špatný.</li>
<li>Hodina vynaložená na psaní testu ušetří den hledání podivné chyby za půl roku. Pokud hodláte ještě za půl roku pracovat ve stejném oboru, zvažte tento aspekt.</li>
</ul>
<p>Ke čtení naleznete mnoho různých materiálů, od teoretických po praktické ukázky. V češtině mě zaujala velmi podrobná <a href="http://www.poeta.cz/Zaklady_testovani.pdf">Příručka o testování (pdf)</a> od <span class="rs-person">Anny Borovcové</span> (<a href="http://testovanisoftwaru.blogspot.com/2009/08/zaklady-testovani.html">blog</a>).</p>
<div class="rs-tip-major">
<p>Pokud vás problematika zajímá, využijte možnosti navštívit školení <span class="rs-person">Jiřího Knesla</span> na téma <a href="http://akademie.blog.root.cz/skoleni-unit-testovani-v-php/">Testování v PHP</a> (viz <a href="http://weblog.shopio.cz/zajimavosti/skoleni-php-testovani/">zkušenosti</a> <a href="http://blog.mareksudak.cz/dojmy-ze-skoleni-unit-testovani-v-php">účastníků</a>).</p>
<div class="cb"></div>
</div>
</div>
<div class="entry-social bottom">
<div class="post-social-buttons social visible-desktop">
<h3 class="screen-reader-text">Sdílejte:</h3>
<ul>
<li class="facebook">
<a title="Sdílet na Facebooku" href="http://www.facebook.com/sharer/sharer.php?u=http%3A%2F%2Fwww.zdrojak.cz%2Fclanky%2Fjeste-k-testovani%2F&amp;t=Je%C5%A1t%C4%9B+k+testov%C3%A1n%C3%AD"
onclick="javascript:window.open(this.href, '', 'menubar=no,toolbar=no,resizable=yes,scrollbars=yes,height=600,width=626');return false;"
target="_blank" rel="nofollow">
<span class="count">0</span></a>
</li>
<li class="twitter">
<a title="Sdílet na Twitteru" href="https://twitter.com/intent/tweet?via=zdrojak&amp;url=http%3A%2F%2Fwww.zdrojak.cz%2Fclanky%2Fjeste-k-testovani%2F&amp;text=Je%C5%A1t%C4%9B+k+testov%C3%A1n%C3%AD"
onclick="javascript:window.open(this.href, '', 'menubar=no,toolbar=no,resizable=yes,scrollbars=yes,height=600,width=626');return false;"
target="_blank" rel="nofollow">
<span class="count">0</span></a>
</li>
<li class="googleplus">
<a title="Sdílet na Google+" href="https://plus.google.com/share?url=http%3A%2F%2Fwww.zdrojak.cz%2Fclanky%2Fjeste-k-testovani%2F&amp;title=Je%C5%A1t%C4%9B+k+testov%C3%A1n%C3%AD"
onclick="javascript:window.open(this.href, '', 'menubar=no,toolbar=no,resizable=yes,scrollbars=yes,height=600,width=626');return false;"
target="_blank" rel="nofollow">
<span class="count">0</span></a>
</li>
</ul>
</div> </div>
<footer class="entry-meta">
<div class="author-info">
<div class="author-avatar">
<a href="http://www.zdrojak.cz/autori/martin-maly/" rel="author"><img width="96" height="120" src="http://www.zdrojak.cz/wp-content/uploads/2013/01/mmaly-122686257455681.png" class="attachment-thumbnail" alt="Martin Malý" /></a>
</div>
<div class="author-description">
<h2>
<a href="http://www.zdrojak.cz/autori/martin-maly/" rel=author">Martin Malý</a>
</h2>
<div class="text"><p>Začal programovat v roce 1984 s programovatelnou kalkulačkou. Pokračoval k BASICu, assembleru Z80, Forthu, Pascalu, Céčku, dalším assemblerům, před časem v PHP a teď je rád, že neprogramuje&#8230;</p>
</div>
</div>
</div> </footer>
</article>
<div id="comments" class="comments-area table no-print">
<a class="comments_bubble visible-desktop" rel="nofollow" title="43 komentářů" href="http://www.zdrojak.cz/clanky/jeste-k-testovani/?show=comments#comments">
<strong class="screen-reader-text">Komentáře: </strong>
<span class="count">43</span>
</a> <h2 class="comments-title">Přehled komentářů</h2>
<div class="comments-table">
<table>
<tr id="tr-comment-16006" class="comment even thread-even depth-1">
<td class="author"><img alt='' src='http://0.gravatar.com/avatar/ad516503a11cd5ca435acc9bb6523536?s=16' class='avatar avatar-16 photo avatar-default' height='16' width='16' />danaketh</td>
<th class="comment-title"><a href="http://www.zdrojak.cz/clanky/jeste-k-testovani/?show=comments#comment-16006">Psaní testů</a></th>
<td class="comments-date"><time datetime="2011-03-14T08:06:00+00:00">14.3.2011 v 06:06</time></td>
</tr><tr id="tr-comment-16008" class="comment byuser comment-author-martin-maly bypostauthor odd alt depth-2">
<td class="author"><img alt='' src='http://0.gravatar.com/avatar/cf68d5ecb446e39c3cebe0873daf6f01?s=16&amp;d=http%3A%2F%2F0.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D16&amp;r=G' class='avatar avatar-16 photo' height='16' width='16' />Martin Malý</td>
<th class="comment-title"><span class="line"></span><a href="http://www.zdrojak.cz/clanky/jeste-k-testovani/?show=comments#comment-16008">Re: Psaní testů</a></th>
<td class="comments-date"><time datetime="2011-03-14T08:18:07+00:00">14.3.2011 v 06:18</time></td>
</tr></div>
<tr id="tr-comment-16018" class="comment even depth-2">
<td class="author"><img alt='' src='http://0.gravatar.com/avatar/ad516503a11cd5ca435acc9bb6523536?s=16' class='avatar avatar-16 photo avatar-default' height='16' width='16' />myshpa</td>
<th class="comment-title"><span class="line"></span><a href="http://www.zdrojak.cz/clanky/jeste-k-testovani/?show=comments#comment-16018">Re: Psaní testů</a></th>
<td class="comments-date"><time datetime="2011-03-14T09:52:45+00:00">14.3.2011 v 07:52</time></td>
</tr></div>
<tr id="tr-comment-16032" class="comment byuser comment-author-martin-maly bypostauthor odd alt depth-2">
<td class="author"><img alt='' src='http://0.gravatar.com/avatar/cf68d5ecb446e39c3cebe0873daf6f01?s=16&amp;d=http%3A%2F%2F0.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D16&amp;r=G' class='avatar avatar-16 photo' height='16' width='16' />Martin Malý</td>
<th class="comment-title"><span class="line"></span><a href="http://www.zdrojak.cz/clanky/jeste-k-testovani/?show=comments#comment-16032">Re: Psaní testů</a></th>
<td class="comments-date"><time datetime="2011-03-14T14:22:14+00:00">14.3.2011 ve 12:22</time></td>
</tr></div>
</div>
<tr id="tr-comment-16007" class="comment even thread-odd thread-alt depth-1">
<td class="author"><img alt='' src='http://0.gravatar.com/avatar/ad516503a11cd5ca435acc9bb6523536?s=16' class='avatar avatar-16 photo avatar-default' height='16' width='16' />jáchym</td>
<th class="comment-title"><a href="http://www.zdrojak.cz/clanky/jeste-k-testovani/?show=comments#comment-16007">tip na testování v javascriptu</a></th>
<td class="comments-date"><time datetime="2011-03-14T08:17:35+00:00">14.3.2011 v 06:17</time></td>
</tr><tr id="tr-comment-16010" class="comment byuser comment-author-martin-maly bypostauthor odd alt depth-2">
<td class="author"><img alt='' src='http://0.gravatar.com/avatar/cf68d5ecb446e39c3cebe0873daf6f01?s=16&amp;d=http%3A%2F%2F0.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D16&amp;r=G' class='avatar avatar-16 photo' height='16' width='16' />Martin Malý</td>
<th class="comment-title"><span class="line"></span><a href="http://www.zdrojak.cz/clanky/jeste-k-testovani/?show=comments#comment-16010">Re: tip na testování v javascriptu</a></th>
<td class="comments-date"><time datetime="2011-03-14T08:20:46+00:00">14.3.2011 v 06:20</time></td>
</tr><tr id="tr-comment-16024" class="comment even depth-3">
<td class="author"><img alt='' src='http://0.gravatar.com/avatar/ad516503a11cd5ca435acc9bb6523536?s=16' class='avatar avatar-16 photo avatar-default' height='16' width='16' />Jáchym</td>
<th class="comment-title"><span class="line"></span><span class="line"></span><a href="http://www.zdrojak.cz/clanky/jeste-k-testovani/?show=comments#comment-16024">Re: tip na testování v javascriptu</a></th>
<td class="comments-date"><time datetime="2011-03-14T12:42:37+00:00">14.3.2011 v 10:42</time></td>
</tr><tr id="tr-comment-16030" class="comment byuser comment-author-martin-maly bypostauthor odd alt depth-4">
<td class="author"><img alt='' src='http://0.gravatar.com/avatar/cf68d5ecb446e39c3cebe0873daf6f01?s=16&amp;d=http%3A%2F%2F0.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D16&amp;r=G' class='avatar avatar-16 photo' height='16' width='16' />Martin Malý</td>
<th class="comment-title"><span class="line"></span><span class="line"></span><span class="line"></span><a href="http://www.zdrojak.cz/clanky/jeste-k-testovani/?show=comments#comment-16030">Re: tip na testování v javascriptu</a></th>
<td class="comments-date"><time datetime="2011-03-14T13:58:09+00:00">14.3.2011 v 11:58</time></td>
</tr></div>
</div>
</div>
<tr id="tr-comment-16046" class="comment even depth-2">
<td class="author"><img alt='' src='http://0.gravatar.com/avatar/ad516503a11cd5ca435acc9bb6523536?s=16' class='avatar avatar-16 photo avatar-default' height='16' width='16' />Aleš Roubíček</td>
<th class="comment-title"><span class="line"></span><a href="http://www.zdrojak.cz/clanky/jeste-k-testovani/?show=comments#comment-16046">Re: tip na testování v javascriptu</a></th>
<td class="comments-date"><time datetime="2011-03-14T20:40:51+00:00">14.3.2011 ve 18:40</time></td>
</tr></div>
</div>
<tr id="tr-comment-16009" class="comment odd alt thread-even depth-1">
<td class="author"><img alt='' src='http://0.gravatar.com/avatar/ad516503a11cd5ca435acc9bb6523536?s=16' class='avatar avatar-16 photo avatar-default' height='16' width='16' />Mastodont</td>
<th class="comment-title"><a href="http://www.zdrojak.cz/clanky/jeste-k-testovani/?show=comments#comment-16009">Dotaz</a></th>
<td class="comments-date"><time datetime="2011-03-14T08:18:43+00:00">14.3.2011 v 06:18</time></td>
</tr><tr id="tr-comment-16011" class="comment byuser comment-author-martin-maly bypostauthor even depth-2">
<td class="author"><img alt='' src='http://0.gravatar.com/avatar/cf68d5ecb446e39c3cebe0873daf6f01?s=16&amp;d=http%3A%2F%2F0.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D16&amp;r=G' class='avatar avatar-16 photo' height='16' width='16' />Martin Malý</td>
<th class="comment-title"><span class="line"></span><a href="http://www.zdrojak.cz/clanky/jeste-k-testovani/?show=comments#comment-16011">Re: Dotaz</a></th>
<td class="comments-date"><time datetime="2011-03-14T08:34:00+00:00">14.3.2011 v 06:34</time></td>
</tr><tr id="tr-comment-16012" class="comment odd alt depth-3">
<td class="author"><img alt='' src='http://0.gravatar.com/avatar/ad516503a11cd5ca435acc9bb6523536?s=16' class='avatar avatar-16 photo avatar-default' height='16' width='16' />tdvorak</td>
<th class="comment-title"><span class="line"></span><span class="line"></span><a href="http://www.zdrojak.cz/clanky/jeste-k-testovani/?show=comments#comment-16012">Re: Dotaz</a></th>
<td class="comments-date"><time datetime="2011-03-14T08:43:48+00:00">14.3.2011 v 06:43</time></td>
</tr><tr id="tr-comment-16027" class="comment byuser comment-author-michal-augustyn even depth-4">
<td class="author"><img alt='' src='/wp-content/uploads/avatars/2009/07/augi.jpg' class='avatar avatar-16 photo' height='16' width='16' />Michal Augustýn</td>
<th class="comment-title"><span class="line"></span><span class="line"></span><span class="line"></span><a href="http://www.zdrojak.cz/clanky/jeste-k-testovani/?show=comments#comment-16027">Re: Dotaz</a></th>
<td class="comments-date"><time datetime="2011-03-14T13:20:06+00:00">14.3.2011 v 11:20</time></td>
</tr></div>
</div>
</div>
<tr id="tr-comment-16021" class="comment odd alt depth-2">
<td class="author"><img alt='' src='http://0.gravatar.com/avatar/ad516503a11cd5ca435acc9bb6523536?s=16' class='avatar avatar-16 photo avatar-default' height='16' width='16' />jos</td>
<th class="comment-title"><span class="line"></span><a href="http://www.zdrojak.cz/clanky/jeste-k-testovani/?show=comments#comment-16021">Re: Dotaz</a></th>
<td class="comments-date"><time datetime="2011-03-14T10:33:47+00:00">14.3.2011 v 08:33</time></td>
</tr><tr id="tr-comment-16022" class="comment byuser comment-author-martin-maly bypostauthor even depth-3">
<td class="author"><img alt='' src='http://0.gravatar.com/avatar/cf68d5ecb446e39c3cebe0873daf6f01?s=16&amp;d=http%3A%2F%2F0.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D16&amp;r=G' class='avatar avatar-16 photo' height='16' width='16' />Martin Malý</td>
<th class="comment-title"><span class="line"></span><span class="line"></span><a href="http://www.zdrojak.cz/clanky/jeste-k-testovani/?show=comments#comment-16022">Re: Dotaz</a></th>
<td class="comments-date"><time datetime="2011-03-14T11:41:11+00:00">14.3.2011 v 09:41</time></td>
</tr></div>
</div>
</div>
<tr id="tr-comment-16013" class="comment odd alt thread-odd thread-alt depth-1">
<td class="author"><img alt='' src='http://0.gravatar.com/avatar/ad516503a11cd5ca435acc9bb6523536?s=16' class='avatar avatar-16 photo avatar-default' height='16' width='16' />Tomáš Herceg</td>
<th class="comment-title"><a href="http://www.zdrojak.cz/clanky/jeste-k-testovani/?show=comments#comment-16013">Unit testy</a></th>
<td class="comments-date"><time datetime="2011-03-14T09:20:42+00:00">14.3.2011 v 07:20</time></td>
</tr><tr id="tr-comment-16015" class="comment byuser comment-author-martin-maly bypostauthor even depth-2">
<td class="author"><img alt='' src='http://0.gravatar.com/avatar/cf68d5ecb446e39c3cebe0873daf6f01?s=16&amp;d=http%3A%2F%2F0.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D16&amp;r=G' class='avatar avatar-16 photo' height='16' width='16' />Martin Malý</td>
<th class="comment-title"><span class="line"></span><a href="http://www.zdrojak.cz/clanky/jeste-k-testovani/?show=comments#comment-16015">Re: Unit testy</a></th>
<td class="comments-date"><time datetime="2011-03-14T09:39:36+00:00">14.3.2011 v 07:39</time></td>
</tr><tr id="tr-comment-16029" class="comment byuser comment-author-michal-augustyn odd alt depth-3">
<td class="author"><img alt='' src='/wp-content/uploads/avatars/2009/07/augi.jpg' class='avatar avatar-16 photo' height='16' width='16' />Michal Augustýn</td>
<th class="comment-title"><span class="line"></span><span class="line"></span><a href="http://www.zdrojak.cz/clanky/jeste-k-testovani/?show=comments#comment-16029">Re: Unit testy</a></th>
<td class="comments-date"><time datetime="2011-03-14T13:47:15+00:00">14.3.2011 v 11:47</time></td>
</tr><tr id="tr-comment-16031" class="comment byuser comment-author-martin-maly bypostauthor even depth-4">
<td class="author"><img alt='' src='http://0.gravatar.com/avatar/cf68d5ecb446e39c3cebe0873daf6f01?s=16&amp;d=http%3A%2F%2F0.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D16&amp;r=G' class='avatar avatar-16 photo' height='16' width='16' />Martin Malý</td>
<th class="comment-title"><span class="line"></span><span class="line"></span><span class="line"></span><a href="http://www.zdrojak.cz/clanky/jeste-k-testovani/?show=comments#comment-16031">Re: Unit testy</a></th>
<td class="comments-date"><time datetime="2011-03-14T13:59:53+00:00">14.3.2011 v 11:59</time></td>
</tr></div>
<tr id="tr-comment-16034" class="comment odd alt depth-4">
<td class="author"><img alt='' src='http://0.gravatar.com/avatar/ad516503a11cd5ca435acc9bb6523536?s=16' class='avatar avatar-16 photo avatar-default' height='16' width='16' />vlk</td>
<th class="comment-title"><span class="line"></span><span class="line"></span><span class="line"></span><a href="http://www.zdrojak.cz/clanky/jeste-k-testovani/?show=comments#comment-16034">Re: Unit testy</a></th>
<td class="comments-date"><time datetime="2011-03-14T15:51:05+00:00">14.3.2011 ve 13:51</time></td>
</tr><tr id="tr-comment-16039" class="comment byuser comment-author-michal-augustyn even depth-5">
<td class="author"><img alt='' src='/wp-content/uploads/avatars/2009/07/augi.jpg' class='avatar avatar-16 photo' height='16' width='16' />Michal Augustýn</td>
<th class="comment-title"><span class="line"></span><span class="line"></span><span class="line"></span><span class="line"></span><a href="http://www.zdrojak.cz/clanky/jeste-k-testovani/?show=comments#comment-16039">Re: Unit testy</a></th>
<td class="comments-date"><time datetime="2011-03-14T17:09:55+00:00">14.3.2011 ve 15:09</time></td>
</tr></div>
<tr id="tr-comment-16040" class="comment byuser comment-author-koubel odd alt depth-5">
<td class="author"><img alt='' src='http://1.gravatar.com/avatar/b9554e2b58b2863fceb0f53132f55a56?s=16&amp;d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D16&amp;r=G' class='avatar avatar-16 photo' height='16' width='16' />koubel</td>
<th class="comment-title"><span class="line"></span><span class="line"></span><span class="line"></span><span class="line"></span><a href="http://www.zdrojak.cz/clanky/jeste-k-testovani/?show=comments#comment-16040">TDD nedovolí prasit</a></th>
<td class="comments-date"><time datetime="2011-03-14T17:38:12+00:00">14.3.2011 ve 15:38</time></td>
</tr></div>
</div>
</div>
<tr id="tr-comment-16035" class="comment even depth-3">
<td class="author"><img alt='' src='http://0.gravatar.com/avatar/ad516503a11cd5ca435acc9bb6523536?s=16' class='avatar avatar-16 photo avatar-default' height='16' width='16' />tdvorak</td>
<th class="comment-title"><span class="line"></span><span class="line"></span><a href="http://www.zdrojak.cz/clanky/jeste-k-testovani/?show=comments#comment-16035">Re: Unit testy</a></th>
<td class="comments-date"><time datetime="2011-03-14T16:16:26+00:00">14.3.2011 ve 14:16</time></td>
</tr></div>
<tr id="tr-comment-16036" class="comment odd alt depth-3">
<td class="author"><img alt='' src='http://0.gravatar.com/avatar/ad516503a11cd5ca435acc9bb6523536?s=16' class='avatar avatar-16 photo avatar-default' height='16' width='16' />Tomáš Herceg</td>
<th class="comment-title"><span class="line"></span><span class="line"></span><a href="http://www.zdrojak.cz/clanky/jeste-k-testovani/?show=comments#comment-16036">Re: Unit testy</a></th>
<td class="comments-date"><time datetime="2011-03-14T16:36:00+00:00">14.3.2011 ve 14:36</time></td>
</tr><tr id="tr-comment-16037" class="comment byuser comment-author-drevolution even depth-4">
<td class="author"><img alt='' src='/wp-content/uploads/avatars/2006/03/drevolution.jpg' class='avatar avatar-16 photo' height='16' width='16' />drevolution</td>
<th class="comment-title"><span class="line"></span><span class="line"></span><span class="line"></span><a href="http://www.zdrojak.cz/clanky/jeste-k-testovani/?show=comments#comment-16037">Re: Unit testy</a></th>
<td class="comments-date"><time datetime="2011-03-14T16:47:57+00:00">14.3.2011 ve 14:47</time></td>
</tr></div>
</div>
<tr id="tr-comment-16038" class="comment byuser comment-author-ped odd alt depth-3">
<td class="author"><img alt='' src='http://0.gravatar.com/avatar/880ab36eec5a04a4afb1443836aae55f?s=16&amp;d=http%3A%2F%2F0.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D16&amp;r=G' class='avatar avatar-16 photo' height='16' width='16' />Ped</td>
<th class="comment-title"><span class="line"></span><span class="line"></span><a href="http://www.zdrojak.cz/clanky/jeste-k-testovani/?show=comments#comment-16038">Re: Unit testy</a></th>
<td class="comments-date"><time datetime="2011-03-14T16:59:15+00:00">14.3.2011 ve 14:59</time></td>
</tr></div>
<tr id="tr-comment-16048" class="comment byuser comment-author-frantisek-kucera even depth-3">
<td class="author"><img alt='' src='/wp-content/uploads/avatars/2006/09/xkucf03.jpg' class='avatar avatar-16 photo' height='16' width='16' />František Kučera</td>
<th class="comment-title"><span class="line"></span><span class="line"></span><a href="http://www.zdrojak.cz/clanky/jeste-k-testovani/?show=comments#comment-16048">Re: Unit testy</a></th>
<td class="comments-date"><time datetime="2011-03-14T20:49:24+00:00">14.3.2011 ve 18:49</time></td>
</tr></div>
</div>
</div>
<tr id="tr-comment-16014" class="comment odd alt thread-even depth-1">
<td class="author"><img alt='' src='http://0.gravatar.com/avatar/ad516503a11cd5ca435acc9bb6523536?s=16' class='avatar avatar-16 photo avatar-default' height='16' width='16' />Michal</td>
<th class="comment-title"><a href="http://www.zdrojak.cz/clanky/jeste-k-testovani/?show=comments#comment-16014">Testy v PHP</a></th>
<td class="comments-date"><time datetime="2011-03-14T09:31:39+00:00">14.3.2011 v 07:31</time></td>
</tr></div>
<tr id="tr-comment-16028" class="comment even thread-odd thread-alt depth-1">
<td class="author"><img alt='' src='http://0.gravatar.com/avatar/ad516503a11cd5ca435acc9bb6523536?s=16' class='avatar avatar-16 photo avatar-default' height='16' width='16' />rox</td>
<th class="comment-title"><a href="http://www.zdrojak.cz/clanky/jeste-k-testovani/?show=comments#comment-16028">Jeste se musime hodne ucit...</a></th>
<td class="comments-date"><time datetime="2011-03-14T13:38:04+00:00">14.3.2011 v 11:38</time></td>
</tr></div>
<tr id="tr-comment-16041" class="comment odd alt thread-even depth-1">
<td class="author"><img alt='' src='http://0.gravatar.com/avatar/ad516503a11cd5ca435acc9bb6523536?s=16' class='avatar avatar-16 photo avatar-default' height='16' width='16' />Opravdový odborník :-)</td>
<th class="comment-title"><a href="http://www.zdrojak.cz/clanky/jeste-k-testovani/?show=comments#comment-16041">Re: Ještě k testování</a></th>
<td class="comments-date"><time datetime="2011-03-14T18:19:19+00:00">14.3.2011 ve 16:19</time></td>
</tr><tr id="tr-comment-16044" class="comment byuser comment-author-martin-maly bypostauthor even depth-2">
<td class="author"><img alt='' src='http://0.gravatar.com/avatar/cf68d5ecb446e39c3cebe0873daf6f01?s=16&amp;d=http%3A%2F%2F0.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D16&amp;r=G' class='avatar avatar-16 photo' height='16' width='16' />Martin Malý</td>
<th class="comment-title"><span class="line"></span><a href="http://www.zdrojak.cz/clanky/jeste-k-testovani/?show=comments#comment-16044">Re: Ještě k testování</a></th>
<td class="comments-date"><time datetime="2011-03-14T20:07:29+00:00">14.3.2011 ve 18:07</time></td>
</tr></div>
<tr id="tr-comment-16047" class="comment byuser comment-author-martin-maly bypostauthor odd alt depth-2">
<td class="author"><img alt='' src='http://0.gravatar.com/avatar/cf68d5ecb446e39c3cebe0873daf6f01?s=16&amp;d=http%3A%2F%2F0.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D16&amp;r=G' class='avatar avatar-16 photo' height='16' width='16' />Martin Malý</td>
<th class="comment-title"><span class="line"></span><a href="http://www.zdrojak.cz/clanky/jeste-k-testovani/?show=comments#comment-16047">Re: Ještě k testování</a></th>
<td class="comments-date"><time datetime="2011-03-14T20:44:03+00:00">14.3.2011 ve 18:44</time></td>
</tr><tr id="tr-comment-16699" class="comment even depth-3">
<td class="author"><img alt='' src='http://0.gravatar.com/avatar/ad516503a11cd5ca435acc9bb6523536?s=16' class='avatar avatar-16 photo avatar-default' height='16' width='16' />Opravdový odborník :-)</td>
<th class="comment-title"><span class="line"></span><span class="line"></span><a href="http://www.zdrojak.cz/clanky/jeste-k-testovani/?show=comments#comment-16699">Re: Ještě k testování</a></th>
<td class="comments-date"><time datetime="2011-04-20T09:42:34+00:00">20.4.2011 v 07:42</time></td>
</tr></div>
</div>
<tr id="tr-comment-16052" class="comment odd alt depth-2">
<td class="author"><img alt='' src='http://0.gravatar.com/avatar/ad516503a11cd5ca435acc9bb6523536?s=16' class='avatar avatar-16 photo avatar-default' height='16' width='16' />valnoha</td>
<th class="comment-title"><span class="line"></span><a href="http://www.zdrojak.cz/clanky/jeste-k-testovani/?show=comments#comment-16052">Re: Ještě k testování</a></th>
<td class="comments-date"><time datetime="2011-03-14T21:57:06+00:00">14.3.2011 ve 19:57</time></td>
</tr></div>
<tr id="tr-comment-23312" class="comment even depth-2">
<td class="author"><img alt='' src='http://0.gravatar.com/avatar/ad516503a11cd5ca435acc9bb6523536?s=16' class='avatar avatar-16 photo avatar-default' height='16' width='16' />Charvi</td>
<th class="comment-title"><span class="line"></span><a href="http://www.zdrojak.cz/clanky/jeste-k-testovani/?show=comments#comment-23312">Re: Ještě k testování</a></th>
<td class="comments-date"><time datetime="2012-10-06T00:06:25+00:00">6.10.2012 ve 22:06</time></td>
</tr></div>
</div>
<tr id="tr-comment-16042" class="comment odd alt thread-odd thread-alt depth-1">
<td class="author"><img alt='' src='http://0.gravatar.com/avatar/ad516503a11cd5ca435acc9bb6523536?s=16' class='avatar avatar-16 photo avatar-default' height='16' width='16' />heptau</td>
<th class="comment-title"><a href="http://www.zdrojak.cz/clanky/jeste-k-testovani/?show=comments#comment-16042">Testy nad databazi</a></th>
<td class="comments-date"><time datetime="2011-03-14T18:21:41+00:00">14.3.2011 ve 16:21</time></td>
</tr><tr id="tr-comment-16045" class="comment byuser comment-author-michal-augustyn even depth-2">
<td class="author"><img alt='' src='/wp-content/uploads/avatars/2009/07/augi.jpg' class='avatar avatar-16 photo' height='16' width='16' />Michal Augustýn</td>
<th class="comment-title"><span class="line"></span><a href="http://www.zdrojak.cz/clanky/jeste-k-testovani/?show=comments#comment-16045">Re: Testy nad databazi</a></th>
<td class="comments-date"><time datetime="2011-03-14T20:31:44+00:00">14.3.2011 ve 18:31</time></td>
</tr></div>
<tr id="tr-comment-16049" class="comment odd alt depth-2">
<td class="author"><img alt='' src='http://0.gravatar.com/avatar/ad516503a11cd5ca435acc9bb6523536?s=16' class='avatar avatar-16 photo avatar-default' height='16' width='16' />Aleš Roubíček</td>
<th class="comment-title"><span class="line"></span><a href="http://www.zdrojak.cz/clanky/jeste-k-testovani/?show=comments#comment-16049">Re: Testy nad databazi</a></th>
<td class="comments-date"><time datetime="2011-03-14T20:49:59+00:00">14.3.2011 ve 18:49</time></td>
</tr></div>
<tr id="tr-comment-16050" class="comment byuser comment-author-frantisek-kucera even depth-2">
<td class="author"><img alt='' src='/wp-content/uploads/avatars/2006/09/xkucf03.jpg' class='avatar avatar-16 photo' height='16' width='16' />František Kučera</td>
<th class="comment-title"><span class="line"></span><a href="http://www.zdrojak.cz/clanky/jeste-k-testovani/?show=comments#comment-16050">Re: Testy nad databazi</a></th>
<td class="comments-date"><time datetime="2011-03-14T21:33:11+00:00">14.3.2011 ve 19:33</time></td>
</tr></div>
<tr id="tr-comment-16051" class="comment byuser comment-author-maio odd alt depth-2">
<td class="author"><img alt='' src='/wp-content/uploads/avatars/2011/03/maio.jpg' class='avatar avatar-16 photo' height='16' width='16' />maio</td>
<th class="comment-title"><span class="line"></span><a href="http://www.zdrojak.cz/clanky/jeste-k-testovani/?show=comments#comment-16051">Re: Testy nad databazi</a></th>
<td class="comments-date"><time datetime="2011-03-14T21:53:50+00:00">14.3.2011 ve 19:53</time></td>
</tr></div>
</div>
<tr id="tr-comment-16053" class="comment byuser comment-author-maio even thread-even depth-1">
<td class="author"><img alt='' src='/wp-content/uploads/avatars/2011/03/maio.jpg' class='avatar avatar-16 photo' height='16' width='16' />maio</td>
<th class="comment-title"><a href="http://www.zdrojak.cz/clanky/jeste-k-testovani/?show=comments#comment-16053">Test-driven development</a></th>
<td class="comments-date"><time datetime="2011-03-14T22:41:07+00:00">14.3.2011 ve 20:41</time></td>
</tr></div>
<tr id="tr-comment-16057" class="comment odd alt thread-odd thread-alt depth-1">
<td class="author"><img alt='' src='http://0.gravatar.com/avatar/ad516503a11cd5ca435acc9bb6523536?s=16' class='avatar avatar-16 photo avatar-default' height='16' width='16' />kert</td>
<th class="comment-title"><a href="http://www.zdrojak.cz/clanky/jeste-k-testovani/?show=comments#comment-16057">Šedivé příspěvky</a></th>
<td class="comments-date"><time datetime="2011-03-15T10:58:27+00:00">15.3.2011 v 08:58</time></td>
</tr><tr id="tr-comment-16058" class="comment even depth-2">
<td class="author"><img alt='' src='http://0.gravatar.com/avatar/ad516503a11cd5ca435acc9bb6523536?s=16' class='avatar avatar-16 photo avatar-default' height='16' width='16' />Kdyby</td>
<th class="comment-title"><span class="line"></span><a href="http://www.zdrojak.cz/clanky/jeste-k-testovani/?show=comments#comment-16058">Re: Šedivé příspěvky</a></th>
<td class="comments-date"><time datetime="2011-03-15T11:17:32+00:00">15.3.2011 v 09:17</time></td>
</tr></div>
</div>
</table>
</div>
<div class="comments-actions">
<a class="menu-toggle comment-reply-link respond" href="?show=comments#respond">
Přidat komentář
</a>
<a class="all-comments" href="?show=comments#comments">Zobrazit všechny komentáře</a>
</div>
</div> <nav class="nav-single no-print">
<h3 class="assistive-text">Navigace pro příspěvky</h3>
<span class="nav-previous">
<a href="http://www.zdrojak.cz/clanky/co-zaujalo-jiriho-knesla/" rel="prev">Co zaujalo Jiřího Knesla</a> </span>
<span class="nav-next">
<a href="http://www.zdrojak.cz/clanky/zlepseni-vykonu-animaci-v-javascriptu/" rel="next">Zlepšení výkonu animací v JavaScriptu</a> </span>
</nav>
<div class="print">Zdroj: http://www.zdrojak.cz/?p=3450</div>
</div>
</div>
<div id="secondary" class="widget-area" role="complementary">
<aside id="recent_category_posts-0" class="with-thumb widget widget_recent_entries"> <h3 class="widget-title">Webdesign</h3> <div class="thumbnail">
<img width="80" height="80" src="http://www.zdrojak.cz/wp-content/uploads/2011/02/html5-badge-1.png" class="attachment- wp-post-image" alt="Webdesign" title="Webdesign" /> </div>
<ul>
<li><a href="http://www.zdrojak.cz/clanky/bezpecny-sandboxovany-iframe/" title="Bezpečný sandboxovaný iframe">Bezpečný sandboxovaný iframe</a></li>
<li><a href="http://www.zdrojak.cz/clanky/webove-stranky-dostanou-rozpoznavani-reci-prichazi-web-speech-api/" title="Webové stránky dostanou rozpoznávání řeči &#8211; přichází Web Speech API">Webové stránky dostanou rozpoznávání řeči &#8211; přichází Web Speech API</a></li>
<li><a href="http://www.zdrojak.cz/clanky/metody-poskytovani-textovych-alternativ-obrazku-shrnuti/" title="Metody poskytování textových alternativ obrázků &#8211; shrnutí">Metody poskytování textových alternativ obrázků &#8211; shrnutí</a></li>
</ul>
</aside> <aside id="recent-posts-0" class="widget widget_recent_entries"> <h3 class="widget-title">Nejnovější příspěvky</h3> <ul>
<li>
<a href="http://www.zdrojak.cz/clanky/zapier-propojovani-api/" title="Zapier &#8211; dejte propojování API ten správný šmrnc">Zapier &#8211; dejte propojování API ten správný šmrnc</a>
</li>
<li>
<a href="http://www.zdrojak.cz/clanky/testovani-v-php-tvorba-testovatelneho-kodu-ii/" title="Testování v PHP: tvorba testovatelného kódu II.">Testování v PHP: tvorba testovatelného kódu II.</a>
</li>
<li>
<a href="http://www.zdrojak.cz/clanky/menard-instant-angularjs-starter-recenze/" title="Dan Menard: Instant AngularJS Starter (recenze první knihy o AngularJS)">Dan Menard: Instant AngularJS Starter (recenze první knihy o AngularJS)</a>
</li>
<li>
<a href="http://www.zdrojak.cz/clanky/reportaz-z-devel-cz-konference-2013/" title="Reportáž z Devel.cz konference 2013">Reportáž z Devel.cz konference 2013</a>
</li>
<li>
<a href="http://www.zdrojak.cz/clanky/tvorba-moderniho-e-shopu-nahravani-obrazku-k-produktu/" title="Tvorba moderního e-shopu: nahrávání obrázků k produktu">Tvorba moderního e-shopu: nahrávání obrázků k produktu</a>
</li>
</ul>
</aside><aside id="related_posts-0" class="widget widget_related_posts"><h3 class="widget-title">Související články</h3><ul><li><a href="http://www.zdrojak.cz/clanky/doctrine-2-a-notorm-videotutorial/" rel="bookmark">Doctrine 2 a NotORM &#8211; videotutoriál</a></li><li><a href="http://www.zdrojak.cz/clanky/prasit-ci-neprasit/" rel="bookmark">Prasit, či neprasit?</a></li><li><a href="http://www.zdrojak.cz/clanky/vite-jak-uzivatele-pouzivaji-vasi-aplikaci/" rel="bookmark">Víte, jak uživatelé používají vaši aplikaci?</a></li><li><a href="http://www.zdrojak.cz/clanky/phpexcel-tabulky-jednoduse/" rel="bookmark">PHPExcel: tabulky jednoduše</a></li><li><a href="http://www.zdrojak.cz/clanky/knihy-pro-vyvojare-leden-unor-2010/" rel="bookmark">Knihy pro vývojáře &#8211; leden/únor 2010</a></li></ul></aside><aside id="recent-comments-0" class="widget widget_recent_comments"><h3 class="widget-title">Nejnovější komentáře</h3><ul id="recentcomments"><li class="recentcomments">Honza Marek u <a href="http://www.zdrojak.cz/clanky/testovani-v-php-tvorba-testovatelneho-kodu-ii/?show=comments#comment-24141">Testování v PHP: tvorba testovatelného kódu II.</a></li><li class="recentcomments">arron u <a href="http://www.zdrojak.cz/clanky/testovani-v-php-tvorba-testovatelneho-kodu-ii/?show=comments#comment-24140">Testování v PHP: tvorba testovatelného kódu II.</a></li><li class="recentcomments">pav u <a href="http://www.zdrojak.cz/clanky/testovani-v-php-tvorba-testovatelneho-kodu-ii/?show=comments#comment-24139">Testování v PHP: tvorba testovatelného kódu II.</a></li><li class="recentcomments">Martin Hassman u <a href="http://www.zdrojak.cz/clanky/uvodni-analyza-pro-moderni-e-shop/?show=comments#comment-24138">Úvodní analýza pro moderní e-shop</a></li><li class="recentcomments">xlilien u <a href="http://www.zdrojak.cz/clanky/uvodni-analyza-pro-moderni-e-shop/?show=comments#comment-24137">Úvodní analýza pro moderní e-shop</a></li></ul></aside> </div><!-- #secondary -->
</div>
<div id="bottom" class="site-bottom no-print">
<div class="wrapper">
<aside class="social widget widget_nav_menu">
<h3 class="widget-title">Sledujte</h3>
<div class="content">
<ul class="menu">
<li class="googleplus">
<a target="_blank" href="https://plus.google.com/101725826130888424314/posts">Google + Zdroják</a>
</li>
<li class="twitter">
<a target="_blank" href="https://twitter.com/zdrojak">Twitter @zdrojak</a>
</li>
<li class="mail">
<a href="/mail">Zpravodaj</a>
</li>
<li class="rss">
<a href="/feed">RSS</a>
</li>
</ul>
</div>
</aside> </div>
</div>
<footer id="colophon" class="site-footer no-print" role="contentinfo">
<div>
<p>
<a href="http://www.zdrojak.cz">Zdroják</a>, o tvorbě webových stránek a aplikací.
ISSN <span id="issn">1803-5620</span>
</p>
<p>Vydává <a href="http://www.devel.cz">Devel.cz Lab s.r.o.</a>. Všechna práva vyhrazena.</p>
</div>
<ul>
<li>
<a href="/redakce/" title="Redakční texty" class="redakce">Redakce</a>
<span> | </span>
</li>
<li>
<a href="/autori/" title="Autoři" class="autori">Autoři</a>
<span> | </span>
</li>
<li>
<a href="/redakce/kontakt/" title="Kontakt" class="kontakt">Kontakt</a>
<span> | </span>
</li>
<li>
<a href="http://www.iinfo.cz/reklama/" target="_blank" title="Reklama" class="reklama">Reklama</a>
<span> | </span>
</li>
<li>
<a href="/redakce/podminky-uzivani/" title="Podmínky užívání" class="podminky-uzivani">Podmínky užívání</a>
<span> | </span>
</li>
<li>
<a href="/redakce/podpora/" title="Podpořte Zdroják" class="podporte-zdrojak">Podpořte Zdroják</a>
<span> | </span>
</li>
<li>
<a href="/slovnicek/" title="Slovníček" class="slovnicek">Slovníček</a>
<span> | </span>
</li>
<li>
<a href="/redakce/rss-exporty/" title="RSS" class="rss">RSS</a>
<span></span>
</li>
</ul>
</footer>
</div>
<script type='text/javascript' src='http://www.zdrojak.cz/wp-includes/js/admin-bar.min.js?ver=3.5.1'></script>
<script type='text/javascript' src='http://www.zdrojak.cz/wp-content/themes/twentytwelve/js/navigation.js?ver=1.0'></script>
<div id="wpadminbar" class="nojq nojs" role="navigation">
<a class="screen-reader-shortcut" href="#wp-toolbar" tabindex="1">Přejít k navigační liště</a>
<div class="quicklinks" id="wp-toolbar" role="navigation" aria-label="Horní navigační lišta" tabindex="0">
<ul id="wp-admin-bar-top-secondary" class="ab-top-secondary ab-top-menu">
<li id="wp-admin-bar-zd_login"><a class="ab-item" href="http://www.zdrojak.cz/wp-login.php">Přihlásit se</a> </li>
<li id="wp-admin-bar-zd_register"><a class="ab-item" href="http://www.zdrojak.cz/wp-login.php?action=register">Registrace</a> </li></ul><ul id="wp-admin-bar-root-default" class="ab-top-menu">
<li id="wp-admin-bar-devel_cz" class="link root"><a class="ab-item" href="http://devel.cz" target="_blank" title="Devel.cz">Devel.cz</a> </li>
<li id="wp-admin-bar-root_cz" class="link root"><a class="ab-item" href="http://www.root.cz" target="_blank" title="Root.cz">Root.cz</a> </li>
<li id="wp-admin-bar-lupa_cz" class="link lupa"><a class="ab-item" href="http://www.lupa.cz" target="_blank" title="Lupa.cz">Lupa.cz</a> </li>
<li id="wp-admin-bar-testomato_com" class="link testomato"><a class="ab-item" href="http://www.testomato.com/" target="_blank" title="Testomato.com">Testomato.com</a> </li></ul> </div>
<a class="screen-reader-shortcut" href="http://www.zdrojak.cz/wp-login.php?action=logout&#038;_wpnonce=09e0bc3e07">Odhlásit se</a>
</div>
<div id="webstats">
<script src="http://c1.navrcholu.cz/code?site=72;t=t1x1" type="text/javascript"></script>
<noscript>
<div>
<img src="http://c1.navrcholu.cz/hit?site=72;t=t1x1;ref=;jss=0" width="1" height="1" alt=""/>
</div>
</noscript>
<script type="text/javascript">
var _gaq = _gaq || [];
_gaq.push(['_setAccount', 'UA-30960355-1']);
_gaq.push(['_trackPageview']);
(function () {
var ga = document.createElement('script');
ga.type = 'text/javascript';
ga.async = true;
ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
var s = document.getElementsByTagName('script')[0];
s.parentNode.insertBefore(ga, s);
})();
</script>
</div>
<!-- d0e88e1c2777b950889c170b9b92adf8 -->
</body>
</html>

@ -0,0 +1,44 @@
# -*- coding: utf8 -*-
from __future__ import absolute_import
from __future__ import division, print_function, unicode_literals
from os.path import join, dirname
from readability.readable import Article
from readability._compat import unicode
from ...compat import unittest
class TestArticle(unittest.TestCase):
"""
Test the scoring and parsing of the article from URL below:
http://www.zdrojak.cz/clanky/jeste-k-testovani/
"""
def setUp(self):
"""Load up the article for us"""
article_path = join(dirname(__file__), "article.html")
with open(article_path, "rb") as file:
self.document = Article(file.read(), "http://www.zdrojak.cz/clanky/jeste-k-testovani/")
def tearDown(self):
"""Drop the article"""
self.document = None
def test_parses(self):
"""Verify we can parse the document."""
self.assertIn('id="readabilityBody"', self.document.readable)
def test_content_exists(self):
"""Verify that some content exists."""
self.assertIsInstance(self.document.readable, unicode)
text = "S automatizovaným testováním kódu (a ve zbytku článku budu mít na mysli právě to) jsem se setkal v několika firmách."
self.assertIn(text, self.document.readable)
text = "Ke čtení naleznete mnoho různých materiálů, od teoretických po praktické ukázky."
self.assertIn(text, self.document.readable)
def test_content_does_not_exist(self):
"""Verify we cleaned out some content that shouldn't exist."""
self.assertNotIn("Pokud vás problematika zajímá, využijte možnosti navštívit školení", self.document.readable)

@ -1,15 +1,15 @@
import os
from operator import attrgetter
try:
# Python < 2.7
import unittest2 as unittest
except ImportError:
import unittest
# -*- coding: utf8 -*-
from __future__ import absolute_import
from __future__ import division, print_function, unicode_literals
import os
from breadability.readable import Article
from breadability.readable import check_siblings
from breadability.readable import prep_article
from operator import attrgetter
from readability.readable import Article
from readability.readable import check_siblings
from readability.readable import prep_article
from ...compat import unittest
class TestArticle(unittest.TestCase):
@ -37,16 +37,14 @@ class TestArticle(unittest.TestCase):
self.assertFalse(
'#anExampleGoogleDoesntIntendToShareBlogAndItWill' in doc.readable)
@unittest.skip("Test fails because of some weird hash.")
def test_candidates(self):
"""Verify we have candidates."""
doc = Article(self.article)
# from lxml.etree import tounicode
found = False
wanted_hash = '04e46055'
# from breadability.logconfig import LNODE
# from breadability.logconfig import set_logging_level
# set_logging_level('DEBUG')
# LNODE.activate()
for node in doc.candidates.values():
if node.hash_id == wanted_hash:
found = node
@ -69,5 +67,3 @@ class TestArticle(unittest.TestCase):
# This article hits up against the img > p conditional filtering
# because of the many .gif images in the content. We've removed that
# rule.
# set_logging_level('INFO')
# LNODE.deactivate()

@ -0,0 +1,91 @@
# -*- coding: utf8 -*-
from __future__ import absolute_import
from __future__ import division, print_function, unicode_literals
from collections import defaultdict
from readability._compat import to_unicode, to_bytes
from readability.document import (OriginalDocument, determine_encoding,
convert_breaks_to_paragraphs)
from .compat import unittest
from .utils import load_snippet
class TestOriginalDocument(unittest.TestCase):
"""Verify we can process html into a document to work off of."""
def test_convert_br_tags_to_paragraphs(self):
returned = convert_breaks_to_paragraphs(
"<div>HI<br><br>How are you?<br><br> \t \n <br>Fine\n I guess</div>")
self.assertEqual(returned,
"<div>HI</p><p>How are you?</p><p>Fine\n I guess</div>")
def test_convert_hr_tags_to_paragraphs(self):
returned = convert_breaks_to_paragraphs(
"<div>HI<br><br>How are you?<hr/> \t \n <br>Fine\n I guess</div>")
self.assertEqual(returned,
"<div>HI</p><p>How are you?</p><p>Fine\n I guess</div>")
def test_readin_min_document(self):
"""Verify we can read in a min html document"""
doc = OriginalDocument(load_snippet('document_min.html'))
self.assertTrue(to_unicode(doc).startswith('<html>'))
self.assertEqual(doc.title, 'Min Document Title')
def test_readin_with_base_url(self):
"""Passing a url should update links to be absolute links"""
doc = OriginalDocument(
load_snippet('document_absolute_url.html'),
url="http://blog.mitechie.com/test.html")
self.assertTrue(to_unicode(doc).startswith('<html>'))
# find the links on the page and make sure each one starts with out
# base url we told it to use.
links = doc.links
self.assertEqual(len(links), 3)
# we should have two links that start with our blog url
# and one link that starts with amazon
link_counts = defaultdict(int)
for link in links:
if link.get('href').startswith('http://blog.mitechie.com'):
link_counts['blog'] += 1
else:
link_counts['other'] += 1
self.assertEqual(link_counts['blog'], 2)
self.assertEqual(link_counts['other'], 1)
def test_no_br_allowed(self):
"""We convert all <br/> tags to <p> tags"""
doc = OriginalDocument(load_snippet('document_min.html'))
self.assertIsNone(doc.dom.find('.//br'))
def test_empty_title(self):
"""We convert all <br/> tags to <p> tags"""
document = OriginalDocument("<html><head><title></title></head><body></body></html>")
self.assertEqual(document.title, "")
def test_title_only_with_tags(self):
"""We convert all <br/> tags to <p> tags"""
document = OriginalDocument("<html><head><title><em></em></title></head><body></body></html>")
self.assertEqual(document.title, "")
def test_no_title(self):
"""We convert all <br/> tags to <p> tags"""
document = OriginalDocument("<html><head></head><body></body></html>")
self.assertEqual(document.title, "")
def test_encoding(self):
text = "ľščťžýáíéäúňôůě".encode("iso-8859-2")
encoding = determine_encoding(text)
def test_encoding_short(self):
text = "ľščťžýáíé".encode("iso-8859-2")
encoding = determine_encoding(text)
self.assertEqual(encoding, "utf8")
text = to_bytes("ľščťžýáíé")
encoding = determine_encoding(text)
self.assertEqual(encoding, "utf8")

@ -1,21 +1,21 @@
# -*- coding: utf8 -*-
from __future__ import absolute_import
from __future__ import division, print_function, unicode_literals
from lxml.etree import tounicode
from lxml.html import document_fromstring
from lxml.html import fragment_fromstring
try:
# Python < 2.7
import unittest2 as unittest
except ImportError:
import unittest
from breadability.readable import Article
from breadability.readable import get_class_weight
from breadability.readable import get_link_density
from breadability.readable import is_bad_link
from breadability.readable import score_candidates
from breadability.readable import transform_misused_divs_into_paragraphs
from breadability.scoring import ScoredNode
from breadability.tests import load_snippet
from breadability.tests import load_article
from readability._compat import to_unicode
from readability.readable import Article
from readability.readable import get_class_weight
from readability.readable import get_link_density
from readability.readable import is_bad_link
from readability.readable import score_candidates
from readability.readable import leaf_div_elements_into_paragraphs
from readability.scoring import ScoredNode
from .compat import unittest
from .utils import load_snippet, load_article
class TestReadableDocument(unittest.TestCase):
@ -25,12 +25,12 @@ class TestReadableDocument(unittest.TestCase):
"""We get back an element tree from our original doc"""
doc = Article(load_snippet('document_min.html'))
# We get back the document as a div tag currently by default.
self.assertEqual(doc._readable.tag, 'div')
self.assertEqual(doc.readable_dom.tag, 'div')
def test_doc_no_scripts_styles(self):
"""Step #1 remove all scripts from the document"""
doc = Article(load_snippet('document_scripts.html'))
readable = doc._readable
readable = doc.readable_dom
self.assertEqual(readable.findall(".//script"), [])
self.assertEqual(readable.findall(".//style"), [])
self.assertEqual(readable.findall(".//link"), [])
@ -42,8 +42,8 @@ class TestReadableDocument(unittest.TestCase):
"""
doc = Article(load_snippet('document_min.html'))
self.assertEqual(doc._readable.tag, 'div')
self.assertEqual(doc._readable.get('id'), 'readabilityBody')
self.assertEqual(doc.readable_dom.tag, 'div')
self.assertEqual(doc.readable_dom.get('id'), 'readabilityBody')
def test_body_doesnt_exist(self):
"""If we can't find a body, then we create one.
@ -52,8 +52,8 @@ class TestReadableDocument(unittest.TestCase):
"""
doc = Article(load_snippet('document_no_body.html'))
self.assertEqual(doc._readable.tag, 'div')
self.assertEqual(doc._readable.get('id'), 'readabilityBody')
self.assertEqual(doc.readable_dom.tag, 'div')
self.assertEqual(doc.readable_dom.get('id'), 'readabilityBody')
def test_bare_content(self):
"""If the document is just pure content, no html tags we should be ok
@ -62,16 +62,16 @@ class TestReadableDocument(unittest.TestCase):
"""
doc = Article(load_snippet('document_only_content.html'))
self.assertEqual(doc._readable.tag, 'div')
self.assertEqual(doc._readable.get('id'), 'readabilityBody')
self.assertEqual(doc.readable_dom.tag, 'div')
self.assertEqual(doc.readable_dom.get('id'), 'readabilityBody')
def test_no_content(self):
"""Without content we supply an empty unparsed doc."""
doc = Article('')
self.assertEqual(doc._readable.tag, 'div')
self.assertEqual(doc._readable.get('id'), 'readabilityBody')
self.assertEqual(doc._readable.get('class'), 'parsing-error')
self.assertEqual(doc.readable_dom.tag, 'div')
self.assertEqual(doc.readable_dom.get('id'), 'readabilityBody')
self.assertEqual(doc.readable_dom.get('class'), 'parsing-error')
class TestCleaning(unittest.TestCase):
@ -80,7 +80,7 @@ class TestCleaning(unittest.TestCase):
def test_unlikely_hits(self):
"""Verify we wipe out things from our unlikely list."""
doc = Article(load_snippet('test_readable_unlikely.html'))
readable = doc._readable
readable = doc.readable_dom
must_not_appear = ['comment', 'community', 'disqus', 'extra', 'foot',
'header', 'menu', 'remark', 'rss', 'shoutbox', 'sidebar',
'sponsor', 'ad-break', 'agegate', 'pagination' '', 'pager',
@ -119,8 +119,8 @@ class TestCleaning(unittest.TestCase):
test_doc = document_fromstring(test_html)
self.assertEqual(
tounicode(
transform_misused_divs_into_paragraphs(test_doc)),
u"<html><body><p>simple</p></body></html>"
leaf_div_elements_into_paragraphs(test_doc)),
to_unicode("<html><body><p>simple</p></body></html>")
)
test_html2 = ('<html><body><div>simple<a href="">link</a>'
@ -128,8 +128,18 @@ class TestCleaning(unittest.TestCase):
test_doc2 = document_fromstring(test_html2)
self.assertEqual(
tounicode(
transform_misused_divs_into_paragraphs(test_doc2)),
u'<html><body><p>simple<a href="">link</a></p></body></html>'
leaf_div_elements_into_paragraphs(test_doc2)),
to_unicode('<html><body><p>simple<a href="">link</a></p></body></html>')
)
def test_dont_transform_div_with_div(self):
"""Verify that only child <div> element is replaced by <p>."""
dom = document_fromstring(
"<html><body><div>text<div>child</div>aftertext</div></body></html>")
self.assertEqual(
tounicode(leaf_div_elements_into_paragraphs(dom)),
to_unicode("<html><body><div>text<p>child</p>aftertext</div></body></html>")
)
def test_bad_links(self):
@ -173,7 +183,7 @@ class TestCandidateNodes(unittest.TestCase):
def test_article_enables_candidate_access(self):
"""Candidates are accessible after document processing."""
doc = Article(load_article('ars/ars.001.html'))
doc = Article(load_article('ars.001.html'))
self.assertTrue(hasattr(doc, 'candidates'))
@ -206,51 +216,45 @@ class TestScoringNodes(unittest.TestCase):
def test_we_get_candidates(self):
"""Processing candidates should get us a list of nodes to try out."""
# we'll start out using our first real test document
test_nodes = []
doc = document_fromstring(load_article('ars/ars.001.html'))
for node in doc.getiterator():
if node.tag in ['p', 'td', 'pre']:
test_nodes.append(node)
doc = document_fromstring(load_article("ars.001.html"))
test_nodes = tuple(doc.iter("p", "td", "pre"))
candidates = score_candidates(test_nodes)
# this might change as we tweak our algorithm, but if it does change,
# this might change as we tweak our algorithm, but if it does,
# it signifies we need to look at what we changed.
self.assertEqual(len(candidates.keys()), 6)
self.assertEqual(len(candidates.keys()), 37)
# one of these should have a decent score
scores = sorted([c.content_score for c in candidates.values()])
scores = sorted(c.content_score for c in candidates.values())
self.assertTrue(scores[-1] > 100)
def test_bonus_score_per_100_chars_in_p(self):
"""Nodes get 1pt per 100 characters up to 3 max points"""
def build_doc(length):
div = '<div id="content" class=""><p>{0}</p></div>'
document_str = '<html><body>{0}</body></html>'
content = 'c' * length
test_div = div.format(content)
doc = document_fromstring(document_str.format(test_div))
test_nodes = []
for node in doc.getiterator():
if node.tag == 'p':
test_nodes.append(node)
return test_nodes
test_nodes = build_doc(400)
"""Nodes get 1 point per 100 characters up to max. 3 points."""
def build_candidates(length):
html = "<p>%s</p>" % ("c" * length)
node = fragment_fromstring(html)
return [node]
test_nodes = build_candidates(50)
candidates = score_candidates(test_nodes)
pscore_50 = max(c.content_score for c in candidates.values())
test_nodes = build_candidates(100)
candidates = score_candidates(test_nodes)
pscore_400 = max([c.content_score for c in candidates.values()])
pscore_100 = max(c.content_score for c in candidates.values())
test_nodes = build_doc(100)
test_nodes = build_candidates(300)
candidates = score_candidates(test_nodes)
pscore_100 = max([c.content_score for c in candidates.values()])
pscore_300 = max(c.content_score for c in candidates.values())
test_nodes = build_doc(50)
test_nodes = build_candidates(400)
candidates = score_candidates(test_nodes)
pscore_50 = max([c.content_score for c in candidates.values()])
pscore_400 = max(c.content_score for c in candidates.values())
self.assertEqual(pscore_100, pscore_50 + 1)
self.assertEqual(pscore_400, pscore_50 + 3)
self.assertAlmostEqual(pscore_50 + 0.5, pscore_100)
self.assertAlmostEqual(pscore_100 + 2.0, pscore_300)
self.assertAlmostEqual(pscore_300, pscore_400)
class TestLinkDensityScoring(unittest.TestCase):
@ -258,24 +262,69 @@ class TestLinkDensityScoring(unittest.TestCase):
def test_link_density(self):
"""Test that we get a link density"""
doc = document_fromstring(load_article('ars/ars.001.html'))
for node in doc.getiterator():
if node.tag in ['p', 'td', 'pre']:
density = get_link_density(node)
doc = document_fromstring(load_article('ars.001.html'))
for node in doc.iter('p', 'td', 'pre'):
density = get_link_density(node)
# the density must be between 0, 1
self.assertTrue(density >= 0.0 and density <= 1.0)
# the density must be between 0, 1
self.assertTrue(density >= 0.0 and density <= 1.0)
class TestSiblings(unittest.TestCase):
"""Siblings will be included if their content is related."""
@unittest.skip("Not implemented yet.")
def test_bad_siblings_not_counted(self):
""""""
assert True, "TBD"
raise NotImplementedError()
@unittest.skip("Not implemented yet.")
def test_good_siblings_counted(self):
""""""
assert True, "TBD"
raise NotImplementedError()
class TestMainText(unittest.TestCase):
def test_empty(self):
article = Article("")
annotated_text = article.main_text
self.assertEqual(annotated_text, [])
def test_no_annotations(self):
article = Article("<div><p>This is text with no annotations</p></div>")
annotated_text = article.main_text
self.assertEqual(annotated_text,
[(("This is text with no annotations", None),)])
def test_one_annotation(self):
article = Article("<div><p>This is text\r\twith <del>no</del> annotations</p></div>")
annotated_text = article.main_text
expected = [(
("This is text\nwith", None),
("no", ("del",)),
("annotations", None),
)]
self.assertEqual(annotated_text, expected)
def test_simple_snippet(self):
snippet = Article(load_snippet("annotated_1.html"))
annotated_text = snippet.main_text
expected = [
(
("Paragraph is more", None),
("better", ("em",)),
(".\nThis text is very", None),
("pretty", ("strong",)),
("'cause she's girl.", None),
),
(
("This is not", None),
("crap", ("big",)),
("so", None),
("readability", ("dfn",)),
("me :)", None),
)
]
self.assertEqual(annotated_text, expected)

@ -1,21 +1,50 @@
# -*- coding: utf8 -*-
from __future__ import absolute_import
from __future__ import division, print_function, unicode_literals
import re
from operator import attrgetter
from lxml.html import document_fromstring
from lxml.html import fragment_fromstring
from operator import attrgetter
try:
# Python < 2.7
import unittest2 as unittest
except ImportError:
import unittest
from breadability.readable import Article
from breadability.scoring import check_node_attr
from breadability.scoring import get_class_weight
from breadability.scoring import ScoredNode
from breadability.scoring import score_candidates
from breadability.readable import get_link_density
from breadability.readable import is_unlikely_node
from breadability.tests import load_snippet
from readability.readable import Article
from readability.scoring import check_node_attributes
from readability.scoring import get_class_weight
from readability.scoring import ScoredNode
from readability.scoring import score_candidates
from readability.scoring import generate_hash_id
from readability.readable import get_link_density
from readability.readable import is_unlikely_node
from .compat import unittest
from .utils import load_snippet
class TestHashId(unittest.TestCase):
def test_generate_hash(self):
dom = fragment_fromstring("<div>ľščťžýáí</div>")
generate_hash_id(dom)
def test_hash_from_id_on_exception(self):
generate_hash_id(None)
def test_different_hashes(self):
dom = fragment_fromstring("<div>ľščťžýáí</div>")
hash_dom = generate_hash_id(dom)
hash_none = generate_hash_id(None)
self.assertNotEqual(hash_dom, hash_none)
def test_equal_hashes(self):
dom1 = fragment_fromstring("<div>ľščťžýáí</div>")
dom2 = fragment_fromstring("<div>ľščťžýáí</div>")
hash_dom1 = generate_hash_id(dom1)
hash_dom2 = generate_hash_id(dom2)
self.assertEqual(hash_dom1, hash_dom2)
hash_none1 = generate_hash_id(None)
hash_none2 = generate_hash_id(None)
self.assertEqual(hash_none1, hash_none2)
class TestCheckNodeAttr(unittest.TestCase):
@ -27,33 +56,33 @@ class TestCheckNodeAttr(unittest.TestCase):
"""
def test_has_class(self):
"""Verify that a node has a class in our set."""
test_re = re.compile('test1|test2', re.I)
test_pattern = re.compile('test1|test2', re.I)
test_node = fragment_fromstring('<div/>')
test_node.set('class', 'test2 comment')
self.assertTrue(check_node_attr(test_node, 'class', test_re))
self.assertTrue(check_node_attributes(test_pattern, test_node, 'class'))
def test_has_id(self):
"""Verify that a node has an id in our set."""
test_re = re.compile('test1|test2', re.I)
test_pattern = re.compile('test1|test2', re.I)
test_node = fragment_fromstring('<div/>')
test_node.set('id', 'test2')
self.assertTrue(check_node_attr(test_node, 'id', test_re))
self.assertTrue(check_node_attributes(test_pattern, test_node, 'id'))
def test_lacks_class(self):
"""Verify that a node does not have a class in our set."""
test_re = re.compile('test1|test2', re.I)
test_pattern = re.compile('test1|test2', re.I)
test_node = fragment_fromstring('<div/>')
test_node.set('class', 'test4 comment')
self.assertFalse(check_node_attr(test_node, 'class', test_re))
self.assertFalse(check_node_attributes(test_pattern, test_node, 'class'))
def test_lacks_id(self):
"""Verify that a node does not have an id in our set."""
test_re = re.compile('test1|test2', re.I)
test_pattern = re.compile('test1|test2', re.I)
test_node = fragment_fromstring('<div/>')
test_node.set('id', 'test4')
self.assertFalse(check_node_attr(test_node, 'id', test_re))
self.assertFalse(check_node_attributes(test_pattern, test_node, 'id'))
class TestLinkDensity(unittest.TestCase):
@ -61,20 +90,17 @@ class TestLinkDensity(unittest.TestCase):
def test_empty_node(self):
"""An empty node doesn't have much of a link density"""
empty_div = u"<div></div>"
doc = Article(empty_div)
assert 0 == get_link_density(doc._readable), "Link density is nadda"
doc = Article("<div></div>")
self.assertEqual(get_link_density(doc.readable_dom), 0.0)
def test_small_doc_no_links(self):
doc = Article(load_snippet('document_min.html'))
assert 0 == get_link_density(doc._readable), "Still no link density"
self.assertEqual(get_link_density(doc.readable_dom), 0.0)
def test_several_links(self):
"""This doc has a 3 links with the majority of content."""
doc = Article(load_snippet('document_absolute_url.html'))
self.assertAlmostEqual(
get_link_density(doc._readable), 0.349,
places=3)
self.assertAlmostEqual(get_link_density(doc.readable_dom), 22/37)
class TestClassWeight(unittest.TestCase):
@ -82,9 +108,7 @@ class TestClassWeight(unittest.TestCase):
def test_no_matches_zero(self):
"""If you don't have the attribute then you get a weight of 0"""
empty_div = u"<div></div>"
node = fragment_fromstring(empty_div)
node = fragment_fromstring("<div></div>")
self.assertEqual(get_class_weight(node), 0)
def test_id_hits(self):
@ -224,7 +248,7 @@ class TestScoreCandidates(unittest.TestCase):
def test_simple_candidate_set(self):
"""Tests a simple case of two candidate nodes"""
doc = """
html = """
<html>
<body>
<div class="content">
@ -238,18 +262,16 @@ class TestScoreCandidates(unittest.TestCase):
</body>
</html>
"""
d_elem = document_fromstring(doc)
divs = d_elem.findall(".//div")
f_elem = divs[0]
s_elem = divs[1]
res = score_candidates([f_elem, s_elem])
ordered = sorted([c for c in res.values()],
key=attrgetter('content_score'),
reverse=True)
# the body element should have a higher score
self.assertTrue(ordered[0].node.tag == 'body')
# the html element is the outer should come in second
self.assertTrue(ordered[1].node.tag == 'html')
dom = document_fromstring(html)
div_nodes = dom.findall(".//div")
candidates = score_candidates(div_nodes)
ordered = sorted((c for c in candidates.values()), reverse=True,
key=attrgetter("content_score"))
self.assertEqual(ordered[0].node.tag, "div")
self.assertEqual(ordered[0].node.attrib["class"], "content")
self.assertEqual(ordered[1].node.tag, "body")
self.assertEqual(ordered[2].node.tag, "html")
self.assertEqual(ordered[3].node.tag, "div")
self.assertEqual(ordered[3].node.attrib["class"], "footer")

@ -0,0 +1,23 @@
# -*- coding: utf8 -*-
from __future__ import absolute_import
from __future__ import division, print_function, unicode_literals
from os.path import abspath, dirname, join
TEST_DIR = abspath(dirname(__file__))
def load_snippet(file_name):
"""Helper to fetch in the content of a test snippet."""
file_path = join(TEST_DIR, "data/snippets", file_name)
with open(file_path, "rb") as file:
return file.read()
def load_article(file_name):
"""Helper to fetch in the content of a test article."""
file_path = join(TEST_DIR, "data/articles", file_name)
with open(file_path, "rb") as file:
return file.read()
Loading…
Cancel
Save