Go to file
2020-03-21 11:29:29 +07:00
doc Documentation draft 2019-12-29 19:12:42 +01:00
readability Use black to format the code 2020-01-30 17:32:43 +01:00
tests Use black to format the code 2020-01-30 17:32:43 +01:00
.gitignore Adds tox configuration. 2015-04-29 16:16:46 +02:00
.travis.yml add coverage tests 2020-01-30 18:01:35 +01:00
LICENSE Add LICENSE file 2020-03-20 21:46:49 +02:00
Makefile Updated docs for positive_keywords and negative_keywords, cleaner implementation. 2018-05-07 18:27:06 +07:00
README.rst Syntax highlight the README 2020-01-09 10:29:49 +01:00
requirements.txt Adds tox configuration. 2015-04-29 16:16:46 +02:00
setup.py Use black to format the code 2020-01-30 17:32:43 +01:00
tox.ini Skip missing interpreters in tox.ini 2020-01-28 20:33:23 +01:00

.. image:: https://travis-ci.org/buriy/python-readability.svg?branch=master
    :target: https://travis-ci.org/buriy/python-readability


python-readability
==================

Given a html document, it pulls out the main body text and cleans it up.

This is a python port of a ruby port of `arc90's readability
project <http://lab.arc90.com/experiments/readability/>`__.

Installation
------------

It's easy using ``pip``, just run:

.. code-block:: bash

    $ pip install readability-lxml

Usage
-----

.. code-block:: python

    >>> import requests
    >>> from readability import Document

    >>> response = requests.get('http://example.com')
    >>> doc = Document(response.text)
    >>> doc.title()
    'Example Domain'

    >>> doc.summary()
    """<html><body><div><body id="readabilityBody">\n<div>\n    <h1>Example Domain</h1>\n
    <p>This domain is established to be used for illustrative examples in documents. You may
    use this\n    domain in examples without prior coordination or asking for permission.</p>
    \n    <p><a href="http://www.iana.org/domains/example">More information...</a></p>\n</div>
    \n</body>\n</div></body></html>"""

Change Log
----------

-  0.8beta Replaced XHTML output with HTML5 output in summary() call.
-  0.7.1 Support for Python 3.7 . Fixed a slowdown when processing documents with lots of spaces.
-  0.7 Improved HTML5 tags handling. Fixed stripping unwanted HTML nodes (only first matching node was removed before).
-  0.6 Finally a release which supports Python versions 2.6, 2.7, 3.3 - 3.6
-  0.5 Preparing a release to support Python versions 2.6, 2.7, 3.3 and 3.4
-  0.4 Added Videos loading and allowed more images per paragraph
-  0.3 Added Document.encoding, positive\_keywords and negative\_keywords

Licensing
--------

This code is under `the Apache License
2.0 <http://www.apache.org/licenses/LICENSE-2.0>`__ license.

Thanks to
---------

-  Latest `readability.js <https://github.com/MHordecki/readability-redux/blob/master/readability/readability.js>`__
-  Ruby port by starrhorne and iterationlabs
-  `Python port <https://github.com/gfxmonk/python-readability>`__ by gfxmonk
-  `Decruft effort <http://www.minvolai.com/blog/decruft-arc90s-readability-in-python/>` to move to lxml
-  "BR to P" fix from readability.js which improves quality for smaller texts
-  Github users contributions.