2012-05-06 23:52:59 +00:00
|
|
|
breadability - another readability Python port
|
2013-03-07 16:48:17 +00:00
|
|
|
==============================================
|
2013-03-07 16:05:47 +00:00
|
|
|
.. image:: https://api.travis-ci.org/miso-belica/breadability.png?branch=master
|
2013-03-07 13:57:14 +00:00
|
|
|
:target: https://travis-ci.org/miso-belica/breadability
|
|
|
|
|
2012-05-06 23:52:59 +00:00
|
|
|
I've tried to work with the various forks of some ancient codebase that ported
|
|
|
|
`readability`_ to Python. The lack of tests, unused regex's, and commented out
|
2012-05-06 23:57:03 +00:00
|
|
|
sections of code in other Python ports just drove me nuts.
|
2012-05-03 01:43:58 +00:00
|
|
|
|
2012-05-06 23:52:59 +00:00
|
|
|
I put forth an effort to bring in several of the better forks into one
|
|
|
|
codebase, but they've diverged so much that I just can't work with it.
|
2012-05-03 01:43:58 +00:00
|
|
|
|
2012-05-06 23:52:59 +00:00
|
|
|
So what's any sane person to do? Re-port it with my own repo, add some tests,
|
|
|
|
infrastructure, and try to make this port better. OSS FTW (and yea, NIH FML,
|
|
|
|
but oh well I did try)
|
2012-05-03 01:43:58 +00:00
|
|
|
|
2012-05-06 23:52:59 +00:00
|
|
|
This is a pretty straight port of the JS here:
|
2012-05-03 01:43:58 +00:00
|
|
|
|
2012-05-06 23:52:59 +00:00
|
|
|
- http://code.google.com/p/arc90labs-readability/source/browse/trunk/js/readability.js#82
|
2012-05-03 01:43:58 +00:00
|
|
|
|
|
|
|
|
2012-05-06 23:52:59 +00:00
|
|
|
Installation
|
2013-03-07 16:48:17 +00:00
|
|
|
------------
|
2012-05-07 00:45:44 +00:00
|
|
|
This does depend on lxml so you'll need some C headers in order to install
|
|
|
|
things from pip so that it can compile.
|
|
|
|
|
2013-03-07 16:48:17 +00:00
|
|
|
.. code-block:: bash
|
2012-05-07 00:45:44 +00:00
|
|
|
|
2013-03-07 16:48:17 +00:00
|
|
|
$ [sudo] apt-get install libxml2-dev libxslt-dev
|
|
|
|
$ [sudo] pip install git+git://github.com/miso-belica/breadability.git
|
2012-05-06 23:52:59 +00:00
|
|
|
|
2013-03-07 14:43:02 +00:00
|
|
|
Tests
|
2013-03-07 16:48:17 +00:00
|
|
|
-----
|
|
|
|
.. code-block:: bash
|
2013-03-07 14:43:02 +00:00
|
|
|
|
2013-03-07 16:48:17 +00:00
|
|
|
$ nosetests --with-coverage --cover-package=breadability --cover-erase tests
|
|
|
|
$ nosetests-3.3 --with-coverage --cover-package=breadability --cover-erase tests
|
2013-03-07 14:43:02 +00:00
|
|
|
|
2012-05-06 23:52:59 +00:00
|
|
|
|
|
|
|
Usage
|
2013-03-07 16:48:17 +00:00
|
|
|
-----
|
|
|
|
Command line
|
|
|
|
~~~~~~~~~~~~
|
2012-05-06 23:52:59 +00:00
|
|
|
|
2013-03-14 23:10:41 +00:00
|
|
|
.. code-block:: bash
|
2012-05-06 23:52:59 +00:00
|
|
|
|
|
|
|
$ breadability http://wiki.python.org/moin/BeginnersGuide
|
|
|
|
|
2012-05-08 23:39:02 +00:00
|
|
|
Options
|
2013-03-07 16:48:17 +00:00
|
|
|
```````
|
2012-05-08 23:39:02 +00:00
|
|
|
|
2013-03-14 23:10:41 +00:00
|
|
|
- **b** will write out the parsed content to a temp file and open it in a
|
2013-03-07 16:48:17 +00:00
|
|
|
browser for viewing.
|
2013-03-14 23:10:41 +00:00
|
|
|
- **d** will write out debug scoring statements to help track why a node was
|
2013-03-07 16:48:17 +00:00
|
|
|
chosen as the document and why some nodes were removed from the final
|
|
|
|
product.
|
2013-03-14 23:10:41 +00:00
|
|
|
- **f** will override the default behaviour of getting an html fragment (<div>)
|
2013-03-07 16:48:17 +00:00
|
|
|
and give you back a full <html> document.
|
2013-03-14 23:10:41 +00:00
|
|
|
- **v** will output in verbose debug mode and help let you know why it parsed
|
2013-03-07 16:48:17 +00:00
|
|
|
how it did.
|
2012-05-06 23:52:59 +00:00
|
|
|
|
|
|
|
|
2013-03-07 16:48:17 +00:00
|
|
|
Python API
|
|
|
|
~~~~~~~~~~
|
|
|
|
.. code-block:: python
|
2012-05-06 23:52:59 +00:00
|
|
|
|
2013-03-07 16:48:17 +00:00
|
|
|
from __future__ import print_function
|
2012-05-06 23:52:59 +00:00
|
|
|
|
|
|
|
from breadability.readable import Article
|
2013-03-07 16:48:17 +00:00
|
|
|
|
|
|
|
|
|
|
|
if __name__ == "__main__":
|
|
|
|
document = Article(html_as_text, url=source_url)
|
|
|
|
print(document.readable)
|
2012-05-06 23:52:59 +00:00
|
|
|
|
|
|
|
|
|
|
|
Work to be done
|
|
|
|
---------------
|
|
|
|
Yep, I've got some catching up to do. I don't do pagination, I've got a lot of
|
|
|
|
custom tweaks I need to get going, there are some articles that fail to parse.
|
|
|
|
I also have more tests to write on a lot of the cleaning helpers, but
|
|
|
|
hopefully things are setup in a way that those can/will be added.
|
|
|
|
|
|
|
|
Fortunately, I need this library for my tools:
|
|
|
|
|
|
|
|
- https://bmark.us
|
|
|
|
- http://readable.bmark.us
|
|
|
|
|
|
|
|
so I really need this to be an active and improving project.
|
|
|
|
|
|
|
|
|
2013-03-07 16:48:17 +00:00
|
|
|
Off the top of my heads TODO list:
|
2012-05-06 23:53:59 +00:00
|
|
|
|
2013-03-07 16:48:17 +00:00
|
|
|
- Support metadata from parsed article [url, confidence scores, all
|
|
|
|
candidates we thought about?]
|
|
|
|
- More tests, more thorough tests
|
|
|
|
- More sample articles we need to test against in the test_articles
|
|
|
|
- Tests that run through and check for regressions of the test_articles
|
|
|
|
- Tidy'ing the HTML that comes out, might help with regression tests ^^
|
|
|
|
- Multiple page articles
|
|
|
|
- Performance tuning, we do a lot of looping and re-drop some nodes that
|
|
|
|
should be skipped. We should have a set of regression tests for this so
|
|
|
|
that if we implement a change that blows up performance we know it right
|
|
|
|
away.
|
|
|
|
- More docs for things, but sphinx docs and in code comments to help
|
|
|
|
understand wtf we're doing and why. That's the biggest hurdle to some of
|
|
|
|
this stuff.
|
2012-05-06 23:55:04 +00:00
|
|
|
|
|
|
|
|
2012-05-12 01:15:53 +00:00
|
|
|
Inspiration
|
2013-03-07 16:48:17 +00:00
|
|
|
~~~~~~~~~~~
|
2012-05-12 01:15:53 +00:00
|
|
|
|
2012-05-12 01:17:07 +00:00
|
|
|
- `python-readability`_
|
|
|
|
- `decruft`_
|
|
|
|
- `readability`_
|
2012-05-12 01:15:53 +00:00
|
|
|
|
|
|
|
|
|
|
|
|
2012-05-06 23:55:04 +00:00
|
|
|
.. _readability: http://code.google.com/p/arc90labs-readability/
|
2012-05-12 01:15:53 +00:00
|
|
|
.. _TravisCI: http://travis-ci.org/
|
|
|
|
.. _decruft: https://github.com/dcramer/decruft
|
|
|
|
.. _python-readability: https://github.com/buriy/python-readability
|