breadability

mirror of https://github.com/bookieio/breadability synced 2024-11-04 12:00:19 +00:00

Go to file

Richard Harding 745598dff9 Update news file with initial release		2012-05-06 20:47:24 -04:00
src/breadability	Work on tweaking out parser algorithm to help find the right candidate: fixes #2	2012-05-06 20:34:42 -04:00
.gitignore	Initial bootstrap of modern package template	2012-05-02 21:43:58 -04:00
HACKING.txt	Initial bootstrap of modern package template	2012-05-02 21:43:58 -04:00
Makefile	Start to add makefile for running life	2012-05-02 23:06:05 -04:00
MANIFEST.in	Initial bootstrap of modern package template	2012-05-02 21:43:58 -04:00
NEWS.txt	Update news file with initial release	2012-05-06 20:47:24 -04:00
README.rst	Update the readme for install info	2012-05-06 20:45:44 -04:00
setup.py	Update cmd line client/interface, update doc builders	2012-05-05 13:08:24 -04:00

README.rst

breadability - another readability Python port
===============================================
I've tried to work with the various forks of some ancient codebase that ported
`readability`_ to Python. The lack of tests, unused regex's, and commented out
sections of code in other Python ports just drove me nuts.

I put forth an effort to bring in several of the better forks into one
codebase, but they've diverged so much that I just can't work with it.

So what's any sane person to do? Re-port it with my own repo, add some tests,
infrastructure, and try to make this port better. OSS FTW (and yea, NIH FML,
but oh well I did try)

This is a pretty straight port of the JS here:

- http://code.google.com/p/arc90labs-readability/source/browse/trunk/js/readability.js#82


Installation
-------------
This does depend on lxml so you'll need some C headers in order to install
things from pip so that it can compile.

::

    pip install breadability


Usage
------

cmd line
~~~~~~~~~

::

    $ breadability http://wiki.python.org/moin/BeginnersGuide

Add the `-v` flag to get some details on how we actually parsed this thing. I
want to grow that debugging info into enough to try to track good/bad things
we did in processing.

::

    $ breadability -v http://wiki.python.org/moin/BeginnersGuide


Using from Python
~~~~~~~~~~~~~~~~~~

::

    from breadability.readable import Article
    readable_article = Article(html_text, url=url_came_from)
    print readable_article


Work to be done
---------------
Yep, I've got some catching up to do. I don't do pagination, I've got a lot of
custom tweaks I need to get going, there are some articles that fail to parse.
I also have more tests to write on a lot of the cleaning helpers, but
hopefully things are setup in a way that those can/will be added.

Fortunately, I need this library for my tools:

- https://bmark.us
- http://readable.bmark.us

so I really need this to be an active and improving project.


Off the top of my heads todo list:

  - Support metadata from parsed article [url, confidence scores, all
    candidates we thought about?]
  - More tests, more thorough tests
  - More sample articles we need to test against in the test_articles
  - Tests that run through and check for regressions of the test_articles
  - Tidy'ing the HTML that comes out, might help with regression tests ^^
  - Multiple page articles
  - Performance tuning, we do a lot of looping and re-drop some nodes that
    should be skipped. We should have a set of regression tests for this so
    that if we implement a change that blows up performance we know it right
    away.
  - Get up on pypi along with the rest of the ports
  - More docs for things, but sphinx docs and in code comments to help
    understand wtf we're doing and why. That's the biggest hurdle to some of
    this stuff.

Helping out
------------
If you want to help, shoot me a pull request, an issue report with broken
urls, etc.

You can ping me on irc, I'm always in the `#bookie` channel in freenode.


.. _readability: http://code.google.com/p/arc90labs-readability/