Commit Graph

51 Commits

Author SHA1 Message Date
Richard Harding
f1a79fb8f8 Update to make sure we don't drop the html tag when ditching elements 2012-04-17 11:04:36 -04:00
Richard Harding
46f0302ebc rename the document_only flag to html_partial 2012-04-17 10:17:14 -04:00
Richard Harding
b8fc399fac Fix rebase issue in the Makefile 2012-04-17 09:20:23 -04:00
Richard Harding
82804b664d Update .gitignore file for venv and nosetests. 2012-04-17 08:47:04 -04:00
Richard Harding
4376eedc13 Add makefile testing, building, uploading.
- Adds a makefile with helpers
- make all will setup a virtualenv and get deps
- make test will install test deps and run nosetests
- make version_update will open the setup.py for updating version string
- make upload will build and upload sdist to pypi
2012-04-17 08:45:42 -04:00
Richard Harding
8d3e39f04e Update readme 2012-04-16 21:24:33 -04:00
Richard Harding
a46dc14251 Try to pep8 all the things but give up when I got close. 2012-04-16 21:23:19 -04:00
Richard Harding
5a98e2c1b8 Correct appending and allow for document only
- Fix the appending of siblings to the correct nested element
- Add a document only flag so that you can get a dom tree you can nest
yourself without html/body tags.
2012-04-16 20:55:13 -04:00
Richard Harding
edccec5d3b Work on why we have an empty <body/> tag
- Seems to come because the sanitizer ends up with two nodes, not one. The
first is an empty body, the second is the article div.
- Fix up the tabs so we can work with the file. Needs lots of pep8 love.
- Implement an initial hack that at least gets it working atm.
- Start to add test cases, sample html files we can test against, etc.
2012-04-16 17:13:24 -04:00
Yuri Baburov
ab783b25b7 Merge pull request #11 from JanX2/master
Fixing gap in node_length coverage (length=80 was missed)
Continue early in remove_unlikely_candidates() in case there is neither a class nor an id attribute.
Adding comment about oversight in transform_misused_divs_into_paragraphs
2012-03-25 22:39:29 -07:00
Jan Weiß
3cdc3d67af Adding comment about oversight in transform_misused_divs_into_paragraphs(). 2012-03-24 10:00:07 +01:00
Jan Weiß
960f885edf Continue early in remove_unlikely_candidates() in case there is neither a class nor an id attribute. 2012-03-24 09:56:08 +01:00
Jan Weiß
6b3961cd30 Fixing gap in node_length coverage. 2012-03-24 09:54:41 +01:00
Yuri Baburov
f9b604c9a8 Merge pull request #10 from facundo/master
Fix: Document.score_paragraphs should use ._html() not .html in case it's used not from .summary() method.
Thanks to facundo.
2012-02-07 20:02:05 -08:00
facundo
bb93ae1e5f fixed a small issue on the Document score_paragraphs method 2012-02-06 23:05:26 -05:00
Yuri Baburov
fc6a500298 Merge pull request #9 from Psycojoker/master
Add lxml to the dependencies list in the setup.py
Please note that lxml sometimes can't be built from sources, lots of people use binary distributions, which setup.py/pip can't handle properly!
2012-01-07 23:59:08 -08:00
Laurent Peuch
1583d8a794 add lxml missing dependancy 2012-01-07 21:48:46 +01:00
Yuri Baburov
11c4d95411 Fixed indentation, encoding issue and README bug. Thanks to Greg Jastrab. Bump version to 0.2.3 2011-07-27 02:05:16 +07:00
Yuri Baburov
6bf4948e69 More README fixes for pipy and github. Bump to version 0.2.2 2011-07-26 13:42:20 +07:00
Yuri Baburov
f189ab905d Fixed README for pypi. 2011-07-02 00:30:31 +07:00
Yuri Baburov
61715dca0a Bump to version 0.2 2011-06-30 12:08:46 +07:00
Yuri Baburov
21906f1c44 Better setup.py, now we're "readability-lxml" in pypi. Thanks to Jerry Charumilind. 2011-06-30 11:54:07 +07:00
Yuri Baburov
c2ec1d1c38 Sorted out unicode issues, thanks to Lee Semel. 2011-06-30 11:51:16 +07:00
Yuri Baburov
45781a600f Added command-line usage 2011-06-30 11:47:14 +07:00
Yuri Baburov
97ba2a0369 Debug utilities. 2011-06-30 11:46:37 +07:00
Lee Semel
f3d0a8d842 Allow passing unicode objects 2011-06-30 12:17:23 +08:00
Jerry Charumilind
ad38fac40a Add chardet to installation requirements 2011-06-30 12:17:08 +08:00
Jerry Charumilind
8c1adc5141 Expose Document in readability package 2011-06-30 12:17:08 +08:00
Jerry Charumilind
bae87079e9 Change to automatically find packages 2011-06-30 12:17:07 +08:00
Jerry Charumilind
5bf5192d03 Add version number to track changes more easily 2011-06-30 12:17:07 +08:00
Yuri Baburov
7a1e063c22 Updated setup.py to my fork, changed package name to lxml-readability 2011-06-25 23:14:01 -07:00
Yuri Baburov
43c34bacc1 Renamed encodings to encoding to avoid conflicts with system module. 2011-06-16 17:53:02 +07:00
Yuri Baburov
096d4db6ce Added usage 2011-06-14 04:33:15 -07:00
Yuri Baburov
f55f16baa1 Updated scoring algorithm to match readability.js v1.7.1 2011-06-01 12:16:32 +07:00
Yuri Baburov
96f476181c Improved title shortener method, and added it to the Document class. 2011-05-11 19:58:27 +07:00
Yuri Baburov
f925e3ef05 Corrected README 2011-05-02 21:45:23 -07:00
Yuri Baburov
dada82099b Moved to lxml (based on decruft version); better encoding recognition. 2011-05-03 11:34:29 +07:00
gfxmonk
b5639a0822 well that was quick; first fork added 2011-01-20 23:03:30 +11:00
gfxmonk
324e280e16 added note to readme to make it clear that I'm not actively working on this library 2011-01-20 22:28:01 +11:00
Tim Cuthbertson
7ebbcc03d2 made setup.py executable 2010-09-16 22:01:13 +10:00
Sean Brant
a5d47a1129 added setup.py 2010-09-14 19:18:35 -05:00
gfxmonk
2b6a2d3db4 removing empty paragraphs is not very useful, and can break some (stupid) websites 2010-05-01 00:08:23 +10:00
gfxmonk
1d862a00c3 fixed bug where only immediate text was being considered for weights, instead of all nested text 2010-05-01 00:07:30 +10:00
gfxmonk
0eacd959a4 failsafe parsing and more logging 2010-04-30 22:34:53 +10:00
gfxmonk
87ad057706 unicode, dammit! 2010-04-26 23:22:54 +10:00
gfxmonk
a224c5b759 minor 2010-04-24 14:24:09 +10:00
gfxmonk
e42a39e1aa modified readme 2010-04-24 13:47:35 +10:00
gfxmonk
f73b5f05c4 split out into content and summary methods 2010-04-24 00:41:09 +10:00
gfxmonk
c952f421b7 clean up content method and debug 2010-04-23 23:28:51 +10:00
gfxmonk
c0ca60ee26 use a more leniant parser 2010-04-23 20:51:56 +10:00