Commit Graph

26 Commits

Author SHA1 Message Date
Richard Harding
a46dc14251 Try to pep8 all the things but give up when I got close. 2012-04-16 21:23:19 -04:00
Richard Harding
5a98e2c1b8 Correct appending and allow for document only
- Fix the appending of siblings to the correct nested element
- Add a document only flag so that you can get a dom tree you can nest
yourself without html/body tags.
2012-04-16 20:55:13 -04:00
Richard Harding
edccec5d3b Work on why we have an empty <body/> tag
- Seems to come because the sanitizer ends up with two nodes, not one. The
first is an empty body, the second is the article div.
- Fix up the tabs so we can work with the file. Needs lots of pep8 love.
- Implement an initial hack that at least gets it working atm.
- Start to add test cases, sample html files we can test against, etc.
2012-04-16 17:13:24 -04:00
Jan Weiß
3cdc3d67af Adding comment about oversight in transform_misused_divs_into_paragraphs(). 2012-03-24 10:00:07 +01:00
Jan Weiß
960f885edf Continue early in remove_unlikely_candidates() in case there is neither a class nor an id attribute. 2012-03-24 09:56:08 +01:00
Jan Weiß
6b3961cd30 Fixing gap in node_length coverage. 2012-03-24 09:54:41 +01:00
facundo
bb93ae1e5f fixed a small issue on the Document score_paragraphs method 2012-02-06 23:05:26 -05:00
Yuri Baburov
11c4d95411 Fixed indentation, encoding issue and README bug. Thanks to Greg Jastrab. Bump version to 0.2.3 2011-07-27 02:05:16 +07:00
Yuri Baburov
61715dca0a Bump to version 0.2 2011-06-30 12:08:46 +07:00
Yuri Baburov
c2ec1d1c38 Sorted out unicode issues, thanks to Lee Semel. 2011-06-30 11:51:16 +07:00
Yuri Baburov
97ba2a0369 Debug utilities. 2011-06-30 11:46:37 +07:00
Lee Semel
f3d0a8d842 Allow passing unicode objects 2011-06-30 12:17:23 +08:00
Jerry Charumilind
8c1adc5141 Expose Document in readability package 2011-06-30 12:17:08 +08:00
Yuri Baburov
43c34bacc1 Renamed encodings to encoding to avoid conflicts with system module. 2011-06-16 17:53:02 +07:00
Yuri Baburov
f55f16baa1 Updated scoring algorithm to match readability.js v1.7.1 2011-06-01 12:16:32 +07:00
Yuri Baburov
96f476181c Improved title shortener method, and added it to the Document class. 2011-05-11 19:58:27 +07:00
Yuri Baburov
dada82099b Moved to lxml (based on decruft version); better encoding recognition. 2011-05-03 11:34:29 +07:00
gfxmonk
2b6a2d3db4 removing empty paragraphs is not very useful, and can break some (stupid) websites 2010-05-01 00:08:23 +10:00
gfxmonk
1d862a00c3 fixed bug where only immediate text was being considered for weights, instead of all nested text 2010-05-01 00:07:30 +10:00
gfxmonk
0eacd959a4 failsafe parsing and more logging 2010-04-30 22:34:53 +10:00
gfxmonk
87ad057706 unicode, dammit! 2010-04-26 23:22:54 +10:00
gfxmonk
a224c5b759 minor 2010-04-24 14:24:09 +10:00
gfxmonk
f73b5f05c4 split out into content and summary methods 2010-04-24 00:41:09 +10:00
gfxmonk
c952f421b7 clean up content method and debug 2010-04-23 23:28:51 +10:00
gfxmonk
c0ca60ee26 use a more leniant parser 2010-04-23 20:51:56 +10:00
gfxmonk
ad3d52ade4 initial 2010-04-22 21:55:00 +10:00