Richard Harding
a46dc14251
Try to pep8 all the things but give up when I got close.
2012-04-16 21:23:19 -04:00
Richard Harding
5a98e2c1b8
Correct appending and allow for document only
...
- Fix the appending of siblings to the correct nested element
- Add a document only flag so that you can get a dom tree you can nest
yourself without html/body tags.
2012-04-16 20:55:13 -04:00
Richard Harding
edccec5d3b
Work on why we have an empty <body/> tag
...
- Seems to come because the sanitizer ends up with two nodes, not one. The
first is an empty body, the second is the article div.
- Fix up the tabs so we can work with the file. Needs lots of pep8 love.
- Implement an initial hack that at least gets it working atm.
- Start to add test cases, sample html files we can test against, etc.
2012-04-16 17:13:24 -04:00
Jan Weiß
3cdc3d67af
Adding comment about oversight in transform_misused_divs_into_paragraphs().
2012-03-24 10:00:07 +01:00
Jan Weiß
960f885edf
Continue early in remove_unlikely_candidates() in case there is neither a class nor an id attribute.
2012-03-24 09:56:08 +01:00
Jan Weiß
6b3961cd30
Fixing gap in node_length coverage.
2012-03-24 09:54:41 +01:00
facundo
bb93ae1e5f
fixed a small issue on the Document score_paragraphs method
2012-02-06 23:05:26 -05:00
Yuri Baburov
11c4d95411
Fixed indentation, encoding issue and README bug. Thanks to Greg Jastrab. Bump version to 0.2.3
2011-07-27 02:05:16 +07:00
Yuri Baburov
61715dca0a
Bump to version 0.2
2011-06-30 12:08:46 +07:00
Yuri Baburov
c2ec1d1c38
Sorted out unicode issues, thanks to Lee Semel.
2011-06-30 11:51:16 +07:00
Yuri Baburov
f55f16baa1
Updated scoring algorithm to match readability.js v1.7.1
2011-06-01 12:16:32 +07:00
Yuri Baburov
96f476181c
Improved title shortener method, and added it to the Document class.
2011-05-11 19:58:27 +07:00
Yuri Baburov
dada82099b
Moved to lxml (based on decruft version); better encoding recognition.
2011-05-03 11:34:29 +07:00
gfxmonk
2b6a2d3db4
removing empty paragraphs is not very useful, and can break some (stupid) websites
2010-05-01 00:08:23 +10:00
gfxmonk
1d862a00c3
fixed bug where only immediate text was being considered for weights, instead of all nested text
2010-05-01 00:07:30 +10:00
gfxmonk
0eacd959a4
failsafe parsing and more logging
2010-04-30 22:34:53 +10:00
gfxmonk
87ad057706
unicode, dammit!
2010-04-26 23:22:54 +10:00
gfxmonk
a224c5b759
minor
2010-04-24 14:24:09 +10:00
gfxmonk
f73b5f05c4
split out into content and summary methods
2010-04-24 00:41:09 +10:00
gfxmonk
c952f421b7
clean up content method and debug
2010-04-23 23:28:51 +10:00
gfxmonk
c0ca60ee26
use a more leniant parser
2010-04-23 20:51:56 +10:00
gfxmonk
ad3d52ade4
initial
2010-04-22 21:55:00 +10:00