Richard Harding
e9a5cbfe7f
Remove pdb dummy
13 years ago
Richard Harding
f1a79fb8f8
Update to make sure we don't drop the html tag when ditching elements
13 years ago
Richard Harding
46f0302ebc
rename the document_only flag to html_partial
13 years ago
Richard Harding
a46dc14251
Try to pep8 all the things but give up when I got close.
13 years ago
Richard Harding
5a98e2c1b8
Correct appending and allow for document only
...
- Fix the appending of siblings to the correct nested element
- Add a document only flag so that you can get a dom tree you can nest
yourself without html/body tags.
13 years ago
Richard Harding
edccec5d3b
Work on why we have an empty <body/> tag
...
- Seems to come because the sanitizer ends up with two nodes, not one. The
first is an empty body, the second is the article div.
- Fix up the tabs so we can work with the file. Needs lots of pep8 love.
- Implement an initial hack that at least gets it working atm.
- Start to add test cases, sample html files we can test against, etc.
13 years ago
Jan Weiß
3cdc3d67af
Adding comment about oversight in transform_misused_divs_into_paragraphs().
13 years ago
Jan Weiß
960f885edf
Continue early in remove_unlikely_candidates() in case there is neither a class nor an id attribute.
13 years ago
Jan Weiß
6b3961cd30
Fixing gap in node_length coverage.
13 years ago
facundo
bb93ae1e5f
fixed a small issue on the Document score_paragraphs method
13 years ago
Yuri Baburov
11c4d95411
Fixed indentation, encoding issue and README bug. Thanks to Greg Jastrab. Bump version to 0.2.3
13 years ago
Yuri Baburov
61715dca0a
Bump to version 0.2
13 years ago
Yuri Baburov
c2ec1d1c38
Sorted out unicode issues, thanks to Lee Semel.
13 years ago
Yuri Baburov
97ba2a0369
Debug utilities.
13 years ago
Lee Semel
f3d0a8d842
Allow passing unicode objects
13 years ago
Jerry Charumilind
8c1adc5141
Expose Document in readability package
13 years ago
Yuri Baburov
43c34bacc1
Renamed encodings to encoding to avoid conflicts with system module.
14 years ago
Yuri Baburov
f55f16baa1
Updated scoring algorithm to match readability.js v1.7.1
14 years ago
Yuri Baburov
96f476181c
Improved title shortener method, and added it to the Document class.
14 years ago
Yuri Baburov
dada82099b
Moved to lxml (based on decruft version); better encoding recognition.
14 years ago
gfxmonk
2b6a2d3db4
removing empty paragraphs is not very useful, and can break some (stupid) websites
15 years ago
gfxmonk
1d862a00c3
fixed bug where only immediate text was being considered for weights, instead of all nested text
15 years ago
gfxmonk
0eacd959a4
failsafe parsing and more logging
15 years ago
gfxmonk
87ad057706
unicode, dammit!
15 years ago
gfxmonk
a224c5b759
minor
15 years ago
gfxmonk
f73b5f05c4
split out into content and summary methods
15 years ago
gfxmonk
c952f421b7
clean up content method and debug
15 years ago
gfxmonk
c0ca60ee26
use a more leniant parser
15 years ago
gfxmonk
ad3d52ade4
initial
15 years ago