Richard Harding
f1a79fb8f8
Update to make sure we don't drop the html tag when ditching elements
2012-04-17 11:04:36 -04:00
Richard Harding
46f0302ebc
rename the document_only flag to html_partial
2012-04-17 10:17:14 -04:00
Richard Harding
b8fc399fac
Fix rebase issue in the Makefile
2012-04-17 09:20:23 -04:00
Richard Harding
82804b664d
Update .gitignore file for venv and nosetests.
2012-04-17 08:47:04 -04:00
Richard Harding
4376eedc13
Add makefile testing, building, uploading.
...
- Adds a makefile with helpers
- make all will setup a virtualenv and get deps
- make test will install test deps and run nosetests
- make version_update will open the setup.py for updating version string
- make upload will build and upload sdist to pypi
2012-04-17 08:45:42 -04:00
Richard Harding
8d3e39f04e
Update readme
2012-04-16 21:24:33 -04:00
Richard Harding
a46dc14251
Try to pep8 all the things but give up when I got close.
2012-04-16 21:23:19 -04:00
Richard Harding
5a98e2c1b8
Correct appending and allow for document only
...
- Fix the appending of siblings to the correct nested element
- Add a document only flag so that you can get a dom tree you can nest
yourself without html/body tags.
2012-04-16 20:55:13 -04:00
Richard Harding
edccec5d3b
Work on why we have an empty <body/> tag
...
- Seems to come because the sanitizer ends up with two nodes, not one. The
first is an empty body, the second is the article div.
- Fix up the tabs so we can work with the file. Needs lots of pep8 love.
- Implement an initial hack that at least gets it working atm.
- Start to add test cases, sample html files we can test against, etc.
2012-04-16 17:13:24 -04:00
Yuri Baburov
ab783b25b7
Merge pull request #11 from JanX2/master
...
Fixing gap in node_length coverage (length=80 was missed)
Continue early in remove_unlikely_candidates() in case there is neither a class nor an id attribute.
Adding comment about oversight in transform_misused_divs_into_paragraphs
2012-03-25 22:39:29 -07:00
Jan Weiß
3cdc3d67af
Adding comment about oversight in transform_misused_divs_into_paragraphs().
2012-03-24 10:00:07 +01:00
Jan Weiß
960f885edf
Continue early in remove_unlikely_candidates() in case there is neither a class nor an id attribute.
2012-03-24 09:56:08 +01:00
Jan Weiß
6b3961cd30
Fixing gap in node_length coverage.
2012-03-24 09:54:41 +01:00
Yuri Baburov
f9b604c9a8
Merge pull request #10 from facundo/master
...
Fix: Document.score_paragraphs should use ._html() not .html in case it's used not from .summary() method.
Thanks to facundo.
2012-02-07 20:02:05 -08:00
facundo
bb93ae1e5f
fixed a small issue on the Document score_paragraphs method
2012-02-06 23:05:26 -05:00
Yuri Baburov
fc6a500298
Merge pull request #9 from Psycojoker/master
...
Add lxml to the dependencies list in the setup.py
Please note that lxml sometimes can't be built from sources, lots of people use binary distributions, which setup.py/pip can't handle properly!
2012-01-07 23:59:08 -08:00
Laurent Peuch
1583d8a794
add lxml missing dependancy
2012-01-07 21:48:46 +01:00
Yuri Baburov
11c4d95411
Fixed indentation, encoding issue and README bug. Thanks to Greg Jastrab. Bump version to 0.2.3
2011-07-27 02:05:16 +07:00
Yuri Baburov
6bf4948e69
More README fixes for pipy and github. Bump to version 0.2.2
2011-07-26 13:42:20 +07:00
Yuri Baburov
f189ab905d
Fixed README for pypi.
2011-07-02 00:30:31 +07:00
Yuri Baburov
61715dca0a
Bump to version 0.2
2011-06-30 12:08:46 +07:00
Yuri Baburov
21906f1c44
Better setup.py, now we're "readability-lxml" in pypi. Thanks to Jerry Charumilind.
2011-06-30 11:54:07 +07:00
Yuri Baburov
c2ec1d1c38
Sorted out unicode issues, thanks to Lee Semel.
2011-06-30 11:51:16 +07:00
Yuri Baburov
45781a600f
Added command-line usage
2011-06-30 11:47:14 +07:00
Yuri Baburov
97ba2a0369
Debug utilities.
2011-06-30 11:46:37 +07:00
Lee Semel
f3d0a8d842
Allow passing unicode objects
2011-06-30 12:17:23 +08:00
Jerry Charumilind
ad38fac40a
Add chardet to installation requirements
2011-06-30 12:17:08 +08:00
Jerry Charumilind
8c1adc5141
Expose Document in readability package
2011-06-30 12:17:08 +08:00
Jerry Charumilind
bae87079e9
Change to automatically find packages
2011-06-30 12:17:07 +08:00
Jerry Charumilind
5bf5192d03
Add version number to track changes more easily
2011-06-30 12:17:07 +08:00
Yuri Baburov
7a1e063c22
Updated setup.py to my fork, changed package name to lxml-readability
2011-06-25 23:14:01 -07:00
Yuri Baburov
43c34bacc1
Renamed encodings to encoding to avoid conflicts with system module.
2011-06-16 17:53:02 +07:00
Yuri Baburov
096d4db6ce
Added usage
2011-06-14 04:33:15 -07:00
Yuri Baburov
f55f16baa1
Updated scoring algorithm to match readability.js v1.7.1
2011-06-01 12:16:32 +07:00
Yuri Baburov
96f476181c
Improved title shortener method, and added it to the Document class.
2011-05-11 19:58:27 +07:00
Yuri Baburov
f925e3ef05
Corrected README
2011-05-02 21:45:23 -07:00
Yuri Baburov
dada82099b
Moved to lxml (based on decruft version); better encoding recognition.
2011-05-03 11:34:29 +07:00
gfxmonk
b5639a0822
well that was quick; first fork added
2011-01-20 23:03:30 +11:00
gfxmonk
324e280e16
added note to readme to make it clear that I'm not actively working on this library
2011-01-20 22:28:01 +11:00
Tim Cuthbertson
7ebbcc03d2
made setup.py executable
2010-09-16 22:01:13 +10:00
Sean Brant
a5d47a1129
added setup.py
2010-09-14 19:18:35 -05:00
gfxmonk
2b6a2d3db4
removing empty paragraphs is not very useful, and can break some (stupid) websites
2010-05-01 00:08:23 +10:00
gfxmonk
1d862a00c3
fixed bug where only immediate text was being considered for weights, instead of all nested text
2010-05-01 00:07:30 +10:00
gfxmonk
0eacd959a4
failsafe parsing and more logging
2010-04-30 22:34:53 +10:00
gfxmonk
87ad057706
unicode, dammit!
2010-04-26 23:22:54 +10:00
gfxmonk
a224c5b759
minor
2010-04-24 14:24:09 +10:00
gfxmonk
e42a39e1aa
modified readme
2010-04-24 13:47:35 +10:00
gfxmonk
f73b5f05c4
split out into content and summary methods
2010-04-24 00:41:09 +10:00
gfxmonk
c952f421b7
clean up content method and debug
2010-04-23 23:28:51 +10:00
gfxmonk
c0ca60ee26
use a more leniant parser
2010-04-23 20:51:56 +10:00