Commit Graph

39 Commits

Author SHA1 Message Date
Yuri Baburov
154658798b Merge pull request #64 from martinth/master
Added python 3 support (Supported: python 2.6, 2.7, 3.3, 3.4).
Thanks a lot to @martinth
2015-07-26 14:11:37 +05:00
Marko Horvatic
f0ff9b2425 Move logging.basicConfig to main function 2015-06-24 16:21:04 +02:00
Yuri Baburov
e2bc1ea055 Improved #65 which has given warning, added cssselect lib, bumped to 0.5.1 2015-05-06 14:33:14 +06:00
Mariusz Osiecki
bf9e7404fa Failure if best_elem is root (fix #58) 2015-05-06 09:34:55 +02:00
Martin Thurau
ce7ca26835 Adds compatibility raise_with_traceback method to support different raise syntax
Unfortunately the Python 2 `raise` syntax is not supported in Python 3.3 and not all 3.4.x versions so we deal with that by using conditional imports and a compatibility layer.
2015-04-29 23:35:18 +02:00
Martin Thurau
3ac56329e2 Corrects some things were 2to3 did to much. 2015-04-29 19:33:43 +02:00
Martin Thurau
aa4132f57a Adds Python 3.4 support.
Code now supports Python 2.6, 2.7 and 3.4. PYthon 3.3 isn't support
because of some issues with the parser and the difference between old and
new `raise` syntax.
2015-04-29 16:18:21 +02:00
Yuri Baburov
987570bef0 Updated package links for Python 2.7 and Python 3 support 2015-04-27 15:59:31 +06:00
Yuri Baburov
1fac7e685a Added a feature to allow more images per article (with a test) 2015-04-27 14:35:00 +06:00
Miguel Galves
d04d41b749 Insert text inside iframe for correct output 2015-04-27 14:05:31 +06:00
Miguel Galves
f1759c1404 Allows iframes containing youtube or vimeo videos. People like them 2015-04-27 13:52:01 +06:00
Yuri Baburov
638f73f6a2 Fix for #52: <input type="hidden"> are not counted any more for "form removal" heuristic. 2014-09-22 15:31:31 +07:00
Yuri Baburov
08658d1d31 Released v 0.3, and uploaded to the pypi. 2013-10-10 02:39:37 +07:00
hush-hush
e2e78e4d55 Make lxml clean tree available for user modifications. 2012-09-17 13:54:08 +02:00
Richard Harding
e9a5cbfe7f Remove pdb dummy 2012-04-17 11:33:09 -04:00
Richard Harding
f1a79fb8f8 Update to make sure we don't drop the html tag when ditching elements 2012-04-17 11:04:36 -04:00
Richard Harding
46f0302ebc rename the document_only flag to html_partial 2012-04-17 10:17:14 -04:00
Richard Harding
a46dc14251 Try to pep8 all the things but give up when I got close. 2012-04-16 21:23:19 -04:00
Richard Harding
5a98e2c1b8 Correct appending and allow for document only
- Fix the appending of siblings to the correct nested element
- Add a document only flag so that you can get a dom tree you can nest
yourself without html/body tags.
2012-04-16 20:55:13 -04:00
Richard Harding
edccec5d3b Work on why we have an empty <body/> tag
- Seems to come because the sanitizer ends up with two nodes, not one. The
first is an empty body, the second is the article div.
- Fix up the tabs so we can work with the file. Needs lots of pep8 love.
- Implement an initial hack that at least gets it working atm.
- Start to add test cases, sample html files we can test against, etc.
2012-04-16 17:13:24 -04:00
Jan Weiß
3cdc3d67af Adding comment about oversight in transform_misused_divs_into_paragraphs(). 2012-03-24 10:00:07 +01:00
Jan Weiß
960f885edf Continue early in remove_unlikely_candidates() in case there is neither a class nor an id attribute. 2012-03-24 09:56:08 +01:00
Jan Weiß
6b3961cd30 Fixing gap in node_length coverage. 2012-03-24 09:54:41 +01:00
facundo
bb93ae1e5f fixed a small issue on the Document score_paragraphs method 2012-02-06 23:05:26 -05:00
Yuri Baburov
11c4d95411 Fixed indentation, encoding issue and README bug. Thanks to Greg Jastrab. Bump version to 0.2.3 2011-07-27 02:05:16 +07:00
Yuri Baburov
61715dca0a Bump to version 0.2 2011-06-30 12:08:46 +07:00
Yuri Baburov
c2ec1d1c38 Sorted out unicode issues, thanks to Lee Semel. 2011-06-30 11:51:16 +07:00
Yuri Baburov
f55f16baa1 Updated scoring algorithm to match readability.js v1.7.1 2011-06-01 12:16:32 +07:00
Yuri Baburov
96f476181c Improved title shortener method, and added it to the Document class. 2011-05-11 19:58:27 +07:00
Yuri Baburov
dada82099b Moved to lxml (based on decruft version); better encoding recognition. 2011-05-03 11:34:29 +07:00
gfxmonk
2b6a2d3db4 removing empty paragraphs is not very useful, and can break some (stupid) websites 2010-05-01 00:08:23 +10:00
gfxmonk
1d862a00c3 fixed bug where only immediate text was being considered for weights, instead of all nested text 2010-05-01 00:07:30 +10:00
gfxmonk
0eacd959a4 failsafe parsing and more logging 2010-04-30 22:34:53 +10:00
gfxmonk
87ad057706 unicode, dammit! 2010-04-26 23:22:54 +10:00
gfxmonk
a224c5b759 minor 2010-04-24 14:24:09 +10:00
gfxmonk
f73b5f05c4 split out into content and summary methods 2010-04-24 00:41:09 +10:00
gfxmonk
c952f421b7 clean up content method and debug 2010-04-23 23:28:51 +10:00
gfxmonk
c0ca60ee26 use a more leniant parser 2010-04-23 20:51:56 +10:00
gfxmonk
ad3d52ade4 initial 2010-04-22 21:55:00 +10:00