alphapapa
8443a87f5c
Update readability.py
9 years ago
alphapapa
5fc2d3684a
Use Mozilla User-Agent
...
Use a "Mozilla" user-agent to avoid HTTP 403 errors. Fixes #71 .
9 years ago
Yuri Baburov
65d1ebb06d
Fixed #70 and added xpath option
9 years ago
Yuri Baburov
c0d794fdd8
Update readability.py
...
Fixed logging namespace
9 years ago
Yuri Baburov
8ff11e68a6
Debugging improvements. Bump to 0.6.0.5
9 years ago
Yuri Baburov
fcdbe563a5
Fixed #49 . Bump to 0.6.0.4
9 years ago
Yuri Baburov
24bb20c761
Added dev branch features.
...
Bumped to version 0.6
9 years ago
Yuri Baburov
154658798b
Merge pull request #64 from martinth/master
...
Added python 3 support (Supported: python 2.6, 2.7, 3.3, 3.4).
Thanks a lot to @martinth
9 years ago
Marko Horvatic
f0ff9b2425
Move logging.basicConfig to main function
9 years ago
Yuri Baburov
e2bc1ea055
Improved #65 which has given warning, added cssselect lib, bumped to 0.5.1
10 years ago
Mariusz Osiecki
bf9e7404fa
Failure if best_elem is root ( fix #58 )
10 years ago
Martin Thurau
386e48d29b
Fixes checking of declared encodings in get_encoding.
...
In PYthon 3 .decode() on bytes requires the name of the encoding to be a str type which means we have to convert the extracted encoding before we can use it.
10 years ago
Martin Thurau
046d2c10c3
Fixes regex declaration in get_encoding.
...
Since get_encoding() is only called when the input is *not* already unicode we need to declare the regexs as byte type so they continue to work in Python 3.
10 years ago
Martin Thurau
ce7ca26835
Adds compatibility `raise_with_traceback` method to support different `raise` syntax
...
Unfortunately the Python 2 `raise` syntax is not supported in Python 3.3 and not all 3.4.x versions so we deal with that by using conditional imports and a compatibility layer.
10 years ago
Martin Thurau
3ac56329e2
Corrects some things were 2to3 did to much.
10 years ago
Martin Thurau
aa4132f57a
Adds Python 3.4 support.
...
Code now supports Python 2.6, 2.7 and 3.4. PYthon 3.3 isn't support
because of some issues with the parser and the difference between old and
new `raise` syntax.
10 years ago
Yuri Baburov
987570bef0
Updated package links for Python 2.7 and Python 3 support
10 years ago
Yuri Baburov
1fac7e685a
Added a feature to allow more images per article (with a test)
10 years ago
Miguel Galves
d04d41b749
Insert text inside iframe for correct output
10 years ago
Miguel Galves
be2a1c4646
Let width and height attributes
10 years ago
Miguel Galves
f1759c1404
Allows iframes containing youtube or vimeo videos. People like them
10 years ago
Yuri Baburov
e4bcbe57d7
Fixes #53
10 years ago
Nathan Breit
75e2e0cb3a
Defaulting to utf-8 when chardet returns None
...
On articles like this one chardet returns None:
http://news.zing.vn/nhip-song-tre/thay-giao-gay-sot-tung-bo-luat-tinh-yeu/a291427.html
This causes exceptions later on when encoding.lower() is called
10 years ago
Yuri Baburov
638f73f6a2
Fix for #52 : <input type="hidden"> are not counted any more for "form removal" heuristic.
10 years ago
Mark Perdomo
3a43a3fe7e
Added code to check declared encodings first and check them
...
from kennethreitz/requests/utils.py. Also I added some superset
encodings I have found in Chinese pages that are mishandled by
chardet/character declarations.
11 years ago
Yuri Baburov
d8595b7103
Quickfix for #41
11 years ago
Yuri Baburov
318f25c577
Minor fix in encoding guessing. Claiming it v0.3.0.1
11 years ago
Yuri Baburov
08658d1d31
Released v 0.3, and uploaded to the pypi.
11 years ago
hush-hush
e2e78e4d55
Make lxml clean tree available for user modifications.
12 years ago
Drew Vogel
fdba8d9e11
Added check on title.text to avoid a TypeError on None.
12 years ago
Zach Denton
0843d9cdf2
Explicitly check if title is None. fixes #22
...
This fixes #22 which caused all titles to be blank.
12 years ago
Andrey Popp
95852d5c18
readability.htmls: some docs do not have title elem
13 years ago
Richard Harding
e9a5cbfe7f
Remove pdb dummy
13 years ago
Richard Harding
f1a79fb8f8
Update to make sure we don't drop the html tag when ditching elements
13 years ago
Richard Harding
46f0302ebc
rename the document_only flag to html_partial
13 years ago
Richard Harding
a46dc14251
Try to pep8 all the things but give up when I got close.
13 years ago
Richard Harding
5a98e2c1b8
Correct appending and allow for document only
...
- Fix the appending of siblings to the correct nested element
- Add a document only flag so that you can get a dom tree you can nest
yourself without html/body tags.
13 years ago
Richard Harding
edccec5d3b
Work on why we have an empty <body/> tag
...
- Seems to come because the sanitizer ends up with two nodes, not one. The
first is an empty body, the second is the article div.
- Fix up the tabs so we can work with the file. Needs lots of pep8 love.
- Implement an initial hack that at least gets it working atm.
- Start to add test cases, sample html files we can test against, etc.
13 years ago
Jan Weiß
3cdc3d67af
Adding comment about oversight in transform_misused_divs_into_paragraphs().
13 years ago
Jan Weiß
960f885edf
Continue early in remove_unlikely_candidates() in case there is neither a class nor an id attribute.
13 years ago
Jan Weiß
6b3961cd30
Fixing gap in node_length coverage.
13 years ago
facundo
bb93ae1e5f
fixed a small issue on the Document score_paragraphs method
13 years ago
Yuri Baburov
11c4d95411
Fixed indentation, encoding issue and README bug. Thanks to Greg Jastrab. Bump version to 0.2.3
13 years ago
Yuri Baburov
61715dca0a
Bump to version 0.2
13 years ago
Yuri Baburov
c2ec1d1c38
Sorted out unicode issues, thanks to Lee Semel.
13 years ago
Yuri Baburov
97ba2a0369
Debug utilities.
13 years ago
Lee Semel
f3d0a8d842
Allow passing unicode objects
13 years ago
Jerry Charumilind
8c1adc5141
Expose Document in readability package
13 years ago
Yuri Baburov
43c34bacc1
Renamed encodings to encoding to avoid conflicts with system module.
14 years ago
Yuri Baburov
f55f16baa1
Updated scoring algorithm to match readability.js v1.7.1
14 years ago