python-readability

Commit Graph

Author	SHA1	Message	Date
alphapapa	8443a87f5c	Update readability.py	9 years ago
alphapapa	5fc2d3684a	Use Mozilla User-Agent Use a "Mozilla" user-agent to avoid HTTP 403 errors. Fixes #71.	9 years ago
Yuri Baburov	65d1ebb06d	Fixed #70 and added xpath option	9 years ago
Yuri Baburov	c0d794fdd8	Update readability.py Fixed logging namespace	9 years ago
Yuri Baburov	8ff11e68a6	Debugging improvements. Bump to 0.6.0.5	9 years ago
Yuri Baburov	fcdbe563a5	Fixed #49 . Bump to 0.6.0.4	9 years ago
Yuri Baburov	24bb20c761	Added dev branch features. Bumped to version 0.6	9 years ago
Yuri Baburov	154658798b	Merge pull request #64 from martinth/master Added python 3 support (Supported: python 2.6, 2.7, 3.3, 3.4). Thanks a lot to @martinth	9 years ago
Marko Horvatic	f0ff9b2425	Move logging.basicConfig to main function	9 years ago
Yuri Baburov	e2bc1ea055	Improved #65 which has given warning, added cssselect lib, bumped to 0.5.1	10 years ago
Mariusz Osiecki	bf9e7404fa	Failure if best_elem is root (fix #58 )	10 years ago
Martin Thurau	386e48d29b	Fixes checking of declared encodings in get_encoding. In PYthon 3 .decode() on bytes requires the name of the encoding to be a str type which means we have to convert the extracted encoding before we can use it.	10 years ago
Martin Thurau	046d2c10c3	Fixes regex declaration in get_encoding. Since get_encoding() is only called when the input is not already unicode we need to declare the regexs as byte type so they continue to work in Python 3.	10 years ago
Martin Thurau	ce7ca26835	Adds compatibility `raise_with_traceback` method to support different `raise` syntax Unfortunately the Python 2 `raise` syntax is not supported in Python 3.3 and not all 3.4.x versions so we deal with that by using conditional imports and a compatibility layer.	10 years ago
Martin Thurau	3ac56329e2	Corrects some things were 2to3 did to much.	10 years ago
Martin Thurau	aa4132f57a	Adds Python 3.4 support. Code now supports Python 2.6, 2.7 and 3.4. PYthon 3.3 isn't support because of some issues with the parser and the difference between old and new `raise` syntax.	10 years ago
Yuri Baburov	987570bef0	Updated package links for Python 2.7 and Python 3 support	10 years ago
Yuri Baburov	1fac7e685a	Added a feature to allow more images per article (with a test)	10 years ago
Miguel Galves	d04d41b749	Insert text inside iframe for correct output	10 years ago
Miguel Galves	be2a1c4646	Let width and height attributes	10 years ago
Miguel Galves	f1759c1404	Allows iframes containing youtube or vimeo videos. People like them	10 years ago
Yuri Baburov	e4bcbe57d7	Fixes #53	10 years ago
Nathan Breit	75e2e0cb3a	Defaulting to utf-8 when chardet returns None On articles like this one chardet returns None: http://news.zing.vn/nhip-song-tre/thay-giao-gay-sot-tung-bo-luat-tinh-yeu/a291427.html This causes exceptions later on when encoding.lower() is called	10 years ago
Yuri Baburov	638f73f6a2	Fix for #52 : <input type="hidden"> are not counted any more for "form removal" heuristic.	10 years ago
Mark Perdomo	3a43a3fe7e	Added code to check declared encodings first and check them from kennethreitz/requests/utils.py. Also I added some superset encodings I have found in Chinese pages that are mishandled by chardet/character declarations.	11 years ago
Yuri Baburov	d8595b7103	Quickfix for #41	11 years ago
Yuri Baburov	318f25c577	Minor fix in encoding guessing. Claiming it v0.3.0.1	11 years ago
Yuri Baburov	08658d1d31	Released v 0.3, and uploaded to the pypi.	11 years ago
hush-hush	e2e78e4d55	Make lxml clean tree available for user modifications.	12 years ago
Drew Vogel	fdba8d9e11	Added check on title.text to avoid a TypeError on None.	12 years ago
Zach Denton	0843d9cdf2	Explicitly check if title is None. fixes #22 This fixes #22 which caused all titles to be blank.	12 years ago
Andrey Popp	95852d5c18	readability.htmls: some docs do not have title elem	13 years ago
Richard Harding	e9a5cbfe7f	Remove pdb dummy	13 years ago
Richard Harding	f1a79fb8f8	Update to make sure we don't drop the html tag when ditching elements	13 years ago
Richard Harding	46f0302ebc	rename the document_only flag to html_partial	13 years ago
Richard Harding	a46dc14251	Try to pep8 all the things but give up when I got close.	13 years ago
Richard Harding	5a98e2c1b8	Correct appending and allow for document only - Fix the appending of siblings to the correct nested element - Add a document only flag so that you can get a dom tree you can nest yourself without html/body tags.	13 years ago
Richard Harding	edccec5d3b	Work on why we have an empty <body/> tag - Seems to come because the sanitizer ends up with two nodes, not one. The first is an empty body, the second is the article div. - Fix up the tabs so we can work with the file. Needs lots of pep8 love. - Implement an initial hack that at least gets it working atm. - Start to add test cases, sample html files we can test against, etc.	13 years ago
Jan Weiß	3cdc3d67af	Adding comment about oversight in transform_misused_divs_into_paragraphs().	13 years ago
Jan Weiß	960f885edf	Continue early in remove_unlikely_candidates() in case there is neither a class nor an id attribute.	13 years ago
Jan Weiß	6b3961cd30	Fixing gap in node_length coverage.	13 years ago
facundo	bb93ae1e5f	fixed a small issue on the Document score_paragraphs method	13 years ago
Yuri Baburov	11c4d95411	Fixed indentation, encoding issue and README bug. Thanks to Greg Jastrab. Bump version to 0.2.3	13 years ago
Yuri Baburov	61715dca0a	Bump to version 0.2	13 years ago
Yuri Baburov	c2ec1d1c38	Sorted out unicode issues, thanks to Lee Semel.	13 years ago
Yuri Baburov	97ba2a0369	Debug utilities.	13 years ago
Lee Semel	f3d0a8d842	Allow passing unicode objects	13 years ago
Jerry Charumilind	8c1adc5141	Expose Document in readability package	13 years ago
Yuri Baburov	43c34bacc1	Renamed encodings to encoding to avoid conflicts with system module.	14 years ago
Yuri Baburov	f55f16baa1	Updated scoring algorithm to match readability.js v1.7.1	14 years ago

1 2

61 Commits (5337adc590ab4c01ba7c494147108aa284beac61)