python-readability

Author	SHA1	Message	Date
alphapapa	8443a87f5c	Update readability.py	2016-04-03 21:38:17 -05:00
alphapapa	5fc2d3684a	Use Mozilla User-Agent Use a "Mozilla" user-agent to avoid HTTP 403 errors. Fixes #71.	2016-04-03 21:32:36 -05:00
Yuri Baburov	65d1ebb06d	Fixed #70 and added xpath option	2015-09-29 18:40:17 +02:00
Yuri Baburov	c0d794fdd8	Update readability.py Fixed logging namespace	2015-08-26 15:11:12 +05:00
Yuri Baburov	8ff11e68a6	Debugging improvements. Bump to 0.6.0.5	2015-07-27 11:59:17 +06:00
Yuri Baburov	fcdbe563a5	Fixed #49 . Bump to 0.6.0.4	2015-07-27 10:06:28 +06:00
Yuri Baburov	24bb20c761	Added dev branch features. Bumped to version 0.6	2015-07-27 00:22:45 +06:00
Yuri Baburov	154658798b	Merge pull request #64 from martinth/master Added python 3 support (Supported: python 2.6, 2.7, 3.3, 3.4). Thanks a lot to @martinth	2015-07-26 14:11:37 +05:00
Marko Horvatic	f0ff9b2425	Move logging.basicConfig to main function	2015-06-24 16:21:04 +02:00
Yuri Baburov	e2bc1ea055	Improved #65 which has given warning, added cssselect lib, bumped to 0.5.1	2015-05-06 14:33:14 +06:00
Mariusz Osiecki	bf9e7404fa	Failure if best_elem is root (fix #58 )	2015-05-06 09:34:55 +02:00
Martin Thurau	ce7ca26835	Adds compatibility `raise_with_traceback` method to support different `raise` syntax Unfortunately the Python 2 `raise` syntax is not supported in Python 3.3 and not all 3.4.x versions so we deal with that by using conditional imports and a compatibility layer.	2015-04-29 23:35:18 +02:00
Martin Thurau	3ac56329e2	Corrects some things were 2to3 did to much.	2015-04-29 19:33:43 +02:00
Martin Thurau	aa4132f57a	Adds Python 3.4 support. Code now supports Python 2.6, 2.7 and 3.4. PYthon 3.3 isn't support because of some issues with the parser and the difference between old and new `raise` syntax.	2015-04-29 16:18:21 +02:00
Yuri Baburov	987570bef0	Updated package links for Python 2.7 and Python 3 support	2015-04-27 15:59:31 +06:00
Yuri Baburov	1fac7e685a	Added a feature to allow more images per article (with a test)	2015-04-27 14:35:00 +06:00
Miguel Galves	d04d41b749	Insert text inside iframe for correct output	2015-04-27 14:05:31 +06:00
Miguel Galves	f1759c1404	Allows iframes containing youtube or vimeo videos. People like them	2015-04-27 13:52:01 +06:00
Yuri Baburov	638f73f6a2	Fix for #52 : <input type="hidden"> are not counted any more for "form removal" heuristic.	2014-09-22 15:31:31 +07:00
Yuri Baburov	08658d1d31	Released v 0.3, and uploaded to the pypi.	2013-10-10 02:39:37 +07:00
hush-hush	e2e78e4d55	Make lxml clean tree available for user modifications.	2012-09-17 13:54:08 +02:00
Richard Harding	e9a5cbfe7f	Remove pdb dummy	2012-04-17 11:33:09 -04:00
Richard Harding	f1a79fb8f8	Update to make sure we don't drop the html tag when ditching elements	2012-04-17 11:04:36 -04:00
Richard Harding	46f0302ebc	rename the document_only flag to html_partial	2012-04-17 10:17:14 -04:00
Richard Harding	a46dc14251	Try to pep8 all the things but give up when I got close.	2012-04-16 21:23:19 -04:00
Richard Harding	5a98e2c1b8	Correct appending and allow for document only - Fix the appending of siblings to the correct nested element - Add a document only flag so that you can get a dom tree you can nest yourself without html/body tags.	2012-04-16 20:55:13 -04:00
Richard Harding	edccec5d3b	Work on why we have an empty <body/> tag - Seems to come because the sanitizer ends up with two nodes, not one. The first is an empty body, the second is the article div. - Fix up the tabs so we can work with the file. Needs lots of pep8 love. - Implement an initial hack that at least gets it working atm. - Start to add test cases, sample html files we can test against, etc.	2012-04-16 17:13:24 -04:00
Jan Weiß	3cdc3d67af	Adding comment about oversight in transform_misused_divs_into_paragraphs().	2012-03-24 10:00:07 +01:00
Jan Weiß	960f885edf	Continue early in remove_unlikely_candidates() in case there is neither a class nor an id attribute.	2012-03-24 09:56:08 +01:00
Jan Weiß	6b3961cd30	Fixing gap in node_length coverage.	2012-03-24 09:54:41 +01:00
facundo	bb93ae1e5f	fixed a small issue on the Document score_paragraphs method	2012-02-06 23:05:26 -05:00
Yuri Baburov	11c4d95411	Fixed indentation, encoding issue and README bug. Thanks to Greg Jastrab. Bump version to 0.2.3	2011-07-27 02:05:16 +07:00
Yuri Baburov	61715dca0a	Bump to version 0.2	2011-06-30 12:08:46 +07:00
Yuri Baburov	c2ec1d1c38	Sorted out unicode issues, thanks to Lee Semel.	2011-06-30 11:51:16 +07:00
Yuri Baburov	f55f16baa1	Updated scoring algorithm to match readability.js v1.7.1	2011-06-01 12:16:32 +07:00
Yuri Baburov	96f476181c	Improved title shortener method, and added it to the Document class.	2011-05-11 19:58:27 +07:00
Yuri Baburov	dada82099b	Moved to lxml (based on decruft version); better encoding recognition.	2011-05-03 11:34:29 +07:00
gfxmonk	2b6a2d3db4	removing empty paragraphs is not very useful, and can break some (stupid) websites	2010-05-01 00:08:23 +10:00
gfxmonk	1d862a00c3	fixed bug where only immediate text was being considered for weights, instead of all nested text	2010-05-01 00:07:30 +10:00
gfxmonk	0eacd959a4	failsafe parsing and more logging	2010-04-30 22:34:53 +10:00
gfxmonk	87ad057706	unicode, dammit!	2010-04-26 23:22:54 +10:00
gfxmonk	a224c5b759	minor	2010-04-24 14:24:09 +10:00
gfxmonk	f73b5f05c4	split out into content and summary methods	2010-04-24 00:41:09 +10:00
gfxmonk	c952f421b7	clean up content method and debug	2010-04-23 23:28:51 +10:00
gfxmonk	c0ca60ee26	use a more leniant parser	2010-04-23 20:51:56 +10:00
gfxmonk	ad3d52ade4	initial	2010-04-22 21:55:00 +10:00

46 Commits