python-readability

Author	SHA1	Message	Date
Richard Harding	a46dc14251	Try to pep8 all the things but give up when I got close.	2012-04-16 21:23:19 -04:00
Richard Harding	5a98e2c1b8	Correct appending and allow for document only - Fix the appending of siblings to the correct nested element - Add a document only flag so that you can get a dom tree you can nest yourself without html/body tags.	2012-04-16 20:55:13 -04:00
Richard Harding	edccec5d3b	Work on why we have an empty <body/> tag - Seems to come because the sanitizer ends up with two nodes, not one. The first is an empty body, the second is the article div. - Fix up the tabs so we can work with the file. Needs lots of pep8 love. - Implement an initial hack that at least gets it working atm. - Start to add test cases, sample html files we can test against, etc.	2012-04-16 17:13:24 -04:00
Jan Weiß	3cdc3d67af	Adding comment about oversight in transform_misused_divs_into_paragraphs().	2012-03-24 10:00:07 +01:00
Jan Weiß	960f885edf	Continue early in remove_unlikely_candidates() in case there is neither a class nor an id attribute.	2012-03-24 09:56:08 +01:00
Jan Weiß	6b3961cd30	Fixing gap in node_length coverage.	2012-03-24 09:54:41 +01:00
facundo	bb93ae1e5f	fixed a small issue on the Document score_paragraphs method	2012-02-06 23:05:26 -05:00
Yuri Baburov	11c4d95411	Fixed indentation, encoding issue and README bug. Thanks to Greg Jastrab. Bump version to 0.2.3	2011-07-27 02:05:16 +07:00
Yuri Baburov	61715dca0a	Bump to version 0.2	2011-06-30 12:08:46 +07:00
Yuri Baburov	c2ec1d1c38	Sorted out unicode issues, thanks to Lee Semel.	2011-06-30 11:51:16 +07:00
Yuri Baburov	97ba2a0369	Debug utilities.	2011-06-30 11:46:37 +07:00
Lee Semel	f3d0a8d842	Allow passing unicode objects	2011-06-30 12:17:23 +08:00
Jerry Charumilind	8c1adc5141	Expose Document in readability package	2011-06-30 12:17:08 +08:00
Yuri Baburov	43c34bacc1	Renamed encodings to encoding to avoid conflicts with system module.	2011-06-16 17:53:02 +07:00
Yuri Baburov	f55f16baa1	Updated scoring algorithm to match readability.js v1.7.1	2011-06-01 12:16:32 +07:00
Yuri Baburov	96f476181c	Improved title shortener method, and added it to the Document class.	2011-05-11 19:58:27 +07:00
Yuri Baburov	dada82099b	Moved to lxml (based on decruft version); better encoding recognition.	2011-05-03 11:34:29 +07:00
gfxmonk	2b6a2d3db4	removing empty paragraphs is not very useful, and can break some (stupid) websites	2010-05-01 00:08:23 +10:00
gfxmonk	1d862a00c3	fixed bug where only immediate text was being considered for weights, instead of all nested text	2010-05-01 00:07:30 +10:00
gfxmonk	0eacd959a4	failsafe parsing and more logging	2010-04-30 22:34:53 +10:00
gfxmonk	87ad057706	unicode, dammit!	2010-04-26 23:22:54 +10:00
gfxmonk	a224c5b759	minor	2010-04-24 14:24:09 +10:00
gfxmonk	f73b5f05c4	split out into content and summary methods	2010-04-24 00:41:09 +10:00
gfxmonk	c952f421b7	clean up content method and debug	2010-04-23 23:28:51 +10:00
gfxmonk	c0ca60ee26	use a more leniant parser	2010-04-23 20:51:56 +10:00
gfxmonk	ad3d52ade4	initial	2010-04-22 21:55:00 +10:00

26 Commits