python-readability

Author	SHA1	Message	Date
Richard Harding	5a98e2c1b8	Correct appending and allow for document only - Fix the appending of siblings to the correct nested element - Add a document only flag so that you can get a dom tree you can nest yourself without html/body tags.	2012-04-16 20:55:13 -04:00
Richard Harding	edccec5d3b	Work on why we have an empty <body/> tag - Seems to come because the sanitizer ends up with two nodes, not one. The first is an empty body, the second is the article div. - Fix up the tabs so we can work with the file. Needs lots of pep8 love. - Implement an initial hack that at least gets it working atm. - Start to add test cases, sample html files we can test against, etc.	2012-04-16 17:13:24 -04:00
Yuri Baburov	ab783b25b7	Merge pull request #11 from JanX2/master Fixing gap in node_length coverage (length=80 was missed) Continue early in remove_unlikely_candidates() in case there is neither a class nor an id attribute. Adding comment about oversight in transform_misused_divs_into_paragraphs	2012-03-25 22:39:29 -07:00
Jan Weiß	3cdc3d67af	Adding comment about oversight in transform_misused_divs_into_paragraphs().	2012-03-24 10:00:07 +01:00
Jan Weiß	960f885edf	Continue early in remove_unlikely_candidates() in case there is neither a class nor an id attribute.	2012-03-24 09:56:08 +01:00
Jan Weiß	6b3961cd30	Fixing gap in node_length coverage.	2012-03-24 09:54:41 +01:00
Yuri Baburov	f9b604c9a8	Merge pull request #10 from facundo/master Fix: Document.score_paragraphs should use ._html() not .html in case it's used not from .summary() method. Thanks to facundo.	2012-02-07 20:02:05 -08:00
facundo	bb93ae1e5f	fixed a small issue on the Document score_paragraphs method	2012-02-06 23:05:26 -05:00
Yuri Baburov	fc6a500298	Merge pull request #9 from Psycojoker/master Add lxml to the dependencies list in the setup.py Please note that lxml sometimes can't be built from sources, lots of people use binary distributions, which setup.py/pip can't handle properly!	2012-01-07 23:59:08 -08:00
Laurent Peuch	1583d8a794	add lxml missing dependancy	2012-01-07 21:48:46 +01:00
Yuri Baburov	11c4d95411	Fixed indentation, encoding issue and README bug. Thanks to Greg Jastrab. Bump version to 0.2.3	2011-07-27 02:05:16 +07:00
Yuri Baburov	6bf4948e69	More README fixes for pipy and github. Bump to version 0.2.2	2011-07-26 13:42:20 +07:00
Yuri Baburov	f189ab905d	Fixed README for pypi.	2011-07-02 00:30:31 +07:00
Yuri Baburov	61715dca0a	Bump to version 0.2	2011-06-30 12:08:46 +07:00
Yuri Baburov	21906f1c44	Better setup.py, now we're "readability-lxml" in pypi. Thanks to Jerry Charumilind.	2011-06-30 11:54:07 +07:00
Yuri Baburov	c2ec1d1c38	Sorted out unicode issues, thanks to Lee Semel.	2011-06-30 11:51:16 +07:00
Yuri Baburov	45781a600f	Added command-line usage	2011-06-30 11:47:14 +07:00
Yuri Baburov	97ba2a0369	Debug utilities.	2011-06-30 11:46:37 +07:00
Lee Semel	f3d0a8d842	Allow passing unicode objects	2011-06-30 12:17:23 +08:00
Jerry Charumilind	ad38fac40a	Add chardet to installation requirements	2011-06-30 12:17:08 +08:00
Jerry Charumilind	8c1adc5141	Expose Document in readability package	2011-06-30 12:17:08 +08:00
Jerry Charumilind	bae87079e9	Change to automatically find packages	2011-06-30 12:17:07 +08:00
Jerry Charumilind	5bf5192d03	Add version number to track changes more easily	2011-06-30 12:17:07 +08:00
Yuri Baburov	7a1e063c22	Updated setup.py to my fork, changed package name to lxml-readability	2011-06-25 23:14:01 -07:00
Yuri Baburov	43c34bacc1	Renamed encodings to encoding to avoid conflicts with system module.	2011-06-16 17:53:02 +07:00
Yuri Baburov	096d4db6ce	Added usage	2011-06-14 04:33:15 -07:00
Yuri Baburov	f55f16baa1	Updated scoring algorithm to match readability.js v1.7.1	2011-06-01 12:16:32 +07:00
Yuri Baburov	96f476181c	Improved title shortener method, and added it to the Document class.	2011-05-11 19:58:27 +07:00
Yuri Baburov	f925e3ef05	Corrected README	2011-05-02 21:45:23 -07:00
Yuri Baburov	dada82099b	Moved to lxml (based on decruft version); better encoding recognition.	2011-05-03 11:34:29 +07:00
gfxmonk	b5639a0822	well that was quick; first fork added	2011-01-20 23:03:30 +11:00
gfxmonk	324e280e16	added note to readme to make it clear that I'm not actively working on this library	2011-01-20 22:28:01 +11:00
Tim Cuthbertson	7ebbcc03d2	made setup.py executable	2010-09-16 22:01:13 +10:00
Sean Brant	a5d47a1129	added setup.py	2010-09-14 19:18:35 -05:00
gfxmonk	2b6a2d3db4	removing empty paragraphs is not very useful, and can break some (stupid) websites	2010-05-01 00:08:23 +10:00
gfxmonk	1d862a00c3	fixed bug where only immediate text was being considered for weights, instead of all nested text	2010-05-01 00:07:30 +10:00
gfxmonk	0eacd959a4	failsafe parsing and more logging	2010-04-30 22:34:53 +10:00
gfxmonk	87ad057706	unicode, dammit!	2010-04-26 23:22:54 +10:00
gfxmonk	a224c5b759	minor	2010-04-24 14:24:09 +10:00
gfxmonk	e42a39e1aa	modified readme	2010-04-24 13:47:35 +10:00
gfxmonk	f73b5f05c4	split out into content and summary methods	2010-04-24 00:41:09 +10:00
gfxmonk	c952f421b7	clean up content method and debug	2010-04-23 23:28:51 +10:00
gfxmonk	c0ca60ee26	use a more leniant parser	2010-04-23 20:51:56 +10:00
gfxmonk	ad3d52ade4	initial	2010-04-22 21:55:00 +10:00

44 Commits