Commit Graph

227 Commits (master)
 

Author SHA1 Message Date
Mišo Belica 573a05f940 Added alternative "python-goose" into README 11 years ago
Mišo Belica d138b6394e Cleanups 11 years ago
Mišo Belica e5401d7ab2 Added URL into User-Agent string 11 years ago
Mišo Belica d530acb8c6 I discovered maintainer meta-data parameter 11 years ago
Mišo Belica c091249162 Changed execution of nosetests 11 years ago
Richard Harding 042779bd12 Update version to 0.1.14 11 years ago
Richard Harding 05e13a4834 Update to only append sibling if we don't already have it 11 years ago
Richard Harding 952ea273c5 Update to version 0.1.13 11 years ago
Craig Maloney 9b9ec5b0e6 Treat images a little differently so they get more inclusion.
- When the body of the article contains screenshots/etc we want to try to keep
those images around.
- Added test for Business Insider article
- Adding sweetshark test from issue #1
- Add craig to the credits
11 years ago
Mišo Belica 471db19a43 Added BTE tool into similar tools to readme 11 years ago
Mišo Belica 43cc38dc7b Cleanup 11 years ago
Richard Harding 37c6c41d29 Update versions for 0.1.12 11 years ago
macmenot 4f2b744a3a Set urllib useragent string.
- Use a custom string to help with identifying traffic
- Update version to 0.1.12
- Small linting

Adjust the user agent string, lint
11 years ago
Mišo Belica 81ba7aec3c Create console scripts with python version suffix 11 years ago
Mišo Belica 51df29f05d Write readable content into temp file in binary mode 11 years ago
Mišo Belica 42530d4af7 Use py3k compatible urllib with own User-Agent header 11 years ago
Mišo Belica 9ed02047dd Added string representation for empty scored node 11 years ago
Mišo Belica 7630237b86 Added missing empty line 11 years ago
Mišo Belica c34bc53d9e Updated list of similar tools 11 years ago
Mišo Belica bf6cfef556 Renamed '_py3k.py' -> '_compat.py' 11 years ago
Mišo Belica bd084a8e28 Fixed named argument name 'fragment' 11 years ago
Mišo Belica 8f3ebf0950 Removed file with version number 11 years ago
Mišo Belica 8c775fee7f Added new test article 11 years ago
Mišo Belica c9afc38c49 Cleanups for function 'clean_document' 11 years ago
Mišo Belica 5c20673d45 Don't remove h1/h2 elements from readable article 11 years ago
Mišo Belica c9e087d077 Cleanups 11 years ago
Mišo Belica e0c87223ae Better log messages while scoring candidates 11 years ago
Mišo Belica df5cb8c8f6 Added scored nodes into candidates 11 years ago
Mišo Belica f858f0dbb0 1 pt for 100 inner text chars is computed as float 11 years ago
Mišo Belica 31b75c1cd8 Updated docstring for 'get_link_density' [ci skip] 11 years ago
Mišo Belica d054823958 Added simple test for parser of annotated text 11 years ago
Mišo Belica 05d2230015 Load articles/snippets as binary strings 11 years ago
Mišo Belica e6191fe0d1 Link density is computed with normalized whitespace
HTML code contains many whitespace and if there is
large amount of indentation characters link density
is small even if there are only links with usefull
text.
11 years ago
Mišo Belica 671580ac2c Use groupby for to group annotated texts 11 years ago
Mišo Belica c2a5b74230 Changed representation of annotated text 11 years ago
Mišo Belica e366721873 Convert <hr> tag into paragraphs 11 years ago
Mišo Belica e198b94ffb Added string utils for handling whitespace 11 years ago
Mišo Belica 3449a33d87 Test for changing multiple <br> into <p> 11 years ago
Mišo Belica 7bd7231e25 Renamed property of 'OriginalDocument': 'html' -> 'dom' 11 years ago
Mišo Belica 0e748a80a6 Cleaned class 'Article' 11 years ago
Mišo Belica 530b7d8f22 Drop unlikely candidates as soon as you can 11 years ago
Mišo Belica 69dd9ef4fd Changed 'readable_annotated_text' -> 'main_text' 11 years ago
Mišo Belica c47530bfe0 Updated changelog 11 years ago
Mišo Belica 0df3a95c1e Property of ``Article`` with annotated text 11 years ago
Mišo Belica 7337e2fb38 Join node with 1 child of the same type 11 years ago
Mišo Belica ade957cb47 Don't change <div> to <p> if it contains <p> elements 11 years ago
Mišo Belica 35dd10f546 Better logging messages 11 years ago
Mišo Belica f5939f4608 Skip unused tests instead of useless passing 11 years ago
Mišo Belica 6b87ac5e07 Use unicode literals from future, not 'to_string' 11 years ago
Mišo Belica c9e8e00b92 Refactored class ``OriginalDocument`` 11 years ago