Mišo Belica
573a05f940
Added alternative "python-goose" into README
11 years ago
Mišo Belica
d138b6394e
Cleanups
11 years ago
Mišo Belica
e5401d7ab2
Added URL into User-Agent string
11 years ago
Mišo Belica
d530acb8c6
I discovered maintainer meta-data parameter
11 years ago
Mišo Belica
c091249162
Changed execution of nosetests
11 years ago
Richard Harding
042779bd12
Update version to 0.1.14
11 years ago
Richard Harding
05e13a4834
Update to only append sibling if we don't already have it
11 years ago
Richard Harding
952ea273c5
Update to version 0.1.13
11 years ago
Craig Maloney
9b9ec5b0e6
Treat images a little differently so they get more inclusion.
...
- When the body of the article contains screenshots/etc we want to try to keep
those images around.
- Added test for Business Insider article
- Adding sweetshark test from issue #1
- Add craig to the credits
11 years ago
Mišo Belica
471db19a43
Added BTE tool into similar tools to readme
11 years ago
Mišo Belica
43cc38dc7b
Cleanup
11 years ago
Richard Harding
37c6c41d29
Update versions for 0.1.12
11 years ago
macmenot
4f2b744a3a
Set urllib useragent string.
...
- Use a custom string to help with identifying traffic
- Update version to 0.1.12
- Small linting
Adjust the user agent string, lint
11 years ago
Mišo Belica
81ba7aec3c
Create console scripts with python version suffix
11 years ago
Mišo Belica
51df29f05d
Write readable content into temp file in binary mode
11 years ago
Mišo Belica
42530d4af7
Use py3k compatible urllib with own User-Agent header
11 years ago
Mišo Belica
9ed02047dd
Added string representation for empty scored node
11 years ago
Mišo Belica
7630237b86
Added missing empty line
11 years ago
Mišo Belica
c34bc53d9e
Updated list of similar tools
11 years ago
Mišo Belica
bf6cfef556
Renamed '_py3k.py' -> '_compat.py'
11 years ago
Mišo Belica
bd084a8e28
Fixed named argument name 'fragment'
11 years ago
Mišo Belica
8f3ebf0950
Removed file with version number
11 years ago
Mišo Belica
8c775fee7f
Added new test article
11 years ago
Mišo Belica
c9afc38c49
Cleanups for function 'clean_document'
11 years ago
Mišo Belica
5c20673d45
Don't remove h1/h2 elements from readable article
11 years ago
Mišo Belica
c9e087d077
Cleanups
11 years ago
Mišo Belica
e0c87223ae
Better log messages while scoring candidates
11 years ago
Mišo Belica
df5cb8c8f6
Added scored nodes into candidates
11 years ago
Mišo Belica
f858f0dbb0
1 pt for 100 inner text chars is computed as float
11 years ago
Mišo Belica
31b75c1cd8
Updated docstring for 'get_link_density' [ci skip]
11 years ago
Mišo Belica
d054823958
Added simple test for parser of annotated text
11 years ago
Mišo Belica
05d2230015
Load articles/snippets as binary strings
11 years ago
Mišo Belica
e6191fe0d1
Link density is computed with normalized whitespace
...
HTML code contains many whitespace and if there is
large amount of indentation characters link density
is small even if there are only links with usefull
text.
11 years ago
Mišo Belica
671580ac2c
Use groupby for to group annotated texts
11 years ago
Mišo Belica
c2a5b74230
Changed representation of annotated text
11 years ago
Mišo Belica
e366721873
Convert <hr> tag into paragraphs
11 years ago
Mišo Belica
e198b94ffb
Added string utils for handling whitespace
11 years ago
Mišo Belica
3449a33d87
Test for changing multiple <br> into <p>
11 years ago
Mišo Belica
7bd7231e25
Renamed property of 'OriginalDocument': 'html' -> 'dom'
11 years ago
Mišo Belica
0e748a80a6
Cleaned class 'Article'
11 years ago
Mišo Belica
530b7d8f22
Drop unlikely candidates as soon as you can
11 years ago
Mišo Belica
69dd9ef4fd
Changed 'readable_annotated_text' -> 'main_text'
11 years ago
Mišo Belica
c47530bfe0
Updated changelog
11 years ago
Mišo Belica
0df3a95c1e
Property of ``Article`` with annotated text
11 years ago
Mišo Belica
7337e2fb38
Join node with 1 child of the same type
11 years ago
Mišo Belica
ade957cb47
Don't change <div> to <p> if it contains <p> elements
11 years ago
Mišo Belica
35dd10f546
Better logging messages
11 years ago
Mišo Belica
f5939f4608
Skip unused tests instead of useless passing
11 years ago
Mišo Belica
6b87ac5e07
Use unicode literals from future, not 'to_string'
11 years ago
Mišo Belica
c9e8e00b92
Refactored class ``OriginalDocument``
11 years ago