Richard Harding
05e13a4834
Update to only append sibling if we don't already have it
11 years ago
Richard Harding
952ea273c5
Update to version 0.1.13
11 years ago
Craig Maloney
9b9ec5b0e6
Treat images a little differently so they get more inclusion.
...
- When the body of the article contains screenshots/etc we want to try to keep
those images around.
- Added test for Business Insider article
- Adding sweetshark test from issue #1
- Add craig to the credits
11 years ago
Mišo Belica
471db19a43
Added BTE tool into similar tools to readme
11 years ago
Mišo Belica
43cc38dc7b
Cleanup
11 years ago
Richard Harding
37c6c41d29
Update versions for 0.1.12
11 years ago
macmenot
4f2b744a3a
Set urllib useragent string.
...
- Use a custom string to help with identifying traffic
- Update version to 0.1.12
- Small linting
Adjust the user agent string, lint
11 years ago
Mišo Belica
81ba7aec3c
Create console scripts with python version suffix
12 years ago
Mišo Belica
51df29f05d
Write readable content into temp file in binary mode
12 years ago
Mišo Belica
42530d4af7
Use py3k compatible urllib with own User-Agent header
12 years ago
Mišo Belica
9ed02047dd
Added string representation for empty scored node
12 years ago
Mišo Belica
7630237b86
Added missing empty line
12 years ago
Mišo Belica
c34bc53d9e
Updated list of similar tools
12 years ago
Mišo Belica
bf6cfef556
Renamed '_py3k.py' -> '_compat.py'
12 years ago
Mišo Belica
bd084a8e28
Fixed named argument name 'fragment'
12 years ago
Mišo Belica
8f3ebf0950
Removed file with version number
12 years ago
Mišo Belica
8c775fee7f
Added new test article
12 years ago
Mišo Belica
c9afc38c49
Cleanups for function 'clean_document'
12 years ago
Mišo Belica
5c20673d45
Don't remove h1/h2 elements from readable article
12 years ago
Mišo Belica
c9e087d077
Cleanups
12 years ago
Mišo Belica
e0c87223ae
Better log messages while scoring candidates
12 years ago
Mišo Belica
df5cb8c8f6
Added scored nodes into candidates
12 years ago
Mišo Belica
f858f0dbb0
1 pt for 100 inner text chars is computed as float
12 years ago
Mišo Belica
31b75c1cd8
Updated docstring for 'get_link_density' [ci skip]
12 years ago
Mišo Belica
d054823958
Added simple test for parser of annotated text
12 years ago
Mišo Belica
05d2230015
Load articles/snippets as binary strings
12 years ago
Mišo Belica
e6191fe0d1
Link density is computed with normalized whitespace
...
HTML code contains many whitespace and if there is
large amount of indentation characters link density
is small even if there are only links with usefull
text.
12 years ago
Mišo Belica
671580ac2c
Use groupby for to group annotated texts
12 years ago
Mišo Belica
c2a5b74230
Changed representation of annotated text
12 years ago
Mišo Belica
e366721873
Convert <hr> tag into paragraphs
12 years ago
Mišo Belica
e198b94ffb
Added string utils for handling whitespace
12 years ago
Mišo Belica
3449a33d87
Test for changing multiple <br> into <p>
12 years ago
Mišo Belica
7bd7231e25
Renamed property of 'OriginalDocument': 'html' -> 'dom'
12 years ago
Mišo Belica
0e748a80a6
Cleaned class 'Article'
12 years ago
Mišo Belica
530b7d8f22
Drop unlikely candidates as soon as you can
12 years ago
Mišo Belica
69dd9ef4fd
Changed 'readable_annotated_text' -> 'main_text'
12 years ago
Mišo Belica
c47530bfe0
Updated changelog
12 years ago
Mišo Belica
0df3a95c1e
Property of ``Article`` with annotated text
12 years ago
Mišo Belica
7337e2fb38
Join node with 1 child of the same type
12 years ago
Mišo Belica
ade957cb47
Don't change <div> to <p> if it contains <p> elements
12 years ago
Mišo Belica
35dd10f546
Better logging messages
12 years ago
Mišo Belica
f5939f4608
Skip unused tests instead of useless passing
12 years ago
Mišo Belica
6b87ac5e07
Use unicode literals from future, not 'to_string'
12 years ago
Mišo Belica
c9e8e00b92
Refactored class ``OriginalDocument``
12 years ago
Mišo Belica
eb8a8c5248
Replaced deprecated method 'getiterator' by 'iter'
12 years ago
Mišo Belica
2159625626
Function 'callable' has returned in Python 3.2
12 years ago
Mišo Belica
76832530b4
I don't use Makefile
12 years ago
Mišo Belica
5abe69d917
Added new test article
12 years ago
Mišo Belica
5e41280f77
Updated helper for creating an article test
12 years ago
Mišo Belica
0178cfff5c
Added compatibility file with unittest2 import
12 years ago