Commit Graph

146 Commits

Author SHA1 Message Date
Mišo Belica
05d2230015 Load articles/snippets as binary strings 2013-03-26 19:55:50 +01:00
Mišo Belica
e6191fe0d1 Link density is computed with normalized whitespace
HTML code contains many whitespace and if there is
large amount of indentation characters link density
is small even if there are only links with usefull
text.
2013-03-26 19:55:18 +01:00
Mišo Belica
671580ac2c Use groupby for to group annotated texts 2013-03-25 16:32:52 +01:00
Mišo Belica
c2a5b74230 Changed representation of annotated text 2013-03-25 14:26:03 +01:00
Mišo Belica
e366721873 Convert <hr> tag into paragraphs 2013-03-25 13:57:33 +01:00
Mišo Belica
e198b94ffb Added string utils for handling whitespace 2013-03-25 13:41:43 +01:00
Mišo Belica
3449a33d87 Test for changing multiple <br> into <p> 2013-03-23 17:04:30 +01:00
Mišo Belica
7bd7231e25 Renamed property of 'OriginalDocument': 'html' -> 'dom' 2013-03-23 17:03:54 +01:00
Mišo Belica
0e748a80a6 Cleaned class 'Article' 2013-03-23 16:07:42 +01:00
Mišo Belica
530b7d8f22 Drop unlikely candidates as soon as you can 2013-03-23 16:02:43 +01:00
Mišo Belica
69dd9ef4fd Changed 'readable_annotated_text' -> 'main_text' 2013-03-23 15:47:14 +01:00
Mišo Belica
c47530bfe0 Updated changelog 2013-03-21 19:53:07 +01:00
Mišo Belica
0df3a95c1e Property of `Article` with annotated text 2013-03-21 19:43:22 +01:00
Mišo Belica
7337e2fb38 Join node with 1 child of the same type 2013-03-21 19:42:18 +01:00
Mišo Belica
ade957cb47 Don't change <div> to <p> if it contains <p> elements 2013-03-21 19:41:00 +01:00
Mišo Belica
35dd10f546 Better logging messages 2013-03-21 19:38:54 +01:00
Mišo Belica
f5939f4608 Skip unused tests instead of useless passing 2013-03-21 19:36:04 +01:00
Mišo Belica
6b87ac5e07 Use unicode literals from future, not 'to_string' 2013-03-19 23:49:07 +01:00
Mišo Belica
c9e8e00b92 Refactored class `OriginalDocument` 2013-03-19 23:48:14 +01:00
Mišo Belica
eb8a8c5248 Replaced deprecated method 'getiterator' by 'iter' 2013-03-19 16:06:49 +01:00
Mišo Belica
2159625626 Function 'callable' has returned in Python 3.2 2013-03-19 15:33:49 +01:00
Mišo Belica
76832530b4 I don't use Makefile 2013-03-19 01:28:30 +01:00
Mišo Belica
5abe69d917 Added new test article 2013-03-19 01:13:46 +01:00
Mišo Belica
5e41280f77 Updated helper for creating an article test 2013-03-19 00:31:44 +01:00
Mišo Belica
0178cfff5c Added compatibility file with unittest2 import 2013-03-18 22:01:11 +01:00
Mišo Belica
26fe24789c Made packages from all tests 2013-03-18 21:45:33 +01:00
Mišo Belica
ee483a7f91 Changed location of test HTML files 2013-03-18 21:40:19 +01:00
Mišo Belica
3b5b2b1522 Renamed to readability 2013-03-18 21:25:09 +01:00
Mišo Belica
cf781bc595 Updated implementation of cached property
Cached value of properties are stored
in instance's '__dict__'.
2013-03-17 00:57:28 +01:00
Mišo Belica
4e3227521e Fewer code - fewer bugs (I hope) 2013-03-15 01:40:41 +01:00
Mišo Belica
1a5970b238 Better names and positions for variables 2013-03-15 00:52:56 +01:00
Mišo Belica
930b6ced12 Fixed transformation of leaf <div> into <p> 2013-03-15 00:48:13 +01:00
Mišo Belica
314c999730 Drop useless tags by HTML cleaner 2013-03-15 00:23:41 +01:00
Mišo Belica
272fe480a3 Updated setup.py 2013-03-15 00:10:55 +01:00
Mišo Belica
9eacbd579c Updated LICENSE, AUTHORS, README 2013-03-15 00:10:41 +01:00
Mišo Belica
18b5c9b447 Refactored file 'scoring.py' 2013-03-11 23:06:21 +01:00
Mišo Belica
dcb7c18fd5 Refactored file 'document.py'
Removed non-intuitive parts and dead code
not covered by tests. Better names for objects.
Better coverage by tests.
2013-03-11 22:10:26 +01:00
Mišo Belica
03ff0be266 Moved client script into 'breadability.scripts' 2013-03-11 21:18:04 +01:00
Mišo Belica
c92f61fa53 Fixed docopt version 2013-03-11 12:43:17 +01:00
Mišo Belica
ec88a4efe6 Use docopt as an argument parser 2013-03-11 12:37:15 +01:00
Mišo Belica
8470ef2b45 Purification of file readable.py 2013-03-09 13:15:05 +01:00
Mišo Belica
b3b987440d Added test runner via nosetests 2013-03-09 13:05:16 +01:00
Mišo Belica
2e2e906da7 Purification of document.py 2013-03-09 00:05:49 +01:00
Mišo Belica
9f0fc2d433 Purification 2013-03-08 23:48:35 +01:00
Mišo Belica
baaefeda3c Refactored computing of link density 2013-03-08 23:23:30 +01:00
Mišo Belica
3f71e1b7d4 Refactored checking of node's attribute 2013-03-08 23:19:24 +01:00
Mišo Belica
636a38d705 Refactored generating of hash ID 2013-03-08 23:06:57 +01:00
Mišo Belica
9a613317c0 Make package from tests 2013-03-08 23:05:14 +01:00
Mišo Belica
cc00976533 Replace implementation of 'cached_property'
Parameter 'ttl' isn't needed.
2013-03-08 19:29:15 +01:00
Mišo Belica
e3b6ee2fd6 Suppress warning "ResourceWarning: unclosed file" 2013-03-08 17:46:18 +01:00