PiRSquared17
ac72938d40
Merge pull request #216 from makoshark/master
...
Issue #8 : avoid MemoryError fatal on big histories, remove sha1 for Wikia
2015-02-10 00:37:57 +00:00
PiRSquared17
28fc715b28
Make tests pass (fix/remove URLs)
...
Remove more Gentoo URLs (see 5069119b
).
Fix WikiPapers API, and remove it from API test.
(It gives incorrect API URL in its HTML output.)
2015-02-09 23:59:36 +00:00
nemobis
5069119b42
Remove wiki.gentoo.org from tests
...
The test is failing. https://travis-ci.org/WikiTeam/wikiteam/builds/50102997#L546
Might be our fault, but they just updated code:
Tyrian – (f313f23) 12:47, 23 January 2015 GPLv3+ Gentoo's new web theme ported to MediaWiki. Alex Legler
I don't think testing screenscraping against a theme used only by Gentoo makes much sense for us.
2015-02-09 22:13:13 +01:00
Benjamin Mako Hill
eb8b44aef0
strip <sha1> tags returned under <page>
...
The Wikia API is exporting sha1 sums as part of the response for pages.
These are invalid XML and are causing dump parsing code (e.g.,
MediaWiki-Utilities) to fail. Also, sha1 should be revisions, not pages so
it's not entirely clear to me what this is referring to.
2015-02-06 18:50:25 -08:00
Benjamin Mako Hill
145b2eaaf4
changed getXMLPage() into a generator
...
The program tended to run out of memory when processing very large pages (i.e.,
pages with extremely large numbers of revisions or pages with large numbers of
very large revisions). This mitigates the problem by changing getXMLPage() into
a generator which allows us to write pages after each request to the API.
This requied changes to the getXMLPage() function and also changes to other
parts of the code that called it.
Additionally, when the function was called, it's text was checked in several
ways. This required a few changes including a running tally of revisions
instead of post hoc check and it required error checking being moved into a
Exception rather than just an if statement that looked at the final result.
2015-02-06 17:19:24 -08:00
Federico Leva
a1921f0919
Update list of wikia.com unarchived wikis
...
The list of unarchived wikis was compared to the list of wikis that we
managed to download with dumpgenerator.py:
https://archive.org/details/wikia_dump_20141219
To allow the comparison, the naming format was aligned to the format
used by dumpgenerator.py for 7z files.
2015-02-06 09:17:53 +01:00
Emilio J. Rodríguez-Posada
9a6570ec5a
Update README.md
2014-12-23 13:33:19 +01:00
Federico Leva
ce6fbfee55
Use curl --fail instead and other fixes; add list
...
Now tested and used to produce the list of some 300k Wikia wikis
which don't yet have a public dump. Will soon be archived.
2014-12-19 08:17:59 +01:00
Federico Leva
7471900e56
It's easier if the list has the actual domains
2014-12-17 22:50:53 +01:00
Federico Leva
8bd3373960
Add wikia.py, to list Wikia wikis we'll dump ourselves
2014-12-17 22:49:10 +01:00
Federico Leva
38e778faad
Add 7z2bz2.sh
2014-12-17 13:35:59 +01:00
Marek Šuppa
e370257aeb
tests: Updated Index endpoint for WikiPapers
...
* Updated API endpoint for WikiPapers on Referata which was previously (http://wikipapers.referata.com/w/index.php ) and now resolves to (http://wikipapers.referata.com/index.php ).
2014-12-08 06:49:03 +01:00
Marek Šuppa
7b9ca8aa6b
tests: Updated API endpoint for WikiPapers
...
* Updated API endpoint for WikiPapers on Referata. It used to be (http://wikipapers.referata.com/w/api.php ), now it resolves to (http://wikipapers.referata.com/api.php ). This was breaking the tests.
2014-12-08 06:37:29 +01:00
Federico Leva
e26711afc9
Merge branch 'master' of github.com:WikiTeam/wikiteam
2014-12-05 15:01:32 +01:00
Federico Leva
ed2d87418c
Update with some wikis done in the last batch
2014-12-05 15:00:43 +01:00
Emilio J. Rodríguez-Posada
43cda4ec01
excluding wiki-site.com farm too
2014-12-03 11:39:53 +01:00
Emilio J. Rodríguez-Posada
7463b16b36
Merge branch 'master' of https://github.com/WikiTeam/wikiteam
2014-11-27 20:12:50 +01:00
Emilio J. Rodríguez-Posada
9681fdfd14
linking to GitHub
2014-11-27 20:12:27 +01:00
Marek Šuppa
b003cf94e2
tests: Disable broken Wiki
...
* Disabled http://wiki.greenmuseum.org/ since it's broken and was breaking the tests `'Unknown' != 'PhpWiki'`
2014-11-26 23:33:44 +01:00
Emilio J. Rodríguez-Posada
8d4def5885
improving duplicate filter, removing www. www1., etc; excluding editthis.info
2014-11-26 17:32:08 +01:00
Emilio J. Rodríguez-Posada
9ca67fa4d3
not archived wikis script
2014-11-26 16:34:14 +01:00
Federico Leva
8cf4d4e6ea
Add 30k domains from another crawler
...
11011 were found alive by checkalive.py (though there could be more
if one checks more subdomains and subdirectories), some thousands
more by checklive.pl (but mostly or all false positives).
Of the alive ones, about 6245 were new to WikiApiary!
https://wikiapiary.com/wiki/Category:Oct_2014_Import
2014-11-01 22:23:25 +01:00
Federico Leva
7e0071ae7f
Add some UseModWiki-looking domains
2014-11-01 22:03:01 +01:00
nemobis
6b11cef9dc
A few thousands more doku.php URLs from own scraping
2014-10-29 19:02:06 +01:00
nemobis
0624d0303b
Merge pull request #198 from Southparkfan/patch-1
...
Update list of Orain wikis
2014-10-08 19:45:49 +02:00
Southparkfan
8ca9eb8757
Update date of Orain wikilist
2014-10-08 19:11:05 +02:00
Southparkfan
2e2fe9b818
Update list of Orain wikis
2014-10-08 19:10:27 +02:00
Marek Šuppa
8c44cff165
readme: Small wording fixes
...
* Small fixed in `Download Wikimedia dumps` section.
2014-10-04 12:02:48 +02:00
nemobis
6f74781e78
Merge pull request #197 from mrshu/mrshu/autopep8fied-wikiadownloader
...
wikiadownloader: Autopep8fied
2014-10-03 23:27:28 +02:00
mr.Shu
f022b02e47
wikiadownloader: Autopep8fied
...
* Made the source look a bit better, though this script might not be
used anymore.
Signed-off-by: mr.Shu <mr@shu.io>
2014-10-02 23:06:42 +02:00
nemobis
b3ef165529
Merge pull request #194 from mrshu/mrshu/dumpgenerator-pep8fied
...
dumpgenerator: AutoPEP8-fied
2014-10-01 23:56:36 +02:00
mr.Shu
04446a40a5
dumpgenerator: AutoPEP8-fied
...
* Used autopep8 to made sure the code looks nice and is actually PEP8
compliant.
Signed-off-by: mr.Shu <mr@shu.io>
2014-10-01 22:26:56 +02:00
nemobis
23a60fa850
MediaWiki CamelCase
2014-10-01 08:27:10 +02:00
nemobis
31112b3a80
checkalive.py: more checks before accessing stuff
2014-09-29 13:26:26 +02:00
nemobis
225c3eb478
A thousand more doku.php URLs from search
2014-09-29 09:12:33 +02:00
nemobis
e0f8e36bf4
Merge pull request #190 from PiRSquared17/api-allpages-disabled
...
Fallback to getPageTitlesScraper() if API allpages disabled
2014-09-28 16:34:24 +02:00
nemobis
a7e1b13304
Merge pull request #193 from mrshu/mrshu/readme-fix-wording
...
readme: Fix wording
2014-09-28 16:32:56 +02:00
nemobis
3fc7dcb5de
Add some more doku.php URLs
2014-09-26 23:55:57 +02:00
mr.Shu
54c373e9a0
readme: Fix wording
...
* Made a few wording changes to make the README.md more clear.
Signed-off-by: mr.Shu <mr@shu.io>
2014-09-25 18:55:00 +02:00
Marek Šuppa
40d863fb99
README: update working
...
* Updated wording to make the README more clear.
2014-09-25 18:54:36 +02:00
Emilio J. Rodríguez-Posada
87ce2d4540
Merge pull request #192 from mrshu/mrshu/add-travis-image
...
update: Add TravisCI image to README
2014-09-25 16:05:42 +02:00
mr.Shu
7b0b54b6e5
update: Add TravisCI image to README
...
* Added TravisCI image which specifies whether the tests are passing or
not to Developers section.
Signed-off-by: mr.Shu <mr@shu.io>
2014-09-25 11:59:47 +02:00
Emilio J. Rodríguez-Posada
5c8e316e67
Merge pull request #189 from PiRSquared17/get-wiki-engine
...
Improve getWikiEngine()
2014-09-25 11:58:26 +02:00
Emilio J. Rodríguez-Posada
086415bc00
Merge pull request #191 from mrshu/mrshu/setup-travis
...
tests: Add .travis.yml and Travis CI
2014-09-25 11:55:02 +02:00
mr.Shu
14c62c6587
tests: Add .travis.yml and Travis CI
...
* Added .travis.yml to enable Travis CI
Signed-off-by: mr.Shu <mr@shu.io>
2014-09-23 23:12:40 +02:00
PiRSquared17
757019521a
Fallback to scraper if API allpages disabled
2014-09-23 15:53:51 -04:00
PiRSquared17
4b3c862a58
Comment debugging print, fix test
2014-09-23 15:10:06 -04:00
PiRSquared17
7a1db0525b
Add more wiki engines to getWikiEngine
2014-09-23 15:04:36 -04:00
nemobis
40c406cd00
Merge pull request #188 from PiRSquared17/wikiengine-lists
...
Add subdirectories to listsofwikis for different wiki engines
2014-09-23 07:19:43 +02:00
PiRSquared17
56c2177106
Add (incomplete) list of dokuwikis
2014-09-22 23:56:53 -04:00