Commit Graph

696 Commits (f4ec129bff28bd6c8cafefd592061e1ab646645f)
 

Author SHA1 Message Date
Benjamin Mako Hill f4ec129bff updated wikiadownloader.py to work with new dumps
Bitrot seems to have gotten the best of this script and it sounds like it
hasn't been used. This at least gets it to work by:

- find both .gz and the .7z dumps
- parse the new date format on html
- find dumps in the correct place
- move all chatter to stderr instead of stdout
9 years ago
Benjamin Mako Hill eb8b44aef0 strip <sha1> tags returned under <page>
The Wikia API is exporting sha1 sums as part of the response for pages.
These are invalid XML and are causing dump parsing code (e.g.,
MediaWiki-Utilities) to fail.  Also, sha1 should be revisions, not pages so
it's not entirely clear to me what this is referring to.
9 years ago
Benjamin Mako Hill 145b2eaaf4 changed getXMLPage() into a generator
The program tended to run out of memory when processing very large pages (i.e.,
pages with extremely large numbers of revisions or pages with large numbers of
very large revisions). This mitigates the problem by changing getXMLPage() into
a generator which allows us to write pages after each request to the API.

This requied changes to the getXMLPage() function and also changes to other
parts of the code that called it.

Additionally, when the function was called, it's text was checked in several
ways. This required a few changes including a running tally of revisions
instead of post hoc check and it required error checking being moved into a
Exception rather than just an if statement that looked at the final result.
9 years ago
Federico Leva a1921f0919 Update list of wikia.com unarchived wikis
The list of unarchived wikis was compared to the list of wikis that we
managed to download with dumpgenerator.py:
https://archive.org/details/wikia_dump_20141219
To allow the comparison, the naming format was aligned to the format
used by dumpgenerator.py for 7z files.
9 years ago
Emilio J. Rodríguez-Posada 9a6570ec5a Update README.md 10 years ago
Federico Leva ce6fbfee55 Use curl --fail instead and other fixes; add list
Now tested and used to produce the list of some 300k Wikia wikis
which don't yet have a public dump. Will soon be archived.
10 years ago
Federico Leva 7471900e56 It's easier if the list has the actual domains 10 years ago
Federico Leva 8bd3373960 Add wikia.py, to list Wikia wikis we'll dump ourselves 10 years ago
Federico Leva 38e778faad Add 7z2bz2.sh 10 years ago
Marek Šuppa e370257aeb tests: Updated Index endpoint for WikiPapers
* Updated API endpoint for WikiPapers on Referata which was previously (http://wikipapers.referata.com/w/index.php) and now resolves to (http://wikipapers.referata.com/index.php).
10 years ago
Marek Šuppa 7b9ca8aa6b tests: Updated API endpoint for WikiPapers
* Updated API endpoint for WikiPapers on Referata. It used to be (http://wikipapers.referata.com/w/api.php), now it resolves to (http://wikipapers.referata.com/api.php). This was breaking the tests.
10 years ago
Federico Leva e26711afc9 Merge branch 'master' of github.com:WikiTeam/wikiteam 10 years ago
Federico Leva ed2d87418c Update with some wikis done in the last batch 10 years ago
Emilio J. Rodríguez-Posada 43cda4ec01 excluding wiki-site.com farm too 10 years ago
Emilio J. Rodríguez-Posada 7463b16b36 Merge branch 'master' of https://github.com/WikiTeam/wikiteam 10 years ago
Emilio J. Rodríguez-Posada 9681fdfd14 linking to GitHub 10 years ago
Marek Šuppa b003cf94e2 tests: Disable broken Wiki
* Disabled http://wiki.greenmuseum.org/ since it's broken and was breaking the tests `'Unknown' != 'PhpWiki'`
10 years ago
Emilio J. Rodríguez-Posada 8d4def5885 improving duplicate filter, removing www. www1., etc; excluding editthis.info 10 years ago
Emilio J. Rodríguez-Posada 9ca67fa4d3 not archived wikis script 10 years ago
Federico Leva 8cf4d4e6ea Add 30k domains from another crawler
11011 were found alive by checkalive.py (though there could be more
if one checks more subdomains and subdirectories), some thousands
more by checklive.pl (but mostly or all false positives).

Of the alive ones, about 6245 were new to WikiApiary!
https://wikiapiary.com/wiki/Category:Oct_2014_Import
10 years ago
Federico Leva 7e0071ae7f Add some UseModWiki-looking domains 10 years ago
nemobis 6b11cef9dc A few thousands more doku.php URLs from own scraping 10 years ago
nemobis 0624d0303b Merge pull request #198 from Southparkfan/patch-1
Update list of Orain wikis
10 years ago
Southparkfan 8ca9eb8757 Update date of Orain wikilist 10 years ago
Southparkfan 2e2fe9b818 Update list of Orain wikis 10 years ago
Marek Šuppa 8c44cff165 readme: Small wording fixes
* Small fixed in `Download Wikimedia dumps` section.
10 years ago
nemobis 6f74781e78 Merge pull request #197 from mrshu/mrshu/autopep8fied-wikiadownloader
wikiadownloader: Autopep8fied
10 years ago
mr.Shu f022b02e47 wikiadownloader: Autopep8fied
* Made the source look a bit better, though this script might not be
  used anymore.

Signed-off-by: mr.Shu <mr@shu.io>
10 years ago
nemobis b3ef165529 Merge pull request #194 from mrshu/mrshu/dumpgenerator-pep8fied
dumpgenerator: AutoPEP8-fied
10 years ago
mr.Shu 04446a40a5 dumpgenerator: AutoPEP8-fied
* Used autopep8 to made sure the code looks nice and is actually PEP8
  compliant.

Signed-off-by: mr.Shu <mr@shu.io>
10 years ago
nemobis 23a60fa850 MediaWiki CamelCase 10 years ago
nemobis 31112b3a80 checkalive.py: more checks before accessing stuff 10 years ago
nemobis 225c3eb478 A thousand more doku.php URLs from search 10 years ago
nemobis e0f8e36bf4 Merge pull request #190 from PiRSquared17/api-allpages-disabled
Fallback to getPageTitlesScraper() if API allpages disabled
10 years ago
nemobis a7e1b13304 Merge pull request #193 from mrshu/mrshu/readme-fix-wording
readme: Fix wording
10 years ago
nemobis 3fc7dcb5de Add some more doku.php URLs 10 years ago
mr.Shu 54c373e9a0 readme: Fix wording
* Made a few wording changes to make the README.md more clear.

Signed-off-by: mr.Shu <mr@shu.io>
10 years ago
Marek Šuppa 40d863fb99 README: update working
* Updated wording to make the README more clear.
10 years ago
Emilio J. Rodríguez-Posada 87ce2d4540 Merge pull request #192 from mrshu/mrshu/add-travis-image
update: Add TravisCI image to README
10 years ago
mr.Shu 7b0b54b6e5 update: Add TravisCI image to README
* Added TravisCI image which specifies whether the tests are passing or
  not to Developers section.

Signed-off-by: mr.Shu <mr@shu.io>
10 years ago
Emilio J. Rodríguez-Posada 5c8e316e67 Merge pull request #189 from PiRSquared17/get-wiki-engine
Improve getWikiEngine()
10 years ago
Emilio J. Rodríguez-Posada 086415bc00 Merge pull request #191 from mrshu/mrshu/setup-travis
tests: Add .travis.yml and Travis CI
10 years ago
mr.Shu 14c62c6587 tests: Add .travis.yml and Travis CI
* Added .travis.yml to enable Travis CI

Signed-off-by: mr.Shu <mr@shu.io>
10 years ago
PiRSquared17 757019521a Fallback to scraper if API allpages disabled 10 years ago
PiRSquared17 4b3c862a58 Comment debugging print, fix test 10 years ago
PiRSquared17 7a1db0525b Add more wiki engines to getWikiEngine 10 years ago
nemobis 40c406cd00 Merge pull request #188 from PiRSquared17/wikiengine-lists
Add subdirectories to listsofwikis for different wiki engines
10 years ago
PiRSquared17 56c2177106 Add (incomplete) list of dokuwikis 10 years ago
PiRSquared17 03ddde3702 Move wiki lists to mediawiki subdirectory 10 years ago
Emilio J. Rodríguez-Posada 43a105335b Merge pull request #185 from PiRSquared17/fix-tests
Relax delay() test by 10 ms, add test for allpages
10 years ago