mirror of https://github.com/WikiTeam/wikiteam synced 2024-11-12 07:12:41 +00:00

Go to file

Antoine Musso 362309a2da Add tox env for flake8 linter Most people know about pep8 which enforce coding style. pyflakes goes a step beyond by analyzing the code. flake8 is basically a wrapper around both pep8 and pyflakes and comes with some additional checks. I find it very useful since you only need to require one package to have a lot of code issues reported to you. This patch provides a 'flake8' tox environement to easily install and run the utility on the code base. One simply has to: tox -eflake8 The repository in its current state does not pass checks We can later easily ensure there is no regression by adjusting Travis configuration to run this env. The env has NOT been added to the default list of environement. More informations about flake8: https://pypi.python.org/pypi/flake8		2014-11-16 23:06:27 +01:00
batchdownload	Issue 97: Add new siteinfo.json to the archived 7z	2014-06-26 11:11:19 +02:00
listsofwikis	Add 30k domains from another crawler	2014-11-01 22:23:25 +01:00
research/paper-wikiteam-2014	instructions to compile LaTeX paper;	2014-01-24 20:08:16 +00:00
testing	Comment debugging print, fix test	2014-09-23 15:10:06 -04:00
.gitignore	Easily run tests in a virtualenv with tox and nose	2014-07-01 00:52:04 +02:00
.travis.yml	tests: Add .travis.yml and Travis CI	2014-09-23 23:12:40 +02:00
commonschecker.py	Wrong else: missing indentation, endless loop	2014-07-22 11:03:21 +02:00
commonsdownloader.py	Issue #66 : try your.org first	2014-07-21 10:39:08 +02:00
commonssql.py	Issue 85: more cross-platform shebang on all scripts	2014-02-26 23:22:53 +00:00
dumpgenerator.py	Merge pull request #194 from mrshu/mrshu/dumpgenerator-pep8fied	2014-10-01 23:56:36 +02:00
gui.py	Issue 85: more cross-platform shebang on all scripts	2014-02-26 23:22:53 +00:00
LICENSE	renaming gpl.txt to LICENSE	2014-06-25 16:03:15 +02:00
README.md	readme: Small wording fixes	2014-10-04 12:02:48 +02:00
requirements.txt	Update requirements file	2014-07-03 12:40:12 -04:00
tox.ini	Add tox env for flake8 linter	2014-11-16 23:06:27 +01:00
uploader.py	Issue 85: more cross-platform shebang on all scripts	2014-02-26 23:22:53 +00:00
wikiadownloader.py	Issue 85: more cross-platform shebang on all scripts	2014-02-26 23:22:53 +00:00
wikipediadownloader.py	wikiadownloader: Autopep8fied	2014-10-02 23:06:42 +02:00

README.md

WikiTeam

We archive wikis, from Wikipedia to tiniest wikis

WikiTeam software is a set of tools for archiving wikis. They work on MediaWiki wikis, but we want to expand to other wiki engines. As of June 2014, WikiTeam has preserved more than 13,000 stand-alone wikis, several wikifarms, regular Wikipedia dumps and 34 TB of Wikimedia Commons images.

There are thousands of wikis in the Internet. Every day some of them are no longer publicly available and, due to lack of backups, lost forever. Millions of people download tons of media files (movies, music, books, etc) from the Internet, serving as a kind of distributed backup. Wikis, most of them under free licenses, disappear from time to time because nobody grabbed a copy of them. That is a shame that we would like to solve.

WikiTeam is the Archive Team (GitHub) subcommittee on wikis. It was founded and originally developed by Emilio J. Rodríguez-Posada, a Wikipedia veteran editor and amateur archivist. Many people have helped by sending suggestions, reporting bugs, writing documentation, providing help in the mailing list and making wiki backups. Thanks to all, especially to: Federico Leva, Alex Buie, Scott Boyd, Hydriz, Platonides, Ian McEwen, Mike Dupont, balr0g and PiRSquared17.

Quick guide

This is a very quick guide for the most used features of WikiTeam tools. For further information, read the tutorial and the rest of the documentation. You can also ask in the mailing list.

Requirements

Confirm you satisfy the requirements:

pip install --upgrade -r requirements.txt

or, if you don't have enough permissions for the above,

pip install --user --upgrade -r requirements.txt

Download any wiki

To download any wiki, use one of the following options:

python dumpgenerator.py http://wiki.domain.org --xml --images (complete XML histories and images)

If the script can't find itself the API and/or index.php paths, then you can provide them:

python dumpgenerator.py --api=http://wiki.domain.org/w/api.php --xml --images

python dumpgenerator.py --api=http://wiki.domain.org/w/api.php --index=http://wiki.domain.org/w/index.php --xml --images

If you only want the XML histories, just use --xml. For only the images, just --images. For only the current version of every page, --xml --curonly.

You can resume an aborted download:

python dumpgenerator.py --api=http://wiki.domain.org/w/api.php --xml --images --resume --path=/path/to/incomplete-dump

See more options:

python dumpgenerator.py --help

Download Wikimedia dumps

To download Wikimedia XML dumps (Wikipedia, Wikibooks, Wikinews, etc) you can run:

python wikipediadownloader.py (download all projects)

See more options:

python wikipediadownloader.py --help

Download Wikimedia Commons images

There is a script for this, but we have uploaded the tarballs to Internet Archive, so it's more useful to reseed their torrents than to re-generate old ones with the script.

Developers

You can run tests easily by using the tox command. It is probably already present in your operating system, you would need version 1.6. If it is not, you can download it from pypi with: pip install tox.

Example usage:

$ tox
py27 runtests: commands[0] | nosetests --nocapture --nologcapture
Checking http://wiki.annotation.jp/api.php
Trying to parse かずさアノテーション - ソーシャル・ゲノム・アノテーション.jpg from API
Retrieving image filenames
.    Found 266 images
.
-------------------------------------------
Ran 1 test in 2.253s

OK
_________________ summary _________________
  py27: commands succeeded
  congratulations :)
$