Approach suggested by @makoshark, finally found the time to start
implementing it.
* Do not produce and save the titles list all at once. Instead, use
the scraper and API as generators and save titles on the go. Also,
try to start the generator from the appropriate title.
For now the title sorting is not implemented. Pages will be in the
order given by namespace ID, then page name.
* When resuming, read both the title list and the XML file from the
end rather than the beginning. If the correct terminator is
present, only one line needs to be read.
* In both cases, use a generator instead of a huge list in memory.
* Also truncate the resumed XML without writing it from scratch.
For now using GNU ed: very compact, though shelling out is ugly.
I gave up on using file.seek and file.truncate to avoid reading the
whole file from the beginning or complicating reverse_readline()
with more offset calculations.
This should avoid MemoryError in most cases.
Tested by running a dump over a 1.24 wiki with 11 pages: a complete
dump and a resumed dump from a dump interrupted with ctrl-c.
* For some reason, in a previous commit I had noticed that maxretries
was not respected in getXMLPageCore, but I didn't fix it. Done now.
* If the "Special" namespace alias doesn't work, fetch the local one.
My previous code broke the page missing detection code with two negative
outcomes:
- missing pages were not reported in the error log
- ever missing page generated an extraneous "</page>" line in output which
rendered dumps invalid
This patch improves the exception code in general and fixes both of these
issues.
Bitrot seems to have gotten the best of this script and it sounds like it
hasn't been used. This at least gets it to work by:
- find both .gz and the .7z dumps
- parse the new date format on html
- find dumps in the correct place
- move all chatter to stderr instead of stdout
The test is failing. https://travis-ci.org/WikiTeam/wikiteam/builds/50102997#L546
Might be our fault, but they just updated code:
Tyrian – (f313f23) 12:47, 23 January 2015 GPLv3+ Gentoo's new web theme ported to MediaWiki. Alex Legler
I don't think testing screenscraping against a theme used only by Gentoo makes much sense for us.
The Wikia API is exporting sha1 sums as part of the response for pages.
These are invalid XML and are causing dump parsing code (e.g.,
MediaWiki-Utilities) to fail. Also, sha1 should be revisions, not pages so
it's not entirely clear to me what this is referring to.
The program tended to run out of memory when processing very large pages (i.e.,
pages with extremely large numbers of revisions or pages with large numbers of
very large revisions). This mitigates the problem by changing getXMLPage() into
a generator which allows us to write pages after each request to the API.
This requied changes to the getXMLPage() function and also changes to other
parts of the code that called it.
Additionally, when the function was called, it's text was checked in several
ways. This required a few changes including a running tally of revisions
instead of post hoc check and it required error checking being moved into a
Exception rather than just an if statement that looked at the final result.
The list of unarchived wikis was compared to the list of wikis that we
managed to download with dumpgenerator.py:
https://archive.org/details/wikia_dump_20141219
To allow the comparison, the naming format was aligned to the format
used by dumpgenerator.py for 7z files.
Most people know about pep8 which enforce coding style. pyflakes
goes a step beyond by analyzing the code.
flake8 is basically a wrapper around both pep8 and pyflakes and comes
with some additional checks. I find it very useful since you only need
to require one package to have a lot of code issues reported to you.
This patch provides a 'flake8' tox environement to easily install
and run the utility on the code base. One simply has to:
tox -eflake8
The repository in its current state does not pass checks We can later
easily ensure there is no regression by adjusting Travis configuration
to run this env.
The env has NOT been added to the default list of environement.
More informations about flake8: https://pypi.python.org/pypi/flake8
11011 were found alive by checkalive.py (though there could be more
if one checks more subdomains and subdirectories), some thousands
more by checklive.pl (but mostly or all false positives).
Of the alive ones, about 6245 were new to WikiApiary!
https://wikiapiary.com/wiki/Category:Oct_2014_Import