* http://biografias.bcn.cl/api.php does not like the data to be POSTed.
Just use URL parameters. Some wikis had anti-spam protections which
made us POST everything, but for most wikis this should be fine.
* If the index is not defined, don't fail.
* Use only the base api.php URL, not parameters, in domain2prefix.
https://github.com/WikiTeam/wikiteam/issues/314
File "./dumpgenerator.py", line 1212, in generateImageDump
if not re.search(r'</mediawiki>', xmlfiledesc):
UnboundLocalError: local variable 'xmlfiledesc' referenced before assignment
We work well without api.php. this was a needless suicide.
Especially as sometimes sysadmins like to disable the API for no
reason and then index.php is our only option to archive the wiki.
* Log "XML export on this wiki is broken, quitting." to the error
file so that grepping reveals which dumps were interrupted so.
* Automatically reduce export size for a page when downloading the
entire history at once results in a MemoryError.
* Truncate the file with a pythonic method (.seek and .truncate)
while reading from the end, by making reverse_readline() a weird
hybrid to avoid an actual coroutine.
Approach suggested by @makoshark, finally found the time to start
implementing it.
* Do not produce and save the titles list all at once. Instead, use
the scraper and API as generators and save titles on the go. Also,
try to start the generator from the appropriate title.
For now the title sorting is not implemented. Pages will be in the
order given by namespace ID, then page name.
* When resuming, read both the title list and the XML file from the
end rather than the beginning. If the correct terminator is
present, only one line needs to be read.
* In both cases, use a generator instead of a huge list in memory.
* Also truncate the resumed XML without writing it from scratch.
For now using GNU ed: very compact, though shelling out is ugly.
I gave up on using file.seek and file.truncate to avoid reading the
whole file from the beginning or complicating reverse_readline()
with more offset calculations.
This should avoid MemoryError in most cases.
Tested by running a dump over a 1.24 wiki with 11 pages: a complete
dump and a resumed dump from a dump interrupted with ctrl-c.
* For some reason, in a previous commit I had noticed that maxretries
was not respected in getXMLPageCore, but I didn't fix it. Done now.
* If the "Special" namespace alias doesn't work, fetch the local one.