* Still maintained and available for python 3 as well.
* Allows raw API requests as we need.
* Does not provide handy generators, we need to do continuation.
* Decides on its own which protocol and exact path to use, fails at it.
* Appears to use POST by default unless asked otherwise, what to do?
* It was just an old trick to get past some barriers which were waived with GET.
* It's not conformant and doesn't play well with some redirects.
* Some recent wikis seem to not like it at all, see also issue #311.
Warning!: "./tdicampswikiacom-20180522-wikidump" path exists
Traceback (most recent call last):
File "./dumpgenerator.py", line 2321, in <module>
main()
File "./dumpgenerator.py", line 2283, in main
while reply.lower() not in ['yes', 'y', 'no', 'n']:
UnboundLocalError: local variable 'reply' referenced before assignment
Traceback (most recent call last):
File "./dumpgenerator.py", line 2294, in <module>
main()
File "./dumpgenerator.py", line 2239, in main
config, other = getParameters(params=params)
File "./dumpgenerator.py", line 1587, in getParameters
if api and check:
UnboundLocalError: local variable 'check' referenced before assignment
Test case:
Titles saved at... 39fanficwikiacom-20180521-titles.txt
377 page titles loaded
http://39fanfic.wikia.com/api.php
Getting the XML header from the API
Retrieving the XML for every page from the beginning
30 namespaces found
Exporting revisions from namespace 0
Warning. Could not use allrevisions, wiki too old.
1 more revisions exported
Traceback (most recent call last):
File "./dumpgenerator.py", line 2291, in <module>
main()
File "./dumpgenerator.py", line 2283, in main
createNewDump(config=config, other=other)
File "./dumpgenerator.py", line 1849, in createNewDump
generateXMLDump(config=config, titles=titles, session=other['session'])
File "./dumpgenerator.py", line 732, in generateXMLDump
for xml in getXMLRevisions(config=config, session=session):
File "./dumpgenerator.py", line 861, in getXMLRevisions
yield makeXmlFromPage(pages[page])
File "./dumpgenerator.py", line 880, in makeXmlFromPage
E.username(str(rev['user'])),
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-3: ordinal not in range(128)
Apparently the initial JSON test is not enough, the JSON can be broken
or unexpected in other ways/points.
Fallback to the old scraper in such a case.
Fixes https://github.com/WikiTeam/wikiteam/issues/295 , perhaps.
If the scraper doesn't work for the wiki, the dump will fail entirely,
even if maybe the list of titles was almost complete. A different
solution may be in order.
Traceback (most recent call last):
File "dumpgenerator.py", line 2214, in <module>
print 'Trying to use path "%s"...' % (config['path'])
File "dumpgenerator.py", line 2210, in main
elif reply.lower() in ['no', 'n']:
File "dumpgenerator.py", line 1977, in saveSiteInfo
File "dumpgenerator.py", line 1711, in getJSON
return False
File "/usr/lib/python2.7/site-packages/requests/models.py", line 892, in json
return complexjson.loads(self.text, **kwargs)
File "/usr/lib64/python2.7/site-packages/simplejson/__init__.py", line 516, in loads
return _default_decoder.decode(s)
File "/usr/lib64/python2.7/site-packages/simplejson/decoder.py", line 374, in decode
obj, end = self.raw_decode(s)
File "/usr/lib64/python2.7/site-packages/simplejson/decoder.py", line 404, in raw_decode
return self.scan_once(s, idx=_w(s, idx).end())
simplejson.scanner.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Or if for instance the directory was named compared to the saved config:
Resuming previous dump process...
Traceback (most recent call last):
File "./dumpgenerator.py", line 2238, in <module>
main()
File "./dumpgenerator.py", line 2228, in main
resumePreviousDump(config=config, other=other)
File "./dumpgenerator.py", line 1829, in resumePreviousDump
if lasttitle == '--END--':
UnboundLocalError: local variable 'lasttitle' referenced before assignment
* Do not try exportnowrap first: it returns a blank page.
* Add an allpages option, which simply uses readTitles but cannot resume.
FIXME: this only exports the current revision!
$ python dumpgenerator.py --xml --index=http://meritbadge.org/wiki/index.php
fails on at least one MediaWiki 1.12 wiki:
Trying generating a new dump into a new directory...
Loading page titles from namespaces = all
Excluding titles from namespaces = None
Traceback (most recent call last):
File "dumpgenerator.py", line 2211, in <module>
main()
File "dumpgenerator.py", line 2203, in main
createNewDump(config=config, other=other)
File "dumpgenerator.py", line 1766, in createNewDump
getPageTitles(config=config, session=other['session'])
File "dumpgenerator.py", line 400, in getPageTitles
test = getJSON(r)
File "dumpgenerator.py", line 1708, in getJSON
return request.json()
File "/usr/lib/python2.7/site-packages/requests/models.py", line 892, in json
return complexjson.loads(self.text, **kwargs)
File "/usr/lib64/python2.7/site-packages/simplejson/__init__.py", line 516, in loads
return _default_decoder.decode(s)
File "/usr/lib64/python2.7/site-packages/simplejson/decoder.py", line 374, in decode
obj, end = self.raw_decode(s)
File "/usr/lib64/python2.7/site-packages/simplejson/decoder.py", line 404, in raw_decode
return self.scan_once(s, idx=_w(s, idx).end())
simplejson.scanner.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
* http://biografias.bcn.cl/api.php does not like the data to be POSTed.
Just use URL parameters. Some wikis had anti-spam protections which
made us POST everything, but for most wikis this should be fine.
* If the index is not defined, don't fail.
* Use only the base api.php URL, not parameters, in domain2prefix.
https://github.com/WikiTeam/wikiteam/issues/314
File "./dumpgenerator.py", line 1212, in generateImageDump
if not re.search(r'</mediawiki>', xmlfiledesc):
UnboundLocalError: local variable 'xmlfiledesc' referenced before assignment
We work well without api.php. this was a needless suicide.
Especially as sometimes sysadmins like to disable the API for no
reason and then index.php is our only option to archive the wiki.
* Log "XML export on this wiki is broken, quitting." to the error
file so that grepping reveals which dumps were interrupted so.
* Automatically reduce export size for a page when downloading the
entire history at once results in a MemoryError.
* Truncate the file with a pythonic method (.seek and .truncate)
while reading from the end, by making reverse_readline() a weird
hybrid to avoid an actual coroutine.
Approach suggested by @makoshark, finally found the time to start
implementing it.
* Do not produce and save the titles list all at once. Instead, use
the scraper and API as generators and save titles on the go. Also,
try to start the generator from the appropriate title.
For now the title sorting is not implemented. Pages will be in the
order given by namespace ID, then page name.
* When resuming, read both the title list and the XML file from the
end rather than the beginning. If the correct terminator is
present, only one line needs to be read.
* In both cases, use a generator instead of a huge list in memory.
* Also truncate the resumed XML without writing it from scratch.
For now using GNU ed: very compact, though shelling out is ugly.
I gave up on using file.seek and file.truncate to avoid reading the
whole file from the beginning or complicating reverse_readline()
with more offset calculations.
This should avoid MemoryError in most cases.
Tested by running a dump over a 1.24 wiki with 11 pages: a complete
dump and a resumed dump from a dump interrupted with ctrl-c.
* For some reason, in a previous commit I had noticed that maxretries
was not respected in getXMLPageCore, but I didn't fix it. Done now.
* If the "Special" namespace alias doesn't work, fetch the local one.
My previous code broke the page missing detection code with two negative
outcomes:
- missing pages were not reported in the error log
- ever missing page generated an extraneous "</page>" line in output which
rendered dumps invalid
This patch improves the exception code in general and fixes both of these
issues.
The Wikia API is exporting sha1 sums as part of the response for pages.
These are invalid XML and are causing dump parsing code (e.g.,
MediaWiki-Utilities) to fail. Also, sha1 should be revisions, not pages so
it's not entirely clear to me what this is referring to.
The program tended to run out of memory when processing very large pages (i.e.,
pages with extremely large numbers of revisions or pages with large numbers of
very large revisions). This mitigates the problem by changing getXMLPage() into
a generator which allows us to write pages after each request to the API.
This requied changes to the getXMLPage() function and also changes to other
parts of the code that called it.
Additionally, when the function was called, it's text was checked in several
ways. This required a few changes including a running tally of revisions
instead of post hoc check and it required error checking being moved into a
Exception rather than just an if statement that looked at the final result.
Properly fixes#74.
Algorithm:
1. Try all siteinfo props. If this gives an error, continue. Otherwise, stop.
2. Try MediaWiki 1.11-1.12 siteinfo props. If this gives an error, continue. Otherwise, stop.
3. Try minimal siteinfo props. Stop.
Not using sishowalldb=1 to avoid possible error (by default), since this data is of little use anyway.