There already was code that looks like it was supposed to truncate files, but it calculated the index wrong and didn't properly check all lines. It worked out, though, because it didn't actually call the truncate function.
Now, truncation occurs to the last `</page>` tag. If the XML file ends with a `</page>` tag, then nothing gets truncated. The page is added after that; if nothing was truncated, this will result in the same page being listed twice (which already happened with the missing truncation), but if truncation did happen then the file should no longer be invalid.
Before, you would get UnicodeWarning: Unicode unequal comparison failed to convert both arguments to Unicode - interpreting them as being unequal. The %s versus {} change was needed because otherwise you would get UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: ordinal not in range(128). There is probably a better way of solving that, but this one does work.
There are small typos in:
- dumpgenerator.py
- wikiteam/mediawiki.py
Fixes:
- Should read `inconsistencies` rather than `inconsistences`.
- Should read `partially` rather than `partialy`.
Using the API and the Special:Allpages scraper should result in the same number of titles.
Fix the detection of the next subpages on Special:Allpages.
Change the max depth to 100 and implement an anti loop (could fail on non-western wiki).
getXMLPage() yield on "</page>" so xmlfiledesc cannot contains "</mediawiki>".
Change the search to "</page>" and inject "</mediawiki>" if it is missing to fixup the XML
* Make it easy to batch requests for some wikis where millions of titles
are really just one-revision thread items and need to be gone through
as fast as possible.
* Status code error message.
Tested with a partial dumps over 100 MB:
https://tinyvillage.fandom.com/api.php
(grepped <title> to see the previously downloaded ones were kept and the
new ones continued from expected; did not validate a final XML).
Otherwise we end up using Special:Export even though the export API
would work perfectly well with --xmlrevisions.
For some reason using the general requests session always got an empty
response from the Wikia API.
May also fix images on fandom.com:
https://github.com/WikiTeam/wikiteam/issues/330
Otherwise we end up using Special:Export even though the export API
would work perfectly well with --xmlrevisions.
May also fix images on fandom.com:
https://github.com/WikiTeam/wikiteam/issues/330
Avoid UnboundLocalError: local variable 'xml' referenced before assignment
If the page exists, its XML export is returned by the API; otherwise only
the header that we were looking for.
Fixes https://github.com/WikiTeam/wikiteam/issues/355
* Actually proceed to the next page when no continuation.
* Provide the same output as with the usual per-page export.
Tested on a MediaWiki 1.16 wiki with success.
Traceback (most recent call last):
File "dumpgenerator.py", line 2362, in <module>
File "dumpgenerator.py", line 2354, in main
resumePreviousDump(config=config, other=other)
File "dumpgenerator.py", line 1921, in createNewDump
getPageTitles(config=config, session=other['session'])
File "dumpgenerator.py", line 755, in generateXMLDump
for xml in getXMLRevisions(config=config, session=session):
File "dumpgenerator.py", line 861, in getXMLRevisions
revids.append(str(revision['revid']))
IndexError: list index out of range