strip <sha1> tags returned under <page>

The Wikia API is exporting sha1 sums as part of the response for pages. These are invalid XML and are causing dump parsing code (e.g., MediaWiki-Utilities) to fail. Also, sha1 should be revisions, not pages so it's not entirely clear to me what this is referring to.
9 years ago · eb8b44aef0
parent 145b2eaaf4
commit eb8b44aef0
1 changed files with 6 additions and 0 deletions
--- a/dumpgenerator.py
+++ b/dumpgenerator.py
@ -509,6 +509,12 @@ def getXMLPage(config={}, title='', verbose=True, session=None):
    xml = getXMLPageCore(params=params, config=config, session=session)
    if not xml:
        raise PageMissingError
+    else:
+        # strip these sha1s sums which keep showing up in the export and
+        # which are invalid for the XML schema (they only apply to
+        # revisions)
+        xml = re.sub(r'\n\s*<sha1>\w+</sha1>\s*\n', r'\n', xml)
+        xml = re.sub(r'\n\s*<sha1/>\s*\n', r'\n', xml)

    yield xml.split("</page>")[0]