strip <sha1> tags returned under <page>

The Wikia API is exporting sha1 sums as part of the response for pages.
These are invalid XML and are causing dump parsing code (e.g.,
MediaWiki-Utilities) to fail.  Also, sha1 should be revisions, not pages so
it's not entirely clear to me what this is referring to.
pull/216/head
Benjamin Mako Hill 9 years ago
parent 145b2eaaf4
commit eb8b44aef0

@ -509,6 +509,12 @@ def getXMLPage(config={}, title='', verbose=True, session=None):
xml = getXMLPageCore(params=params, config=config, session=session)
if not xml:
raise PageMissingError
else:
# strip these sha1s sums which keep showing up in the export and
# which are invalid for the XML schema (they only apply to
# revisions)
xml = re.sub(r'\n\s*<sha1>\w+</sha1>\s*\n', r'\n', xml)
xml = re.sub(r'\n\s*<sha1/>\s*\n', r'\n', xml)
yield xml.split("</page>")[0]

Loading…
Cancel
Save