Commit Graph

1133 Commits (master)
 

Author SHA1 Message Date
Federico Leva 7289225d2c Directly catch exception for page missing in getXMLRevisions()
The caller cannot catch the PME exception because it doesn't know about
the title. Just log the error here.
4 years ago
Federico Leva aabf3ea037 uploader.py: switch to requests, BytesIO, rights API
* Now uploads the logo again, at least in standard or Wikia skin.
* Finds license information more often.
* Translates Wikia license URL.
* More specific error reporting.
4 years ago
Federico Leva e194077e52 uploader.py: Use requests GET, handle Wikia weird URLs
POST requests with urllib were getting empty responses from Wikia.
4 years ago
nemobis e136ee5536
Merge pull request #372 from nemobis/wikia
Avoid launcher.py 7z failures
4 years ago
Federico Leva 20fe64e2dd Delete temporary 7z file if compression failed, don't preserve it
Fixes https://github.com/WikiTeam/wikiteam/issues/366
4 years ago
Federico Leva 8c6f05bb54 Consider status code before content in checkIndex() and checkalive.py
Fixes https://github.com/WikiTeam/wikiteam/issues/369
4 years ago
nemobis 5bde9ba4fe
Merge pull request #371 from nemobis/wikia
Update list of Wikia wikis
4 years ago
Federico Leva 8fb2b44fdb Update list of Wikia wikis with today's list from the API 4 years ago
Federico Leva ed46725a89 Sort list of Wikia wikis again
No change in content.
4 years ago
nemobis add13e2a31
Merge pull request #368 from nemobis/xmlrevisions
Recover from more crashes: oversighted revs, resume API
4 years ago
Federico Leva 9ac1e6d0f1 Implement resume in --xmlrevisions (but not yet with list=allrevisions)
Tested with a partial dumps over 100 MB:
https://tinyvillage.fandom.com/api.php
(grepped <title> to see the previously downloaded ones were kept and the
new ones continued from expected; did not validate a final XML).
4 years ago
Federico Leva a664b17a9c Handle deleted contributor name in --xmlrevisions
Avoids failure in https://deployment.wikimedia.beta.wmflabs.org/w/api.php
for revision https://deployment.wikimedia.beta.wmflabs.org/?oldid=2349 .
4 years ago
nemobis 912450b606
Merge pull request #367 from nemobis/xmlrevisions
Make --xmlrevisions work on some more wikis
4 years ago
Federico Leva b162e7b14f Reduce the API limit to 50 for arvlimit, gaplimit, ailimit
Avoids to crash on errors or warnings which some wikis return for bigger
requests, like https://www.openkm.com/wiki/api.php (MediaWiki 1.27.3).
4 years ago
Federico Leva d543f7d4dd Check the API URL against mwclient too, so it doesn't fail later
Change the protocol from HTTP to HTTPS if needed. Fixes:
http://nimiarkisto.fi/w/api.php
4 years ago
Federico Leva d1619392f4 Force the lxml factory to pass around unicode strings
Not necessarily the most compatible with downstream XML parsers, but at
least should ensure that we manage to write the XML file. The encoding
declared in the header is not necessarily the same we get from the API.

See also:
https://lxml.de/FAQ.html#why-can-t-lxml-parse-my-xml-from-unicode-strings
https://lxml.de/3.7/parsing.html#serialising-to-unicode-strings

Fixes https://github.com/WikiTeam/wikiteam/issues/363
4 years ago
Federico Leva 6dc86d1964 Actually use the next batch from prop=revisions in MediaWiki 1.19 4 years ago
nemobis 21bc71a751
Merge pull request #365 from nemobis/xmlrevisions
Indent the number of revisions more, consistent with page title style
4 years ago
Federico Leva 2ba69b3810 Indent the number of revisions more, consistent with page title style 4 years ago
nemobis 577389e059
Merge pull request #364 from nemobis/xmlrevisions
Implement continuation for --xmlrevisions with prop=revisions in MW 1.19
4 years ago
Federico Leva 8fef62d46e Implement continuation for --xmlrevisions with prop=revisions in MW 1.19 4 years ago
nemobis 84444bee36
Merge pull request #360 from nemobis/xmlrevisions
Wikia API fixes
4 years ago
Federico Leva 8b58599645 Merge branch 'xmlrevisions' of github.com:nemobis/wikiteam into xmlrevisions 4 years ago
Federico Leva 17283113dd Wikia: make getXMLHeader() check more lenient
Otherwise we end up using Special:Export even though the export API
would work perfectly well with --xmlrevisions.

For some reason using the general requests session always got an empty
response from the Wikia API.

May also fix images on fandom.com:
https://github.com/WikiTeam/wikiteam/issues/330
4 years ago
Federico Leva 2c21eadf7c Wikia: make getXMLHeader() check more lenient,
Otherwise we end up using Special:Export even though the export API
would work perfectly well with --xmlrevisions.

May also fix images on fandom.com:
https://github.com/WikiTeam/wikiteam/issues/330
4 years ago
Federico Leva 131e19979c Use mwclient generator for allpages
Tested with MediaWiki 1.31 and 1.19.
4 years ago
nemobis 3f39a97acc
Merge pull request #359 from nemobis/xmlrevisions
Switch the --xmlrevisions option to mwclient and related changes
4 years ago
Federico Leva faf0e31b4e Don't set apfrom in initial allpages request, use suggested continuation
Helps with recent MediaWiki versions like 1.31 where variants of "!" can
give a bad title error and the continuation wants apcontinue anyway.
4 years ago
Federico Leva 49017e3f20 Catch HTTP Error 405 and switch from POST to GET for API requests
Seen on http://wiki.ainigma.eu/index.php?title=Hlavn%C3%AD_strana:
HTTPError: HTTP Error 405: Method Not Allowed
4 years ago
Federico Leva 8b5378f991 Fix query prop=revisions continuation in MediaWiki 1.22
This wiki has the old query-continue format but it's not exposes here.
4 years ago
Federico Leva 92da7388b0 Avoid asking allpages API if API not available
So that it doesn't have to iterate among non-existing titles.

Fixes https://github.com/WikiTeam/wikiteam/issues/348
4 years ago
Federico Leva 1645c1d832 More robust XML header fetch for getXMLHeader()
Avoid UnboundLocalError: local variable 'xml' referenced before assignment

If the page exists, its XML export is returned by the API; otherwise only
the header that we were looking for.

Fixes https://github.com/WikiTeam/wikiteam/issues/355
4 years ago
Federico Leva 0b37b39923 Define xml header as empty first so that it can fail graciously
Fixes https://github.com/WikiTeam/wikiteam/issues/355
4 years ago
Federico Leva becd01b271 Use defined requests.exceptions.ConnectionError
Fixes https://github.com/WikiTeam/wikiteam/issues/356
4 years ago
Federico Leva f0436ee57c Make mwclient respect the provided HTTP/HTTPS scheme
Fixes https://github.com/WikiTeam/wikiteam/issues/358
4 years ago
Federico Leva 9ec6ce42d3 Finish xmlrevisions option for older wikis
* Actually proceed to the next page when no continuation.
* Provide the same output as with the usual per-page export.

Tested on a MediaWiki 1.16 wiki with success.
4 years ago
Federico Leva 0f35d03929 Remove rvlimit=max, fails in MediaWiki 1.16
For instance:
"Exception Caught: Internal error in ApiResult::setElement: Attempting to add element revisions=50, existing value is 500"
https://wiki.rabenthal.net/api.php?action=query&prop=revisions&titles=Hauptseite&rvprop=ids&rvlimit=max
4 years ago
Federico Leva 6b12e20a9d Actually convert the titles query method to mwclient too 4 years ago
Federico Leva f10adb71af Don't try to add revisions if the namespace has none
Traceback (most recent call last):
  File "dumpgenerator.py", line 2362, in <module>

  File "dumpgenerator.py", line 2354, in main
    resumePreviousDump(config=config, other=other)
  File "dumpgenerator.py", line 1921, in createNewDump
    getPageTitles(config=config, session=other['session'])
  File "dumpgenerator.py", line 755, in generateXMLDump
    for xml in getXMLRevisions(config=config, session=session):
  File "dumpgenerator.py", line 861, in getXMLRevisions
    revids.append(str(revision['revid']))
IndexError: list index out of range
4 years ago
Federico Leva 3760501f74 Add a couple comments 4 years ago
Federico Leva 11507e931e Initial switch to mwclient for the xmlrevisions option
* Still maintained and available for python 3 as well.
* Allows raw API requests as we need.
* Does not provide handy generators, we need to do continuation.
* Decides on its own which protocol and exact path to use, fails at it.
* Appears to use POST by default unless asked otherwise, what to do?
4 years ago
nemobis 353f4d90a6 Merge pull request #349 from nemobis/xmlrevisions
Use GET rather than POST for API requests
4 years ago
Federico Leva 3d04dcbf5c Use GET rather than POST for API requests
* It was just an old trick to get past some barriers which were waived with GET.
* It's not conformant and doesn't play well with some redirects.
* Some recent wikis seem to not like it at all, see also issue #311.
4 years ago
nemobis 128e23c3a4
Merge pull request #346 from nemobis/bug/334
Use GET rather than POST for allpages API query
4 years ago
Federico Leva 4cdc5a7784 Use GET rather than POST for allpages API query
POST does not follow the redirect from HTTP to HTTPS, which makes the
request (and the entire dump) fail if an API URL is passed like
http://7daystodie-de.gamepedia.com/api.php

Fixes https://github.com/WikiTeam/wikiteam/issues/334
4 years ago
nemobis 210158473e
Merge pull request #345 from nemobis/2020list
Update MediaWiki and Wikia lists
4 years ago
Federico Leva 7dad9a44cd Give up on Wikia-made dumps
There are less than 500 available right now, out of 400k active wikis.
4 years ago
Federico Leva accc7db019 Update list of MediaWikis
* Run checkalive.py on the "originalurl" URLs from existing items in the
  WikiTeam collection on the Internet Archive, minus dead wiki farms.
* Downloaded the list of unarchived wikis from WikiApiary.
4 years ago
Federico Leva aa0b133c1d Minimal update to list of Wikia wikis
* Change API URL to HTTPS and fandom.com.
* New output of the script (403k wikis), changed to wikia.com for diff purposes.
4 years ago
nemobis 0eeb6bfcb0
Upload all relevant wikidump.7z and history.xml.7z
Don't stop at the first 7z file found in the directory listing.
Should be fast enough for most users.

Fixes #326
4 years ago