Commit Graph

1056 Commits (5986467b12a9eef5731142fadd98ac6ea67b3b85)
 

Author SHA1 Message Date
Nicolas SAPA 5986467b12 Cleanup of link rot
Lot of wiki in test_dumpgenerator.py doesn't exist anymore.
Remove them from the CI.
4 years ago
Nicolas SAPA b289f86243 Fix getPageTitlesScraper
Using the API and the Special:Allpages scraper should result in the same number of titles.
Fix the detection of the next subpages on Special:Allpages.
Change the max depth to 100 and implement an anti loop (could fail on non-western wiki).
4 years ago
Nicolas SAPA 1048bc3275 skilledtests.com doesn't host a MediaWiki anymore
http://skilledtests.com/wiki/ redirect to https://simcast.com,
something 'Powered by Microsoft News'
4 years ago
Nicolas SAPA 320115fe5a Try to fix CI by using current URL for archiveteam.org
In commit 966df37c54, emijrp changed http://archiveteam.org/ to https://www.archiveteam.org/
Today, https://archiveteam.org/index.php?title=Special:Version show a canonical URL of https://archiveteam.org/
So try to fix the CI by doing a s/www.archiveteam.org/archiveteam.org/g
4 years ago
Nicolas SAPA e4b43927b9 Fixup description grab in generateImageDump
getXMLPage() yield on "</page>" so xmlfiledesc cannot contains "</mediawiki>".
Change the search to "</page>" and inject "</mediawiki>" if it is missing to fixup the XML
4 years ago
Nicolas SAPA eacaf08b2f Try to fix a broken HTTP to HTTPS redirect in generateImageDump()
Some wiki fail to do the HTTP to HTTPs redirect correctly so try it ourself.
4 years ago
Nicolas SAPA 7675b0d17c Add exception handler for requests.exceptions.ReadTimeout in getXMLPageCore()
Treat a ReadTimeout the same as a ConnectionError (log the error & retry)
4 years ago
Nicolas SAPA 4a5eef97da Update the default user-agent
A ModSecurity rule block the old UA so switch to the current Firefox 78 UA.
4 years ago
nemobis 9b1996d436
Merge pull request #387 from robkam/patch-1
fix typo
4 years ago
Rob Kam e6f4674b42
fix typo 4 years ago
nemobis ee39e8f85b
Merge pull request #386 from RhinosF1/patch-1
Update miraheze.org list
4 years ago
RhinosF1 3b28efab80
Update miraheze.org list
Using https://gist.github.com/RhinosF1/18c83dfbfadb84e28ee083628c029b41
4 years ago
nemobis 85ae14419f
Merge pull request #381 from robkam/patch-1
Add that the script requires Python 2.7
4 years ago
Rob Kam c563012c1c
Add that the script requires Python 2.7 4 years ago
nemobis 6e85afca82
Merge pull request #378 from nemobis/wikia
More efficient Wikia download and launcher.py
4 years ago
nemobis 4eae50b2fb
Merge pull request #377 from nemobis/uploaderurl
uploader.py: Handle protocol-relative base URL
4 years ago
Federico Leva 3ddfa85391 uploader.py: Handle protocol-relative base URL
Fixes https://github.com/WikiTeam/wikiteam/issues/376
4 years ago
Federico Leva abd908914f Adapt to some more Wikia wikis edge cases
* Make it easy to batch requests for some wikis where millions of titles
  are really just one-revision thread items and need to be gone through
  as fast as possible.
* Status code error message.
4 years ago
Federico Leva e4524b8aec launcher.py: Avoid shell=True to consume half as many processes
No idea if "python2" will be converted to anything meaningful on Windows,
but then you're not really supposed to use the shell either in that dungeon.
https://docs.python.org/2.7/library/subprocess.html#subprocess.Popen
4 years ago
Federico Leva 0f5664028f Stricter prefix matching in launcher.py
For instance, do not skip gleefandomcom if gleefandomcom_ru is found.
4 years ago
nemobis 573623ed16
Merge pull request #373 from nemobis/wikia
uploader.py logo and metadata improvements
4 years ago
Federico Leva 7de75012d1 Fix merge of the getXMLRevisions() loop 4 years ago
nemobis 8a2116699e
Merge branch 'master' into wikia 4 years ago
Federico Leva 7289225d2c Directly catch exception for page missing in getXMLRevisions()
The caller cannot catch the PME exception because it doesn't know about
the title. Just log the error here.
4 years ago
Federico Leva aabf3ea037 uploader.py: switch to requests, BytesIO, rights API
* Now uploads the logo again, at least in standard or Wikia skin.
* Finds license information more often.
* Translates Wikia license URL.
* More specific error reporting.
4 years ago
Federico Leva e194077e52 uploader.py: Use requests GET, handle Wikia weird URLs
POST requests with urllib were getting empty responses from Wikia.
4 years ago
nemobis e136ee5536
Merge pull request #372 from nemobis/wikia
Avoid launcher.py 7z failures
4 years ago
Federico Leva 20fe64e2dd Delete temporary 7z file if compression failed, don't preserve it
Fixes https://github.com/WikiTeam/wikiteam/issues/366
4 years ago
Federico Leva 8c6f05bb54 Consider status code before content in checkIndex() and checkalive.py
Fixes https://github.com/WikiTeam/wikiteam/issues/369
4 years ago
nemobis 5bde9ba4fe
Merge pull request #371 from nemobis/wikia
Update list of Wikia wikis
4 years ago
Federico Leva 8fb2b44fdb Update list of Wikia wikis with today's list from the API 4 years ago
Federico Leva ed46725a89 Sort list of Wikia wikis again
No change in content.
4 years ago
nemobis add13e2a31
Merge pull request #368 from nemobis/xmlrevisions
Recover from more crashes: oversighted revs, resume API
4 years ago
Federico Leva 9ac1e6d0f1 Implement resume in --xmlrevisions (but not yet with list=allrevisions)
Tested with a partial dumps over 100 MB:
https://tinyvillage.fandom.com/api.php
(grepped <title> to see the previously downloaded ones were kept and the
new ones continued from expected; did not validate a final XML).
4 years ago
Federico Leva a664b17a9c Handle deleted contributor name in --xmlrevisions
Avoids failure in https://deployment.wikimedia.beta.wmflabs.org/w/api.php
for revision https://deployment.wikimedia.beta.wmflabs.org/?oldid=2349 .
4 years ago
nemobis 912450b606
Merge pull request #367 from nemobis/xmlrevisions
Make --xmlrevisions work on some more wikis
4 years ago
Federico Leva b162e7b14f Reduce the API limit to 50 for arvlimit, gaplimit, ailimit
Avoids to crash on errors or warnings which some wikis return for bigger
requests, like https://www.openkm.com/wiki/api.php (MediaWiki 1.27.3).
4 years ago
Federico Leva d543f7d4dd Check the API URL against mwclient too, so it doesn't fail later
Change the protocol from HTTP to HTTPS if needed. Fixes:
http://nimiarkisto.fi/w/api.php
4 years ago
Federico Leva d1619392f4 Force the lxml factory to pass around unicode strings
Not necessarily the most compatible with downstream XML parsers, but at
least should ensure that we manage to write the XML file. The encoding
declared in the header is not necessarily the same we get from the API.

See also:
https://lxml.de/FAQ.html#why-can-t-lxml-parse-my-xml-from-unicode-strings
https://lxml.de/3.7/parsing.html#serialising-to-unicode-strings

Fixes https://github.com/WikiTeam/wikiteam/issues/363
4 years ago
Federico Leva 6dc86d1964 Actually use the next batch from prop=revisions in MediaWiki 1.19 4 years ago
nemobis 21bc71a751
Merge pull request #365 from nemobis/xmlrevisions
Indent the number of revisions more, consistent with page title style
4 years ago
Federico Leva 2ba69b3810 Indent the number of revisions more, consistent with page title style 4 years ago
nemobis 577389e059
Merge pull request #364 from nemobis/xmlrevisions
Implement continuation for --xmlrevisions with prop=revisions in MW 1.19
4 years ago
Federico Leva 8fef62d46e Implement continuation for --xmlrevisions with prop=revisions in MW 1.19 4 years ago
nemobis 84444bee36
Merge pull request #360 from nemobis/xmlrevisions
Wikia API fixes
4 years ago
Federico Leva 8b58599645 Merge branch 'xmlrevisions' of github.com:nemobis/wikiteam into xmlrevisions 4 years ago
Federico Leva 17283113dd Wikia: make getXMLHeader() check more lenient
Otherwise we end up using Special:Export even though the export API
would work perfectly well with --xmlrevisions.

For some reason using the general requests session always got an empty
response from the Wikia API.

May also fix images on fandom.com:
https://github.com/WikiTeam/wikiteam/issues/330
4 years ago
Federico Leva 2c21eadf7c Wikia: make getXMLHeader() check more lenient,
Otherwise we end up using Special:Export even though the export API
would work perfectly well with --xmlrevisions.

May also fix images on fandom.com:
https://github.com/WikiTeam/wikiteam/issues/330
4 years ago
Federico Leva 131e19979c Use mwclient generator for allpages
Tested with MediaWiki 1.31 and 1.19.
4 years ago
nemobis 3f39a97acc
Merge pull request #359 from nemobis/xmlrevisions
Switch the --xmlrevisions option to mwclient and related changes
4 years ago