2
0
mirror of https://github.com/WikiTeam/wikiteam synced 2024-11-10 13:10:27 +00:00
Commit Graph

374 Commits

Author SHA1 Message Date
Pokechu22
df230a96c9 Fix exporting via prop=revisions 2023-05-07 08:39:44 -07:00
yzqzss
90a64c6a22
make requests.session to use --retries value
(default=5)
2023-01-13 23:36:32 +08:00
Pokechu22
97146c6f01 Use the same requests session for getting the wiki engine and checking API/index 2022-11-26 21:32:08 -08:00
Pokechu22
6668999658 Update User-Agent to latest Firefox 2022-11-26 21:32:08 -08:00
Pokechu22
cad7260d7c Fix crash when the image description is missing for an image containing non-ascii characters
title is already unicode, so we shouldn't need to decode it (and don't in generateXMLDump).
2022-10-23 16:29:30 -07:00
Pokechu22
5b3fc4ac7b Pass requests session to mwclient
This means it uses our configured user-agent, as well as any cookies.
2022-10-22 21:06:10 -07:00
Pokechu22
1af69ca147 Skip empty revisions when using --xmlrevisions
Before, the download would die, and need to be resumed from the start.
2022-10-21 19:57:31 -07:00
Pokechu22
4a2cbd4843 Use session.get instead of requests.get in getXMLHeader
`session.get` uses our configured User-Agent, while `requests.get` uses the default one.
2022-09-17 14:30:51 -07:00
Pokechu22
9b2c6e40ae Fix truncation when resuming
There already was code that looks like it was supposed to truncate files, but it calculated the index wrong and didn't properly check all lines. It worked out, though, because it didn't actually call the truncate function.

Now, truncation occurs to the last `</page>` tag. If the XML file ends with a `</page>` tag, then nothing gets truncated. The page is added after that; if nothing was truncated, this will result in the same page being listed twice (which already happened with the missing truncation), but if truncation did happen then the file should no longer be invalid.
2022-09-16 22:20:27 -07:00
Pokechu22
43945c467f Work around unicode titles not working with resuming
Before, you would get UnicodeWarning: Unicode unequal comparison failed to convert both arguments to Unicode - interpreting them as being unequal. The %s versus {} change was needed because otherwise you would get UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: ordinal not in range(128). There is probably a better way of solving that, but this one does work.
2022-09-16 22:15:55 -07:00
nemobis
d7b6924845
Merge pull request #408 from shreyasminocha/fix-resume-images
Fix image resuming
2022-02-05 10:23:16 +02:00
Tim Gates
ecbcc6118e
docs: Fix a few typos
There are small typos in:
- dumpgenerator.py
- wikiteam/mediawiki.py

Fixes:
- Should read `inconsistencies` rather than `inconsistences`.
- Should read `partially` rather than `partialy`.
2021-12-25 16:14:05 +11:00
Shreyas Minocha
e55de36cb7
Fix image resuming 2021-06-07 02:40:08 +05:30
Nicolas SAPA
b289f86243 Fix getPageTitlesScraper
Using the API and the Special:Allpages scraper should result in the same number of titles.
Fix the detection of the next subpages on Special:Allpages.
Change the max depth to 100 and implement an anti loop (could fail on non-western wiki).
2020-08-28 08:25:33 +02:00
Nicolas SAPA
e4b43927b9 Fixup description grab in generateImageDump
getXMLPage() yield on "</page>" so xmlfiledesc cannot contains "</mediawiki>".
Change the search to "</page>" and inject "</mediawiki>" if it is missing to fixup the XML
2020-08-28 06:42:22 +02:00
Nicolas SAPA
eacaf08b2f Try to fix a broken HTTP to HTTPS redirect in generateImageDump()
Some wiki fail to do the HTTP to HTTPs redirect correctly so try it ourself.
2020-08-28 06:38:01 +02:00
Nicolas SAPA
7675b0d17c Add exception handler for requests.exceptions.ReadTimeout in getXMLPageCore()
Treat a ReadTimeout the same as a ConnectionError (log the error & retry)
2020-08-28 06:12:56 +02:00
Nicolas SAPA
4a5eef97da Update the default user-agent
A ModSecurity rule block the old UA so switch to the current Firefox 78 UA.
2020-08-28 06:09:20 +02:00
Rob Kam
e6f4674b42
fix typo 2020-06-11 10:28:18 +01:00
Federico Leva
abd908914f Adapt to some more Wikia wikis edge cases
* Make it easy to batch requests for some wikis where millions of titles
  are really just one-revision thread items and need to be gone through
  as fast as possible.
* Status code error message.
2020-03-05 12:51:32 +02:00
Federico Leva
7de75012d1 Fix merge of the getXMLRevisions() loop 2020-02-22 14:50:04 +02:00
nemobis
8a2116699e
Merge branch 'master' into wikia 2020-02-22 14:25:37 +02:00
Federico Leva
7289225d2c Directly catch exception for page missing in getXMLRevisions()
The caller cannot catch the PME exception because it doesn't know about
the title. Just log the error here.
2020-02-22 14:15:39 +02:00
nemobis
e136ee5536
Merge pull request #372 from nemobis/wikia
Avoid launcher.py 7z failures
2020-02-17 13:14:33 +02:00
Federico Leva
8c6f05bb54 Consider status code before content in checkIndex() and checkalive.py
Fixes https://github.com/WikiTeam/wikiteam/issues/369
2020-02-16 16:46:19 +02:00
Federico Leva
9ac1e6d0f1 Implement resume in --xmlrevisions (but not yet with list=allrevisions)
Tested with a partial dumps over 100 MB:
https://tinyvillage.fandom.com/api.php
(grepped <title> to see the previously downloaded ones were kept and the
new ones continued from expected; did not validate a final XML).
2020-02-14 13:16:33 +02:00
Federico Leva
a664b17a9c Handle deleted contributor name in --xmlrevisions
Avoids failure in https://deployment.wikimedia.beta.wmflabs.org/w/api.php
for revision https://deployment.wikimedia.beta.wmflabs.org/?oldid=2349 .
2020-02-13 17:13:16 +02:00
Federico Leva
b162e7b14f Reduce the API limit to 50 for arvlimit, gaplimit, ailimit
Avoids to crash on errors or warnings which some wikis return for bigger
requests, like https://www.openkm.com/wiki/api.php (MediaWiki 1.27.3).
2020-02-13 15:58:39 +02:00
Federico Leva
d543f7d4dd Check the API URL against mwclient too, so it doesn't fail later
Change the protocol from HTTP to HTTPS if needed. Fixes:
http://nimiarkisto.fi/w/api.php
2020-02-13 15:45:17 +02:00
Federico Leva
d1619392f4 Force the lxml factory to pass around unicode strings
Not necessarily the most compatible with downstream XML parsers, but at
least should ensure that we manage to write the XML file. The encoding
declared in the header is not necessarily the same we get from the API.

See also:
https://lxml.de/FAQ.html#why-can-t-lxml-parse-my-xml-from-unicode-strings
https://lxml.de/3.7/parsing.html#serialising-to-unicode-strings

Fixes https://github.com/WikiTeam/wikiteam/issues/363
2020-02-13 14:32:58 +02:00
Federico Leva
6dc86d1964 Actually use the next batch from prop=revisions in MediaWiki 1.19 2020-02-13 10:49:29 +02:00
Federico Leva
2ba69b3810 Indent the number of revisions more, consistent with page title style 2020-02-11 22:44:17 +02:00
Federico Leva
8fef62d46e Implement continuation for --xmlrevisions with prop=revisions in MW 1.19 2020-02-11 17:22:36 +02:00
Federico Leva
8b58599645 Merge branch 'xmlrevisions' of github.com:nemobis/wikiteam into xmlrevisions 2020-02-10 23:07:20 +02:00
Federico Leva
17283113dd Wikia: make getXMLHeader() check more lenient
Otherwise we end up using Special:Export even though the export API
would work perfectly well with --xmlrevisions.

For some reason using the general requests session always got an empty
response from the Wikia API.

May also fix images on fandom.com:
https://github.com/WikiTeam/wikiteam/issues/330
2020-02-10 23:05:16 +02:00
Federico Leva
2c21eadf7c Wikia: make getXMLHeader() check more lenient,
Otherwise we end up using Special:Export even though the export API
would work perfectly well with --xmlrevisions.

May also fix images on fandom.com:
https://github.com/WikiTeam/wikiteam/issues/330
2020-02-10 22:32:01 +02:00
Federico Leva
131e19979c Use mwclient generator for allpages
Tested with MediaWiki 1.31 and 1.19.
2020-02-10 22:13:21 +02:00
Federico Leva
faf0e31b4e Don't set apfrom in initial allpages request, use suggested continuation
Helps with recent MediaWiki versions like 1.31 where variants of "!" can
give a bad title error and the continuation wants apcontinue anyway.
2020-02-10 21:19:01 +02:00
Federico Leva
49017e3f20 Catch HTTP Error 405 and switch from POST to GET for API requests
Seen on http://wiki.ainigma.eu/index.php?title=Hlavn%C3%AD_strana:
HTTPError: HTTP Error 405: Method Not Allowed
2020-02-10 20:52:13 +02:00
Federico Leva
8b5378f991 Fix query prop=revisions continuation in MediaWiki 1.22
This wiki has the old query-continue format but it's not exposes here.
2020-02-10 19:33:10 +02:00
Federico Leva
92da7388b0 Avoid asking allpages API if API not available
So that it doesn't have to iterate among non-existing titles.

Fixes https://github.com/WikiTeam/wikiteam/issues/348
2020-02-10 19:32:44 +02:00
Federico Leva
1645c1d832 More robust XML header fetch for getXMLHeader()
Avoid UnboundLocalError: local variable 'xml' referenced before assignment

If the page exists, its XML export is returned by the API; otherwise only
the header that we were looking for.

Fixes https://github.com/WikiTeam/wikiteam/issues/355
2020-02-10 18:18:26 +02:00
Federico Leva
0b37b39923 Define xml header as empty first so that it can fail graciously
Fixes https://github.com/WikiTeam/wikiteam/issues/355
2020-02-10 18:05:42 +02:00
Federico Leva
becd01b271 Use defined requests.exceptions.ConnectionError
Fixes https://github.com/WikiTeam/wikiteam/issues/356
2020-02-10 18:00:43 +02:00
Federico Leva
f0436ee57c Make mwclient respect the provided HTTP/HTTPS scheme
Fixes https://github.com/WikiTeam/wikiteam/issues/358
2020-02-10 17:59:03 +02:00
Federico Leva
9ec6ce42d3 Finish xmlrevisions option for older wikis
* Actually proceed to the next page when no continuation.
* Provide the same output as with the usual per-page export.

Tested on a MediaWiki 1.16 wiki with success.
2020-02-10 17:33:57 +02:00
Federico Leva
0f35d03929 Remove rvlimit=max, fails in MediaWiki 1.16
For instance:
"Exception Caught: Internal error in ApiResult::setElement: Attempting to add element revisions=50, existing value is 500"
https://wiki.rabenthal.net/api.php?action=query&prop=revisions&titles=Hauptseite&rvprop=ids&rvlimit=max
2020-02-10 16:08:04 +02:00
Federico Leva
6b12e20a9d Actually convert the titles query method to mwclient too 2020-02-10 15:46:05 +02:00
Federico Leva
f10adb71af Don't try to add revisions if the namespace has none
Traceback (most recent call last):
  File "dumpgenerator.py", line 2362, in <module>

  File "dumpgenerator.py", line 2354, in main
    resumePreviousDump(config=config, other=other)
  File "dumpgenerator.py", line 1921, in createNewDump
    getPageTitles(config=config, session=other['session'])
  File "dumpgenerator.py", line 755, in generateXMLDump
    for xml in getXMLRevisions(config=config, session=session):
  File "dumpgenerator.py", line 861, in getXMLRevisions
    revids.append(str(revision['revid']))
IndexError: list index out of range
2020-02-10 15:27:10 +02:00
Federico Leva
3760501f74 Add a couple comments 2020-02-10 15:25:40 +02:00