wikiteam

mirror of https://github.com/WikiTeam/wikiteam synced 2024-11-10 13:10:27 +00:00

Author	SHA1	Message	Date
Pokechu22	df230a96c9	Fix exporting via prop=revisions	2023-05-07 08:39:44 -07:00
yzqzss	90a64c6a22	make `requests.session` to use `--retries` value (default=5)	2023-01-13 23:36:32 +08:00
Pokechu22	97146c6f01	Use the same requests session for getting the wiki engine and checking API/index	2022-11-26 21:32:08 -08:00
Pokechu22	6668999658	Update User-Agent to latest Firefox	2022-11-26 21:32:08 -08:00
Pokechu22	cad7260d7c	Fix crash when the image description is missing for an image containing non-ascii characters title is already unicode, so we shouldn't need to decode it (and don't in generateXMLDump).	2022-10-23 16:29:30 -07:00
Pokechu22	5b3fc4ac7b	Pass requests session to mwclient This means it uses our configured user-agent, as well as any cookies.	2022-10-22 21:06:10 -07:00
Pokechu22	1af69ca147	Skip empty revisions when using --xmlrevisions Before, the download would die, and need to be resumed from the start.	2022-10-21 19:57:31 -07:00
Pokechu22	4a2cbd4843	Use `session.get` instead of `requests.get` in `getXMLHeader` `session.get` uses our configured User-Agent, while `requests.get` uses the default one.	2022-09-17 14:30:51 -07:00
Pokechu22	9b2c6e40ae	Fix truncation when resuming There already was code that looks like it was supposed to truncate files, but it calculated the index wrong and didn't properly check all lines. It worked out, though, because it didn't actually call the truncate function. Now, truncation occurs to the last `</page>` tag. If the XML file ends with a `</page>` tag, then nothing gets truncated. The page is added after that; if nothing was truncated, this will result in the same page being listed twice (which already happened with the missing truncation), but if truncation did happen then the file should no longer be invalid.	2022-09-16 22:20:27 -07:00
Pokechu22	43945c467f	Work around unicode titles not working with resuming Before, you would get UnicodeWarning: Unicode unequal comparison failed to convert both arguments to Unicode - interpreting them as being unequal. The %s versus {} change was needed because otherwise you would get UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: ordinal not in range(128). There is probably a better way of solving that, but this one does work.	2022-09-16 22:15:55 -07:00
nemobis	d7b6924845	Merge pull request #408 from shreyasminocha/fix-resume-images Fix image resuming	2022-02-05 10:23:16 +02:00
Tim Gates	ecbcc6118e	docs: Fix a few typos There are small typos in: - dumpgenerator.py - wikiteam/mediawiki.py Fixes: - Should read `inconsistencies` rather than `inconsistences`. - Should read `partially` rather than `partialy`.	2021-12-25 16:14:05 +11:00
Shreyas Minocha	e55de36cb7	Fix image resuming	2021-06-07 02:40:08 +05:30
Nicolas SAPA	b289f86243	Fix getPageTitlesScraper Using the API and the Special:Allpages scraper should result in the same number of titles. Fix the detection of the next subpages on Special:Allpages. Change the max depth to 100 and implement an anti loop (could fail on non-western wiki).	2020-08-28 08:25:33 +02:00
Nicolas SAPA	e4b43927b9	Fixup description grab in generateImageDump getXMLPage() yield on "</page>" so xmlfiledesc cannot contains "</mediawiki>". Change the search to "</page>" and inject "</mediawiki>" if it is missing to fixup the XML	2020-08-28 06:42:22 +02:00
Nicolas SAPA	eacaf08b2f	Try to fix a broken HTTP to HTTPS redirect in generateImageDump() Some wiki fail to do the HTTP to HTTPs redirect correctly so try it ourself.	2020-08-28 06:38:01 +02:00
Nicolas SAPA	7675b0d17c	Add exception handler for requests.exceptions.ReadTimeout in getXMLPageCore() Treat a ReadTimeout the same as a ConnectionError (log the error & retry)	2020-08-28 06:12:56 +02:00
Nicolas SAPA	4a5eef97da	Update the default user-agent A ModSecurity rule block the old UA so switch to the current Firefox 78 UA.	2020-08-28 06:09:20 +02:00
Rob Kam	e6f4674b42	fix typo	2020-06-11 10:28:18 +01:00
Federico Leva	abd908914f	Adapt to some more Wikia wikis edge cases * Make it easy to batch requests for some wikis where millions of titles are really just one-revision thread items and need to be gone through as fast as possible. * Status code error message.	2020-03-05 12:51:32 +02:00
Federico Leva	7de75012d1	Fix merge of the getXMLRevisions() loop	2020-02-22 14:50:04 +02:00
nemobis	8a2116699e	Merge branch 'master' into wikia	2020-02-22 14:25:37 +02:00
Federico Leva	7289225d2c	Directly catch exception for page missing in getXMLRevisions() The caller cannot catch the PME exception because it doesn't know about the title. Just log the error here.	2020-02-22 14:15:39 +02:00
nemobis	e136ee5536	Merge pull request #372 from nemobis/wikia Avoid launcher.py 7z failures	2020-02-17 13:14:33 +02:00
Federico Leva	8c6f05bb54	Consider status code before content in checkIndex() and checkalive.py Fixes https://github.com/WikiTeam/wikiteam/issues/369	2020-02-16 16:46:19 +02:00
Federico Leva	9ac1e6d0f1	Implement resume in --xmlrevisions (but not yet with list=allrevisions) Tested with a partial dumps over 100 MB: https://tinyvillage.fandom.com/api.php (grepped <title> to see the previously downloaded ones were kept and the new ones continued from expected; did not validate a final XML).	2020-02-14 13:16:33 +02:00
Federico Leva	a664b17a9c	Handle deleted contributor name in --xmlrevisions Avoids failure in https://deployment.wikimedia.beta.wmflabs.org/w/api.php for revision https://deployment.wikimedia.beta.wmflabs.org/?oldid=2349 .	2020-02-13 17:13:16 +02:00
Federico Leva	b162e7b14f	Reduce the API limit to 50 for arvlimit, gaplimit, ailimit Avoids to crash on errors or warnings which some wikis return for bigger requests, like https://www.openkm.com/wiki/api.php (MediaWiki 1.27.3).	2020-02-13 15:58:39 +02:00
Federico Leva	d543f7d4dd	Check the API URL against mwclient too, so it doesn't fail later Change the protocol from HTTP to HTTPS if needed. Fixes: http://nimiarkisto.fi/w/api.php	2020-02-13 15:45:17 +02:00
Federico Leva	d1619392f4	Force the lxml factory to pass around unicode strings Not necessarily the most compatible with downstream XML parsers, but at least should ensure that we manage to write the XML file. The encoding declared in the header is not necessarily the same we get from the API. See also: https://lxml.de/FAQ.html#why-can-t-lxml-parse-my-xml-from-unicode-strings https://lxml.de/3.7/parsing.html#serialising-to-unicode-strings Fixes https://github.com/WikiTeam/wikiteam/issues/363	2020-02-13 14:32:58 +02:00
Federico Leva	6dc86d1964	Actually use the next batch from prop=revisions in MediaWiki 1.19	2020-02-13 10:49:29 +02:00
Federico Leva	2ba69b3810	Indent the number of revisions more, consistent with page title style	2020-02-11 22:44:17 +02:00
Federico Leva	8fef62d46e	Implement continuation for --xmlrevisions with prop=revisions in MW 1.19	2020-02-11 17:22:36 +02:00
Federico Leva	8b58599645	Merge branch 'xmlrevisions' of github.com:nemobis/wikiteam into xmlrevisions	2020-02-10 23:07:20 +02:00
Federico Leva	17283113dd	Wikia: make getXMLHeader() check more lenient Otherwise we end up using Special:Export even though the export API would work perfectly well with --xmlrevisions. For some reason using the general requests session always got an empty response from the Wikia API. May also fix images on fandom.com: https://github.com/WikiTeam/wikiteam/issues/330	2020-02-10 23:05:16 +02:00
Federico Leva	2c21eadf7c	Wikia: make getXMLHeader() check more lenient, Otherwise we end up using Special:Export even though the export API would work perfectly well with --xmlrevisions. May also fix images on fandom.com: https://github.com/WikiTeam/wikiteam/issues/330	2020-02-10 22:32:01 +02:00
Federico Leva	131e19979c	Use mwclient generator for allpages Tested with MediaWiki 1.31 and 1.19.	2020-02-10 22:13:21 +02:00
Federico Leva	faf0e31b4e	Don't set apfrom in initial allpages request, use suggested continuation Helps with recent MediaWiki versions like 1.31 where variants of "!" can give a bad title error and the continuation wants apcontinue anyway.	2020-02-10 21:19:01 +02:00
Federico Leva	49017e3f20	Catch HTTP Error 405 and switch from POST to GET for API requests Seen on http://wiki.ainigma.eu/index.php?title=Hlavn%C3%AD_strana: HTTPError: HTTP Error 405: Method Not Allowed	2020-02-10 20:52:13 +02:00
Federico Leva	8b5378f991	Fix query prop=revisions continuation in MediaWiki 1.22 This wiki has the old query-continue format but it's not exposes here.	2020-02-10 19:33:10 +02:00
Federico Leva	92da7388b0	Avoid asking allpages API if API not available So that it doesn't have to iterate among non-existing titles. Fixes https://github.com/WikiTeam/wikiteam/issues/348	2020-02-10 19:32:44 +02:00
Federico Leva	1645c1d832	More robust XML header fetch for getXMLHeader() Avoid UnboundLocalError: local variable 'xml' referenced before assignment If the page exists, its XML export is returned by the API; otherwise only the header that we were looking for. Fixes https://github.com/WikiTeam/wikiteam/issues/355	2020-02-10 18:18:26 +02:00
Federico Leva	0b37b39923	Define xml header as empty first so that it can fail graciously Fixes https://github.com/WikiTeam/wikiteam/issues/355	2020-02-10 18:05:42 +02:00
Federico Leva	becd01b271	Use defined requests.exceptions.ConnectionError Fixes https://github.com/WikiTeam/wikiteam/issues/356	2020-02-10 18:00:43 +02:00
Federico Leva	f0436ee57c	Make mwclient respect the provided HTTP/HTTPS scheme Fixes https://github.com/WikiTeam/wikiteam/issues/358	2020-02-10 17:59:03 +02:00
Federico Leva	9ec6ce42d3	Finish xmlrevisions option for older wikis * Actually proceed to the next page when no continuation. * Provide the same output as with the usual per-page export. Tested on a MediaWiki 1.16 wiki with success.	2020-02-10 17:33:57 +02:00
Federico Leva	0f35d03929	Remove rvlimit=max, fails in MediaWiki 1.16 For instance: "Exception Caught: Internal error in ApiResult::setElement: Attempting to add element revisions=50, existing value is 500" https://wiki.rabenthal.net/api.php?action=query&prop=revisions&titles=Hauptseite&rvprop=ids&rvlimit=max	2020-02-10 16:08:04 +02:00
Federico Leva	6b12e20a9d	Actually convert the titles query method to mwclient too	2020-02-10 15:46:05 +02:00
Federico Leva	f10adb71af	Don't try to add revisions if the namespace has none Traceback (most recent call last): File "dumpgenerator.py", line 2362, in <module> File "dumpgenerator.py", line 2354, in main resumePreviousDump(config=config, other=other) File "dumpgenerator.py", line 1921, in createNewDump getPageTitles(config=config, session=other['session']) File "dumpgenerator.py", line 755, in generateXMLDump for xml in getXMLRevisions(config=config, session=session): File "dumpgenerator.py", line 861, in getXMLRevisions revids.append(str(revision['revid'])) IndexError: list index out of range	2020-02-10 15:27:10 +02:00
Federico Leva	3760501f74	Add a couple comments	2020-02-10 15:25:40 +02:00

1 2 3 4 5 ...

374 Commits