wikiteam

mirror of https://github.com/WikiTeam/wikiteam synced 2024-11-10 13:10:27 +00:00

Author	SHA1	Message	Date
Federico Leva	11507e931e	Initial switch to mwclient for the xmlrevisions option * Still maintained and available for python 3 as well. * Allows raw API requests as we need. * Does not provide handy generators, we need to do continuation. * Decides on its own which protocol and exact path to use, fails at it. * Appears to use POST by default unless asked otherwise, what to do?	2020-02-10 14:20:46 +02:00
Federico Leva	3d04dcbf5c	Use GET rather than POST for API requests * It was just an old trick to get past some barriers which were waived with GET. * It's not conformant and doesn't play well with some redirects. * Some recent wikis seem to not like it at all, see also issue #311.	2020-02-08 12:18:03 +02:00
Federico Leva	83af47d6c0	Catch and raise PageMissingError when query() returns no pages	2018-05-25 11:00:32 +03:00
Federico Leva	73902d39c0	For old MediaWiki releases, use rawcontinue and wikitools query() Otherwise the query continuation may fail and only the top revisions will be exported. Tested with Wikia: http://clubpenguin.wikia.com/api.php?action=query&prop=revisions&titles=Club_Penguin_Wiki Also add parentid since it's available after all. https://github.com/WikiTeam/wikiteam/issues/311#issuecomment-391957783	2018-05-25 10:55:44 +03:00
Federico Leva	da64349a5d	Avoid UnboundLocalError: local variable 'reply' referenced before assignment	2018-05-23 18:32:38 +03:00
Federico Leva	b7789751fc	UnboundLocalError: local variable 'reply' referenced before assignment Warning!: "./tdicampswikiacom-20180522-wikidump" path exists Traceback (most recent call last): File "./dumpgenerator.py", line 2321, in <module> main() File "./dumpgenerator.py", line 2283, in main while reply.lower() not in ['yes', 'y', 'no', 'n']: UnboundLocalError: local variable 'reply' referenced before assignment	2018-05-22 10:30:11 +03:00
Federico Leva	d76b4b4e01	Raise and catch PageMissingError when revisions API result is incomplete https://github.com/WikiTeam/wikiteam/issues/317	2018-05-22 10:16:52 +03:00
Federico Leva	7a655f0074	Check for sha1 presence in makeXmlFromPage()	2018-05-22 09:33:53 +03:00
Federico Leva	4bc41c3aa2	Actually keep track of listed titles and stop when duplicates are returned https://github.com/WikiTeam/wikiteam/issues/309	2018-05-21 16:41:10 +03:00
Federico Leva	80288cf49e	Catch allpages and namespaces API without query results	2018-05-21 16:41:00 +03:00
Federico Leva	e47f638a24	Define "check" before running checkAPI() Traceback (most recent call last): File "./dumpgenerator.py", line 2294, in <module> main() File "./dumpgenerator.py", line 2239, in main config, other = getParameters(params=params) File "./dumpgenerator.py", line 1587, in getParameters if api and check: UnboundLocalError: local variable 'check' referenced before assignment	2018-05-21 15:53:51 +03:00
Federico Leva	bad49d7916	Also default to regenerating dump in --failfast	2018-05-21 11:44:25 +03:00
Federico Leva	bbcafdf869	Support Unicode usernames etc. in makeXmlFromPage() Test case: Titles saved at... 39fanficwikiacom-20180521-titles.txt 377 page titles loaded http://39fanfic.wikia.com/api.php Getting the XML header from the API Retrieving the XML for every page from the beginning 30 namespaces found Exporting revisions from namespace 0 Warning. Could not use allrevisions, wiki too old. 1 more revisions exported Traceback (most recent call last): File "./dumpgenerator.py", line 2291, in <module> main() File "./dumpgenerator.py", line 2283, in main createNewDump(config=config, other=other) File "./dumpgenerator.py", line 1849, in createNewDump generateXMLDump(config=config, titles=titles, session=other['session']) File "./dumpgenerator.py", line 732, in generateXMLDump for xml in getXMLRevisions(config=config, session=session): File "./dumpgenerator.py", line 861, in getXMLRevisions yield makeXmlFromPage(pages[page]) File "./dumpgenerator.py", line 880, in makeXmlFromPage E.username(str(rev['user'])), UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-3: ordinal not in range(128)	2018-05-21 07:54:27 +03:00
Federico Leva	320f231d57	Handle status code > 400 in checkAPI() Fixes https://github.com/WikiTeam/wikiteam/issues/315	2018-05-20 01:41:01 +03:00
Federico Leva	845c05de1e	Go back to data POSTing in checkIndex() and checkAPI() to handle redirects Some redirects from HTTP to HTTPS otherwise end up giving 400, like http://nimiarkisto.fi/	2018-05-20 01:20:32 +03:00
Federico Leva	de752bb6a2	Also add contentmodel to the XML of --xmlrevisions	2018-05-20 00:28:01 +03:00
Federico Leva	03ba77e2f5	Build XML from the pages module when allrevisions not available	2018-05-19 22:34:13 +03:00
Federico Leva	06ad1a9fe3	Update --xmlrevisions help	2018-05-19 18:55:03 +03:00
Federico Leva	7143f7efb1	Actually export all revisions in --xmlrevisions: build XML manually!	2018-05-19 18:49:58 +03:00
Federico Leva	1ff5af7d44	Catch unexpected API errors in getPageTitlesAPI Apparently the initial JSON test is not enough, the JSON can be broken or unexpected in other ways/points. Fallback to the old scraper in such a case. Fixes https://github.com/WikiTeam/wikiteam/issues/295 , perhaps. If the scraper doesn't work for the wiki, the dump will fail entirely, even if maybe the list of titles was almost complete. A different solution may be in order.	2018-05-19 04:11:50 +03:00
Federico Leva	59c4c5430e	Catch missing titles file and JSON response Traceback (most recent call last): File "dumpgenerator.py", line 2214, in <module> print 'Trying to use path "%s"...' % (config['path']) File "dumpgenerator.py", line 2210, in main elif reply.lower() in ['no', 'n']: File "dumpgenerator.py", line 1977, in saveSiteInfo File "dumpgenerator.py", line 1711, in getJSON return False File "/usr/lib/python2.7/site-packages/requests/models.py", line 892, in json return complexjson.loads(self.text, **kwargs) File "/usr/lib64/python2.7/site-packages/simplejson/__init__.py", line 516, in loads return _default_decoder.decode(s) File "/usr/lib64/python2.7/site-packages/simplejson/decoder.py", line 374, in decode obj, end = self.raw_decode(s) File "/usr/lib64/python2.7/site-packages/simplejson/decoder.py", line 404, in raw_decode return self.scan_once(s, idx=_w(s, idx).end()) simplejson.scanner.JSONDecodeError: Expecting value: line 1 column 1 (char 0) Or if for instance the directory was named compared to the saved config: Resuming previous dump process... Traceback (most recent call last): File "./dumpgenerator.py", line 2238, in <module> main() File "./dumpgenerator.py", line 2228, in main resumePreviousDump(config=config, other=other) File "./dumpgenerator.py", line 1829, in resumePreviousDump if lasttitle == '--END--': UnboundLocalError: local variable 'lasttitle' referenced before assignment	2018-05-19 03:31:49 +03:00
Federico Leva	b307de6cb7	Make --xmlrevisions work on Wikia * Do not try exportnowrap first: it returns a blank page. * Add an allpages option, which simply uses readTitles but cannot resume. FIXME: this only exports the current revision!	2018-05-19 03:13:42 +03:00
Federico Leva	680145e6a5	Fallback for --xmlrevisions on a MediaWiki 1.12 wiki	2018-05-19 00:27:13 +03:00
Federico Leva	27cbdfd302	Circumvent API exception when trying to use index.php $ python dumpgenerator.py --xml --index=http://meritbadge.org/wiki/index.php fails on at least one MediaWiki 1.12 wiki: Trying generating a new dump into a new directory... Loading page titles from namespaces = all Excluding titles from namespaces = None Traceback (most recent call last): File "dumpgenerator.py", line 2211, in <module> main() File "dumpgenerator.py", line 2203, in main createNewDump(config=config, other=other) File "dumpgenerator.py", line 1766, in createNewDump getPageTitles(config=config, session=other['session']) File "dumpgenerator.py", line 400, in getPageTitles test = getJSON(r) File "dumpgenerator.py", line 1708, in getJSON return request.json() File "/usr/lib/python2.7/site-packages/requests/models.py", line 892, in json return complexjson.loads(self.text, **kwargs) File "/usr/lib64/python2.7/site-packages/simplejson/__init__.py", line 516, in loads return _default_decoder.decode(s) File "/usr/lib64/python2.7/site-packages/simplejson/decoder.py", line 374, in decode obj, end = self.raw_decode(s) File "/usr/lib64/python2.7/site-packages/simplejson/decoder.py", line 404, in raw_decode return self.scan_once(s, idx=_w(s, idx).end()) simplejson.scanner.JSONDecodeError: Expecting value: line 1 column 1 (char 0)	2018-05-19 00:16:43 +03:00
Federico Leva	d4f0869ecc	Consistently use POST params instead of data Also match URLs which end in ".php$" in domain2prefix().	2018-05-17 09:59:07 +03:00
Federico Leva	754027de42	xmlrevisions: actually allow index to be undefined, don't POST data * http://biografias.bcn.cl/api.php does not like the data to be POSTed. Just use URL parameters. Some wikis had anti-spam protections which made us POST everything, but for most wikis this should be fine. * If the index is not defined, don't fail. * Use only the base api.php URL, not parameters, in domain2prefix. https://github.com/WikiTeam/wikiteam/issues/314	2018-05-17 09:40:20 +03:00
Federico Leva	7c545d05b7	Fix UnboundLocalError and catch RetryError with --xmlrevisions File "./dumpgenerator.py", line 1212, in generateImageDump if not re.search(r'</mediawiki>', xmlfiledesc): UnboundLocalError: local variable 'xmlfiledesc' referenced before assignment	2018-05-08 17:09:38 +00:00
Federico Leva	952fcc6bcf	Up version to 0.4.0-alpha to signify disruption	2018-05-07 21:55:26 +00:00
Federico Leva	33bb1c1f23	Download image description from API when using --xmlrevisions Fixes https://github.com/WikiTeam/wikiteam/issues/308 Also add --failfast option to sneak in all the hacks I use to run the bulk downloads, so I can more easily sync the repos.	2018-05-07 21:53:43 +00:00
Federico Leva	be5ca12075	Avoid generators in API-only export	2018-05-07 20:03:22 +00:00
Fedora	a8cbb357ff	First attempt of API-only export	2018-05-07 19:05:26 +00:00
Fedora	142b48cc69	Add timeouts and retries to increase success rate	2018-05-07 19:01:50 +00:00
Hydriz Scholz	e8cb1a7a5f	Add missing newline	2016-09-30 17:29:33 +08:00
Hydriz Scholz	daa39616e2	Make the copyright year automatically update itself	2016-09-30 13:09:41 +08:00
emijrp	2e99d869e2	fixing Wikia images bug, issue #212	2016-09-18 01:16:46 +02:00
emijrp	3f697dbb5b	restoring dumpgenerator.py code to `f43b7389a0` last stable version. I will rewrite code in wikiteam/ subdirectory	2016-07-31 17:37:31 +02:00
emijrp	4e939f4e98	getting ready to other wiki engines, functions prefixes: MediaWiki (mw), Wikispaces (ws)	2016-07-30 16:32:25 +02:00
nemobis	f43b7389a0	Merge pull request #270 from nemobis/master When we have a working index.php, do not require api.php	2016-02-28 09:14:45 +01:00
Federico Leva	480d421d7b	When we have a working index.php, do not require api.php We work well without api.php. this was a needless suicide. Especially as sometimes sysadmins like to disable the API for no reason and then index.php is our only option to archive the wiki.	2016-02-27 21:10:02 +01:00
emijrp	15223eb75b	Parsing more image names from HTML Special:Allimages	2016-01-29 17:19:15 +01:00
emijrp	e138d6ce52	New API params to continue in Allimages	2016-01-29 16:55:44 +01:00
emijrp	2c0f54d73b	new HTML regexp for Special:Allpages	2016-01-29 16:38:31 +01:00
emijrp	4ef665b53c	In recent MediaWiki versions, API continue is a bit different	2016-01-29 16:14:34 +01:00
Daniel Oaks	376e8a11a3	Avoid out-of-memory error in two extra places	2015-10-22 23:19:50 +10:00
Tim Sheerman-Chase	877b736cd2	Merge branch 'retry' of https://github.com/TimSC/wikiteam into retry	2015-08-07 23:15:58 +01:00
Tim Sheerman-Chase	6716ceab32	Fix tests	2015-08-07 23:07:13 +01:00
Tim Sheerman-Chase	5cb2ecb6b5	Attempting to fix missing config in tests	2015-08-07 21:33:39 +01:00
Tim	93bc29f2d7	Fix syntax errors	2015-08-06 22:37:29 +01:00
Tim	d5a1ed2d5a	Fix indentation, use classic string formating	2015-08-06 22:30:49 +01:00
Tim Sheerman-Chase	8380af5f24	Improve retry logic	2015-08-05 21:24:59 +01:00
PiRSquared17	fadd7134f7	What I meant to do, ugh	2015-04-18 21:28:57 +00:00
PiRSquared17	1b2e83aa8c	Fix minor error with normpath call	2015-04-18 21:27:23 +00:00
PiRSquared17	5db9a1c7f3	Normalize path/foo/ to path/foo, so -2, etc. work (fixes #244 )	2015-04-10 00:21:07 +00:00
Federico Leva	2b78bfb795	Merge branch '2015/iterators' of git://github.com/nemobis/wikiteam into nemobis-2015/iterators Conflicts: requirements.txt	2015-03-30 09:27:00 +02:00
Federico Leva	d4fd745498	Actually allow resuming huge or broken XML dumps * Log "XML export on this wiki is broken, quitting." to the error file so that grepping reveals which dumps were interrupted so. * Automatically reduce export size for a page when downloading the entire history at once results in a MemoryError. * Truncate the file with a pythonic method (.seek and .truncate) while reading from the end, by making reverse_readline() a weird hybrid to avoid an actual coroutine.	2015-03-30 02:35:55 +02:00
Federico Leva	9168a66a54	logerror() wants unicode, but readTitles etc. give bytes Fixes #239.	2015-03-30 00:51:53 +02:00
Federico Leva	632b99ea53	Merge branch '2015/iterators' of https://github.com/nemobis/wikiteam into nemobis-2015/iterators	2015-03-30 00:48:32 +02:00
nemobis	ff2cdfa1cd	Merge pull request #236 from PiRSquared17/fix-server-check-api Catch KeyError to fix server check	2015-03-29 13:53:26 +02:00
nemobis	0b25951ab1	Merge pull request #224 from nemobis/2015/issue26 Issue #26: Local "Special" namespace, actually limit replies	2015-03-29 13:53:20 +02:00
PiRSquared17	03db166718	Catch KeyError to fix server check	2015-03-29 04:14:43 +01:00
PiRSquared17	f80ad39df0	Make filename truncation work with UTF-8	2015-03-28 15:17:06 +00:00
PiRSquared17	90bfd1400e	Merge pull request #229 from PiRSquared17/fix-zwnbsp-bom Strip ZWNBSP (U+FEFF) Byte-Order Mark from JSON/XML	2015-03-24 21:46:33 +00:00
PiRSquared17	fc276d525f	Allow spaces before <mediawiki> tag.	2015-03-24 03:44:03 +00:00
PiRSquared17	1c820dafb7	Strip ZWNBSP (U+FEFF) Byte-Order Mark from JSON/XML	2015-03-24 01:58:01 +00:00
Nemo bis	55e5888a00	Fix UnicodeDecodeError in resume: use kitchen	2015-03-10 22:26:23 +02:00
Federico Leva	14ce5f2c1b	Resume and list titles without keeping everything in memory Approach suggested by @makoshark, finally found the time to start implementing it. * Do not produce and save the titles list all at once. Instead, use the scraper and API as generators and save titles on the go. Also, try to start the generator from the appropriate title. For now the title sorting is not implemented. Pages will be in the order given by namespace ID, then page name. * When resuming, read both the title list and the XML file from the end rather than the beginning. If the correct terminator is present, only one line needs to be read. * In both cases, use a generator instead of a huge list in memory. * Also truncate the resumed XML without writing it from scratch. For now using GNU ed: very compact, though shelling out is ugly. I gave up on using file.seek and file.truncate to avoid reading the whole file from the beginning or complicating reverse_readline() with more offset calculations. This should avoid MemoryError in most cases. Tested by running a dump over a 1.24 wiki with 11 pages: a complete dump and a resumed dump from a dump interrupted with ctrl-c.	2015-03-10 11:21:44 +01:00
Federico Leva	2537e9852e	Make dumpgenerator.py 774: required by launcher.py	2015-03-08 21:33:39 +01:00
Federico Leva	79e2c5951f	Fix API check if only index is passed I forgot that the preceding point only extracts the api.php URL if the "wiki" argument is passed to say it's a MediaWiki wiki (!).	2015-03-08 20:52:24 +01:00
Federico Leva	bdc7c9bf06	Issue 26: Local "Special" namespace, actually limit replies * For some reason, in a previous commit I had noticed that maxretries was not respected in getXMLPageCore, but I didn't fix it. Done now. * If the "Special" namespace alias doesn't work, fetch the local one.	2015-03-08 19:30:09 +01:00
Federico Leva	2f25e6b787	Make checkAPI() more readable and verbose Also return the api URL we found.	2015-03-08 16:01:46 +01:00
Federico Leva	48ad3775fd	Merge branch 'follow-redirects-api' of git://github.com/PiRSquared17/wikiteam into PiRSquared17-follow-redirects-api	2015-03-08 14:35:30 +01:00
nemobis	2284e3d55e	Merge pull request #186 from PiRSquared17/update-headers Preserve default headers, fixing openwrt test	2015-03-08 13:56:00 +01:00
PiRSquared17	5d23cb62f4	Merge pull request #219 from vadp/dir-fnames-unicode convert images directory content to unicode when resuming download	2015-03-04 23:36:59 +00:00
PiRSquared17	d361477a46	Merge pull request #222 from vadp/img-desc-load-err dumpgenerator: catch errors for missing image descriptions	2015-03-03 01:55:59 +00:00
Vadim Shlyakhov	4c1d104326	dumpgenerator: catch errors for missing image descriptions	2015-03-02 12:15:51 +03:00
PiRSquared17	b1ce45b170	Try using URL without index.php as index	2015-03-02 04:13:44 +00:00
PiRSquared17	9c3c992319	Follow API redirects	2015-03-02 03:13:03 +00:00
Vadim Shlyakhov	f7e83a767a	convert images directory content to unicode when resuming download	2015-03-01 19:14:01 +03:00
Benjamin Mako Hill	d2adf5ce7c	Merge branch 'master' of github.com:WikiTeam/wikiteam	2015-02-10 17:05:22 -08:00
Benjamin Mako Hill	f85b4a3082	fixed bug with page missing exception code My previous code broke the page missing detection code with two negative outcomes: - missing pages were not reported in the error log - ever missing page generated an extraneous "</page>" line in output which rendered dumps invalid This patch improves the exception code in general and fixes both of these issues.	2015-02-10 16:56:14 -08:00
PiRSquared17	9480834a37	Fix infinite images loop Closes #205 (hopefully)	2015-02-10 02:24:50 +00:00
Benjamin Mako Hill	eb8b44aef0	strip <sha1> tags returned under <page> The Wikia API is exporting sha1 sums as part of the response for pages. These are invalid XML and are causing dump parsing code (e.g., MediaWiki-Utilities) to fail. Also, sha1 should be revisions, not pages so it's not entirely clear to me what this is referring to.	2015-02-06 18:50:25 -08:00
Benjamin Mako Hill	145b2eaaf4	changed getXMLPage() into a generator The program tended to run out of memory when processing very large pages (i.e., pages with extremely large numbers of revisions or pages with large numbers of very large revisions). This mitigates the problem by changing getXMLPage() into a generator which allows us to write pages after each request to the API. This requied changes to the getXMLPage() function and also changes to other parts of the code that called it. Additionally, when the function was called, it's text was checked in several ways. This required a few changes including a running tally of revisions instead of post hoc check and it required error checking being moved into a Exception rather than just an if statement that looked at the final result.	2015-02-06 17:19:24 -08:00
nemobis	b3ef165529	Merge pull request #194 from mrshu/mrshu/dumpgenerator-pep8fied dumpgenerator: AutoPEP8-fied	2014-10-01 23:56:36 +02:00
mr.Shu	04446a40a5	dumpgenerator: AutoPEP8-fied * Used autopep8 to made sure the code looks nice and is actually PEP8 compliant. Signed-off-by: mr.Shu <mr@shu.io>	2014-10-01 22:26:56 +02:00
nemobis	e0f8e36bf4	Merge pull request #190 from PiRSquared17/api-allpages-disabled Fallback to getPageTitlesScraper() if API allpages disabled	2014-09-28 16:34:24 +02:00
PiRSquared17	757019521a	Fallback to scraper if API allpages disabled	2014-09-23 15:53:51 -04:00
PiRSquared17	4b3c862a58	Comment debugging print, fix test	2014-09-23 15:10:06 -04:00
PiRSquared17	7a1db0525b	Add more wiki engines to getWikiEngine	2014-09-23 15:04:36 -04:00
PiRSquared17	4ceb9ad72e	Preserve default headers, fixing openwrt test	2014-09-20 20:07:49 -04:00
PiRSquared17	b4818d2985	Avoid infinite loop in getImageNamesScraper	2014-09-20 11:41:57 -04:00
nemobis	8a9b50b51d	Merge pull request #183 from PiRSquared17/patch-7 Retry on ConnectionError in getXMLPageCore	2014-09-19 21:08:52 +02:00
nemobis	19c48d3dd0	Merge pull request #180 from PiRSquared17/patch-2 Get as much information from siteinfo as possible	2014-09-19 09:43:07 +02:00
Pi R. Squared	f7187b7048	Retry on ConnectionError in getXMLPageCore Previously it just gave a fatal error.	2014-09-18 20:21:01 -04:00
Pi R. Squared	f31e4e6451	Dict not hashable, also not needed Quick fix.	2014-09-18 17:56:22 -04:00
Pi R. Squared	399f609d70	AllPages API hack for old versions of MediaWiki New API format: http://www.mediawiki.org/w/api.php?action=query&list=allpages&apnamespace=0&apfrom=!&format=json&aplimit=500 Old API format: http://wiki.damirsystems.com/api.php?action=query&list=allpages&apnamespace=0&apfrom=!&format=json	2014-09-18 17:53:38 -04:00
Pi R. Squared	498b64da3f	Try getting index.php from siteinfo API Fixes #49	2014-09-14 11:59:17 -04:00
Pi R. Squared	ff0d230d08	Get as much information from siteinfo as possible Properly fixes #74. Algorithm: 1. Try all siteinfo props. If this gives an error, continue. Otherwise, stop. 2. Try MediaWiki 1.11-1.12 siteinfo props. If this gives an error, continue. Otherwise, stop. 3. Try minimal siteinfo props. Stop. Not using sishowalldb=1 to avoid possible error (by default), since this data is of little use anyway.	2014-09-14 11:10:43 -04:00
Pi R. Squared	322604cc23	Encode title using UTF-8 before printing This fixes #170 and closes #174.	2014-09-14 09:03:44 -04:00
nemobis	11368310ee	Merge pull request #173 from nemobis/issue/131 Fix #131: ValueError: No JSON object could be decoded	2014-08-23 19:28:26 +03:00

1 2 3 4 5 ...

374 Commits