wikiteam

Commit Graph

Author	SHA1	Message	Date
Federico Leva	d76b4b4e01	Raise and catch PageMissingError when revisions API result is incomplete https://github.com/WikiTeam/wikiteam/issues/317	6 years ago
Federico Leva	7a655f0074	Check for sha1 presence in makeXmlFromPage()	6 years ago
Federico Leva	4bc41c3aa2	Actually keep track of listed titles and stop when duplicates are returned https://github.com/WikiTeam/wikiteam/issues/309	6 years ago
Federico Leva	80288cf49e	Catch allpages and namespaces API without query results	6 years ago
Federico Leva	e47f638a24	Define "check" before running checkAPI() Traceback (most recent call last): File "./dumpgenerator.py", line 2294, in <module> main() File "./dumpgenerator.py", line 2239, in main config, other = getParameters(params=params) File "./dumpgenerator.py", line 1587, in getParameters if api and check: UnboundLocalError: local variable 'check' referenced before assignment	6 years ago
Federico Leva	bad49d7916	Also default to regenerating dump in --failfast	6 years ago
Federico Leva	bbcafdf869	Support Unicode usernames etc. in makeXmlFromPage() Test case: Titles saved at... 39fanficwikiacom-20180521-titles.txt 377 page titles loaded http://39fanfic.wikia.com/api.php Getting the XML header from the API Retrieving the XML for every page from the beginning 30 namespaces found Exporting revisions from namespace 0 Warning. Could not use allrevisions, wiki too old. 1 more revisions exported Traceback (most recent call last): File "./dumpgenerator.py", line 2291, in <module> main() File "./dumpgenerator.py", line 2283, in main createNewDump(config=config, other=other) File "./dumpgenerator.py", line 1849, in createNewDump generateXMLDump(config=config, titles=titles, session=other['session']) File "./dumpgenerator.py", line 732, in generateXMLDump for xml in getXMLRevisions(config=config, session=session): File "./dumpgenerator.py", line 861, in getXMLRevisions yield makeXmlFromPage(pages[page]) File "./dumpgenerator.py", line 880, in makeXmlFromPage E.username(str(rev['user'])), UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-3: ordinal not in range(128)	6 years ago
Federico Leva	320f231d57	Handle status code > 400 in checkAPI() Fixes https://github.com/WikiTeam/wikiteam/issues/315	6 years ago
Federico Leva	845c05de1e	Go back to data POSTing in checkIndex() and checkAPI() to handle redirects Some redirects from HTTP to HTTPS otherwise end up giving 400, like http://nimiarkisto.fi/	6 years ago
Federico Leva	de752bb6a2	Also add contentmodel to the XML of --xmlrevisions	6 years ago
Federico Leva	03ba77e2f5	Build XML from the pages module when allrevisions not available	6 years ago
Federico Leva	06ad1a9fe3	Update --xmlrevisions help	6 years ago
Federico Leva	7143f7efb1	Actually export all revisions in --xmlrevisions: build XML manually!	6 years ago
Federico Leva	1ff5af7d44	Catch unexpected API errors in getPageTitlesAPI Apparently the initial JSON test is not enough, the JSON can be broken or unexpected in other ways/points. Fallback to the old scraper in such a case. Fixes https://github.com/WikiTeam/wikiteam/issues/295 , perhaps. If the scraper doesn't work for the wiki, the dump will fail entirely, even if maybe the list of titles was almost complete. A different solution may be in order.	6 years ago
Federico Leva	59c4c5430e	Catch missing titles file and JSON response Traceback (most recent call last): File "dumpgenerator.py", line 2214, in <module> print 'Trying to use path "%s"...' % (config['path']) File "dumpgenerator.py", line 2210, in main elif reply.lower() in ['no', 'n']: File "dumpgenerator.py", line 1977, in saveSiteInfo File "dumpgenerator.py", line 1711, in getJSON return False File "/usr/lib/python2.7/site-packages/requests/models.py", line 892, in json return complexjson.loads(self.text, **kwargs) File "/usr/lib64/python2.7/site-packages/simplejson/__init__.py", line 516, in loads return _default_decoder.decode(s) File "/usr/lib64/python2.7/site-packages/simplejson/decoder.py", line 374, in decode obj, end = self.raw_decode(s) File "/usr/lib64/python2.7/site-packages/simplejson/decoder.py", line 404, in raw_decode return self.scan_once(s, idx=_w(s, idx).end()) simplejson.scanner.JSONDecodeError: Expecting value: line 1 column 1 (char 0) Or if for instance the directory was named compared to the saved config: Resuming previous dump process... Traceback (most recent call last): File "./dumpgenerator.py", line 2238, in <module> main() File "./dumpgenerator.py", line 2228, in main resumePreviousDump(config=config, other=other) File "./dumpgenerator.py", line 1829, in resumePreviousDump if lasttitle == '--END--': UnboundLocalError: local variable 'lasttitle' referenced before assignment	6 years ago
Federico Leva	b307de6cb7	Make --xmlrevisions work on Wikia * Do not try exportnowrap first: it returns a blank page. * Add an allpages option, which simply uses readTitles but cannot resume. FIXME: this only exports the current revision!	6 years ago
Federico Leva	680145e6a5	Fallback for --xmlrevisions on a MediaWiki 1.12 wiki	6 years ago
Federico Leva	27cbdfd302	Circumvent API exception when trying to use index.php $ python dumpgenerator.py --xml --index=http://meritbadge.org/wiki/index.php fails on at least one MediaWiki 1.12 wiki: Trying generating a new dump into a new directory... Loading page titles from namespaces = all Excluding titles from namespaces = None Traceback (most recent call last): File "dumpgenerator.py", line 2211, in <module> main() File "dumpgenerator.py", line 2203, in main createNewDump(config=config, other=other) File "dumpgenerator.py", line 1766, in createNewDump getPageTitles(config=config, session=other['session']) File "dumpgenerator.py", line 400, in getPageTitles test = getJSON(r) File "dumpgenerator.py", line 1708, in getJSON return request.json() File "/usr/lib/python2.7/site-packages/requests/models.py", line 892, in json return complexjson.loads(self.text, **kwargs) File "/usr/lib64/python2.7/site-packages/simplejson/__init__.py", line 516, in loads return _default_decoder.decode(s) File "/usr/lib64/python2.7/site-packages/simplejson/decoder.py", line 374, in decode obj, end = self.raw_decode(s) File "/usr/lib64/python2.7/site-packages/simplejson/decoder.py", line 404, in raw_decode return self.scan_once(s, idx=_w(s, idx).end()) simplejson.scanner.JSONDecodeError: Expecting value: line 1 column 1 (char 0)	6 years ago
Federico Leva	d4f0869ecc	Consistently use POST params instead of data Also match URLs which end in ".php$" in domain2prefix().	6 years ago
Federico Leva	754027de42	xmlrevisions: actually allow index to be undefined, don't POST data * http://biografias.bcn.cl/api.php does not like the data to be POSTed. Just use URL parameters. Some wikis had anti-spam protections which made us POST everything, but for most wikis this should be fine. * If the index is not defined, don't fail. * Use only the base api.php URL, not parameters, in domain2prefix. https://github.com/WikiTeam/wikiteam/issues/314	6 years ago
Federico Leva	7c545d05b7	Fix UnboundLocalError and catch RetryError with --xmlrevisions File "./dumpgenerator.py", line 1212, in generateImageDump if not re.search(r'</mediawiki>', xmlfiledesc): UnboundLocalError: local variable 'xmlfiledesc' referenced before assignment	6 years ago
Federico Leva	952fcc6bcf	Up version to 0.4.0-alpha to signify disruption	6 years ago
Federico Leva	33bb1c1f23	Download image description from API when using --xmlrevisions Fixes https://github.com/WikiTeam/wikiteam/issues/308 Also add --failfast option to sneak in all the hacks I use to run the bulk downloads, so I can more easily sync the repos.	6 years ago
Federico Leva	be5ca12075	Avoid generators in API-only export	6 years ago
Fedora	a8cbb357ff	First attempt of API-only export	6 years ago
Fedora	142b48cc69	Add timeouts and retries to increase success rate	6 years ago
Hydriz Scholz	e8cb1a7a5f	Add missing newline	8 years ago
Hydriz Scholz	daa39616e2	Make the copyright year automatically update itself	8 years ago
emijrp	2e99d869e2	fixing Wikia images bug, issue #212	8 years ago
emijrp	3f697dbb5b	restoring dumpgenerator.py code to `f43b7389a0` last stable version. I will rewrite code in wikiteam/ subdirectory	8 years ago
emijrp	4e939f4e98	getting ready to other wiki engines, functions prefixes: MediaWiki (mw), Wikispaces (ws)	8 years ago
nemobis	f43b7389a0	Merge pull request #270 from nemobis/master When we have a working index.php, do not require api.php	8 years ago
Federico Leva	480d421d7b	When we have a working index.php, do not require api.php We work well without api.php. this was a needless suicide. Especially as sometimes sysadmins like to disable the API for no reason and then index.php is our only option to archive the wiki.	8 years ago
emijrp	15223eb75b	Parsing more image names from HTML Special:Allimages	9 years ago
emijrp	e138d6ce52	New API params to continue in Allimages	9 years ago
emijrp	2c0f54d73b	new HTML regexp for Special:Allpages	9 years ago
emijrp	4ef665b53c	In recent MediaWiki versions, API continue is a bit different	9 years ago
Daniel Oaks	376e8a11a3	Avoid out-of-memory error in two extra places	9 years ago
Tim Sheerman-Chase	877b736cd2	Merge branch 'retry' of https://github.com/TimSC/wikiteam into retry	9 years ago
Tim Sheerman-Chase	6716ceab32	Fix tests	9 years ago
Tim Sheerman-Chase	5cb2ecb6b5	Attempting to fix missing config in tests	9 years ago
Tim	93bc29f2d7	Fix syntax errors	9 years ago
Tim	d5a1ed2d5a	Fix indentation, use classic string formating	9 years ago
Tim Sheerman-Chase	8380af5f24	Improve retry logic	9 years ago
PiRSquared17	fadd7134f7	What I meant to do, ugh	9 years ago
PiRSquared17	1b2e83aa8c	Fix minor error with normpath call	9 years ago
PiRSquared17	5db9a1c7f3	Normalize path/foo/ to path/foo, so -2, etc. work (fixes #244 )	9 years ago
Federico Leva	2b78bfb795	Merge branch '2015/iterators' of git://github.com/nemobis/wikiteam into nemobis-2015/iterators Conflicts: requirements.txt	9 years ago
Federico Leva	d4fd745498	Actually allow resuming huge or broken XML dumps * Log "XML export on this wiki is broken, quitting." to the error file so that grepping reveals which dumps were interrupted so. * Automatically reduce export size for a page when downloading the entire history at once results in a MemoryError. * Truncate the file with a pythonic method (.seek and .truncate) while reading from the end, by making reverse_readline() a weird hybrid to avoid an actual coroutine.	9 years ago
Federico Leva	9168a66a54	logerror() wants unicode, but readTitles etc. give bytes Fixes #239.	9 years ago

1 2 3 4 5 ...

318 Commits (d76b4b4e018587b53ed9d51747425f82c1252b91)