Commit Graph

310 Commits (845c05de1e3c79fefc5810298d4e4f0196bd8105)

Author SHA1 Message Date
Federico Leva 845c05de1e Go back to data POSTing in checkIndex() and checkAPI() to handle redirects
Some redirects from HTTP to HTTPS otherwise end up giving 400, like
http://nimiarkisto.fi/
6 years ago
Federico Leva de752bb6a2 Also add contentmodel to the XML of --xmlrevisions 6 years ago
Federico Leva 03ba77e2f5 Build XML from the pages module when allrevisions not available 6 years ago
Federico Leva 06ad1a9fe3 Update --xmlrevisions help 6 years ago
Federico Leva 7143f7efb1 Actually export all revisions in --xmlrevisions: build XML manually! 6 years ago
Federico Leva 1ff5af7d44 Catch unexpected API errors in getPageTitlesAPI
Apparently the initial JSON test is not enough, the JSON can be broken
or unexpected in other ways/points.
Fallback to the old scraper in such a case.

Fixes https://github.com/WikiTeam/wikiteam/issues/295 , perhaps.

If the scraper doesn't work for the wiki, the dump will fail entirely,
even if maybe the list of titles was almost complete. A different
solution may be in order.
6 years ago
Federico Leva 59c4c5430e Catch missing titles file and JSON response
Traceback (most recent call last):
  File "dumpgenerator.py", line 2214, in <module>
    print 'Trying to use path "%s"...' % (config['path'])
  File "dumpgenerator.py", line 2210, in main
    elif reply.lower() in ['no', 'n']:
  File "dumpgenerator.py", line 1977, in saveSiteInfo

  File "dumpgenerator.py", line 1711, in getJSON
    return False
  File "/usr/lib/python2.7/site-packages/requests/models.py", line 892, in json
    return complexjson.loads(self.text, **kwargs)
  File "/usr/lib64/python2.7/site-packages/simplejson/__init__.py", line 516, in loads
    return _default_decoder.decode(s)
  File "/usr/lib64/python2.7/site-packages/simplejson/decoder.py", line 374, in decode
    obj, end = self.raw_decode(s)
  File "/usr/lib64/python2.7/site-packages/simplejson/decoder.py", line 404, in raw_decode
    return self.scan_once(s, idx=_w(s, idx).end())
simplejson.scanner.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Or if for instance the directory was named compared to the saved config:

Resuming previous dump process...
Traceback (most recent call last):
  File "./dumpgenerator.py", line 2238, in <module>
    main()
  File "./dumpgenerator.py", line 2228, in main
    resumePreviousDump(config=config, other=other)
  File "./dumpgenerator.py", line 1829, in resumePreviousDump
    if lasttitle == '--END--':
UnboundLocalError: local variable 'lasttitle' referenced before assignment
6 years ago
Federico Leva b307de6cb7 Make --xmlrevisions work on Wikia
* Do not try exportnowrap first: it returns a blank page.
* Add an allpages option, which simply uses readTitles but cannot resume.

FIXME: this only exports the current revision!
6 years ago
Federico Leva 680145e6a5 Fallback for --xmlrevisions on a MediaWiki 1.12 wiki 6 years ago
Federico Leva 27cbdfd302 Circumvent API exception when trying to use index.php
$ python dumpgenerator.py --xml --index=http://meritbadge.org/wiki/index.php
fails on at least one MediaWiki 1.12 wiki:

Trying generating a new dump into a new directory...
Loading page titles from namespaces = all
Excluding titles from namespaces = None
Traceback (most recent call last):
  File "dumpgenerator.py", line 2211, in <module>
    main()
  File "dumpgenerator.py", line 2203, in main
    createNewDump(config=config, other=other)
  File "dumpgenerator.py", line 1766, in createNewDump
    getPageTitles(config=config, session=other['session'])
  File "dumpgenerator.py", line 400, in getPageTitles
    test = getJSON(r)
  File "dumpgenerator.py", line 1708, in getJSON
    return request.json()
  File "/usr/lib/python2.7/site-packages/requests/models.py", line 892, in json
    return complexjson.loads(self.text, **kwargs)
  File "/usr/lib64/python2.7/site-packages/simplejson/__init__.py", line 516, in loads
    return _default_decoder.decode(s)
  File "/usr/lib64/python2.7/site-packages/simplejson/decoder.py", line 374, in decode
    obj, end = self.raw_decode(s)
  File "/usr/lib64/python2.7/site-packages/simplejson/decoder.py", line 404, in raw_decode
    return self.scan_once(s, idx=_w(s, idx).end())
simplejson.scanner.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
6 years ago
Federico Leva d4f0869ecc Consistently use POST params instead of data
Also match URLs which end in ".php$" in domain2prefix().
6 years ago
Federico Leva 754027de42 xmlrevisions: actually allow index to be undefined, don't POST data
* http://biografias.bcn.cl/api.php does not like the data to be POSTed.
  Just use URL parameters. Some wikis had anti-spam protections which
  made us POST everything, but for most wikis this should be fine.
* If the index is not defined, don't fail.
* Use only the base api.php URL, not parameters, in domain2prefix.

https://github.com/WikiTeam/wikiteam/issues/314
6 years ago
Federico Leva 7c545d05b7 Fix UnboundLocalError and catch RetryError with --xmlrevisions
File "./dumpgenerator.py", line 1212, in generateImageDump
    if not re.search(r'</mediawiki>', xmlfiledesc):

UnboundLocalError: local variable 'xmlfiledesc' referenced before assignment
6 years ago
Federico Leva 952fcc6bcf Up version to 0.4.0-alpha to signify disruption 6 years ago
Federico Leva 33bb1c1f23 Download image description from API when using --xmlrevisions
Fixes https://github.com/WikiTeam/wikiteam/issues/308

Also add --failfast option to sneak in all the hacks I use to run
the bulk downloads, so I can more easily sync the repos.
6 years ago
Federico Leva be5ca12075 Avoid generators in API-only export 6 years ago
Fedora a8cbb357ff First attempt of API-only export 6 years ago
Fedora 142b48cc69 Add timeouts and retries to increase success rate 6 years ago
Hydriz Scholz e8cb1a7a5f Add missing newline 8 years ago
Hydriz Scholz daa39616e2 Make the copyright year automatically update itself 8 years ago
emijrp 2e99d869e2 fixing Wikia images bug, issue #212 8 years ago
emijrp 3f697dbb5b restoring dumpgenerator.py code to f43b7389a0 last stable version. I will rewrite code in wikiteam/ subdirectory 8 years ago
emijrp 4e939f4e98 getting ready to other wiki engines, functions prefixes: MediaWiki (mw), Wikispaces (ws) 8 years ago
nemobis f43b7389a0 Merge pull request #270 from nemobis/master
When we have a working index.php, do not require api.php
8 years ago
Federico Leva 480d421d7b When we have a working index.php, do not require api.php
We work well without api.php. this was a needless suicide.
Especially as sometimes sysadmins like to disable the API for no
reason and then index.php is our only option to archive the wiki.
8 years ago
emijrp 15223eb75b Parsing more image names from HTML Special:Allimages 9 years ago
emijrp e138d6ce52 New API params to continue in Allimages 9 years ago
emijrp 2c0f54d73b new HTML regexp for Special:Allpages 9 years ago
emijrp 4ef665b53c In recent MediaWiki versions, API continue is a bit different 9 years ago
Daniel Oaks 376e8a11a3 Avoid out-of-memory error in two extra places 9 years ago
Tim Sheerman-Chase 877b736cd2 Merge branch 'retry' of https://github.com/TimSC/wikiteam into retry 9 years ago
Tim Sheerman-Chase 6716ceab32 Fix tests 9 years ago
Tim Sheerman-Chase 5cb2ecb6b5 Attempting to fix missing config in tests 9 years ago
Tim 93bc29f2d7 Fix syntax errors 9 years ago
Tim d5a1ed2d5a Fix indentation, use classic string formating 9 years ago
Tim Sheerman-Chase 8380af5f24 Improve retry logic 9 years ago
PiRSquared17 fadd7134f7 What I meant to do, ugh 9 years ago
PiRSquared17 1b2e83aa8c Fix minor error with normpath call 9 years ago
PiRSquared17 5db9a1c7f3 Normalize path/foo/ to path/foo, so -2, etc. work (fixes #244) 9 years ago
Federico Leva 2b78bfb795 Merge branch '2015/iterators' of git://github.com/nemobis/wikiteam into nemobis-2015/iterators
Conflicts:
	requirements.txt
9 years ago
Federico Leva d4fd745498 Actually allow resuming huge or broken XML dumps
* Log "XML export on this wiki is broken, quitting." to the error
  file so that grepping reveals which dumps were interrupted so.
* Automatically reduce export size for a page when downloading the
  entire history at once results in a MemoryError.
* Truncate the file with a pythonic method (.seek and .truncate)
  while reading from the end, by making reverse_readline() a weird
  hybrid to avoid an actual coroutine.
9 years ago
Federico Leva 9168a66a54 logerror() wants unicode, but readTitles etc. give bytes
Fixes #239.
9 years ago
Federico Leva 632b99ea53 Merge branch '2015/iterators' of https://github.com/nemobis/wikiteam into nemobis-2015/iterators 9 years ago
nemobis ff2cdfa1cd Merge pull request #236 from PiRSquared17/fix-server-check-api
Catch KeyError to fix server check
9 years ago
nemobis 0b25951ab1 Merge pull request #224 from nemobis/2015/issue26
Issue #26: Local "Special" namespace, actually limit replies
9 years ago
PiRSquared17 03db166718 Catch KeyError to fix server check 9 years ago
PiRSquared17 f80ad39df0 Make filename truncation work with UTF-8 9 years ago
PiRSquared17 90bfd1400e Merge pull request #229 from PiRSquared17/fix-zwnbsp-bom
Strip ZWNBSP (U+FEFF) Byte-Order Mark from JSON/XML
9 years ago
PiRSquared17 fc276d525f Allow spaces before <mediawiki> tag. 9 years ago
PiRSquared17 1c820dafb7 Strip ZWNBSP (U+FEFF) Byte-Order Mark from JSON/XML 9 years ago