Commit Graph

921 Commits (f7466850c9e835b3ab753eea1e1f99cee77a085d)
 

Author SHA1 Message Date
Federico Leva f7466850c9 List of wikis to archive, from not-archived.py 6 years ago
Federico Leva d07a14cbce New version of uploader.py with possibility of separate directory
Also much faster than using os.walk, which lists all the images
in all wikidump directories.
6 years ago
Federico Leva 03ba77e2f5 Build XML from the pages module when allrevisions not available 6 years ago
Federico Leva 06ad1a9fe3 Update --xmlrevisions help 6 years ago
Federico Leva 7143f7efb1 Actually export all revisions in --xmlrevisions: build XML manually! 6 years ago
Federico Leva 50c6786f84 Move launcher.py where its imports assume it is
No reason to force users to move it to actually use it.
6 years ago
Federico Leva 1ff5af7d44 Catch unexpected API errors in getPageTitlesAPI
Apparently the initial JSON test is not enough, the JSON can be broken
or unexpected in other ways/points.
Fallback to the old scraper in such a case.

Fixes https://github.com/WikiTeam/wikiteam/issues/295 , perhaps.

If the scraper doesn't work for the wiki, the dump will fail entirely,
even if maybe the list of titles was almost complete. A different
solution may be in order.
6 years ago
Federico Leva 59c4c5430e Catch missing titles file and JSON response
Traceback (most recent call last):
  File "dumpgenerator.py", line 2214, in <module>
    print 'Trying to use path "%s"...' % (config['path'])
  File "dumpgenerator.py", line 2210, in main
    elif reply.lower() in ['no', 'n']:
  File "dumpgenerator.py", line 1977, in saveSiteInfo

  File "dumpgenerator.py", line 1711, in getJSON
    return False
  File "/usr/lib/python2.7/site-packages/requests/models.py", line 892, in json
    return complexjson.loads(self.text, **kwargs)
  File "/usr/lib64/python2.7/site-packages/simplejson/__init__.py", line 516, in loads
    return _default_decoder.decode(s)
  File "/usr/lib64/python2.7/site-packages/simplejson/decoder.py", line 374, in decode
    obj, end = self.raw_decode(s)
  File "/usr/lib64/python2.7/site-packages/simplejson/decoder.py", line 404, in raw_decode
    return self.scan_once(s, idx=_w(s, idx).end())
simplejson.scanner.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Or if for instance the directory was named compared to the saved config:

Resuming previous dump process...
Traceback (most recent call last):
  File "./dumpgenerator.py", line 2238, in <module>
    main()
  File "./dumpgenerator.py", line 2228, in main
    resumePreviousDump(config=config, other=other)
  File "./dumpgenerator.py", line 1829, in resumePreviousDump
    if lasttitle == '--END--':
UnboundLocalError: local variable 'lasttitle' referenced before assignment
6 years ago
Federico Leva b307de6cb7 Make --xmlrevisions work on Wikia
* Do not try exportnowrap first: it returns a blank page.
* Add an allpages option, which simply uses readTitles but cannot resume.

FIXME: this only exports the current revision!
6 years ago
Federico Leva 680145e6a5 Fallback for --xmlrevisions on a MediaWiki 1.12 wiki 6 years ago
Federico Leva 27cbdfd302 Circumvent API exception when trying to use index.php
$ python dumpgenerator.py --xml --index=http://meritbadge.org/wiki/index.php
fails on at least one MediaWiki 1.12 wiki:

Trying generating a new dump into a new directory...
Loading page titles from namespaces = all
Excluding titles from namespaces = None
Traceback (most recent call last):
  File "dumpgenerator.py", line 2211, in <module>
    main()
  File "dumpgenerator.py", line 2203, in main
    createNewDump(config=config, other=other)
  File "dumpgenerator.py", line 1766, in createNewDump
    getPageTitles(config=config, session=other['session'])
  File "dumpgenerator.py", line 400, in getPageTitles
    test = getJSON(r)
  File "dumpgenerator.py", line 1708, in getJSON
    return request.json()
  File "/usr/lib/python2.7/site-packages/requests/models.py", line 892, in json
    return complexjson.loads(self.text, **kwargs)
  File "/usr/lib64/python2.7/site-packages/simplejson/__init__.py", line 516, in loads
    return _default_decoder.decode(s)
  File "/usr/lib64/python2.7/site-packages/simplejson/decoder.py", line 374, in decode
    obj, end = self.raw_decode(s)
  File "/usr/lib64/python2.7/site-packages/simplejson/decoder.py", line 404, in raw_decode
    return self.scan_once(s, idx=_w(s, idx).end())
simplejson.scanner.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
6 years ago
Federico Leva d4f0869ecc Consistently use POST params instead of data
Also match URLs which end in ".php$" in domain2prefix().
6 years ago
Federico Leva 754027de42 xmlrevisions: actually allow index to be undefined, don't POST data
* http://biografias.bcn.cl/api.php does not like the data to be POSTed.
  Just use URL parameters. Some wikis had anti-spam protections which
  made us POST everything, but for most wikis this should be fine.
* If the index is not defined, don't fail.
* Use only the base api.php URL, not parameters, in domain2prefix.

https://github.com/WikiTeam/wikiteam/issues/314
6 years ago
Emilio 3a56037279
Merge pull request #310 from nemobis/master
Update Wikia list with wikia.py
6 years ago
emijrp 811a325756 update 6 years ago
emijrp aec3a14b7b update spider incomplete results, still running; userwikispacesXY lists instead 6 years ago
emijrp 98386e0b4c codification, wikilicense 6 years ago
emijrp 3ee22b27d0 codification try 6 years ago
emijrp 51ebefa1c4 100,000 wikispaces 6 years ago
emijrp 9fb8d4be0e file check 6 years ago
emijrp 8c30b3a2b9 bug invalid content, redownload 6 years ago
emijrp 7280c89b3b duckduckgo spider 6 years ago
emijrp 83158d4506 70k wikis by spider 6 years ago
emijrp 4b483c695b Merge branch 'master' of https://github.com/WikiTeam/wikiteam 6 years ago
emijrp 60704e3303 searching wikis with duckduckgo 6 years ago
Federico Leva 7c545d05b7 Fix UnboundLocalError and catch RetryError with --xmlrevisions
File "./dumpgenerator.py", line 1212, in generateImageDump
    if not re.search(r'</mediawiki>', xmlfiledesc):

UnboundLocalError: local variable 'xmlfiledesc' referenced before assignment
6 years ago
Federico Leva 952fcc6bcf Up version to 0.4.0-alpha to signify disruption 6 years ago
Federico Leva 33bb1c1f23 Download image description from API when using --xmlrevisions
Fixes https://github.com/WikiTeam/wikiteam/issues/308

Also add --failfast option to sneak in all the hacks I use to run
the bulk downloads, so I can more easily sync the repos.
6 years ago
Federico Leva b8909baa3d Update Wikia list with wikia.py 6 years ago
Federico Leva be5ca12075 Avoid generators in API-only export 6 years ago
Fedora ebc02a3b45 Merge branch 'master' of https://github.com/WikiTeam/wikiteam 6 years ago
Fedora a8cbb357ff First attempt of API-only export 6 years ago
Fedora 142b48cc69 Add timeouts and retries to increase success rate 6 years ago
emijrp 60a0ba2e54 sleep 6 years ago
emijrp 061709d9e6 50,000 wikis, do not use this list, use wikispacesXY instead 6 years ago
emijrp 5002eb723a print 6 years ago
emijrp 30a6dc268b wikispaces lists 6 years ago
emijrp e01b2fb0c3 bug wikitext 6 years ago
emijrp ffff6cf568 sleep 6 years ago
emijrp 24ba4ae0ca originalurl metadata 6 years ago
emijrp cd90d30aaa ia checking 6 years ago
emijrp af680ced4a help, params 6 years ago
emijrp 2fe1c0b6b2 uploader included 6 years ago
emijrp 254486af06 param 6 years ago
emijrp 9ab9c64df2 bug in redirects; script accepts wikilist.txt now 6 years ago
emijrp 0574b5f33a second version, it downloads all, including sitemap and mainpage 6 years ago
emijrp 145b040784 update, 10000 wikis, still more arriving 6 years ago
emijrp 0b2dd6f8f8 Merge branch 'master' of https://github.com/WikiTeam/wikiteam 6 years ago
emijrp 557323d85e index path 6 years ago
emijrp cfb225ea5e first version, wikispaces downloader 6 years ago