Commit Graph

1133 Commits (master)
 

Author SHA1 Message Date
Federico Leva bad49d7916 Also default to regenerating dump in --failfast 6 years ago
Federico Leva c5b71f60ad Also default to regenerating dump in --failfast 6 years ago
Federico Leva bbcafdf869 Support Unicode usernames etc. in makeXmlFromPage()
Test case:

Titles saved at... 39fanficwikiacom-20180521-titles.txt
377 page titles loaded
http://39fanfic.wikia.com/api.php
Getting the XML header from the API
Retrieving the XML for every page from the beginning
30 namespaces found
Exporting revisions from namespace 0
Warning. Could not use allrevisions, wiki too old.
1 more revisions exported
Traceback (most recent call last):
  File "./dumpgenerator.py", line 2291, in <module>
    main()
  File "./dumpgenerator.py", line 2283, in main
    createNewDump(config=config, other=other)
  File "./dumpgenerator.py", line 1849, in createNewDump
    generateXMLDump(config=config, titles=titles, session=other['session'])
  File "./dumpgenerator.py", line 732, in generateXMLDump
    for xml in getXMLRevisions(config=config, session=session):
  File "./dumpgenerator.py", line 861, in getXMLRevisions
    yield makeXmlFromPage(pages[page])
  File "./dumpgenerator.py", line 880, in makeXmlFromPage
    E.username(str(rev['user'])),
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-3: ordinal not in range(128)
6 years ago
Federico Leva 3df2513e67 Merge branch 'master' of github.com:WikiTeam/wikiteam 6 years ago
Federico Leva 69ec7e5015 Use os.listdir() and avoid os.walk() in launcher too
With millions of files, everything stalls otherwise.
6 years ago
emijrp a82a98a40a . 6 years ago
emijrp 9352bc9af5 comment 6 years ago
emijrp 3b0d4fef5e utf8 latin1 6 years ago
Federico Leva 4351e09d80 uploader.py: respect --admin in collection 6 years ago
Federico Leva 320f231d57 Handle status code > 400 in checkAPI()
Fixes https://github.com/WikiTeam/wikiteam/issues/315
6 years ago
Federico Leva 845c05de1e Go back to data POSTing in checkIndex() and checkAPI() to handle redirects
Some redirects from HTTP to HTTPS otherwise end up giving 400, like
http://nimiarkisto.fi/
6 years ago
Federico Leva de752bb6a2 Also add contentmodel to the XML of --xmlrevisions 6 years ago
Federico Leva f7466850c9 List of wikis to archive, from not-archived.py 6 years ago
Federico Leva d07a14cbce New version of uploader.py with possibility of separate directory
Also much faster than using os.walk, which lists all the images
in all wikidump directories.
6 years ago
Federico Leva 03ba77e2f5 Build XML from the pages module when allrevisions not available 6 years ago
Federico Leva 06ad1a9fe3 Update --xmlrevisions help 6 years ago
Federico Leva 7143f7efb1 Actually export all revisions in --xmlrevisions: build XML manually! 6 years ago
Federico Leva 50c6786f84 Move launcher.py where its imports assume it is
No reason to force users to move it to actually use it.
6 years ago
Federico Leva 1ff5af7d44 Catch unexpected API errors in getPageTitlesAPI
Apparently the initial JSON test is not enough, the JSON can be broken
or unexpected in other ways/points.
Fallback to the old scraper in such a case.

Fixes https://github.com/WikiTeam/wikiteam/issues/295 , perhaps.

If the scraper doesn't work for the wiki, the dump will fail entirely,
even if maybe the list of titles was almost complete. A different
solution may be in order.
6 years ago
Federico Leva 59c4c5430e Catch missing titles file and JSON response
Traceback (most recent call last):
  File "dumpgenerator.py", line 2214, in <module>
    print 'Trying to use path "%s"...' % (config['path'])
  File "dumpgenerator.py", line 2210, in main
    elif reply.lower() in ['no', 'n']:
  File "dumpgenerator.py", line 1977, in saveSiteInfo

  File "dumpgenerator.py", line 1711, in getJSON
    return False
  File "/usr/lib/python2.7/site-packages/requests/models.py", line 892, in json
    return complexjson.loads(self.text, **kwargs)
  File "/usr/lib64/python2.7/site-packages/simplejson/__init__.py", line 516, in loads
    return _default_decoder.decode(s)
  File "/usr/lib64/python2.7/site-packages/simplejson/decoder.py", line 374, in decode
    obj, end = self.raw_decode(s)
  File "/usr/lib64/python2.7/site-packages/simplejson/decoder.py", line 404, in raw_decode
    return self.scan_once(s, idx=_w(s, idx).end())
simplejson.scanner.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Or if for instance the directory was named compared to the saved config:

Resuming previous dump process...
Traceback (most recent call last):
  File "./dumpgenerator.py", line 2238, in <module>
    main()
  File "./dumpgenerator.py", line 2228, in main
    resumePreviousDump(config=config, other=other)
  File "./dumpgenerator.py", line 1829, in resumePreviousDump
    if lasttitle == '--END--':
UnboundLocalError: local variable 'lasttitle' referenced before assignment
6 years ago
Federico Leva b307de6cb7 Make --xmlrevisions work on Wikia
* Do not try exportnowrap first: it returns a blank page.
* Add an allpages option, which simply uses readTitles but cannot resume.

FIXME: this only exports the current revision!
6 years ago
Federico Leva 680145e6a5 Fallback for --xmlrevisions on a MediaWiki 1.12 wiki 6 years ago
Federico Leva 27cbdfd302 Circumvent API exception when trying to use index.php
$ python dumpgenerator.py --xml --index=http://meritbadge.org/wiki/index.php
fails on at least one MediaWiki 1.12 wiki:

Trying generating a new dump into a new directory...
Loading page titles from namespaces = all
Excluding titles from namespaces = None
Traceback (most recent call last):
  File "dumpgenerator.py", line 2211, in <module>
    main()
  File "dumpgenerator.py", line 2203, in main
    createNewDump(config=config, other=other)
  File "dumpgenerator.py", line 1766, in createNewDump
    getPageTitles(config=config, session=other['session'])
  File "dumpgenerator.py", line 400, in getPageTitles
    test = getJSON(r)
  File "dumpgenerator.py", line 1708, in getJSON
    return request.json()
  File "/usr/lib/python2.7/site-packages/requests/models.py", line 892, in json
    return complexjson.loads(self.text, **kwargs)
  File "/usr/lib64/python2.7/site-packages/simplejson/__init__.py", line 516, in loads
    return _default_decoder.decode(s)
  File "/usr/lib64/python2.7/site-packages/simplejson/decoder.py", line 374, in decode
    obj, end = self.raw_decode(s)
  File "/usr/lib64/python2.7/site-packages/simplejson/decoder.py", line 404, in raw_decode
    return self.scan_once(s, idx=_w(s, idx).end())
simplejson.scanner.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
6 years ago
Federico Leva d4f0869ecc Consistently use POST params instead of data
Also match URLs which end in ".php$" in domain2prefix().
6 years ago
Federico Leva 754027de42 xmlrevisions: actually allow index to be undefined, don't POST data
* http://biografias.bcn.cl/api.php does not like the data to be POSTed.
  Just use URL parameters. Some wikis had anti-spam protections which
  made us POST everything, but for most wikis this should be fine.
* If the index is not defined, don't fail.
* Use only the base api.php URL, not parameters, in domain2prefix.

https://github.com/WikiTeam/wikiteam/issues/314
6 years ago
Emilio 3a56037279
Merge pull request #310 from nemobis/master
Update Wikia list with wikia.py
6 years ago
emijrp 811a325756 update 6 years ago
emijrp aec3a14b7b update spider incomplete results, still running; userwikispacesXY lists instead 6 years ago
emijrp 98386e0b4c codification, wikilicense 6 years ago
emijrp 3ee22b27d0 codification try 6 years ago
emijrp 51ebefa1c4 100,000 wikispaces 6 years ago
emijrp 9fb8d4be0e file check 6 years ago
emijrp 8c30b3a2b9 bug invalid content, redownload 6 years ago
emijrp 7280c89b3b duckduckgo spider 6 years ago
emijrp 83158d4506 70k wikis by spider 6 years ago
emijrp 4b483c695b Merge branch 'master' of https://github.com/WikiTeam/wikiteam 6 years ago
emijrp 60704e3303 searching wikis with duckduckgo 6 years ago
Federico Leva 7c545d05b7 Fix UnboundLocalError and catch RetryError with --xmlrevisions
File "./dumpgenerator.py", line 1212, in generateImageDump
    if not re.search(r'</mediawiki>', xmlfiledesc):

UnboundLocalError: local variable 'xmlfiledesc' referenced before assignment
6 years ago
Federico Leva 952fcc6bcf Up version to 0.4.0-alpha to signify disruption 6 years ago
Federico Leva 33bb1c1f23 Download image description from API when using --xmlrevisions
Fixes https://github.com/WikiTeam/wikiteam/issues/308

Also add --failfast option to sneak in all the hacks I use to run
the bulk downloads, so I can more easily sync the repos.
6 years ago
Federico Leva b8909baa3d Update Wikia list with wikia.py 6 years ago
Federico Leva be5ca12075 Avoid generators in API-only export 6 years ago
Fedora ebc02a3b45 Merge branch 'master' of https://github.com/WikiTeam/wikiteam 6 years ago
Fedora a8cbb357ff First attempt of API-only export 6 years ago
Fedora 142b48cc69 Add timeouts and retries to increase success rate 6 years ago
emijrp 60a0ba2e54 sleep 6 years ago
emijrp 061709d9e6 50,000 wikis, do not use this list, use wikispacesXY instead 6 years ago
emijrp 5002eb723a print 6 years ago
emijrp 30a6dc268b wikispaces lists 6 years ago
emijrp e01b2fb0c3 bug wikitext 6 years ago