2
0
mirror of https://github.com/WikiTeam/wikiteam synced 2024-11-10 13:10:27 +00:00
Commit Graph

926 Commits

Author SHA1 Message Date
Federico Leva
69ec7e5015 Use os.listdir() and avoid os.walk() in launcher too
With millions of files, everything stalls otherwise.
2018-05-21 07:33:03 +03:00
Federico Leva
4351e09d80 uploader.py: respect --admin in collection 2018-05-20 01:48:17 +03:00
Federico Leva
320f231d57 Handle status code > 400 in checkAPI()
Fixes https://github.com/WikiTeam/wikiteam/issues/315
2018-05-20 01:41:01 +03:00
Federico Leva
845c05de1e Go back to data POSTing in checkIndex() and checkAPI() to handle redirects
Some redirects from HTTP to HTTPS otherwise end up giving 400, like
http://nimiarkisto.fi/
2018-05-20 01:20:32 +03:00
Federico Leva
de752bb6a2 Also add contentmodel to the XML of --xmlrevisions 2018-05-20 00:28:01 +03:00
Federico Leva
f7466850c9 List of wikis to archive, from not-archived.py 2018-05-20 00:21:05 +03:00
Federico Leva
d07a14cbce New version of uploader.py with possibility of separate directory
Also much faster than using os.walk, which lists all the images
in all wikidump directories.
2018-05-20 00:00:27 +03:00
Federico Leva
03ba77e2f5 Build XML from the pages module when allrevisions not available 2018-05-19 22:34:13 +03:00
Federico Leva
06ad1a9fe3 Update --xmlrevisions help 2018-05-19 18:55:03 +03:00
Federico Leva
7143f7efb1 Actually export all revisions in --xmlrevisions: build XML manually! 2018-05-19 18:49:58 +03:00
Federico Leva
50c6786f84 Move launcher.py where its imports assume it is
No reason to force users to move it to actually use it.
2018-05-19 04:15:54 +03:00
Federico Leva
1ff5af7d44 Catch unexpected API errors in getPageTitlesAPI
Apparently the initial JSON test is not enough, the JSON can be broken
or unexpected in other ways/points.
Fallback to the old scraper in such a case.

Fixes https://github.com/WikiTeam/wikiteam/issues/295 , perhaps.

If the scraper doesn't work for the wiki, the dump will fail entirely,
even if maybe the list of titles was almost complete. A different
solution may be in order.
2018-05-19 04:11:50 +03:00
Federico Leva
59c4c5430e Catch missing titles file and JSON response
Traceback (most recent call last):
  File "dumpgenerator.py", line 2214, in <module>
    print 'Trying to use path "%s"...' % (config['path'])
  File "dumpgenerator.py", line 2210, in main
    elif reply.lower() in ['no', 'n']:
  File "dumpgenerator.py", line 1977, in saveSiteInfo

  File "dumpgenerator.py", line 1711, in getJSON
    return False
  File "/usr/lib/python2.7/site-packages/requests/models.py", line 892, in json
    return complexjson.loads(self.text, **kwargs)
  File "/usr/lib64/python2.7/site-packages/simplejson/__init__.py", line 516, in loads
    return _default_decoder.decode(s)
  File "/usr/lib64/python2.7/site-packages/simplejson/decoder.py", line 374, in decode
    obj, end = self.raw_decode(s)
  File "/usr/lib64/python2.7/site-packages/simplejson/decoder.py", line 404, in raw_decode
    return self.scan_once(s, idx=_w(s, idx).end())
simplejson.scanner.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Or if for instance the directory was named compared to the saved config:

Resuming previous dump process...
Traceback (most recent call last):
  File "./dumpgenerator.py", line 2238, in <module>
    main()
  File "./dumpgenerator.py", line 2228, in main
    resumePreviousDump(config=config, other=other)
  File "./dumpgenerator.py", line 1829, in resumePreviousDump
    if lasttitle == '--END--':
UnboundLocalError: local variable 'lasttitle' referenced before assignment
2018-05-19 03:31:49 +03:00
Federico Leva
b307de6cb7 Make --xmlrevisions work on Wikia
* Do not try exportnowrap first: it returns a blank page.
* Add an allpages option, which simply uses readTitles but cannot resume.

FIXME: this only exports the current revision!
2018-05-19 03:13:42 +03:00
Federico Leva
680145e6a5 Fallback for --xmlrevisions on a MediaWiki 1.12 wiki 2018-05-19 00:27:13 +03:00
Federico Leva
27cbdfd302 Circumvent API exception when trying to use index.php
$ python dumpgenerator.py --xml --index=http://meritbadge.org/wiki/index.php
fails on at least one MediaWiki 1.12 wiki:

Trying generating a new dump into a new directory...
Loading page titles from namespaces = all
Excluding titles from namespaces = None
Traceback (most recent call last):
  File "dumpgenerator.py", line 2211, in <module>
    main()
  File "dumpgenerator.py", line 2203, in main
    createNewDump(config=config, other=other)
  File "dumpgenerator.py", line 1766, in createNewDump
    getPageTitles(config=config, session=other['session'])
  File "dumpgenerator.py", line 400, in getPageTitles
    test = getJSON(r)
  File "dumpgenerator.py", line 1708, in getJSON
    return request.json()
  File "/usr/lib/python2.7/site-packages/requests/models.py", line 892, in json
    return complexjson.loads(self.text, **kwargs)
  File "/usr/lib64/python2.7/site-packages/simplejson/__init__.py", line 516, in loads
    return _default_decoder.decode(s)
  File "/usr/lib64/python2.7/site-packages/simplejson/decoder.py", line 374, in decode
    obj, end = self.raw_decode(s)
  File "/usr/lib64/python2.7/site-packages/simplejson/decoder.py", line 404, in raw_decode
    return self.scan_once(s, idx=_w(s, idx).end())
simplejson.scanner.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
2018-05-19 00:16:43 +03:00
Federico Leva
d4f0869ecc Consistently use POST params instead of data
Also match URLs which end in ".php$" in domain2prefix().
2018-05-17 09:59:07 +03:00
Federico Leva
754027de42 xmlrevisions: actually allow index to be undefined, don't POST data
* http://biografias.bcn.cl/api.php does not like the data to be POSTed.
  Just use URL parameters. Some wikis had anti-spam protections which
  made us POST everything, but for most wikis this should be fine.
* If the index is not defined, don't fail.
* Use only the base api.php URL, not parameters, in domain2prefix.

https://github.com/WikiTeam/wikiteam/issues/314
2018-05-17 09:40:20 +03:00
Emilio
3a56037279
Merge pull request #310 from nemobis/master
Update Wikia list with wikia.py
2018-05-16 21:31:41 +02:00
emijrp
811a325756 update 2018-05-16 12:42:18 +02:00
emijrp
aec3a14b7b update spider incomplete results, still running; userwikispacesXY lists instead 2018-05-14 20:24:13 +02:00
emijrp
98386e0b4c codification, wikilicense 2018-05-12 10:50:17 +02:00
emijrp
3ee22b27d0 codification try 2018-05-11 22:25:41 +02:00
emijrp
51ebefa1c4 100,000 wikispaces 2018-05-11 21:33:10 +02:00
emijrp
9fb8d4be0e file check 2018-05-10 09:04:08 +02:00
emijrp
8c30b3a2b9 bug invalid content, redownload 2018-05-09 21:29:58 +02:00
emijrp
7280c89b3b duckduckgo spider 2018-05-09 13:41:13 +02:00
emijrp
83158d4506 70k wikis by spider 2018-05-09 13:40:50 +02:00
emijrp
4b483c695b Merge branch 'master' of https://github.com/WikiTeam/wikiteam 2018-05-08 22:49:48 +02:00
emijrp
60704e3303 searching wikis with duckduckgo 2018-05-08 22:49:31 +02:00
Federico Leva
7c545d05b7 Fix UnboundLocalError and catch RetryError with --xmlrevisions
File "./dumpgenerator.py", line 1212, in generateImageDump
    if not re.search(r'</mediawiki>', xmlfiledesc):

UnboundLocalError: local variable 'xmlfiledesc' referenced before assignment
2018-05-08 17:09:38 +00:00
Federico Leva
952fcc6bcf Up version to 0.4.0-alpha to signify disruption 2018-05-07 21:55:26 +00:00
Federico Leva
33bb1c1f23 Download image description from API when using --xmlrevisions
Fixes https://github.com/WikiTeam/wikiteam/issues/308

Also add --failfast option to sneak in all the hacks I use to run
the bulk downloads, so I can more easily sync the repos.
2018-05-07 21:53:43 +00:00
Federico Leva
b8909baa3d Update Wikia list with wikia.py 2018-05-08 00:44:27 +03:00
Federico Leva
be5ca12075 Avoid generators in API-only export 2018-05-07 20:03:22 +00:00
Fedora
ebc02a3b45 Merge branch 'master' of https://github.com/WikiTeam/wikiteam 2018-05-07 19:54:42 +00:00
Fedora
a8cbb357ff First attempt of API-only export 2018-05-07 19:05:26 +00:00
Fedora
142b48cc69 Add timeouts and retries to increase success rate 2018-05-07 19:01:50 +00:00
emijrp
60a0ba2e54 sleep 2018-05-07 20:08:39 +02:00
emijrp
061709d9e6 50,000 wikis, do not use this list, use wikispacesXY instead 2018-05-07 20:08:32 +02:00
emijrp
5002eb723a print 2018-05-07 08:07:37 +02:00
emijrp
30a6dc268b wikispaces lists 2018-05-06 21:14:32 +02:00
emijrp
e01b2fb0c3 bug wikitext 2018-05-06 17:53:32 +02:00
emijrp
ffff6cf568 sleep 2018-05-06 14:27:19 +02:00
emijrp
24ba4ae0ca originalurl metadata 2018-05-06 14:17:14 +02:00
emijrp
cd90d30aaa ia checking 2018-05-06 14:15:35 +02:00
emijrp
af680ced4a help, params 2018-05-06 14:07:48 +02:00
emijrp
2fe1c0b6b2 uploader included 2018-05-06 13:19:21 +02:00
emijrp
254486af06 param 2018-05-05 20:24:07 +02:00
emijrp
9ab9c64df2 bug in redirects; script accepts wikilist.txt now 2018-05-05 20:20:06 +02:00