2
0
mirror of https://github.com/WikiTeam/wikiteam synced 2024-11-10 13:10:27 +00:00
Commit Graph

374 Commits

Author SHA1 Message Date
Federico Leva
11507e931e Initial switch to mwclient for the xmlrevisions option
* Still maintained and available for python 3 as well.
* Allows raw API requests as we need.
* Does not provide handy generators, we need to do continuation.
* Decides on its own which protocol and exact path to use, fails at it.
* Appears to use POST by default unless asked otherwise, what to do?
2020-02-10 14:20:46 +02:00
Federico Leva
3d04dcbf5c Use GET rather than POST for API requests
* It was just an old trick to get past some barriers which were waived with GET.
* It's not conformant and doesn't play well with some redirects.
* Some recent wikis seem to not like it at all, see also issue #311.
2020-02-08 12:18:03 +02:00
Federico Leva
83af47d6c0 Catch and raise PageMissingError when query() returns no pages 2018-05-25 11:00:32 +03:00
Federico Leva
73902d39c0 For old MediaWiki releases, use rawcontinue and wikitools query()
Otherwise the query continuation may fail and only the top revisions
will be exported. Tested with Wikia:
http://clubpenguin.wikia.com/api.php?action=query&prop=revisions&titles=Club_Penguin_Wiki

Also add parentid since it's available after all.

https://github.com/WikiTeam/wikiteam/issues/311#issuecomment-391957783
2018-05-25 10:55:44 +03:00
Federico Leva
da64349a5d Avoid UnboundLocalError: local variable 'reply' referenced before assignment 2018-05-23 18:32:38 +03:00
Federico Leva
b7789751fc UnboundLocalError: local variable 'reply' referenced before assignment
Warning!: "./tdicampswikiacom-20180522-wikidump" path exists
Traceback (most recent call last):
  File "./dumpgenerator.py", line 2321, in <module>
    main()
  File "./dumpgenerator.py", line 2283, in main
    while reply.lower() not in ['yes', 'y', 'no', 'n']:
UnboundLocalError: local variable 'reply' referenced before assignment
2018-05-22 10:30:11 +03:00
Federico Leva
d76b4b4e01 Raise and catch PageMissingError when revisions API result is incomplete
https://github.com/WikiTeam/wikiteam/issues/317
2018-05-22 10:16:52 +03:00
Federico Leva
7a655f0074 Check for sha1 presence in makeXmlFromPage() 2018-05-22 09:33:53 +03:00
Federico Leva
4bc41c3aa2 Actually keep track of listed titles and stop when duplicates are returned
https://github.com/WikiTeam/wikiteam/issues/309
2018-05-21 16:41:10 +03:00
Federico Leva
80288cf49e Catch allpages and namespaces API without query results 2018-05-21 16:41:00 +03:00
Federico Leva
e47f638a24 Define "check" before running checkAPI()
Traceback (most recent call last):
  File "./dumpgenerator.py", line 2294, in <module>
    main()
  File "./dumpgenerator.py", line 2239, in main
    config, other = getParameters(params=params)
  File "./dumpgenerator.py", line 1587, in getParameters
    if api and check:
UnboundLocalError: local variable 'check' referenced before assignment
2018-05-21 15:53:51 +03:00
Federico Leva
bad49d7916 Also default to regenerating dump in --failfast 2018-05-21 11:44:25 +03:00
Federico Leva
bbcafdf869 Support Unicode usernames etc. in makeXmlFromPage()
Test case:

Titles saved at... 39fanficwikiacom-20180521-titles.txt
377 page titles loaded
http://39fanfic.wikia.com/api.php
Getting the XML header from the API
Retrieving the XML for every page from the beginning
30 namespaces found
Exporting revisions from namespace 0
Warning. Could not use allrevisions, wiki too old.
1 more revisions exported
Traceback (most recent call last):
  File "./dumpgenerator.py", line 2291, in <module>
    main()
  File "./dumpgenerator.py", line 2283, in main
    createNewDump(config=config, other=other)
  File "./dumpgenerator.py", line 1849, in createNewDump
    generateXMLDump(config=config, titles=titles, session=other['session'])
  File "./dumpgenerator.py", line 732, in generateXMLDump
    for xml in getXMLRevisions(config=config, session=session):
  File "./dumpgenerator.py", line 861, in getXMLRevisions
    yield makeXmlFromPage(pages[page])
  File "./dumpgenerator.py", line 880, in makeXmlFromPage
    E.username(str(rev['user'])),
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-3: ordinal not in range(128)
2018-05-21 07:54:27 +03:00
Federico Leva
320f231d57 Handle status code > 400 in checkAPI()
Fixes https://github.com/WikiTeam/wikiteam/issues/315
2018-05-20 01:41:01 +03:00
Federico Leva
845c05de1e Go back to data POSTing in checkIndex() and checkAPI() to handle redirects
Some redirects from HTTP to HTTPS otherwise end up giving 400, like
http://nimiarkisto.fi/
2018-05-20 01:20:32 +03:00
Federico Leva
de752bb6a2 Also add contentmodel to the XML of --xmlrevisions 2018-05-20 00:28:01 +03:00
Federico Leva
03ba77e2f5 Build XML from the pages module when allrevisions not available 2018-05-19 22:34:13 +03:00
Federico Leva
06ad1a9fe3 Update --xmlrevisions help 2018-05-19 18:55:03 +03:00
Federico Leva
7143f7efb1 Actually export all revisions in --xmlrevisions: build XML manually! 2018-05-19 18:49:58 +03:00
Federico Leva
1ff5af7d44 Catch unexpected API errors in getPageTitlesAPI
Apparently the initial JSON test is not enough, the JSON can be broken
or unexpected in other ways/points.
Fallback to the old scraper in such a case.

Fixes https://github.com/WikiTeam/wikiteam/issues/295 , perhaps.

If the scraper doesn't work for the wiki, the dump will fail entirely,
even if maybe the list of titles was almost complete. A different
solution may be in order.
2018-05-19 04:11:50 +03:00
Federico Leva
59c4c5430e Catch missing titles file and JSON response
Traceback (most recent call last):
  File "dumpgenerator.py", line 2214, in <module>
    print 'Trying to use path "%s"...' % (config['path'])
  File "dumpgenerator.py", line 2210, in main
    elif reply.lower() in ['no', 'n']:
  File "dumpgenerator.py", line 1977, in saveSiteInfo

  File "dumpgenerator.py", line 1711, in getJSON
    return False
  File "/usr/lib/python2.7/site-packages/requests/models.py", line 892, in json
    return complexjson.loads(self.text, **kwargs)
  File "/usr/lib64/python2.7/site-packages/simplejson/__init__.py", line 516, in loads
    return _default_decoder.decode(s)
  File "/usr/lib64/python2.7/site-packages/simplejson/decoder.py", line 374, in decode
    obj, end = self.raw_decode(s)
  File "/usr/lib64/python2.7/site-packages/simplejson/decoder.py", line 404, in raw_decode
    return self.scan_once(s, idx=_w(s, idx).end())
simplejson.scanner.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Or if for instance the directory was named compared to the saved config:

Resuming previous dump process...
Traceback (most recent call last):
  File "./dumpgenerator.py", line 2238, in <module>
    main()
  File "./dumpgenerator.py", line 2228, in main
    resumePreviousDump(config=config, other=other)
  File "./dumpgenerator.py", line 1829, in resumePreviousDump
    if lasttitle == '--END--':
UnboundLocalError: local variable 'lasttitle' referenced before assignment
2018-05-19 03:31:49 +03:00
Federico Leva
b307de6cb7 Make --xmlrevisions work on Wikia
* Do not try exportnowrap first: it returns a blank page.
* Add an allpages option, which simply uses readTitles but cannot resume.

FIXME: this only exports the current revision!
2018-05-19 03:13:42 +03:00
Federico Leva
680145e6a5 Fallback for --xmlrevisions on a MediaWiki 1.12 wiki 2018-05-19 00:27:13 +03:00
Federico Leva
27cbdfd302 Circumvent API exception when trying to use index.php
$ python dumpgenerator.py --xml --index=http://meritbadge.org/wiki/index.php
fails on at least one MediaWiki 1.12 wiki:

Trying generating a new dump into a new directory...
Loading page titles from namespaces = all
Excluding titles from namespaces = None
Traceback (most recent call last):
  File "dumpgenerator.py", line 2211, in <module>
    main()
  File "dumpgenerator.py", line 2203, in main
    createNewDump(config=config, other=other)
  File "dumpgenerator.py", line 1766, in createNewDump
    getPageTitles(config=config, session=other['session'])
  File "dumpgenerator.py", line 400, in getPageTitles
    test = getJSON(r)
  File "dumpgenerator.py", line 1708, in getJSON
    return request.json()
  File "/usr/lib/python2.7/site-packages/requests/models.py", line 892, in json
    return complexjson.loads(self.text, **kwargs)
  File "/usr/lib64/python2.7/site-packages/simplejson/__init__.py", line 516, in loads
    return _default_decoder.decode(s)
  File "/usr/lib64/python2.7/site-packages/simplejson/decoder.py", line 374, in decode
    obj, end = self.raw_decode(s)
  File "/usr/lib64/python2.7/site-packages/simplejson/decoder.py", line 404, in raw_decode
    return self.scan_once(s, idx=_w(s, idx).end())
simplejson.scanner.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
2018-05-19 00:16:43 +03:00
Federico Leva
d4f0869ecc Consistently use POST params instead of data
Also match URLs which end in ".php$" in domain2prefix().
2018-05-17 09:59:07 +03:00
Federico Leva
754027de42 xmlrevisions: actually allow index to be undefined, don't POST data
* http://biografias.bcn.cl/api.php does not like the data to be POSTed.
  Just use URL parameters. Some wikis had anti-spam protections which
  made us POST everything, but for most wikis this should be fine.
* If the index is not defined, don't fail.
* Use only the base api.php URL, not parameters, in domain2prefix.

https://github.com/WikiTeam/wikiteam/issues/314
2018-05-17 09:40:20 +03:00
Federico Leva
7c545d05b7 Fix UnboundLocalError and catch RetryError with --xmlrevisions
File "./dumpgenerator.py", line 1212, in generateImageDump
    if not re.search(r'</mediawiki>', xmlfiledesc):

UnboundLocalError: local variable 'xmlfiledesc' referenced before assignment
2018-05-08 17:09:38 +00:00
Federico Leva
952fcc6bcf Up version to 0.4.0-alpha to signify disruption 2018-05-07 21:55:26 +00:00
Federico Leva
33bb1c1f23 Download image description from API when using --xmlrevisions
Fixes https://github.com/WikiTeam/wikiteam/issues/308

Also add --failfast option to sneak in all the hacks I use to run
the bulk downloads, so I can more easily sync the repos.
2018-05-07 21:53:43 +00:00
Federico Leva
be5ca12075 Avoid generators in API-only export 2018-05-07 20:03:22 +00:00
Fedora
a8cbb357ff First attempt of API-only export 2018-05-07 19:05:26 +00:00
Fedora
142b48cc69 Add timeouts and retries to increase success rate 2018-05-07 19:01:50 +00:00
Hydriz Scholz
e8cb1a7a5f Add missing newline 2016-09-30 17:29:33 +08:00
Hydriz Scholz
daa39616e2 Make the copyright year automatically update itself 2016-09-30 13:09:41 +08:00
emijrp
2e99d869e2 fixing Wikia images bug, issue #212 2016-09-18 01:16:46 +02:00
emijrp
3f697dbb5b restoring dumpgenerator.py code to f43b7389a0 last stable version. I will rewrite code in wikiteam/ subdirectory 2016-07-31 17:37:31 +02:00
emijrp
4e939f4e98 getting ready to other wiki engines, functions prefixes: MediaWiki (mw), Wikispaces (ws) 2016-07-30 16:32:25 +02:00
nemobis
f43b7389a0 Merge pull request #270 from nemobis/master
When we have a working index.php, do not require api.php
2016-02-28 09:14:45 +01:00
Federico Leva
480d421d7b When we have a working index.php, do not require api.php
We work well without api.php. this was a needless suicide.
Especially as sometimes sysadmins like to disable the API for no
reason and then index.php is our only option to archive the wiki.
2016-02-27 21:10:02 +01:00
emijrp
15223eb75b Parsing more image names from HTML Special:Allimages 2016-01-29 17:19:15 +01:00
emijrp
e138d6ce52 New API params to continue in Allimages 2016-01-29 16:55:44 +01:00
emijrp
2c0f54d73b new HTML regexp for Special:Allpages 2016-01-29 16:38:31 +01:00
emijrp
4ef665b53c In recent MediaWiki versions, API continue is a bit different 2016-01-29 16:14:34 +01:00
Daniel Oaks
376e8a11a3 Avoid out-of-memory error in two extra places 2015-10-22 23:19:50 +10:00
Tim Sheerman-Chase
877b736cd2 Merge branch 'retry' of https://github.com/TimSC/wikiteam into retry 2015-08-07 23:15:58 +01:00
Tim Sheerman-Chase
6716ceab32 Fix tests 2015-08-07 23:07:13 +01:00
Tim Sheerman-Chase
5cb2ecb6b5 Attempting to fix missing config in tests 2015-08-07 21:33:39 +01:00
Tim
93bc29f2d7 Fix syntax errors 2015-08-06 22:37:29 +01:00
Tim
d5a1ed2d5a Fix indentation, use classic string formating 2015-08-06 22:30:49 +01:00
Tim Sheerman-Chase
8380af5f24 Improve retry logic 2015-08-05 21:24:59 +01:00
PiRSquared17
fadd7134f7 What I meant to do, ugh 2015-04-18 21:28:57 +00:00
PiRSquared17
1b2e83aa8c Fix minor error with normpath call 2015-04-18 21:27:23 +00:00
PiRSquared17
5db9a1c7f3 Normalize path/foo/ to path/foo, so -2, etc. work (fixes #244) 2015-04-10 00:21:07 +00:00
Federico Leva
2b78bfb795 Merge branch '2015/iterators' of git://github.com/nemobis/wikiteam into nemobis-2015/iterators
Conflicts:
	requirements.txt
2015-03-30 09:27:00 +02:00
Federico Leva
d4fd745498 Actually allow resuming huge or broken XML dumps
* Log "XML export on this wiki is broken, quitting." to the error
  file so that grepping reveals which dumps were interrupted so.
* Automatically reduce export size for a page when downloading the
  entire history at once results in a MemoryError.
* Truncate the file with a pythonic method (.seek and .truncate)
  while reading from the end, by making reverse_readline() a weird
  hybrid to avoid an actual coroutine.
2015-03-30 02:35:55 +02:00
Federico Leva
9168a66a54 logerror() wants unicode, but readTitles etc. give bytes
Fixes #239.
2015-03-30 00:51:53 +02:00
Federico Leva
632b99ea53 Merge branch '2015/iterators' of https://github.com/nemobis/wikiteam into nemobis-2015/iterators 2015-03-30 00:48:32 +02:00
nemobis
ff2cdfa1cd Merge pull request #236 from PiRSquared17/fix-server-check-api
Catch KeyError to fix server check
2015-03-29 13:53:26 +02:00
nemobis
0b25951ab1 Merge pull request #224 from nemobis/2015/issue26
Issue #26: Local "Special" namespace, actually limit replies
2015-03-29 13:53:20 +02:00
PiRSquared17
03db166718 Catch KeyError to fix server check 2015-03-29 04:14:43 +01:00
PiRSquared17
f80ad39df0 Make filename truncation work with UTF-8 2015-03-28 15:17:06 +00:00
PiRSquared17
90bfd1400e Merge pull request #229 from PiRSquared17/fix-zwnbsp-bom
Strip ZWNBSP (U+FEFF) Byte-Order Mark from JSON/XML
2015-03-24 21:46:33 +00:00
PiRSquared17
fc276d525f Allow spaces before <mediawiki> tag. 2015-03-24 03:44:03 +00:00
PiRSquared17
1c820dafb7 Strip ZWNBSP (U+FEFF) Byte-Order Mark from JSON/XML 2015-03-24 01:58:01 +00:00
Nemo bis
55e5888a00 Fix UnicodeDecodeError in resume: use kitchen 2015-03-10 22:26:23 +02:00
Federico Leva
14ce5f2c1b Resume and list titles without keeping everything in memory
Approach suggested by @makoshark, finally found the time to start
implementing it.
* Do not produce and save the titles list all at once. Instead, use
  the scraper and API as generators and save titles on the go. Also,
  try to start the generator from the appropriate title.
  For now the title sorting is not implemented. Pages will be in the
  order given by namespace ID, then page name.
* When resuming, read both the title list and the XML file from the
  end rather than the beginning. If the correct terminator is
  present, only one line needs to be read.
* In both cases, use a generator instead of a huge list in memory.
* Also truncate the resumed XML without writing it from scratch.
  For now using GNU ed: very compact, though shelling out is ugly.
  I gave up on using file.seek and file.truncate to avoid reading the
  whole file from the beginning or complicating reverse_readline()
  with more offset calculations.

This should avoid MemoryError in most cases.

Tested by running a dump over a 1.24 wiki with 11 pages: a complete
dump and a resumed dump from a dump interrupted with ctrl-c.
2015-03-10 11:21:44 +01:00
Federico Leva
2537e9852e Make dumpgenerator.py 774: required by launcher.py 2015-03-08 21:33:39 +01:00
Federico Leva
79e2c5951f Fix API check if only index is passed
I forgot that the preceding point only extracts the api.php URL if
the "wiki" argument is passed to say it's a MediaWiki wiki (!).
2015-03-08 20:52:24 +01:00
Federico Leva
bdc7c9bf06 Issue 26: Local "Special" namespace, actually limit replies
* For some reason, in a previous commit I had noticed that maxretries
  was not respected in getXMLPageCore, but I didn't fix it. Done now.
* If the "Special" namespace alias doesn't work, fetch the local one.
2015-03-08 19:30:09 +01:00
Federico Leva
2f25e6b787 Make checkAPI() more readable and verbose
Also return the api URL we found.
2015-03-08 16:01:46 +01:00
Federico Leva
48ad3775fd Merge branch 'follow-redirects-api' of git://github.com/PiRSquared17/wikiteam into PiRSquared17-follow-redirects-api 2015-03-08 14:35:30 +01:00
nemobis
2284e3d55e Merge pull request #186 from PiRSquared17/update-headers
Preserve default headers, fixing openwrt test
2015-03-08 13:56:00 +01:00
PiRSquared17
5d23cb62f4 Merge pull request #219 from vadp/dir-fnames-unicode
convert images directory content to unicode when resuming download
2015-03-04 23:36:59 +00:00
PiRSquared17
d361477a46 Merge pull request #222 from vadp/img-desc-load-err
dumpgenerator: catch errors for missing image descriptions
2015-03-03 01:55:59 +00:00
Vadim Shlyakhov
4c1d104326 dumpgenerator: catch errors for missing image descriptions 2015-03-02 12:15:51 +03:00
PiRSquared17
b1ce45b170 Try using URL without index.php as index 2015-03-02 04:13:44 +00:00
PiRSquared17
9c3c992319 Follow API redirects 2015-03-02 03:13:03 +00:00
Vadim Shlyakhov
f7e83a767a convert images directory content to unicode when resuming download 2015-03-01 19:14:01 +03:00
Benjamin Mako Hill
d2adf5ce7c Merge branch 'master' of github.com:WikiTeam/wikiteam 2015-02-10 17:05:22 -08:00
Benjamin Mako Hill
f85b4a3082 fixed bug with page missing exception code
My previous code broke the page missing detection code with two negative
outcomes:

- missing pages were not reported in the error log
- ever missing page generated an extraneous "</page>" line in output which
  rendered dumps invalid

This patch improves the exception code in general and fixes both of these
issues.
2015-02-10 16:56:14 -08:00
PiRSquared17
9480834a37 Fix infinite images loop
Closes #205 (hopefully)
2015-02-10 02:24:50 +00:00
Benjamin Mako Hill
eb8b44aef0 strip <sha1> tags returned under <page>
The Wikia API is exporting sha1 sums as part of the response for pages.
These are invalid XML and are causing dump parsing code (e.g.,
MediaWiki-Utilities) to fail.  Also, sha1 should be revisions, not pages so
it's not entirely clear to me what this is referring to.
2015-02-06 18:50:25 -08:00
Benjamin Mako Hill
145b2eaaf4 changed getXMLPage() into a generator
The program tended to run out of memory when processing very large pages (i.e.,
pages with extremely large numbers of revisions or pages with large numbers of
very large revisions). This mitigates the problem by changing getXMLPage() into
a generator which allows us to write pages after each request to the API.

This requied changes to the getXMLPage() function and also changes to other
parts of the code that called it.

Additionally, when the function was called, it's text was checked in several
ways. This required a few changes including a running tally of revisions
instead of post hoc check and it required error checking being moved into a
Exception rather than just an if statement that looked at the final result.
2015-02-06 17:19:24 -08:00
nemobis
b3ef165529 Merge pull request #194 from mrshu/mrshu/dumpgenerator-pep8fied
dumpgenerator: AutoPEP8-fied
2014-10-01 23:56:36 +02:00
mr.Shu
04446a40a5 dumpgenerator: AutoPEP8-fied
* Used autopep8 to made sure the code looks nice and is actually PEP8
  compliant.

Signed-off-by: mr.Shu <mr@shu.io>
2014-10-01 22:26:56 +02:00
nemobis
e0f8e36bf4 Merge pull request #190 from PiRSquared17/api-allpages-disabled
Fallback to getPageTitlesScraper() if API allpages disabled
2014-09-28 16:34:24 +02:00
PiRSquared17
757019521a Fallback to scraper if API allpages disabled 2014-09-23 15:53:51 -04:00
PiRSquared17
4b3c862a58 Comment debugging print, fix test 2014-09-23 15:10:06 -04:00
PiRSquared17
7a1db0525b Add more wiki engines to getWikiEngine 2014-09-23 15:04:36 -04:00
PiRSquared17
4ceb9ad72e Preserve default headers, fixing openwrt test 2014-09-20 20:07:49 -04:00
PiRSquared17
b4818d2985 Avoid infinite loop in getImageNamesScraper 2014-09-20 11:41:57 -04:00
nemobis
8a9b50b51d Merge pull request #183 from PiRSquared17/patch-7
Retry on ConnectionError in getXMLPageCore
2014-09-19 21:08:52 +02:00
nemobis
19c48d3dd0 Merge pull request #180 from PiRSquared17/patch-2
Get as much information from siteinfo as possible
2014-09-19 09:43:07 +02:00
Pi R. Squared
f7187b7048 Retry on ConnectionError in getXMLPageCore
Previously it just gave a fatal error.
2014-09-18 20:21:01 -04:00
Pi R. Squared
f31e4e6451 Dict not hashable, also not needed
Quick fix.
2014-09-18 17:56:22 -04:00
Pi R. Squared
399f609d70 AllPages API hack for old versions of MediaWiki
New API format: http://www.mediawiki.org/w/api.php?action=query&list=allpages&apnamespace=0&apfrom=!&format=json&aplimit=500
Old API format: http://wiki.damirsystems.com/api.php?action=query&list=allpages&apnamespace=0&apfrom=!&format=json
2014-09-18 17:53:38 -04:00
Pi R. Squared
498b64da3f Try getting index.php from siteinfo API
Fixes #49
2014-09-14 11:59:17 -04:00
Pi R. Squared
ff0d230d08 Get as much information from siteinfo as possible
Properly fixes #74.

Algorithm:
1. Try all siteinfo props. If this gives an error, continue. Otherwise, stop.
2. Try MediaWiki 1.11-1.12 siteinfo props. If this gives an error, continue. Otherwise, stop.
3. Try minimal siteinfo props. Stop.
Not using sishowalldb=1 to avoid possible error (by default), since this data is of little use anyway.
2014-09-14 11:10:43 -04:00
Pi R. Squared
322604cc23 Encode title using UTF-8 before printing
This fixes #170 and closes #174.
2014-09-14 09:03:44 -04:00
nemobis
11368310ee Merge pull request #173 from nemobis/issue/131
Fix #131: ValueError: No JSON object could be decoded
2014-08-23 19:28:26 +03:00