Commit Graph

292 Commits (9fccd3f4da2bc0af5fa590553816c7817df82d54)

Author SHA1 Message Date
Hydriz Scholz e8cb1a7a5f Add missing newline 8 years ago
Hydriz Scholz daa39616e2 Make the copyright year automatically update itself 8 years ago
emijrp 2e99d869e2 fixing Wikia images bug, issue #212 8 years ago
emijrp 3f697dbb5b restoring dumpgenerator.py code to f43b7389a0 last stable version. I will rewrite code in wikiteam/ subdirectory 8 years ago
emijrp 4e939f4e98 getting ready to other wiki engines, functions prefixes: MediaWiki (mw), Wikispaces (ws) 8 years ago
nemobis f43b7389a0 Merge pull request #270 from nemobis/master
When we have a working index.php, do not require api.php
8 years ago
Federico Leva 480d421d7b When we have a working index.php, do not require api.php
We work well without api.php. this was a needless suicide.
Especially as sometimes sysadmins like to disable the API for no
reason and then index.php is our only option to archive the wiki.
8 years ago
emijrp 15223eb75b Parsing more image names from HTML Special:Allimages 8 years ago
emijrp e138d6ce52 New API params to continue in Allimages 8 years ago
emijrp 2c0f54d73b new HTML regexp for Special:Allpages 8 years ago
emijrp 4ef665b53c In recent MediaWiki versions, API continue is a bit different 8 years ago
Daniel Oaks 376e8a11a3 Avoid out-of-memory error in two extra places 9 years ago
Tim Sheerman-Chase 877b736cd2 Merge branch 'retry' of https://github.com/TimSC/wikiteam into retry 9 years ago
Tim Sheerman-Chase 6716ceab32 Fix tests 9 years ago
Tim Sheerman-Chase 5cb2ecb6b5 Attempting to fix missing config in tests 9 years ago
Tim 93bc29f2d7 Fix syntax errors 9 years ago
Tim d5a1ed2d5a Fix indentation, use classic string formating 9 years ago
Tim Sheerman-Chase 8380af5f24 Improve retry logic 9 years ago
PiRSquared17 fadd7134f7 What I meant to do, ugh 9 years ago
PiRSquared17 1b2e83aa8c Fix minor error with normpath call 9 years ago
PiRSquared17 5db9a1c7f3 Normalize path/foo/ to path/foo, so -2, etc. work (fixes #244) 9 years ago
Federico Leva 2b78bfb795 Merge branch '2015/iterators' of git://github.com/nemobis/wikiteam into nemobis-2015/iterators
Conflicts:
	requirements.txt
9 years ago
Federico Leva d4fd745498 Actually allow resuming huge or broken XML dumps
* Log "XML export on this wiki is broken, quitting." to the error
  file so that grepping reveals which dumps were interrupted so.
* Automatically reduce export size for a page when downloading the
  entire history at once results in a MemoryError.
* Truncate the file with a pythonic method (.seek and .truncate)
  while reading from the end, by making reverse_readline() a weird
  hybrid to avoid an actual coroutine.
9 years ago
Federico Leva 9168a66a54 logerror() wants unicode, but readTitles etc. give bytes
Fixes #239.
9 years ago
Federico Leva 632b99ea53 Merge branch '2015/iterators' of https://github.com/nemobis/wikiteam into nemobis-2015/iterators 9 years ago
nemobis ff2cdfa1cd Merge pull request #236 from PiRSquared17/fix-server-check-api
Catch KeyError to fix server check
9 years ago
nemobis 0b25951ab1 Merge pull request #224 from nemobis/2015/issue26
Issue #26: Local "Special" namespace, actually limit replies
9 years ago
PiRSquared17 03db166718 Catch KeyError to fix server check 9 years ago
PiRSquared17 f80ad39df0 Make filename truncation work with UTF-8 9 years ago
PiRSquared17 90bfd1400e Merge pull request #229 from PiRSquared17/fix-zwnbsp-bom
Strip ZWNBSP (U+FEFF) Byte-Order Mark from JSON/XML
9 years ago
PiRSquared17 fc276d525f Allow spaces before <mediawiki> tag. 9 years ago
PiRSquared17 1c820dafb7 Strip ZWNBSP (U+FEFF) Byte-Order Mark from JSON/XML 9 years ago
Nemo bis 55e5888a00 Fix UnicodeDecodeError in resume: use kitchen 9 years ago
Federico Leva 14ce5f2c1b Resume and list titles without keeping everything in memory
Approach suggested by @makoshark, finally found the time to start
implementing it.
* Do not produce and save the titles list all at once. Instead, use
  the scraper and API as generators and save titles on the go. Also,
  try to start the generator from the appropriate title.
  For now the title sorting is not implemented. Pages will be in the
  order given by namespace ID, then page name.
* When resuming, read both the title list and the XML file from the
  end rather than the beginning. If the correct terminator is
  present, only one line needs to be read.
* In both cases, use a generator instead of a huge list in memory.
* Also truncate the resumed XML without writing it from scratch.
  For now using GNU ed: very compact, though shelling out is ugly.
  I gave up on using file.seek and file.truncate to avoid reading the
  whole file from the beginning or complicating reverse_readline()
  with more offset calculations.

This should avoid MemoryError in most cases.

Tested by running a dump over a 1.24 wiki with 11 pages: a complete
dump and a resumed dump from a dump interrupted with ctrl-c.
9 years ago
Federico Leva 2537e9852e Make dumpgenerator.py 774: required by launcher.py 9 years ago
Federico Leva 79e2c5951f Fix API check if only index is passed
I forgot that the preceding point only extracts the api.php URL if
the "wiki" argument is passed to say it's a MediaWiki wiki (!).
9 years ago
Federico Leva bdc7c9bf06 Issue 26: Local "Special" namespace, actually limit replies
* For some reason, in a previous commit I had noticed that maxretries
  was not respected in getXMLPageCore, but I didn't fix it. Done now.
* If the "Special" namespace alias doesn't work, fetch the local one.
9 years ago
Federico Leva 2f25e6b787 Make checkAPI() more readable and verbose
Also return the api URL we found.
9 years ago
Federico Leva 48ad3775fd Merge branch 'follow-redirects-api' of git://github.com/PiRSquared17/wikiteam into PiRSquared17-follow-redirects-api 9 years ago
nemobis 2284e3d55e Merge pull request #186 from PiRSquared17/update-headers
Preserve default headers, fixing openwrt test
9 years ago
PiRSquared17 5d23cb62f4 Merge pull request #219 from vadp/dir-fnames-unicode
convert images directory content to unicode when resuming download
9 years ago
PiRSquared17 d361477a46 Merge pull request #222 from vadp/img-desc-load-err
dumpgenerator: catch errors for missing image descriptions
9 years ago
Vadim Shlyakhov 4c1d104326 dumpgenerator: catch errors for missing image descriptions 9 years ago
PiRSquared17 b1ce45b170 Try using URL without index.php as index 9 years ago
PiRSquared17 9c3c992319 Follow API redirects 9 years ago
Vadim Shlyakhov f7e83a767a convert images directory content to unicode when resuming download 9 years ago
Benjamin Mako Hill d2adf5ce7c Merge branch 'master' of github.com:WikiTeam/wikiteam 9 years ago
Benjamin Mako Hill f85b4a3082 fixed bug with page missing exception code
My previous code broke the page missing detection code with two negative
outcomes:

- missing pages were not reported in the error log
- ever missing page generated an extraneous "</page>" line in output which
  rendered dumps invalid

This patch improves the exception code in general and fixes both of these
issues.
9 years ago
PiRSquared17 9480834a37 Fix infinite images loop
Closes #205 (hopefully)
9 years ago
Benjamin Mako Hill eb8b44aef0 strip <sha1> tags returned under <page>
The Wikia API is exporting sha1 sums as part of the response for pages.
These are invalid XML and are causing dump parsing code (e.g.,
MediaWiki-Utilities) to fail.  Also, sha1 should be revisions, not pages so
it's not entirely clear to me what this is referring to.
9 years ago