Commit Graph

261 Commits (632b99ea530c4253520a4f1824909dec7746132d)

Author SHA1 Message Date
Federico Leva 632b99ea53 Merge branch '2015/iterators' of https://github.com/nemobis/wikiteam into nemobis-2015/iterators 9 years ago
Nemo bis 55e5888a00 Fix UnicodeDecodeError in resume: use kitchen 9 years ago
Federico Leva 14ce5f2c1b Resume and list titles without keeping everything in memory
Approach suggested by @makoshark, finally found the time to start
implementing it.
* Do not produce and save the titles list all at once. Instead, use
  the scraper and API as generators and save titles on the go. Also,
  try to start the generator from the appropriate title.
  For now the title sorting is not implemented. Pages will be in the
  order given by namespace ID, then page name.
* When resuming, read both the title list and the XML file from the
  end rather than the beginning. If the correct terminator is
  present, only one line needs to be read.
* In both cases, use a generator instead of a huge list in memory.
* Also truncate the resumed XML without writing it from scratch.
  For now using GNU ed: very compact, though shelling out is ugly.
  I gave up on using file.seek and file.truncate to avoid reading the
  whole file from the beginning or complicating reverse_readline()
  with more offset calculations.

This should avoid MemoryError in most cases.

Tested by running a dump over a 1.24 wiki with 11 pages: a complete
dump and a resumed dump from a dump interrupted with ctrl-c.
9 years ago
Federico Leva 2537e9852e Make dumpgenerator.py 774: required by launcher.py 9 years ago
Federico Leva 79e2c5951f Fix API check if only index is passed
I forgot that the preceding point only extracts the api.php URL if
the "wiki" argument is passed to say it's a MediaWiki wiki (!).
9 years ago
Federico Leva bdc7c9bf06 Issue 26: Local "Special" namespace, actually limit replies
* For some reason, in a previous commit I had noticed that maxretries
  was not respected in getXMLPageCore, but I didn't fix it. Done now.
* If the "Special" namespace alias doesn't work, fetch the local one.
9 years ago
Federico Leva 2f25e6b787 Make checkAPI() more readable and verbose
Also return the api URL we found.
9 years ago
Federico Leva 48ad3775fd Merge branch 'follow-redirects-api' of git://github.com/PiRSquared17/wikiteam into PiRSquared17-follow-redirects-api 9 years ago
nemobis 2284e3d55e Merge pull request #186 from PiRSquared17/update-headers
Preserve default headers, fixing openwrt test
9 years ago
PiRSquared17 5d23cb62f4 Merge pull request #219 from vadp/dir-fnames-unicode
convert images directory content to unicode when resuming download
9 years ago
PiRSquared17 d361477a46 Merge pull request #222 from vadp/img-desc-load-err
dumpgenerator: catch errors for missing image descriptions
9 years ago
Vadim Shlyakhov 4c1d104326 dumpgenerator: catch errors for missing image descriptions 9 years ago
PiRSquared17 b1ce45b170 Try using URL without index.php as index 9 years ago
PiRSquared17 9c3c992319 Follow API redirects 9 years ago
Vadim Shlyakhov f7e83a767a convert images directory content to unicode when resuming download 9 years ago
Benjamin Mako Hill d2adf5ce7c Merge branch 'master' of github.com:WikiTeam/wikiteam 9 years ago
Benjamin Mako Hill f85b4a3082 fixed bug with page missing exception code
My previous code broke the page missing detection code with two negative
outcomes:

- missing pages were not reported in the error log
- ever missing page generated an extraneous "</page>" line in output which
  rendered dumps invalid

This patch improves the exception code in general and fixes both of these
issues.
9 years ago
PiRSquared17 9480834a37 Fix infinite images loop
Closes #205 (hopefully)
9 years ago
Benjamin Mako Hill eb8b44aef0 strip <sha1> tags returned under <page>
The Wikia API is exporting sha1 sums as part of the response for pages.
These are invalid XML and are causing dump parsing code (e.g.,
MediaWiki-Utilities) to fail.  Also, sha1 should be revisions, not pages so
it's not entirely clear to me what this is referring to.
10 years ago
Benjamin Mako Hill 145b2eaaf4 changed getXMLPage() into a generator
The program tended to run out of memory when processing very large pages (i.e.,
pages with extremely large numbers of revisions or pages with large numbers of
very large revisions). This mitigates the problem by changing getXMLPage() into
a generator which allows us to write pages after each request to the API.

This requied changes to the getXMLPage() function and also changes to other
parts of the code that called it.

Additionally, when the function was called, it's text was checked in several
ways. This required a few changes including a running tally of revisions
instead of post hoc check and it required error checking being moved into a
Exception rather than just an if statement that looked at the final result.
10 years ago
nemobis b3ef165529 Merge pull request #194 from mrshu/mrshu/dumpgenerator-pep8fied
dumpgenerator: AutoPEP8-fied
10 years ago
mr.Shu 04446a40a5 dumpgenerator: AutoPEP8-fied
* Used autopep8 to made sure the code looks nice and is actually PEP8
  compliant.

Signed-off-by: mr.Shu <mr@shu.io>
10 years ago
nemobis e0f8e36bf4 Merge pull request #190 from PiRSquared17/api-allpages-disabled
Fallback to getPageTitlesScraper() if API allpages disabled
10 years ago
PiRSquared17 757019521a Fallback to scraper if API allpages disabled 10 years ago
PiRSquared17 4b3c862a58 Comment debugging print, fix test 10 years ago
PiRSquared17 7a1db0525b Add more wiki engines to getWikiEngine 10 years ago
PiRSquared17 4ceb9ad72e Preserve default headers, fixing openwrt test 10 years ago
PiRSquared17 b4818d2985 Avoid infinite loop in getImageNamesScraper 10 years ago
nemobis 8a9b50b51d Merge pull request #183 from PiRSquared17/patch-7
Retry on ConnectionError in getXMLPageCore
10 years ago
nemobis 19c48d3dd0 Merge pull request #180 from PiRSquared17/patch-2
Get as much information from siteinfo as possible
10 years ago
Pi R. Squared f7187b7048 Retry on ConnectionError in getXMLPageCore
Previously it just gave a fatal error.
10 years ago
Pi R. Squared f31e4e6451 Dict not hashable, also not needed
Quick fix.
10 years ago
Pi R. Squared 399f609d70 AllPages API hack for old versions of MediaWiki
New API format: http://www.mediawiki.org/w/api.php?action=query&list=allpages&apnamespace=0&apfrom=!&format=json&aplimit=500
Old API format: http://wiki.damirsystems.com/api.php?action=query&list=allpages&apnamespace=0&apfrom=!&format=json
10 years ago
Pi R. Squared 498b64da3f Try getting index.php from siteinfo API
Fixes #49
10 years ago
Pi R. Squared ff0d230d08 Get as much information from siteinfo as possible
Properly fixes #74.

Algorithm:
1. Try all siteinfo props. If this gives an error, continue. Otherwise, stop.
2. Try MediaWiki 1.11-1.12 siteinfo props. If this gives an error, continue. Otherwise, stop.
3. Try minimal siteinfo props. Stop.
Not using sishowalldb=1 to avoid possible error (by default), since this data is of little use anyway.
10 years ago
Pi R. Squared 322604cc23 Encode title using UTF-8 before printing
This fixes #170 and closes #174.
10 years ago
nemobis 11368310ee Merge pull request #173 from nemobis/issue/131
Fix #131: ValueError: No JSON object could be decoded
10 years ago
Nemo bis 026c2a9a25 Issue 131: ValueError: No JSON object could be decoded 10 years ago
Sean Yeh 38e73c1cf7 Fix argument parsing to accept delay as a number 10 years ago
Emilio J. Rodríguez-Posada a2efca27b8 improving API/Index calculate 10 years ago
Emilio J. Rodríguez-Posada 4bc43a1c0f improved help messages 10 years ago
Emilio J. Rodríguez-Posada 51806f5a3d fixed #160; improved args parsing and --help; improved API/Index estimate from URL; 10 years ago
Emilio J. Rodríguez-Posada dd7df0cc01 Merge branch 'master' of https://github.com/WikiTeam/wikiteam 10 years ago
Emilio J. Rodríguez-Posada f3b388fc79 a first approach to auto-detect API/Index.php using URL to the Main_Page 10 years ago
Erkan Yilmaz 44b80ceb88 fix link for tutorial 10 years ago
balr0g 8485a5004d Pass session 10 years ago
balr0g fd6ea19b4b config['api'] is set but empty; properly handle this 10 years ago
nemobis 1ff96238eb Denote as alpha until revamp is tested
Per emijrp who asked not to run dumps with this, at https://github.com/WikiTeam/wikiteam/issues/104#issuecomment-48039143
Currently proposed things to fix or check: https://github.com/WikiTeam/wikiteam/issues?milestone=1&state=open
10 years ago
Emilio J. Rodríguez-Posada 89e3c3e462 standarize getImage* functions names 10 years ago
Emilio J. Rodríguez-Posada aaa1822759 improving image list downloader 10 years ago