Commit Graph

241 Commits (145b2eaaf40fb4b422de4d0f32817c2e87ee1de9)

Author SHA1 Message Date
Benjamin Mako Hill 145b2eaaf4 changed getXMLPage() into a generator
The program tended to run out of memory when processing very large pages (i.e.,
pages with extremely large numbers of revisions or pages with large numbers of
very large revisions). This mitigates the problem by changing getXMLPage() into
a generator which allows us to write pages after each request to the API.

This requied changes to the getXMLPage() function and also changes to other
parts of the code that called it.

Additionally, when the function was called, it's text was checked in several
ways. This required a few changes including a running tally of revisions
instead of post hoc check and it required error checking being moved into a
Exception rather than just an if statement that looked at the final result.
9 years ago
nemobis b3ef165529 Merge pull request #194 from mrshu/mrshu/dumpgenerator-pep8fied
dumpgenerator: AutoPEP8-fied
10 years ago
mr.Shu 04446a40a5 dumpgenerator: AutoPEP8-fied
* Used autopep8 to made sure the code looks nice and is actually PEP8
  compliant.

Signed-off-by: mr.Shu <mr@shu.io>
10 years ago
nemobis e0f8e36bf4 Merge pull request #190 from PiRSquared17/api-allpages-disabled
Fallback to getPageTitlesScraper() if API allpages disabled
10 years ago
PiRSquared17 757019521a Fallback to scraper if API allpages disabled 10 years ago
PiRSquared17 4b3c862a58 Comment debugging print, fix test 10 years ago
PiRSquared17 7a1db0525b Add more wiki engines to getWikiEngine 10 years ago
PiRSquared17 b4818d2985 Avoid infinite loop in getImageNamesScraper 10 years ago
nemobis 8a9b50b51d Merge pull request #183 from PiRSquared17/patch-7
Retry on ConnectionError in getXMLPageCore
10 years ago
nemobis 19c48d3dd0 Merge pull request #180 from PiRSquared17/patch-2
Get as much information from siteinfo as possible
10 years ago
Pi R. Squared f7187b7048 Retry on ConnectionError in getXMLPageCore
Previously it just gave a fatal error.
10 years ago
Pi R. Squared f31e4e6451 Dict not hashable, also not needed
Quick fix.
10 years ago
Pi R. Squared 399f609d70 AllPages API hack for old versions of MediaWiki
New API format: http://www.mediawiki.org/w/api.php?action=query&list=allpages&apnamespace=0&apfrom=!&format=json&aplimit=500
Old API format: http://wiki.damirsystems.com/api.php?action=query&list=allpages&apnamespace=0&apfrom=!&format=json
10 years ago
Pi R. Squared 498b64da3f Try getting index.php from siteinfo API
Fixes #49
10 years ago
Pi R. Squared ff0d230d08 Get as much information from siteinfo as possible
Properly fixes #74.

Algorithm:
1. Try all siteinfo props. If this gives an error, continue. Otherwise, stop.
2. Try MediaWiki 1.11-1.12 siteinfo props. If this gives an error, continue. Otherwise, stop.
3. Try minimal siteinfo props. Stop.
Not using sishowalldb=1 to avoid possible error (by default), since this data is of little use anyway.
10 years ago
Pi R. Squared 322604cc23 Encode title using UTF-8 before printing
This fixes #170 and closes #174.
10 years ago
nemobis 11368310ee Merge pull request #173 from nemobis/issue/131
Fix #131: ValueError: No JSON object could be decoded
10 years ago
Nemo bis 026c2a9a25 Issue 131: ValueError: No JSON object could be decoded 10 years ago
Sean Yeh 38e73c1cf7 Fix argument parsing to accept delay as a number 10 years ago
Emilio J. Rodríguez-Posada a2efca27b8 improving API/Index calculate 10 years ago
Emilio J. Rodríguez-Posada 4bc43a1c0f improved help messages 10 years ago
Emilio J. Rodríguez-Posada 51806f5a3d fixed #160; improved args parsing and --help; improved API/Index estimate from URL; 10 years ago
Emilio J. Rodríguez-Posada dd7df0cc01 Merge branch 'master' of https://github.com/WikiTeam/wikiteam 10 years ago
Emilio J. Rodríguez-Posada f3b388fc79 a first approach to auto-detect API/Index.php using URL to the Main_Page 10 years ago
Erkan Yilmaz 44b80ceb88 fix link for tutorial 10 years ago
balr0g 8485a5004d Pass session 10 years ago
balr0g fd6ea19b4b config['api'] is set but empty; properly handle this 10 years ago
nemobis 1ff96238eb Denote as alpha until revamp is tested
Per emijrp who asked not to run dumps with this, at https://github.com/WikiTeam/wikiteam/issues/104#issuecomment-48039143
Currently proposed things to fix or check: https://github.com/WikiTeam/wikiteam/issues?milestone=1&state=open
10 years ago
Emilio J. Rodríguez-Posada 89e3c3e462 standarize getImage* functions names 10 years ago
Emilio J. Rodríguez-Posada aaa1822759 improving image list downloader 10 years ago
Emilio J. Rodríguez-Posada 88c9468c0e improving image list downloader 10 years ago
balr0g 3929e4eb9c Cleanups and error fixes suggested by flake8 (pep8 + pyflakes) 10 years ago
Emilio J. Rodríguez-Posada c07b527e5d adding session to getWikiEngine() 10 years ago
Emilio J. Rodríguez-Posada 30c153ce1f chg: using 'with open' for files 10 years ago
balr0g 9aa3c4a0e1 Removed all traces of urllib except for encode/decode; more bugs fixed. 10 years ago
balr0g c8e11a949b Initial port to Requests 10 years ago
Emilio J. Rodríguez-Posada 9553e3550c adding wiki engine detector 10 years ago
Emilio J. Rodríguez-Posada eb97cf1adf version 0.2.2 and tiny bits in --help 10 years ago
balr0g 50b011f90d Initial port to argparse 10 years ago
Emilio J. Rodríguez-Posada 568deef081 adding comments for clarification 10 years ago
Emilio J. Rodríguez-Posada d4eed1f738 fixing #127 and #134 , now works with APIs that returns 'name' field for images and those that don't do it (in this case we unquote over ascii); also fixing bug that re-download image list when it was completed previously 10 years ago
Emilio J. Rodríguez-Posada 005de23c1d adding gzip to siteinfo downloader 10 years ago
Emilio J. Rodríguez-Posada d79ea64d41 fixing issue #97 pretty siteinfo json saving, indenting 4 chars 10 years ago
Emilio J. Rodríguez-Posada 3854a344fe Merge branch 'master' of https://github.com/WikiTeam/wikiteam 10 years ago
Emilio J. Rodríguez-Posada 1c1f0dbb86 replacing XML with JSON in image downloading 10 years ago
balr0g 481323c7f7 Don't try to download sites with disabled API 10 years ago
nemobis 1933db8a94 Merge pull request #124 from balr0g/scraper-unicode-title-fix
Fix scraper for sites with Unicode titles
10 years ago
balr0g 62be069026 Fix scraper for sites with Unicode titles 10 years ago
nemobis 62d961fa97 Fix typo, unused variable spotted by balrog 10 years ago
nemobis 95bc2dec38 Link GitHub issue tracker 10 years ago