wikiteam

Commit Graph

Author	SHA1	Message	Date
Daniel Oaks	376e8a11a3	Avoid out-of-memory error in two extra places	9 years ago
Tim Sheerman-Chase	877b736cd2	Merge branch 'retry' of https://github.com/TimSC/wikiteam into retry	9 years ago
Tim Sheerman-Chase	6716ceab32	Fix tests	9 years ago
Tim Sheerman-Chase	5cb2ecb6b5	Attempting to fix missing config in tests	9 years ago
Tim	93bc29f2d7	Fix syntax errors	9 years ago
Tim	d5a1ed2d5a	Fix indentation, use classic string formating	9 years ago
Tim Sheerman-Chase	8380af5f24	Improve retry logic	9 years ago
PiRSquared17	fadd7134f7	What I meant to do, ugh	10 years ago
PiRSquared17	1b2e83aa8c	Fix minor error with normpath call	10 years ago
PiRSquared17	5db9a1c7f3	Normalize path/foo/ to path/foo, so -2, etc. work (fixes #244 )	10 years ago
Federico Leva	2b78bfb795	Merge branch '2015/iterators' of git://github.com/nemobis/wikiteam into nemobis-2015/iterators Conflicts: requirements.txt	10 years ago
Federico Leva	d4fd745498	Actually allow resuming huge or broken XML dumps * Log "XML export on this wiki is broken, quitting." to the error file so that grepping reveals which dumps were interrupted so. * Automatically reduce export size for a page when downloading the entire history at once results in a MemoryError. * Truncate the file with a pythonic method (.seek and .truncate) while reading from the end, by making reverse_readline() a weird hybrid to avoid an actual coroutine.	10 years ago
Federico Leva	9168a66a54	logerror() wants unicode, but readTitles etc. give bytes Fixes #239.	10 years ago
Federico Leva	632b99ea53	Merge branch '2015/iterators' of https://github.com/nemobis/wikiteam into nemobis-2015/iterators	10 years ago
nemobis	ff2cdfa1cd	Merge pull request #236 from PiRSquared17/fix-server-check-api Catch KeyError to fix server check	10 years ago
nemobis	0b25951ab1	Merge pull request #224 from nemobis/2015/issue26 Issue #26: Local "Special" namespace, actually limit replies	10 years ago
PiRSquared17	03db166718	Catch KeyError to fix server check	10 years ago
PiRSquared17	f80ad39df0	Make filename truncation work with UTF-8	10 years ago
PiRSquared17	90bfd1400e	Merge pull request #229 from PiRSquared17/fix-zwnbsp-bom Strip ZWNBSP (U+FEFF) Byte-Order Mark from JSON/XML	10 years ago
PiRSquared17	fc276d525f	Allow spaces before <mediawiki> tag.	10 years ago
PiRSquared17	1c820dafb7	Strip ZWNBSP (U+FEFF) Byte-Order Mark from JSON/XML	10 years ago
Nemo bis	55e5888a00	Fix UnicodeDecodeError in resume: use kitchen	10 years ago
Federico Leva	14ce5f2c1b	Resume and list titles without keeping everything in memory Approach suggested by @makoshark, finally found the time to start implementing it. * Do not produce and save the titles list all at once. Instead, use the scraper and API as generators and save titles on the go. Also, try to start the generator from the appropriate title. For now the title sorting is not implemented. Pages will be in the order given by namespace ID, then page name. * When resuming, read both the title list and the XML file from the end rather than the beginning. If the correct terminator is present, only one line needs to be read. * In both cases, use a generator instead of a huge list in memory. * Also truncate the resumed XML without writing it from scratch. For now using GNU ed: very compact, though shelling out is ugly. I gave up on using file.seek and file.truncate to avoid reading the whole file from the beginning or complicating reverse_readline() with more offset calculations. This should avoid MemoryError in most cases. Tested by running a dump over a 1.24 wiki with 11 pages: a complete dump and a resumed dump from a dump interrupted with ctrl-c.	10 years ago
Federico Leva	2537e9852e	Make dumpgenerator.py 774: required by launcher.py	10 years ago
Federico Leva	79e2c5951f	Fix API check if only index is passed I forgot that the preceding point only extracts the api.php URL if the "wiki" argument is passed to say it's a MediaWiki wiki (!).	10 years ago
Federico Leva	bdc7c9bf06	Issue 26: Local "Special" namespace, actually limit replies * For some reason, in a previous commit I had noticed that maxretries was not respected in getXMLPageCore, but I didn't fix it. Done now. * If the "Special" namespace alias doesn't work, fetch the local one.	10 years ago
Federico Leva	2f25e6b787	Make checkAPI() more readable and verbose Also return the api URL we found.	10 years ago
Federico Leva	48ad3775fd	Merge branch 'follow-redirects-api' of git://github.com/PiRSquared17/wikiteam into PiRSquared17-follow-redirects-api	10 years ago
nemobis	2284e3d55e	Merge pull request #186 from PiRSquared17/update-headers Preserve default headers, fixing openwrt test	10 years ago
PiRSquared17	5d23cb62f4	Merge pull request #219 from vadp/dir-fnames-unicode convert images directory content to unicode when resuming download	10 years ago
PiRSquared17	d361477a46	Merge pull request #222 from vadp/img-desc-load-err dumpgenerator: catch errors for missing image descriptions	10 years ago
Vadim Shlyakhov	4c1d104326	dumpgenerator: catch errors for missing image descriptions	10 years ago
PiRSquared17	b1ce45b170	Try using URL without index.php as index	10 years ago
PiRSquared17	9c3c992319	Follow API redirects	10 years ago
Vadim Shlyakhov	f7e83a767a	convert images directory content to unicode when resuming download	10 years ago
Benjamin Mako Hill	d2adf5ce7c	Merge branch 'master' of github.com:WikiTeam/wikiteam	10 years ago
Benjamin Mako Hill	f85b4a3082	fixed bug with page missing exception code My previous code broke the page missing detection code with two negative outcomes: - missing pages were not reported in the error log - ever missing page generated an extraneous "</page>" line in output which rendered dumps invalid This patch improves the exception code in general and fixes both of these issues.	10 years ago
PiRSquared17	9480834a37	Fix infinite images loop Closes #205 (hopefully)	10 years ago
Benjamin Mako Hill	eb8b44aef0	strip <sha1> tags returned under <page> The Wikia API is exporting sha1 sums as part of the response for pages. These are invalid XML and are causing dump parsing code (e.g., MediaWiki-Utilities) to fail. Also, sha1 should be revisions, not pages so it's not entirely clear to me what this is referring to.	10 years ago
Benjamin Mako Hill	145b2eaaf4	changed getXMLPage() into a generator The program tended to run out of memory when processing very large pages (i.e., pages with extremely large numbers of revisions or pages with large numbers of very large revisions). This mitigates the problem by changing getXMLPage() into a generator which allows us to write pages after each request to the API. This requied changes to the getXMLPage() function and also changes to other parts of the code that called it. Additionally, when the function was called, it's text was checked in several ways. This required a few changes including a running tally of revisions instead of post hoc check and it required error checking being moved into a Exception rather than just an if statement that looked at the final result.	10 years ago
nemobis	b3ef165529	Merge pull request #194 from mrshu/mrshu/dumpgenerator-pep8fied dumpgenerator: AutoPEP8-fied	10 years ago
mr.Shu	04446a40a5	dumpgenerator: AutoPEP8-fied * Used autopep8 to made sure the code looks nice and is actually PEP8 compliant. Signed-off-by: mr.Shu <mr@shu.io>	10 years ago
nemobis	e0f8e36bf4	Merge pull request #190 from PiRSquared17/api-allpages-disabled Fallback to getPageTitlesScraper() if API allpages disabled	10 years ago
PiRSquared17	757019521a	Fallback to scraper if API allpages disabled	10 years ago
PiRSquared17	4b3c862a58	Comment debugging print, fix test	10 years ago
PiRSquared17	7a1db0525b	Add more wiki engines to getWikiEngine	10 years ago
PiRSquared17	4ceb9ad72e	Preserve default headers, fixing openwrt test	10 years ago
PiRSquared17	b4818d2985	Avoid infinite loop in getImageNamesScraper	10 years ago
nemobis	8a9b50b51d	Merge pull request #183 from PiRSquared17/patch-7 Retry on ConnectionError in getXMLPageCore	10 years ago
nemobis	19c48d3dd0	Merge pull request #180 from PiRSquared17/patch-2 Get as much information from siteinfo as possible	10 years ago

1 2 3 4 5 ...

281 Commits (bbd999da33863fe14f89137244c7af7c3ad018db)