wikiteam

mirror of https://github.com/WikiTeam/wikiteam synced 2024-11-12 07:12:41 +00:00

Author	SHA1	Message	Date
PiRSquared17	fc276d525f	Allow spaces before <mediawiki> tag.	2015-03-24 03:44:03 +00:00
PiRSquared17	1c820dafb7	Strip ZWNBSP (U+FEFF) Byte-Order Mark from JSON/XML	2015-03-24 01:58:01 +00:00
nemobis	711a88df59	Merge pull request #226 from nemobis/master Make dumpgenerator.py 774: required by launcher.py	2015-03-14 10:44:41 +01:00
Nemo bis	55e5888a00	Fix UnicodeDecodeError in resume: use kitchen	2015-03-10 22:26:23 +02:00
Federico Leva	14ce5f2c1b	Resume and list titles without keeping everything in memory Approach suggested by @makoshark, finally found the time to start implementing it. * Do not produce and save the titles list all at once. Instead, use the scraper and API as generators and save titles on the go. Also, try to start the generator from the appropriate title. For now the title sorting is not implemented. Pages will be in the order given by namespace ID, then page name. * When resuming, read both the title list and the XML file from the end rather than the beginning. If the correct terminator is present, only one line needs to be read. * In both cases, use a generator instead of a huge list in memory. * Also truncate the resumed XML without writing it from scratch. For now using GNU ed: very compact, though shelling out is ugly. I gave up on using file.seek and file.truncate to avoid reading the whole file from the beginning or complicating reverse_readline() with more offset calculations. This should avoid MemoryError in most cases. Tested by running a dump over a 1.24 wiki with 11 pages: a complete dump and a resumed dump from a dump interrupted with ctrl-c.	2015-03-10 11:21:44 +01:00
Federico Leva	2537e9852e	Make dumpgenerator.py 774: required by launcher.py	2015-03-08 21:33:39 +01:00
nemobis	4b81fa00f1	Merge pull request #225 from nemobis/master Fix API check if only index is passed	2015-03-08 20:53:51 +01:00
Federico Leva	79e2c5951f	Fix API check if only index is passed I forgot that the preceding point only extracts the api.php URL if the "wiki" argument is passed to say it's a MediaWiki wiki (!).	2015-03-08 20:52:24 +01:00
Federico Leva	bdc7c9bf06	Issue 26: Local "Special" namespace, actually limit replies * For some reason, in a previous commit I had noticed that maxretries was not respected in getXMLPageCore, but I didn't fix it. Done now. * If the "Special" namespace alias doesn't work, fetch the local one.	2015-03-08 19:30:09 +01:00
Federico Leva	c1a5e3e0ca	Merge branch 'PiRSquared17-follow-redirects-api'	2015-03-08 16:31:32 +01:00
Federico Leva	2f25e6b787	Make checkAPI() more readable and verbose Also return the api URL we found.	2015-03-08 16:01:46 +01:00
Federico Leva	48ad3775fd	Merge branch 'follow-redirects-api' of git://github.com/PiRSquared17/wikiteam into PiRSquared17-follow-redirects-api	2015-03-08 14:35:30 +01:00
nemobis	2284e3d55e	Merge pull request #186 from PiRSquared17/update-headers Preserve default headers, fixing openwrt test	2015-03-08 13:56:00 +01:00
PiRSquared17	5d23cb62f4	Merge pull request #219 from vadp/dir-fnames-unicode convert images directory content to unicode when resuming download	2015-03-04 23:36:59 +00:00
PiRSquared17	d361477a46	Merge pull request #222 from vadp/img-desc-load-err dumpgenerator: catch errors for missing image descriptions	2015-03-03 01:55:59 +00:00
Vadim Shlyakhov	4c1d104326	dumpgenerator: catch errors for missing image descriptions	2015-03-02 12:15:51 +03:00
nemobis	eae90b777b	Merge pull request #221 from PiRSquared17/fix-index-php Try using URL without index.php as index	2015-03-02 07:58:06 +01:00
PiRSquared17	b1ce45b170	Try using URL without index.php as index	2015-03-02 04:13:44 +00:00
PiRSquared17	9c3c992319	Follow API redirects	2015-03-02 03:13:03 +00:00
Vadim Shlyakhov	f7e83a767a	convert images directory content to unicode when resuming download	2015-03-01 19:14:01 +03:00
PiRSquared17	dec0032971	Replace CitiWiki test URL	2015-02-26 12:32:51 +00:00
PiRSquared17	d248b3f3e8	Merge pull request #217 from makoshark/master fix bug with exception handling	2015-02-11 03:50:54 +00:00
Benjamin Mako Hill	d2adf5ce7c	Merge branch 'master' of github.com:WikiTeam/wikiteam	2015-02-10 17:05:22 -08:00
Benjamin Mako Hill	f85b4a3082	fixed bug with page missing exception code My previous code broke the page missing detection code with two negative outcomes: - missing pages were not reported in the error log - ever missing page generated an extraneous "</page>" line in output which rendered dumps invalid This patch improves the exception code in general and fixes both of these issues.	2015-02-10 16:56:14 -08:00
Benjamin Mako Hill	f4ec129bff	updated wikiadownloader.py to work with new dumps Bitrot seems to have gotten the best of this script and it sounds like it hasn't been used. This at least gets it to work by: - find both .gz and the .7z dumps - parse the new date format on html - find dumps in the correct place - move all chatter to stderr instead of stdout	2015-02-10 14:20:21 -08:00
PiRSquared17	0ebe4e519d	Merge pull request #204 from hashar/tox-flake8 Add tox env for flake8 linter	2015-02-10 03:55:57 +00:00
PiRSquared17	9480834a37	Fix infinite images loop Closes #205 (hopefully)	2015-02-10 02:24:50 +00:00
PiRSquared17	ac72938d40	Merge pull request #216 from makoshark/master Issue #8: avoid MemoryError fatal on big histories, remove sha1 for Wikia	2015-02-10 00:37:57 +00:00
PiRSquared17	28fc715b28	Make tests pass (fix/remove URLs) Remove more Gentoo URLs (see `5069119b`). Fix WikiPapers API, and remove it from API test. (It gives incorrect API URL in its HTML output.)	2015-02-09 23:59:36 +00:00
nemobis	5069119b42	Remove wiki.gentoo.org from tests The test is failing. https://travis-ci.org/WikiTeam/wikiteam/builds/50102997#L546 Might be our fault, but they just updated code: Tyrian – (f313f23) 12:47, 23 January 2015 GPLv3+ Gentoo's new web theme ported to MediaWiki. Alex Legler I don't think testing screenscraping against a theme used only by Gentoo makes much sense for us.	2015-02-09 22:13:13 +01:00
Benjamin Mako Hill	eb8b44aef0	strip <sha1> tags returned under <page> The Wikia API is exporting sha1 sums as part of the response for pages. These are invalid XML and are causing dump parsing code (e.g., MediaWiki-Utilities) to fail. Also, sha1 should be revisions, not pages so it's not entirely clear to me what this is referring to.	2015-02-06 18:50:25 -08:00
Benjamin Mako Hill	145b2eaaf4	changed getXMLPage() into a generator The program tended to run out of memory when processing very large pages (i.e., pages with extremely large numbers of revisions or pages with large numbers of very large revisions). This mitigates the problem by changing getXMLPage() into a generator which allows us to write pages after each request to the API. This requied changes to the getXMLPage() function and also changes to other parts of the code that called it. Additionally, when the function was called, it's text was checked in several ways. This required a few changes including a running tally of revisions instead of post hoc check and it required error checking being moved into a Exception rather than just an if statement that looked at the final result.	2015-02-06 17:19:24 -08:00
Federico Leva	a1921f0919	Update list of wikia.com unarchived wikis The list of unarchived wikis was compared to the list of wikis that we managed to download with dumpgenerator.py: https://archive.org/details/wikia_dump_20141219 To allow the comparison, the naming format was aligned to the format used by dumpgenerator.py for 7z files.	2015-02-06 09:17:53 +01:00
Emilio J. Rodríguez-Posada	9a6570ec5a	Update README.md	2014-12-23 13:33:19 +01:00
Federico Leva	ce6fbfee55	Use curl --fail instead and other fixes; add list Now tested and used to produce the list of some 300k Wikia wikis which don't yet have a public dump. Will soon be archived.	2014-12-19 08:17:59 +01:00
Federico Leva	7471900e56	It's easier if the list has the actual domains	2014-12-17 22:50:53 +01:00
Federico Leva	8bd3373960	Add wikia.py, to list Wikia wikis we'll dump ourselves	2014-12-17 22:49:10 +01:00
Federico Leva	38e778faad	Add 7z2bz2.sh	2014-12-17 13:35:59 +01:00
Marek Šuppa	e370257aeb	tests: Updated Index endpoint for WikiPapers * Updated API endpoint for WikiPapers on Referata which was previously (http://wikipapers.referata.com/w/index.php) and now resolves to (http://wikipapers.referata.com/index.php).	2014-12-08 06:49:03 +01:00
Marek Šuppa	7b9ca8aa6b	tests: Updated API endpoint for WikiPapers * Updated API endpoint for WikiPapers on Referata. It used to be (http://wikipapers.referata.com/w/api.php), now it resolves to (http://wikipapers.referata.com/api.php). This was breaking the tests.	2014-12-08 06:37:29 +01:00
Federico Leva	e26711afc9	Merge branch 'master' of github.com:WikiTeam/wikiteam	2014-12-05 15:01:32 +01:00
Federico Leva	ed2d87418c	Update with some wikis done in the last batch	2014-12-05 15:00:43 +01:00
Emilio J. Rodríguez-Posada	43cda4ec01	excluding wiki-site.com farm too	2014-12-03 11:39:53 +01:00
Emilio J. Rodríguez-Posada	7463b16b36	Merge branch 'master' of https://github.com/WikiTeam/wikiteam	2014-11-27 20:12:50 +01:00
Emilio J. Rodríguez-Posada	9681fdfd14	linking to GitHub	2014-11-27 20:12:27 +01:00
Marek Šuppa	b003cf94e2	tests: Disable broken Wiki * Disabled http://wiki.greenmuseum.org/ since it's broken and was breaking the tests `'Unknown' != 'PhpWiki'`	2014-11-26 23:33:44 +01:00
Emilio J. Rodríguez-Posada	8d4def5885	improving duplicate filter, removing www. www1., etc; excluding editthis.info	2014-11-26 17:32:08 +01:00
Emilio J. Rodríguez-Posada	9ca67fa4d3	not archived wikis script	2014-11-26 16:34:14 +01:00
Antoine Musso	362309a2da	Add tox env for flake8 linter Most people know about pep8 which enforce coding style. pyflakes goes a step beyond by analyzing the code. flake8 is basically a wrapper around both pep8 and pyflakes and comes with some additional checks. I find it very useful since you only need to require one package to have a lot of code issues reported to you. This patch provides a 'flake8' tox environement to easily install and run the utility on the code base. One simply has to: tox -eflake8 The repository in its current state does not pass checks We can later easily ensure there is no regression by adjusting Travis configuration to run this env. The env has NOT been added to the default list of environement. More informations about flake8: https://pypi.python.org/pypi/flake8	2014-11-16 23:06:27 +01:00
Federico Leva	8cf4d4e6ea	Add 30k domains from another crawler 11011 were found alive by checkalive.py (though there could be more if one checks more subdomains and subdirectories), some thousands more by checklive.pl (but mostly or all false positives). Of the alive ones, about 6245 were new to WikiApiary! https://wikiapiary.com/wiki/Category:Oct_2014_Import	2014-11-01 22:23:25 +01:00

1 2 3 4 5 ...

777 Commits