2
0
mirror of https://github.com/WikiTeam/wikiteam synced 2024-11-12 07:12:41 +00:00
Commit Graph

777 Commits

Author SHA1 Message Date
PiRSquared17
fc276d525f Allow spaces before <mediawiki> tag. 2015-03-24 03:44:03 +00:00
PiRSquared17
1c820dafb7 Strip ZWNBSP (U+FEFF) Byte-Order Mark from JSON/XML 2015-03-24 01:58:01 +00:00
nemobis
711a88df59 Merge pull request #226 from nemobis/master
Make dumpgenerator.py 774: required by launcher.py
2015-03-14 10:44:41 +01:00
Nemo bis
55e5888a00 Fix UnicodeDecodeError in resume: use kitchen 2015-03-10 22:26:23 +02:00
Federico Leva
14ce5f2c1b Resume and list titles without keeping everything in memory
Approach suggested by @makoshark, finally found the time to start
implementing it.
* Do not produce and save the titles list all at once. Instead, use
  the scraper and API as generators and save titles on the go. Also,
  try to start the generator from the appropriate title.
  For now the title sorting is not implemented. Pages will be in the
  order given by namespace ID, then page name.
* When resuming, read both the title list and the XML file from the
  end rather than the beginning. If the correct terminator is
  present, only one line needs to be read.
* In both cases, use a generator instead of a huge list in memory.
* Also truncate the resumed XML without writing it from scratch.
  For now using GNU ed: very compact, though shelling out is ugly.
  I gave up on using file.seek and file.truncate to avoid reading the
  whole file from the beginning or complicating reverse_readline()
  with more offset calculations.

This should avoid MemoryError in most cases.

Tested by running a dump over a 1.24 wiki with 11 pages: a complete
dump and a resumed dump from a dump interrupted with ctrl-c.
2015-03-10 11:21:44 +01:00
Federico Leva
2537e9852e Make dumpgenerator.py 774: required by launcher.py 2015-03-08 21:33:39 +01:00
nemobis
4b81fa00f1 Merge pull request #225 from nemobis/master
Fix API check if only index is passed
2015-03-08 20:53:51 +01:00
Federico Leva
79e2c5951f Fix API check if only index is passed
I forgot that the preceding point only extracts the api.php URL if
the "wiki" argument is passed to say it's a MediaWiki wiki (!).
2015-03-08 20:52:24 +01:00
Federico Leva
bdc7c9bf06 Issue 26: Local "Special" namespace, actually limit replies
* For some reason, in a previous commit I had noticed that maxretries
  was not respected in getXMLPageCore, but I didn't fix it. Done now.
* If the "Special" namespace alias doesn't work, fetch the local one.
2015-03-08 19:30:09 +01:00
Federico Leva
c1a5e3e0ca Merge branch 'PiRSquared17-follow-redirects-api' 2015-03-08 16:31:32 +01:00
Federico Leva
2f25e6b787 Make checkAPI() more readable and verbose
Also return the api URL we found.
2015-03-08 16:01:46 +01:00
Federico Leva
48ad3775fd Merge branch 'follow-redirects-api' of git://github.com/PiRSquared17/wikiteam into PiRSquared17-follow-redirects-api 2015-03-08 14:35:30 +01:00
nemobis
2284e3d55e Merge pull request #186 from PiRSquared17/update-headers
Preserve default headers, fixing openwrt test
2015-03-08 13:56:00 +01:00
PiRSquared17
5d23cb62f4 Merge pull request #219 from vadp/dir-fnames-unicode
convert images directory content to unicode when resuming download
2015-03-04 23:36:59 +00:00
PiRSquared17
d361477a46 Merge pull request #222 from vadp/img-desc-load-err
dumpgenerator: catch errors for missing image descriptions
2015-03-03 01:55:59 +00:00
Vadim Shlyakhov
4c1d104326 dumpgenerator: catch errors for missing image descriptions 2015-03-02 12:15:51 +03:00
nemobis
eae90b777b Merge pull request #221 from PiRSquared17/fix-index-php
Try using URL without index.php as index
2015-03-02 07:58:06 +01:00
PiRSquared17
b1ce45b170 Try using URL without index.php as index 2015-03-02 04:13:44 +00:00
PiRSquared17
9c3c992319 Follow API redirects 2015-03-02 03:13:03 +00:00
Vadim Shlyakhov
f7e83a767a convert images directory content to unicode when resuming download 2015-03-01 19:14:01 +03:00
PiRSquared17
dec0032971 Replace CitiWiki test URL 2015-02-26 12:32:51 +00:00
PiRSquared17
d248b3f3e8 Merge pull request #217 from makoshark/master
fix bug with exception handling
2015-02-11 03:50:54 +00:00
Benjamin Mako Hill
d2adf5ce7c Merge branch 'master' of github.com:WikiTeam/wikiteam 2015-02-10 17:05:22 -08:00
Benjamin Mako Hill
f85b4a3082 fixed bug with page missing exception code
My previous code broke the page missing detection code with two negative
outcomes:

- missing pages were not reported in the error log
- ever missing page generated an extraneous "</page>" line in output which
  rendered dumps invalid

This patch improves the exception code in general and fixes both of these
issues.
2015-02-10 16:56:14 -08:00
Benjamin Mako Hill
f4ec129bff updated wikiadownloader.py to work with new dumps
Bitrot seems to have gotten the best of this script and it sounds like it
hasn't been used. This at least gets it to work by:

- find both .gz and the .7z dumps
- parse the new date format on html
- find dumps in the correct place
- move all chatter to stderr instead of stdout
2015-02-10 14:20:21 -08:00
PiRSquared17
0ebe4e519d Merge pull request #204 from hashar/tox-flake8
Add tox env for flake8 linter
2015-02-10 03:55:57 +00:00
PiRSquared17
9480834a37 Fix infinite images loop
Closes #205 (hopefully)
2015-02-10 02:24:50 +00:00
PiRSquared17
ac72938d40 Merge pull request #216 from makoshark/master
Issue #8: avoid MemoryError fatal on big histories, remove sha1 for Wikia
2015-02-10 00:37:57 +00:00
PiRSquared17
28fc715b28 Make tests pass (fix/remove URLs)
Remove more Gentoo URLs (see 5069119b).
Fix WikiPapers API, and remove it from API test.
(It gives incorrect API URL in its HTML output.)
2015-02-09 23:59:36 +00:00
nemobis
5069119b42 Remove wiki.gentoo.org from tests
The test is failing. https://travis-ci.org/WikiTeam/wikiteam/builds/50102997#L546
Might be our fault, but they just updated code:
Tyrian	– (f313f23) 12:47, 23 January 2015	GPLv3+	Gentoo's new web theme ported to MediaWiki.	Alex Legler

I don't think testing screenscraping against a theme used only by Gentoo makes much sense for us.
2015-02-09 22:13:13 +01:00
Benjamin Mako Hill
eb8b44aef0 strip <sha1> tags returned under <page>
The Wikia API is exporting sha1 sums as part of the response for pages.
These are invalid XML and are causing dump parsing code (e.g.,
MediaWiki-Utilities) to fail.  Also, sha1 should be revisions, not pages so
it's not entirely clear to me what this is referring to.
2015-02-06 18:50:25 -08:00
Benjamin Mako Hill
145b2eaaf4 changed getXMLPage() into a generator
The program tended to run out of memory when processing very large pages (i.e.,
pages with extremely large numbers of revisions or pages with large numbers of
very large revisions). This mitigates the problem by changing getXMLPage() into
a generator which allows us to write pages after each request to the API.

This requied changes to the getXMLPage() function and also changes to other
parts of the code that called it.

Additionally, when the function was called, it's text was checked in several
ways. This required a few changes including a running tally of revisions
instead of post hoc check and it required error checking being moved into a
Exception rather than just an if statement that looked at the final result.
2015-02-06 17:19:24 -08:00
Federico Leva
a1921f0919 Update list of wikia.com unarchived wikis
The list of unarchived wikis was compared to the list of wikis that we
managed to download with dumpgenerator.py:
https://archive.org/details/wikia_dump_20141219
To allow the comparison, the naming format was aligned to the format
used by dumpgenerator.py for 7z files.
2015-02-06 09:17:53 +01:00
Emilio J. Rodríguez-Posada
9a6570ec5a Update README.md 2014-12-23 13:33:19 +01:00
Federico Leva
ce6fbfee55 Use curl --fail instead and other fixes; add list
Now tested and used to produce the list of some 300k Wikia wikis
which don't yet have a public dump. Will soon be archived.
2014-12-19 08:17:59 +01:00
Federico Leva
7471900e56 It's easier if the list has the actual domains 2014-12-17 22:50:53 +01:00
Federico Leva
8bd3373960 Add wikia.py, to list Wikia wikis we'll dump ourselves 2014-12-17 22:49:10 +01:00
Federico Leva
38e778faad Add 7z2bz2.sh 2014-12-17 13:35:59 +01:00
Marek Šuppa
e370257aeb tests: Updated Index endpoint for WikiPapers
* Updated API endpoint for WikiPapers on Referata which was previously (http://wikipapers.referata.com/w/index.php) and now resolves to (http://wikipapers.referata.com/index.php).
2014-12-08 06:49:03 +01:00
Marek Šuppa
7b9ca8aa6b tests: Updated API endpoint for WikiPapers
* Updated API endpoint for WikiPapers on Referata. It used to be (http://wikipapers.referata.com/w/api.php), now it resolves to (http://wikipapers.referata.com/api.php). This was breaking the tests.
2014-12-08 06:37:29 +01:00
Federico Leva
e26711afc9 Merge branch 'master' of github.com:WikiTeam/wikiteam 2014-12-05 15:01:32 +01:00
Federico Leva
ed2d87418c Update with some wikis done in the last batch 2014-12-05 15:00:43 +01:00
Emilio J. Rodríguez-Posada
43cda4ec01 excluding wiki-site.com farm too 2014-12-03 11:39:53 +01:00
Emilio J. Rodríguez-Posada
7463b16b36 Merge branch 'master' of https://github.com/WikiTeam/wikiteam 2014-11-27 20:12:50 +01:00
Emilio J. Rodríguez-Posada
9681fdfd14 linking to GitHub 2014-11-27 20:12:27 +01:00
Marek Šuppa
b003cf94e2 tests: Disable broken Wiki
* Disabled http://wiki.greenmuseum.org/ since it's broken and was breaking the tests `'Unknown' != 'PhpWiki'`
2014-11-26 23:33:44 +01:00
Emilio J. Rodríguez-Posada
8d4def5885 improving duplicate filter, removing www. www1., etc; excluding editthis.info 2014-11-26 17:32:08 +01:00
Emilio J. Rodríguez-Posada
9ca67fa4d3 not archived wikis script 2014-11-26 16:34:14 +01:00
Antoine Musso
362309a2da Add tox env for flake8 linter
Most people know about pep8 which enforce coding style.  pyflakes
goes a step beyond by analyzing the code.

flake8 is basically a wrapper around both pep8 and pyflakes and comes
with some additional checks.  I find it very useful since you only need
to require one package to have a lot of code issues reported to you.

This patch provides a 'flake8' tox environement to easily install
and run the utility on the code base.  One simply has to:

     tox -eflake8

The repository in its current state does not pass checks We can later
easily ensure there is no regression by adjusting Travis configuration
to run this env.

The env has NOT been added to the default list of environement.

More informations about flake8: https://pypi.python.org/pypi/flake8
2014-11-16 23:06:27 +01:00
Federico Leva
8cf4d4e6ea Add 30k domains from another crawler
11011 were found alive by checkalive.py (though there could be more
if one checks more subdomains and subdirectories), some thousands
more by checklive.pl (but mostly or all false positives).

Of the alive ones, about 6245 were new to WikiApiary!
https://wikiapiary.com/wiki/Category:Oct_2014_Import
2014-11-01 22:23:25 +01:00