Commit Graph

390 Commits (master)

Author SHA1 Message Date
GT610 5b0e98afe6 Fix wrong urllib module call 11 months ago
nemobis 1e363f450f
Merge pull request #464 from saveweb/xml-format-py2
Adjust XML format
12 months ago
nemobis c7150784c1
Merge pull request #452 from yzqzss/patch-4
Update dumpgenerator.py
12 months ago
nemobis a977dc1a8b
Merge pull request #439 from Pokechu22/page-title-scraper-fix
Fix infinite loop on page title scraper
12 months ago
nemobis c56cbf1c12
Merge pull request #453 from yzqzss/patch-5
Speed up file scanning in `images/` dir
12 months ago
nemobis 674381c27c
Merge pull request #448 from yzqzss/patch-1
Match single quotes too when scraping namespaces
12 months ago
nemobis 8167987052
Merge pull request #451 from yzqzss/patch-2
Quote `title` to get correct file description
12 months ago
yzqzss e979adfbeb remove empty <comment> if no comment provided 12 months ago
yzqzss 522807d25d fix: incorrect xml space attr in <text> 12 months ago
Pokechu22 aac816e315 Fix broken http_method fallback
This was probably a copy/paste typo. I don't remember if I ever ran into this in practice but it is something I noticed in the past and never submitted a fix for.
1 year ago
Pokechu22 df230a96c9 Fix exporting via prop=revisions 1 year ago
yzqzss 392fbce083
speed up file scanning
use `set` instead of `list` to speed up the scanning of large numbers of files (>10000) in `images/`.
1 year ago
yzqzss 940d50bbac
Update dumpgenerator.py
fix typo
1 year ago
yzqzss ebac66f557
Update dumpgenerator.py 1 year ago
yzqzss 0be46c7427
quote `title` 1 year ago
yzqzss 90a64c6a22
make `requests.session` to use `--retries` value
(default=5)
1 year ago
yzqzss 331f8e122b
update regex to match `'` and `"` in <option> tag
the new versions of MediaWiki use `'`, older use `"`.
1 year ago
Pokechu22 97146c6f01 Use the same requests session for getting the wiki engine and checking API/index 1 year ago
Pokechu22 6668999658 Update User-Agent to latest Firefox 1 year ago
Pokechu22 cad7260d7c Fix crash when the image description is missing for an image containing non-ascii characters
title is already unicode, so we shouldn't need to decode it (and don't in generateXMLDump).
2 years ago
Pokechu22 5b3fc4ac7b Pass requests session to mwclient
This means it uses our configured user-agent, as well as any cookies.
2 years ago
Pokechu22 1af69ca147 Skip empty revisions when using --xmlrevisions
Before, the download would die, and need to be resumed from the start.
2 years ago
Pokechu22 a1bd3b0851 Fix infinite loop on page title scraper 2 years ago
Pokechu22 4a2cbd4843 Use `session.get` instead of `requests.get` in `getXMLHeader`
`session.get` uses our configured User-Agent, while `requests.get` uses the default one.
2 years ago
Pokechu22 9b2c6e40ae Fix truncation when resuming
There already was code that looks like it was supposed to truncate files, but it calculated the index wrong and didn't properly check all lines. It worked out, though, because it didn't actually call the truncate function.

Now, truncation occurs to the last `</page>` tag. If the XML file ends with a `</page>` tag, then nothing gets truncated. The page is added after that; if nothing was truncated, this will result in the same page being listed twice (which already happened with the missing truncation), but if truncation did happen then the file should no longer be invalid.
2 years ago
Pokechu22 43945c467f Work around unicode titles not working with resuming
Before, you would get UnicodeWarning: Unicode unequal comparison failed to convert both arguments to Unicode - interpreting them as being unequal. The %s versus {} change was needed because otherwise you would get UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: ordinal not in range(128). There is probably a better way of solving that, but this one does work.
2 years ago
nemobis d7b6924845
Merge pull request #408 from shreyasminocha/fix-resume-images
Fix image resuming
2 years ago
Tim Gates ecbcc6118e
docs: Fix a few typos
There are small typos in:
- dumpgenerator.py
- wikiteam/mediawiki.py

Fixes:
- Should read `inconsistencies` rather than `inconsistences`.
- Should read `partially` rather than `partialy`.
2 years ago
Shreyas Minocha e55de36cb7
Fix image resuming 3 years ago
Nicolas SAPA b289f86243 Fix getPageTitlesScraper
Using the API and the Special:Allpages scraper should result in the same number of titles.
Fix the detection of the next subpages on Special:Allpages.
Change the max depth to 100 and implement an anti loop (could fail on non-western wiki).
4 years ago
Nicolas SAPA e4b43927b9 Fixup description grab in generateImageDump
getXMLPage() yield on "</page>" so xmlfiledesc cannot contains "</mediawiki>".
Change the search to "</page>" and inject "</mediawiki>" if it is missing to fixup the XML
4 years ago
Nicolas SAPA eacaf08b2f Try to fix a broken HTTP to HTTPS redirect in generateImageDump()
Some wiki fail to do the HTTP to HTTPs redirect correctly so try it ourself.
4 years ago
Nicolas SAPA 7675b0d17c Add exception handler for requests.exceptions.ReadTimeout in getXMLPageCore()
Treat a ReadTimeout the same as a ConnectionError (log the error & retry)
4 years ago
Nicolas SAPA 4a5eef97da Update the default user-agent
A ModSecurity rule block the old UA so switch to the current Firefox 78 UA.
4 years ago
Rob Kam e6f4674b42
fix typo 4 years ago
Federico Leva abd908914f Adapt to some more Wikia wikis edge cases
* Make it easy to batch requests for some wikis where millions of titles
  are really just one-revision thread items and need to be gone through
  as fast as possible.
* Status code error message.
4 years ago
Federico Leva 7de75012d1 Fix merge of the getXMLRevisions() loop 4 years ago
nemobis 8a2116699e
Merge branch 'master' into wikia 4 years ago
Federico Leva 7289225d2c Directly catch exception for page missing in getXMLRevisions()
The caller cannot catch the PME exception because it doesn't know about
the title. Just log the error here.
4 years ago
nemobis e136ee5536
Merge pull request #372 from nemobis/wikia
Avoid launcher.py 7z failures
4 years ago
Federico Leva 8c6f05bb54 Consider status code before content in checkIndex() and checkalive.py
Fixes https://github.com/WikiTeam/wikiteam/issues/369
4 years ago
Federico Leva 9ac1e6d0f1 Implement resume in --xmlrevisions (but not yet with list=allrevisions)
Tested with a partial dumps over 100 MB:
https://tinyvillage.fandom.com/api.php
(grepped <title> to see the previously downloaded ones were kept and the
new ones continued from expected; did not validate a final XML).
4 years ago
Federico Leva a664b17a9c Handle deleted contributor name in --xmlrevisions
Avoids failure in https://deployment.wikimedia.beta.wmflabs.org/w/api.php
for revision https://deployment.wikimedia.beta.wmflabs.org/?oldid=2349 .
4 years ago
Federico Leva b162e7b14f Reduce the API limit to 50 for arvlimit, gaplimit, ailimit
Avoids to crash on errors or warnings which some wikis return for bigger
requests, like https://www.openkm.com/wiki/api.php (MediaWiki 1.27.3).
4 years ago
Federico Leva d543f7d4dd Check the API URL against mwclient too, so it doesn't fail later
Change the protocol from HTTP to HTTPS if needed. Fixes:
http://nimiarkisto.fi/w/api.php
4 years ago
Federico Leva d1619392f4 Force the lxml factory to pass around unicode strings
Not necessarily the most compatible with downstream XML parsers, but at
least should ensure that we manage to write the XML file. The encoding
declared in the header is not necessarily the same we get from the API.

See also:
https://lxml.de/FAQ.html#why-can-t-lxml-parse-my-xml-from-unicode-strings
https://lxml.de/3.7/parsing.html#serialising-to-unicode-strings

Fixes https://github.com/WikiTeam/wikiteam/issues/363
4 years ago
Federico Leva 6dc86d1964 Actually use the next batch from prop=revisions in MediaWiki 1.19 4 years ago
Federico Leva 2ba69b3810 Indent the number of revisions more, consistent with page title style 4 years ago
Federico Leva 8fef62d46e Implement continuation for --xmlrevisions with prop=revisions in MW 1.19 4 years ago
Federico Leva 8b58599645 Merge branch 'xmlrevisions' of github.com:nemobis/wikiteam into xmlrevisions 4 years ago