Commit Graph

1133 Commits (master)
 

Author SHA1 Message Date
Federico Leva 54d9d8051e Remove dead Miraheze wikis per checkalive.py
Closes issue #465
10 months ago
nemobis bf45fdec92
Merge pull request #466 from GT-610/master
Fix wrong urllib module call in Python 2
11 months ago
Federico Leva c09db669c9 Update checkalive.pl documentation 11 months ago
Federico Leva 1b02cee1d5 Revert "Update miraheze.org list with checkalive.py"
Some 70 % of the removed wikis still return an HTTP 200 although they
may be frozen or closed.

Tested with:

git show | grep ^- | cut -f3 -d/ | sed --regexp-extended 's,(.+),https://\1/wiki/,g' | sort | shuf -n 100 | xargs -I§ -P10 sh -c "curl -Is -w '%{stderr}%{http_code}\n' § > /dev/null" 2>&1 | sort | uniq -c

This reverts commit 0a3dc23f98.
11 months ago
Federico Leva 0a3dc23f98 Update miraheze.org list with checkalive.py
Addresses issue #465
11 months ago
GT610 5b0e98afe6 Fix wrong urllib module call 11 months ago
Federico Leva 40a1f35dae Update miraheze.org list of wikis 11 months ago
nemobis 1e363f450f
Merge pull request #464 from saveweb/xml-format-py2
Adjust XML format
11 months ago
nemobis c7150784c1
Merge pull request #452 from yzqzss/patch-4
Update dumpgenerator.py
11 months ago
nemobis a977dc1a8b
Merge pull request #439 from Pokechu22/page-title-scraper-fix
Fix infinite loop on page title scraper
11 months ago
nemobis c56cbf1c12
Merge pull request #453 from yzqzss/patch-5
Speed up file scanning in `images/` dir
11 months ago
nemobis 674381c27c
Merge pull request #448 from yzqzss/patch-1
Match single quotes too when scraping namespaces
11 months ago
nemobis 8167987052
Merge pull request #451 from yzqzss/patch-2
Quote `title` to get correct file description
11 months ago
yzqzss e979adfbeb remove empty <comment> if no comment provided 11 months ago
yzqzss 522807d25d fix: incorrect xml space attr in <text> 11 months ago
nemobis 0621adf0a3
Merge pull request #463 from Pokechu22/broken-http_method-fallback
Fix broken http_method fallback
12 months ago
Pokechu22 aac816e315 Fix broken http_method fallback
This was probably a copy/paste typo. I don't remember if I ever ran into this in practice but it is something I noticed in the past and never submitted a fix for.
12 months ago
nemobis e339927cc3
Merge pull request #462 from Pokechu22/fix-prop-revisions
Fix exporting via prop=revisions
12 months ago
Pokechu22 df230a96c9 Fix exporting via prop=revisions 12 months ago
emijrp dd0f4a4593
350,000 1 year ago
nemobis ef03cff447
Merge pull request #454 from yzqzss/patch-6
Fix a small syntax error in uploader.py
1 year ago
yzqzss d7153f4c60
Update uploader.py 1 year ago
yzqzss 392fbce083
speed up file scanning
use `set` instead of `list` to speed up the scanning of large numbers of files (>10000) in `images/`.
1 year ago
yzqzss 940d50bbac
Update dumpgenerator.py
fix typo
1 year ago
yzqzss ebac66f557
Update dumpgenerator.py 1 year ago
yzqzss 0be46c7427
quote `title` 1 year ago
nemobis 0c4c54dc9e
Merge pull request #449 from yzqzss/patch-2
make `requests.session` to use `--retries` value
1 year ago
yzqzss 90a64c6a22
make `requests.session` to use `--retries` value
(default=5)
1 year ago
yzqzss 331f8e122b
update regex to match `'` and `"` in <option> tag
the new versions of MediaWiki use `'`, older use `"`.
1 year ago
nemobis 9d614cf8ad
Merge pull request #444 from Pokechu22/wiki-engine-session
Use the same requests session for getting the wiki engine and checking API/index
1 year ago
Pokechu22 97146c6f01 Use the same requests session for getting the wiki engine and checking API/index 1 year ago
Pokechu22 6668999658 Update User-Agent to latest Firefox 1 year ago
nemobis ea5e130517
Merge pull request #442 from Pokechu22/missing-image-description
Fix crash when the image description is missing for an image containing non-ascii characters
1 year ago
Pokechu22 cad7260d7c Fix crash when the image description is missing for an image containing non-ascii characters
title is already unicode, so we shouldn't need to decode it (and don't in generateXMLDump).
2 years ago
nemobis 25329be008
Merge pull request #441 from Pokechu22/mwclient-session
Pass requests session to mwclient
2 years ago
Pokechu22 5b3fc4ac7b Pass requests session to mwclient
This means it uses our configured user-agent, as well as any cookies.
2 years ago
nemobis 52fe2d89a6
Merge pull request #440 from Pokechu22/xmlrevisions-skip-empty-revision
Skip empty revisions when using --xmlrevisions
2 years ago
Pokechu22 1af69ca147 Skip empty revisions when using --xmlrevisions
Before, the download would die, and need to be resumed from the start.
2 years ago
Pokechu22 a1bd3b0851 Fix infinite loop on page title scraper 2 years ago
nemobis 5d83703d50
Merge pull request #438 from Pokechu22/getXMLHeader-session
Use `session.get` instead of `requests.get` in `getXMLHeader`
2 years ago
Pokechu22 4a2cbd4843 Use `session.get` instead of `requests.get` in `getXMLHeader`
`session.get` uses our configured User-Agent, while `requests.get` uses the default one.
2 years ago
nemobis 9808279a6a
Merge pull request #436 from Pokechu22/unicode-resume
Work around unicode titles not working with resuming and fix truncation when resuming
2 years ago
Pokechu22 9b2c6e40ae Fix truncation when resuming
There already was code that looks like it was supposed to truncate files, but it calculated the index wrong and didn't properly check all lines. It worked out, though, because it didn't actually call the truncate function.

Now, truncation occurs to the last `</page>` tag. If the XML file ends with a `</page>` tag, then nothing gets truncated. The page is added after that; if nothing was truncated, this will result in the same page being listed twice (which already happened with the missing truncation), but if truncation did happen then the file should no longer be invalid.
2 years ago
Pokechu22 43945c467f Work around unicode titles not working with resuming
Before, you would get UnicodeWarning: Unicode unequal comparison failed to convert both arguments to Unicode - interpreting them as being unequal. The %s versus {} change was needed because otherwise you would get UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: ordinal not in range(128). There is probably a better way of solving that, but this one does work.
2 years ago
Federico Leva e33f14fce6 Support GiveUpGitHub 2 years ago
nemobis 269841c909
Merge pull request #431 from simonliu99/updatelists
Update mediawiki wikifarm lists
2 years ago
Liu d9885e0845 Update shoutwiki-spider to remove duplicates 2 years ago
Liu fcc4080b23 Update neoseeker.com.info instructions 2 years ago
Liu e7f7266550 Update fandom.com spider and remove duplicates 2 years ago
Liu 9c5c55342d Update miraheze.org spider and remove duplicates 2 years ago