Federico Leva
54d9d8051e
Remove dead Miraheze wikis per checkalive.py
...
Closes issue #465
10 months ago
nemobis
bf45fdec92
Merge pull request #466 from GT-610/master
...
Fix wrong urllib module call in Python 2
11 months ago
Federico Leva
c09db669c9
Update checkalive.pl documentation
11 months ago
Federico Leva
1b02cee1d5
Revert "Update miraheze.org list with checkalive.py"
...
Some 70 % of the removed wikis still return an HTTP 200 although they
may be frozen or closed.
Tested with:
git show | grep ^- | cut -f3 -d/ | sed --regexp-extended 's,(.+),https://\1/wiki/,g ' | sort | shuf -n 100 | xargs -I§ -P10 sh -c "curl -Is -w '%{stderr}%{http_code}\n' § > /dev/null" 2>&1 | sort | uniq -c
This reverts commit 0a3dc23f98
.
11 months ago
Federico Leva
0a3dc23f98
Update miraheze.org list with checkalive.py
...
Addresses issue #465
11 months ago
GT610
5b0e98afe6
Fix wrong urllib module call
11 months ago
Federico Leva
40a1f35dae
Update miraheze.org list of wikis
11 months ago
nemobis
1e363f450f
Merge pull request #464 from saveweb/xml-format-py2
...
Adjust XML format
11 months ago
nemobis
c7150784c1
Merge pull request #452 from yzqzss/patch-4
...
Update dumpgenerator.py
11 months ago
nemobis
a977dc1a8b
Merge pull request #439 from Pokechu22/page-title-scraper-fix
...
Fix infinite loop on page title scraper
11 months ago
nemobis
c56cbf1c12
Merge pull request #453 from yzqzss/patch-5
...
Speed up file scanning in `images/` dir
11 months ago
nemobis
674381c27c
Merge pull request #448 from yzqzss/patch-1
...
Match single quotes too when scraping namespaces
11 months ago
nemobis
8167987052
Merge pull request #451 from yzqzss/patch-2
...
Quote `title` to get correct file description
11 months ago
yzqzss
e979adfbeb
remove empty <comment> if no comment provided
11 months ago
yzqzss
522807d25d
fix: incorrect xml space attr in <text>
11 months ago
nemobis
0621adf0a3
Merge pull request #463 from Pokechu22/broken-http_method-fallback
...
Fix broken http_method fallback
12 months ago
Pokechu22
aac816e315
Fix broken http_method fallback
...
This was probably a copy/paste typo. I don't remember if I ever ran into this in practice but it is something I noticed in the past and never submitted a fix for.
12 months ago
nemobis
e339927cc3
Merge pull request #462 from Pokechu22/fix-prop-revisions
...
Fix exporting via prop=revisions
12 months ago
Pokechu22
df230a96c9
Fix exporting via prop=revisions
12 months ago
emijrp
dd0f4a4593
350,000
1 year ago
nemobis
ef03cff447
Merge pull request #454 from yzqzss/patch-6
...
Fix a small syntax error in uploader.py
1 year ago
yzqzss
d7153f4c60
Update uploader.py
1 year ago
yzqzss
392fbce083
speed up file scanning
...
use `set` instead of `list` to speed up the scanning of large numbers of files (>10000) in `images/`.
1 year ago
yzqzss
940d50bbac
Update dumpgenerator.py
...
fix typo
1 year ago
yzqzss
ebac66f557
Update dumpgenerator.py
1 year ago
yzqzss
0be46c7427
quote `title`
1 year ago
nemobis
0c4c54dc9e
Merge pull request #449 from yzqzss/patch-2
...
make `requests.session` to use `--retries` value
1 year ago
yzqzss
90a64c6a22
make `requests.session` to use `--retries` value
...
(default=5)
1 year ago
yzqzss
331f8e122b
update regex to match `'` and `"` in <option> tag
...
the new versions of MediaWiki use `'`, older use `"`.
1 year ago
nemobis
9d614cf8ad
Merge pull request #444 from Pokechu22/wiki-engine-session
...
Use the same requests session for getting the wiki engine and checking API/index
1 year ago
Pokechu22
97146c6f01
Use the same requests session for getting the wiki engine and checking API/index
1 year ago
Pokechu22
6668999658
Update User-Agent to latest Firefox
1 year ago
nemobis
ea5e130517
Merge pull request #442 from Pokechu22/missing-image-description
...
Fix crash when the image description is missing for an image containing non-ascii characters
1 year ago
Pokechu22
cad7260d7c
Fix crash when the image description is missing for an image containing non-ascii characters
...
title is already unicode, so we shouldn't need to decode it (and don't in generateXMLDump).
2 years ago
nemobis
25329be008
Merge pull request #441 from Pokechu22/mwclient-session
...
Pass requests session to mwclient
2 years ago
Pokechu22
5b3fc4ac7b
Pass requests session to mwclient
...
This means it uses our configured user-agent, as well as any cookies.
2 years ago
nemobis
52fe2d89a6
Merge pull request #440 from Pokechu22/xmlrevisions-skip-empty-revision
...
Skip empty revisions when using --xmlrevisions
2 years ago
Pokechu22
1af69ca147
Skip empty revisions when using --xmlrevisions
...
Before, the download would die, and need to be resumed from the start.
2 years ago
Pokechu22
a1bd3b0851
Fix infinite loop on page title scraper
2 years ago
nemobis
5d83703d50
Merge pull request #438 from Pokechu22/getXMLHeader-session
...
Use `session.get` instead of `requests.get` in `getXMLHeader`
2 years ago
Pokechu22
4a2cbd4843
Use `session.get` instead of `requests.get` in `getXMLHeader`
...
`session.get` uses our configured User-Agent, while `requests.get` uses the default one.
2 years ago
nemobis
9808279a6a
Merge pull request #436 from Pokechu22/unicode-resume
...
Work around unicode titles not working with resuming and fix truncation when resuming
2 years ago
Pokechu22
9b2c6e40ae
Fix truncation when resuming
...
There already was code that looks like it was supposed to truncate files, but it calculated the index wrong and didn't properly check all lines. It worked out, though, because it didn't actually call the truncate function.
Now, truncation occurs to the last `</page>` tag. If the XML file ends with a `</page>` tag, then nothing gets truncated. The page is added after that; if nothing was truncated, this will result in the same page being listed twice (which already happened with the missing truncation), but if truncation did happen then the file should no longer be invalid.
2 years ago
Pokechu22
43945c467f
Work around unicode titles not working with resuming
...
Before, you would get UnicodeWarning: Unicode unequal comparison failed to convert both arguments to Unicode - interpreting them as being unequal. The %s versus {} change was needed because otherwise you would get UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: ordinal not in range(128). There is probably a better way of solving that, but this one does work.
2 years ago
Federico Leva
e33f14fce6
Support GiveUpGitHub
2 years ago
nemobis
269841c909
Merge pull request #431 from simonliu99/updatelists
...
Update mediawiki wikifarm lists
2 years ago
Liu
d9885e0845
Update shoutwiki-spider to remove duplicates
2 years ago
Liu
fcc4080b23
Update neoseeker.com.info instructions
2 years ago
Liu
e7f7266550
Update fandom.com spider and remove duplicates
2 years ago
Liu
9c5c55342d
Update miraheze.org spider and remove duplicates
2 years ago