2
0
mirror of https://github.com/WikiTeam/wikiteam synced 2024-11-18 21:27:45 +00:00
Commit Graph

968 Commits

Author SHA1 Message Date
Emilio
287b8b88a3
250,000 wikis 2019-03-03 17:06:31 +01:00
emijrp
ffb39afd1e 800 wikidot sites 2018-07-21 09:57:07 +02:00
emijrp
28158f9b04 wikis 2018-07-20 21:22:54 +02:00
emijrp
7c72c27f2a wikidot 2018-07-20 16:33:00 +02:00
emijrp
4e8c92b6d2 Merge branch 'master' of https://github.com/WikiTeam/wikiteam 2018-07-13 14:28:57 +02:00
emijrp
0ebf86caf6 update, 1.8M users, 400K wikis 2018-07-13 14:28:44 +02:00
nemobis
bee34f4b1b
Merge pull request #319 from TyIsI/patch-1
Updated with vancouver.hackspace.ca -> vanhack.ca domain change
2018-06-21 07:15:58 +03:00
TyIsI
09fac2aeeb Updated with vancouver.hackspace.ca domain change 2018-06-20 18:23:58 -07:00
emijrp
5aac17ea03 update 2018-06-20 13:03:30 +02:00
emijrp
72b67c74f1 randomize saving 2018-06-20 13:01:01 +02:00
emijrp
ca672426bb quotes issues in titles 2018-05-31 20:44:02 +02:00
emijrp
a69f44caab ignore expired wikis 2018-05-28 22:12:15 +02:00
emijrp
a359984932 ++ 2018-05-26 11:25:53 +02:00
emijrp
5525a3cc4a ++ 2018-05-26 10:03:53 +02:00
emijrp
3361e4d09f Merge branch 'master' of https://github.com/WikiTeam/wikiteam 2018-05-25 23:04:50 +02:00
emijrp
94ebe5e1a3 skiping deactivated wikispaces 2018-05-25 23:04:38 +02:00
Federico Leva
83af47d6c0 Catch and raise PageMissingError when query() returns no pages 2018-05-25 11:00:32 +03:00
Federico Leva
73902d39c0 For old MediaWiki releases, use rawcontinue and wikitools query()
Otherwise the query continuation may fail and only the top revisions
will be exported. Tested with Wikia:
http://clubpenguin.wikia.com/api.php?action=query&prop=revisions&titles=Club_Penguin_Wiki

Also add parentid since it's available after all.

https://github.com/WikiTeam/wikiteam/issues/311#issuecomment-391957783
2018-05-25 10:55:44 +03:00
emijrp
d11df60516 Merge branch 'master' of https://github.com/WikiTeam/wikiteam 2018-05-24 13:28:22 +02:00
emijrp
de7822cd37 duckduckgo parser; remove .zip after upload 2018-05-24 13:28:12 +02:00
Federico Leva
bf4781eeea Merge branch 'master' of github.com:WikiTeam/wikiteam 2018-05-23 18:33:34 +03:00
Federico Leva
da64349a5d Avoid UnboundLocalError: local variable 'reply' referenced before assignment 2018-05-23 18:32:38 +03:00
emijrp
273f1b33cb Merge branch 'master' of https://github.com/WikiTeam/wikiteam 2018-05-23 14:26:07 +02:00
emijrp
70eefcc945 skiping deleted wikis 2018-05-23 14:25:51 +02:00
Federico Leva
3b74173e0f launcher.py style and minor changes 2018-05-22 21:44:18 +03:00
Federico Leva
6fbde766c4 Further reduce os.walk() in launcher.py to speed up 2018-05-22 12:41:02 +03:00
Federico Leva
b7789751fc UnboundLocalError: local variable 'reply' referenced before assignment
Warning!: "./tdicampswikiacom-20180522-wikidump" path exists
Traceback (most recent call last):
  File "./dumpgenerator.py", line 2321, in <module>
    main()
  File "./dumpgenerator.py", line 2283, in main
    while reply.lower() not in ['yes', 'y', 'no', 'n']:
UnboundLocalError: local variable 'reply' referenced before assignment
2018-05-22 10:30:11 +03:00
Federico Leva
d76b4b4e01 Raise and catch PageMissingError when revisions API result is incomplete
https://github.com/WikiTeam/wikiteam/issues/317
2018-05-22 10:16:52 +03:00
Federico Leva
7a655f0074 Check for sha1 presence in makeXmlFromPage() 2018-05-22 09:33:53 +03:00
Federico Leva
baae839a38 Complete update of the Wikia lists
* Reduce the offset to 100, the new limit for non-bots.
* Continue listing even when we get an empty request because all
  the wikis in a batch have become inactive and are filtered out.
* Print less from curl's requests.
* Automatically write the domain names to the files here.
2018-05-21 23:26:40 +03:00
Federico Leva
4bc41c3aa2 Actually keep track of listed titles and stop when duplicates are returned
https://github.com/WikiTeam/wikiteam/issues/309
2018-05-21 16:41:10 +03:00
Federico Leva
80288cf49e Catch allpages and namespaces API without query results 2018-05-21 16:41:00 +03:00
Federico Leva
e47f638a24 Define "check" before running checkAPI()
Traceback (most recent call last):
  File "./dumpgenerator.py", line 2294, in <module>
    main()
  File "./dumpgenerator.py", line 2239, in main
    config, other = getParameters(params=params)
  File "./dumpgenerator.py", line 1587, in getParameters
    if api and check:
UnboundLocalError: local variable 'check' referenced before assignment
2018-05-21 15:53:51 +03:00
Federico Leva
dd32202a55 Merge branch 'master' of github.com:WikiTeam/wikiteam 2018-05-21 11:44:57 +03:00
Federico Leva
fcdc1b5cf2 Use os.listdir('.') 2018-05-21 11:44:35 +03:00
Federico Leva
bad49d7916 Also default to regenerating dump in --failfast 2018-05-21 11:44:25 +03:00
Federico Leva
c5b71f60ad Also default to regenerating dump in --failfast 2018-05-21 07:57:07 +03:00
Federico Leva
bbcafdf869 Support Unicode usernames etc. in makeXmlFromPage()
Test case:

Titles saved at... 39fanficwikiacom-20180521-titles.txt
377 page titles loaded
http://39fanfic.wikia.com/api.php
Getting the XML header from the API
Retrieving the XML for every page from the beginning
30 namespaces found
Exporting revisions from namespace 0
Warning. Could not use allrevisions, wiki too old.
1 more revisions exported
Traceback (most recent call last):
  File "./dumpgenerator.py", line 2291, in <module>
    main()
  File "./dumpgenerator.py", line 2283, in main
    createNewDump(config=config, other=other)
  File "./dumpgenerator.py", line 1849, in createNewDump
    generateXMLDump(config=config, titles=titles, session=other['session'])
  File "./dumpgenerator.py", line 732, in generateXMLDump
    for xml in getXMLRevisions(config=config, session=session):
  File "./dumpgenerator.py", line 861, in getXMLRevisions
    yield makeXmlFromPage(pages[page])
  File "./dumpgenerator.py", line 880, in makeXmlFromPage
    E.username(str(rev['user'])),
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-3: ordinal not in range(128)
2018-05-21 07:54:27 +03:00
Federico Leva
3df2513e67 Merge branch 'master' of github.com:WikiTeam/wikiteam 2018-05-21 07:33:32 +03:00
Federico Leva
69ec7e5015 Use os.listdir() and avoid os.walk() in launcher too
With millions of files, everything stalls otherwise.
2018-05-21 07:33:03 +03:00
emijrp
a82a98a40a . 2018-05-20 20:43:11 +02:00
emijrp
9352bc9af5 comment 2018-05-20 20:37:17 +02:00
emijrp
3b0d4fef5e utf8 latin1 2018-05-20 20:36:08 +02:00
Federico Leva
4351e09d80 uploader.py: respect --admin in collection 2018-05-20 01:48:17 +03:00
Federico Leva
320f231d57 Handle status code > 400 in checkAPI()
Fixes https://github.com/WikiTeam/wikiteam/issues/315
2018-05-20 01:41:01 +03:00
Federico Leva
845c05de1e Go back to data POSTing in checkIndex() and checkAPI() to handle redirects
Some redirects from HTTP to HTTPS otherwise end up giving 400, like
http://nimiarkisto.fi/
2018-05-20 01:20:32 +03:00
Federico Leva
de752bb6a2 Also add contentmodel to the XML of --xmlrevisions 2018-05-20 00:28:01 +03:00
Federico Leva
f7466850c9 List of wikis to archive, from not-archived.py 2018-05-20 00:21:05 +03:00
Federico Leva
d07a14cbce New version of uploader.py with possibility of separate directory
Also much faster than using os.walk, which lists all the images
in all wikidump directories.
2018-05-20 00:00:27 +03:00
Federico Leva
03ba77e2f5 Build XML from the pages module when allrevisions not available 2018-05-19 22:34:13 +03:00