Merge branch 'master' of https://github.com/WikiTeam/wikiteam

10 years ago · 084ccc6456
parent 7d00cfa0de 1933db8a94
commit 084ccc6456
2 changed files with 8 additions and 7 deletions
--- a/README.md
+++ b/README.md
@ -3,9 +3,9 @@

 **WikiTeam software is a set of tools for archiving wikis.** They work on MediaWiki wikis, but we want to expand to other wiki engines. As of June 2014, WikiTeam has preserved more than [13,000 stand-alone wikis](https://github.com/WikiTeam/wikiteam/wiki/Available-Backups), several wikifarms, regular Wikipedia dumps and [24TB of Wikimedia Commons images](https://archive.org/details/wikimediacommons).

-There are [thousands](http://wikiindex.org) of [wikis](https://wikiapiary.com) in the Internet. Everyday some of them are no longer publicly available and, due to lack of backups, lost forever. Millions of people download tons of media files (movies, music, books, etc) from the Internet, implementing a kind of distributed backup. Wikis, most of them under free licenses, disappear from time to time because nobody grabbed a copy of them. That is a shame that we would like to solve.
+There are [thousands](http://wikiindex.org) of [wikis](https://wikiapiary.com) in the Internet. Every day some of them are no longer publicly available and, due to lack of backups, lost forever. Millions of people download tons of media files (movies, music, books, etc) from the Internet, serving as a kind of distributed backup. Wikis, most of them under free licenses, disappear from time to time because nobody grabbed a copy of them. That is a shame that we would like to solve.

-**WikiTeam** is the [Archive Team](http://www.archiveteam.org) ([GitHub](https://github.com/ArchiveTeam)) subcommittee on wikis. It was founded and originally developed by [Emilio J. Rodríguez-Posada](https://github.com/emijrp), a Wikipedia veteran editor and amateur archivist. Many people have help sending suggestions, [reporting bugs](https://github.com/WikiTeam/wikiteam/issues), writing [documentation](https://github.com/WikiTeam/wikiteam/wiki), providing help in the [mailing list](http://groups.google.com/group/wikiteam-discuss) and making [wiki backups](https://github.com/WikiTeam/wikiteam/wiki/Available-Backups). Thanks to all, especially to: [Federico Leva](https://github.com/nemobis), [Alex Buie](https://github.com/ab2525), [Scott Boyd](http://www.sdboyd56.com), [Hydriz](https://github.com/Hydriz), Platonides, Ian McEwen and [Mike Dupont](https://github.com/h4ck3rm1k3).
+**WikiTeam** is the [Archive Team](http://www.archiveteam.org) ([GitHub](https://github.com/ArchiveTeam)) subcommittee on wikis. It was founded and originally developed by [Emilio J. Rodríguez-Posada](https://github.com/emijrp), a Wikipedia veteran editor and amateur archivist. Many people have helped by sending suggestions, [reporting bugs](https://github.com/WikiTeam/wikiteam/issues), writing [documentation](https://github.com/WikiTeam/wikiteam/wiki), providing help in the [mailing list](http://groups.google.com/group/wikiteam-discuss) and making [wiki backups](https://github.com/WikiTeam/wikiteam/wiki/Available-Backups). Thanks to all, especially to: [Federico Leva](https://github.com/nemobis), [Alex Buie](https://github.com/ab2525), [Scott Boyd](http://www.sdboyd56.com), [Hydriz](https://github.com/Hydriz), Platonides, Ian McEwen, [Mike Dupont](https://github.com/h4ck3rm1k3) and [balrog](https://github.com/balr0g).

 <table border=0 cellpadding=5px>
 <tr><td>
@ -55,4 +55,4 @@ See more options:

 ### Download Wikimedia Commons images

-There is a script for this, but we have [uploaded the tarballs](https://archive.org/details/wikimediacommons) to Internet Archive, so perhaps it is a better option download them from IA instead re-generating them with the script.
+There is a script for this, but we have [uploaded the tarballs](https://archive.org/details/wikimediacommons) to Internet Archive, so it's more useful to reseed their torrents than to re-generate old ones with the script.
--- a/dumpgenerator.py
+++ b/dumpgenerator.py
@ -271,7 +271,7 @@ def getPageTitlesScraper(config={}):
                    checked_suballpages.append(name) #to avoid reload dupe subpages links
                    delay(config=config)
                    req2 = urllib2.Request(url=url, headers={'User-Agent': getUserAgent(), 'Accept-Encoding': 'gzip'})
-                    f = urllib2.urlopen(req)
+                    f = urllib2.urlopen(req2)
                    if f.headers.get('Content-Encoding') and 'gzip' in f.headers.get('Content-Encoding'):
                        raw2 = gzip.GzipFile(fileobj=StringIO.StringIO(f.read())).read()
                    else:
@ -286,9 +286,10 @@ def getPageTitlesScraper(config={}):
        c = 0
        m = re.compile(r_title).finditer(rawacum)
        for i in m:
-            if not i.group('title').startswith('Special:'):
-                if not i.group('title') in titles:
-                    titles.append(undoHTMLEntities(text=i.group('title')))
+            t = undoHTMLEntities(text=unicode(i.group('title'), 'utf-8'))
+            if not t.startswith('Special:'):
+                if not t in titles:
+                    titles.append(t)
                    c += 1
        print '    %d titles retrieved in the namespace %d' % (c, namespace)
    return titles