A major update - checkalive.pl now checks for api.php and writes it to the list if found.

git-svn-id: https://wikiteam.googlecode.com/svn/trunk@965 31edc4fc-5e31-b4c4-d58b-c8bc928bcb95
scottdb56 10 years ago
parent 575c9dd3ea
commit b9be07936a

@ -1,117 +1,247 @@
# Name: checkalive.pl v1.2
# Description: This script will go thru a list of URLs & determine
# if they are online & if they are Mediawiki wikis. It should work
# with: "/index.php/Main_Page", "index.php", "api.php" and even pages
# such as: "/wiki/Pagina_principale".
# Name: checkalive.pl v2.0
# Description: This script will go thru a list of URLs & determine if they are online & if they are Mediawiki wikis.
# It should work with: "/index.php/Main_Page", "index.php", "api.php" and even pages such as: "/wiki/Pagina_principale".
# If the URl is not "api.php", it will look for it, check it, and output it if found to be a valid api.php. If not found,
# it will output the URL with "index.php" if that's available.
# Created: 12/14/2013
# Most recently updated: 01/26/2014 (It's a work-in-progress...)
# Most recently updated: 02/25/2014
# Copyright (c) 2013-2014 by Scott D. Boyd - scottdb56@gmail.com
# ====================================================================
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# GNU General Public License for more details.
# ===========================================================================================================================
# This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License
# as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.
# ====================================================================
# This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty
# of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
# NOTE: The following three Perl modules need to be installed on your computer.
# Search for them on cpan.org or use your Linux distro's package manager.
# You should have received a copy of the GNU General Public License along with this program. If not,
# see <http://www.gnu.org/licenses/>.
# ===========================================================================================================================
# NOTE: The following four Perl modules need to be installed on your computer.
# Search for them on cpan.org or use your Linux distro's package manager.
use LWP::Simple;
use LWP::UserAgent;
use Crypt::SSLeay;
my $slp=2; # You can change this number for seconds to sleep between requests (currently 2 seconds)
my $urllist="my-URL-list.txt";
use Mojo::URL;
my $slp=2; # You can change this number for seconds to sleep between requests (currently 2 seconds)
my $urllist="URL-list.txt";
my $alivelist="alive-wikis.txt";
my $deadlist="dead-wikis.txt";
my $pwrdby1="Powered by MediaWiki";
my $pwrdby2="poweredby_mediawiki";
my $genmw="meta name\=\"generator\" content\=\"MediaWiki";
my $mwapi="MediaWiki API documentation page";
my $lw=0; my $dw=0;
my $mwapi2="API Home Page"; # found in an older version of the api
my $indexphp="index.php";
my $apiphp="api.php";
my $wapiphp="w\/api.php";
my $wikiapiphp="wiki\/api.php";
my $apiurl="";
my $live=0; my $dead=0;
my $a=1; my $b=0; my $c=0;
my $flag=0;
my $ua = LWP::UserAgent->new;
$ua->agent("Mozilla/5.0"); # use this user-agent to get into wikis that block spiders & robots
$ua->agent("Mozilla/5.0"); # use this user-agent to get into wikis that block spiders & robots
# Here's where most of the work takes place:
open (MYURLLIST,"<$urllist")
or die "Cannot open the URL-list file: $!";
open (ALIVEFILE,">$alivelist");
open (DEADFILE,">$deadlist");
while (<MYURLLIST>) {
if ((/\#(.*?)/) || (/^\s*$/)) { # check to see if line is a comment or a blank line
next; # if so - skip it
if ((/\#(.*?)/) || (/^\s*$/)) { # check to see if line is a comment or a blank line
next; # if so - skip it
} else {
$url=$_; # assign the current line to $url
$url=$_; # assign the current line to $url
chomp $url;
$req = HTTP::Request->new(GET => $url); # --|
$req->header('Accept' => 'text/html'); # |-- some of these lines
$res = $ua->request($req); # send request |-- were adapted from
if ($res->is_success) { # if the URL still exists |-- lwpcook.pod
print "Got it! "; # |
$doc=$res->content; # |
&ParsePage; # go to "ParsePage" sub-routine |
} else { # |
$errormsg=$res->status_line; # --|
$req = HTTP::Request->new(GET => $url); # --|
$req->header('Accept' => 'text/html'); # |-- some of these lines
$res = $ua->request($req); # send request |-- were adapted from
if ($res->is_success) { # if the URL still exists |-- lwpcook.pod
print "Got it! "; # |
$doc=$res->content; # |
print "Parsing the document... ";
if (($doc=~/$pwrdby1/i) || ($doc=~/$pwrdby2/i)) { # if the page contains: "Powered by MediaWiki"
print "It's alive and powered by Mediawiki\n"; # or: "poweredby_mediawiki"
$flag=1;$live++; # then it's a MediaWiki wiki
& Check4api;
} elsif ($doc=~/$genmw/i) { # if the content generator is MediaWiki
print "It's alive and powered by Mediawiki\n"; # then it's a MediaWiki wiki
& Check4api;
} elsif ($doc=~/$mwapi/i) { # if the api.php contains: "MediaWiki API documentation page"
print "It's alive and powered by Mediawiki\n"; # then it's a MediaWiki wiki
print ALIVEFILE "$url\n";
} elsif ($doc=~/$mwapi2/i) { # if the api.php contains: "API Home Page" (older version)
print "It's alive and powered by Mediawiki\n"; # then it's a MediaWiki wiki
print ALIVEFILE "$url\n";
unless ($flag) {
print "It's alive but NOT powered by MediaWiki\n";
print DEADFILE "$url is alive but NOT powered by Mediawiki\n"; $dead++;
} else {
if (substr($errormsg,0,3) eq "500") { # if response-code 500
print DEADFILE "$url\n"; $dw++;
print DEADFILE "$url\n"; $dead++;
} elsif (substr($errormsg,0,3) eq "401") { # if Unauthorized (code 401)
print DEADFILE "$url\n"; $dw++;
print DEADFILE "$url\n"; $dead++;
} elsif (substr($errormsg,0,3) eq "403") { # if forbidden (code 403)
print DEADFILE "$url is alive but access is denied.\n"; $dw++;
print DEADFILE "$url is alive but access is denied.\n"; $dead++;
} elsif (substr($errormsg,0,3) eq "404") { # if URL is dead (code 404)
print DEADFILE "$url\n"; $dw++;
print DEADFILE "$url\n"; $dead++;
} elsif (substr($errormsg,0,3) eq "406") { # if Not Acceptable (code 406)
print DEADFILE "$url\n"; $dw++;
print DEADFILE "$url\n"; $dead++;
$c++; $b=$c/10;
if ($b==$a) {
print "Checked $c URLs -- "; # print the progress every 10 URLs
print "Checked $c URLs -- "; # print the progress every 10 URLs
if ($slp > 0) {
print "Pausing for $slp seconds...\n\n"; sleep $slp;
} else { # don't pause - go on to the next URL
print "\nFinished! I found $lw live wikis and $dw dead or non-Mediawiki wikis.\n";
print "\nFinished! I found $live live wikis and $dead dead or non-Mediawiki wikis.\n";
# Here's the sub-routine
# ======================
sub ParsePage {
print "Parsing the document... ";
if (($doc=~/$pwrdby1/i) || ($doc=~/$pwrdby2/i)) { # if the page contains: "Powered by MediaWiki"
print "It's alive and powered by Mediawiki\n"; # or: "poweredby_mediawiki"
print ALIVEFILE "$url\n"; # then it's a MediaWiki wiki
} elsif ($doc=~/$genmw/i) { # if the content generator is MediaWiki
print "It's alive and powered by Mediawiki\n"; # then it's a MediaWiki wiki
print ALIVEFILE "$url\n";
} elsif ($doc=~/$mwapi/i) { # if the api.php contains: "MediaWiki API documentation page"
print "It's alive and powered by Mediawiki\n"; # then it's a MediaWiki wiki
print ALIVEFILE "$url\n";
# Here's the sub-routines
# =======================
sub Check4api {
$pos=rindex($url,"\/"); # $pos will contain the position of the final "/" (counting from zero)
$base_plus=substr($url,0,($pos+1)); # $base_plus will contain everything up to & including the final "/"
my $len1=length($url); my $len2=length($base_plus);
if ($len2 < 10) { # if $base_plus contains only "http://" or "https://"
$base_plus=$url; # then assign $url to $base_plus
my $tail=substr $base_plus, -1;
if (!($tail=~/\//)) { # if the last character of $base_plus is not a "/"
$base_plus=$base_plus."\/" ; # then add it
unless ($flag) {
print "It's alive but NOT powered by MediaWiki\n";
print DEADFILE "$url is alive but NOT powered by Mediawiki\n"; $dw++;
$apiurl=$base_plus.$apiphp; # $apiurl is our new URL with api.php tacked on the end
&PauseRoutine; & Fetch_api; # pause & then try to get api.php
if ($res->is_success) {
print "Found api.php... "; $doc=$res->content;
}else{ # if no api.php...
$apiurl=$base_plus.$wapiphp; # modify the URL
&PauseRoutine; & Fetch_api; # pause & then try to get /w/api.php
if ($res->is_success) {
print "Found api.php... "; $doc=$res->content;
}else{ # if no /w/api.php...
$apiurl=$base_plus.$wikiapiphp; # modify the URL
&PauseRoutine; & Fetch_api; # pause & then try to get /wiki/api.php
if ($res->is_success) {
print "Found api.php... "; $doc=$res->content;
if (/https:\/\//) {
} else {
$url = Mojo::URL->new($url);
$base = $url->host; # extract just the host from $url & assign it to $base
my $tail=substr $base, -1;
if (!($tail=~/\//)) { # if the last character of $base_plus is not a "/"
$base=$base."\/" ; # then add it
$apiurl=$base.$apiphp; # $apiurl is our new URL with api.php tacked on the end
&PauseRoutine; & Fetch_api; # pause & then try to get api.php
if ($res->is_success) {
print "Found api.php... "; $doc=$res->content;
}else{ # if no api.php...
$apiurl=$base.$wapiphp; # modify the URL
&PauseRoutine; & Fetch_api; # pause & then try to get /w/api.php
if ($res->is_success) {
print "Found api.php... "; $doc=$res->content;
}else{ # if no /w/api.php...
$apiurl=$base.$wikiapiphp; # modify the URL
&PauseRoutine; & Fetch_api; # pause & then try to get /wiki/api.php
if ($res->is_success) {
print "Found api.php... "; $doc=$res->content;
if (!($url=~/index.php/i)) { # if the URL does not end with index.php...
print "There is no api.php -- I'll try index.php...\n";
$indexurl=$base_plus.$indexphp; # then tack on index.php...
$req = HTTP::Request->new(GET => $indexurl); # and try to get it
$req->header('Accept' => 'text/html');
$res = $ua->request($req); # send request
if ($res->is_success) {
if (($doc=~/$pwrdby1/i) || ($doc=~/$pwrdby2/i)) { # if the page contains: "Powered by MediaWiki"
print ALIVEFILE "$indexurl\n"; # or: "poweredby_mediawiki"
}elsif ($doc=~/$genmw/i) { # if the content generator is MediaWiki
print ALIVEFILE "$indexurl\n";
print "There is no api.php OR index.php for $url\n";
print ALIVEFILE "$url\n";
print ALIVEFILE "$url\n";
print "There is no api.php for $url\n";
print ALIVEFILE "$url\n";
sub Fetch_api {
$req = HTTP::Request->new(GET => $apiurl);
$req->header('Accept' => 'text/html');
$res = $ua->request($req); # send request
sub Parse_api {
print "Parsing the document...\n ";
if ($doc=~/$mwapi/i) { # if the api.php contains: "MediaWiki API documentation page"
print ALIVEFILE "$apiurl\n"; # then it's a MediaWiki wiki - print it to the list
}elsif ($doc=~/$mwapi2/i) { # if the api.php contains: "API Home Page" (older version)
print "Found a valid api.php and writing it to the list\n"; # delete this line after testing
print ALIVEFILE "$apiurl\n"; # then it's a MediaWiki wiki - print it to the list
print "This api.php is not valid.\n"; # then try to get index.php
print "Trying to get $indexurl...\n";
$req = HTTP::Request->new(GET => $indexurl);
$req->header('Accept' => 'text/html');
$res = $ua->request($req); # send request
if ($res->is_success) {
if (($doc=~/$pwrdby1/i) || ($doc=~/$pwrdby2/i)) { # if the page contains: "Powered by MediaWiki"
print ALIVEFILE "$indexurl\n"; # or: "poweredby_mediawiki"
}elsif ($doc=~/$genmw/i) { # if the content generator is MediaWiki
print ALIVEFILE "$indexurl\n";
print "There is no api.php OR index.php for $url\n";
print ALIVEFILE "$url\n";
sub PauseRoutine {
if ($slp > 0) {
print "Pausing for $slp seconds...\n\n"; sleep $slp;
} else { } # don't pause - go on to the next URL

@ -1,9 +1,11 @@
checkalive.pl is a Perl script that will go thru a list of URLs & determine if
they are online and if they are Mediawiki wikis. It should work with
"index.php/Main_Page", "index.php" and "api.php". As of 01/23/2014, I have
started using version numbers.
checkalive.pl is a Perl script that will go thru a list of URLs & determine if they are
online & if they are Mediawiki wikis. It should work with: "/index.php/Main_Page",
"index.php", "api.php" and even pages such as: "/wiki/Pagina_principale". If the URl is
not "api.php", it will look for it, check it, and output it if found to be a valid api.php.
If not found, it will output the URL with "index.php" if that's available.
As of 01/23/2014, I have started using version numbers.
Required programs and modules
@ -12,10 +14,13 @@ be on your system. You will also need to have the following Perl modules install
The first two are contained in LWP - The World-Wide Web library for Perl
(aka: libwww-perl-6.x), available at CPAN, or through your Linux distro's package manager.
(aka: libwww-perl-6.x), available at CPAN, (http://www.cpan.org)or through your Linux
distro's package manager.
Crypt::SSLeay (OpenSSL support for LWP) is also available at CPAN. This module
is needed to properly handle any URLs beginning with "https".
Mojo::URL is available at CPAN as well. It's needed to extract the domain name from a URL.
@ -31,6 +36,12 @@ There are several variables you can change, or you can just use them as-is:
and will be noted as such.
Any other variable that you want to change - you do so at your own risk.
Starting the script
If you want to use the default configuration noted above, at a command prompt, simply
type: "perl checkalive.pl" (without the quotes). You must be in the same directory (or
folder) as the script and the URL list that you want to check.
The script does NOT have a "resume" feature at this time. If you are running through a
@ -38,8 +49,3 @@ list of 1000's of URLs, and the script crashes, or you kill it, your lists of al
dead URLs will NOT BE SAVED TO DISK. I suggest breaking up your list into smaller lists
of a few hundred URLs in each list until I can implement a resume feature.
The LWP library does transparent redirect handling, so I can't capture the new URL that
is displayed on screen as the script is running. Therefore, any of the URLs that get
redirected to a new URL will have the original URL saved to the appropriate list(whether
it's dead or alive).
