A major update - checkalive.pl now checks for api.php and writes it to the list if found.

git-svn-id: https://wikiteam.googlecode.com/svn/trunk@965 31edc4fc-5e31-b4c4-d58b-c8bc928bcb95
2024-11-04 12:00:28 +00:00 · 2014-02-27 05:50:28 +00:00 · 2014-02-27 05:50:28 +00:00 · b9be07936a
commit b9be07936a
parent 575c9dd3ea
2 changed files with 214 additions and 78 deletions
--- a/listsofwikis/checkalive.pl
+++ b/listsofwikis/checkalive.pl
@ -1,117 +1,247 @@
 #!/usr/bin/perl

-# Name: checkalive.pl v1.2
-# Description: This script will go thru a list of URLs & determine 
-# if they are online & if they are Mediawiki wikis. It should work
-# with: "/index.php/Main_Page", "index.php", "api.php" and even pages
-# such as: "/wiki/Pagina_principale".
+# Name: checkalive.pl v2.0
+# Description: This script will go thru a list of URLs & determine if they are online & if they are Mediawiki wikis.
+# It should work with: "/index.php/Main_Page", "index.php", "api.php" and even pages such as: "/wiki/Pagina_principale".
+# If the URl is not "api.php", it will look for it, check it, and output it if found to be a valid api.php. If not found,
+# it will output the URL with "index.php" if that's available.
 #
 # Created: 12/14/2013
-# Most recently updated: 01/26/2014 (It's a work-in-progress...)
+# Most recently updated: 02/25/2014
 # Copyright (c) 2013-2014 by Scott D. Boyd - scottdb56@gmail.com
-# ====================================================================
-# This program is free software: you can redistribute it and/or modify
-# it under the terms of the GNU General Public License as published by
-# the Free Software Foundation, either version 3 of the License, or
-# (at your option) any later version.
 #
-# This program is distributed in the hope that it will be useful,
-# but WITHOUT ANY WARRANTY; without even the implied warranty of
-# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
-# GNU General Public License for more details.
+# ===========================================================================================================================
+# This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License 
+# as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
 #
-# You should have received a copy of the GNU General Public License
-# along with this program.  If not, see <http://www.gnu.org/licenses/>.
-# ====================================================================
+# This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty 
+# of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for more details.
 #
-# NOTE: The following three Perl modules need to be installed on your computer.
-# Search for them on cpan.org or use your Linux distro's package manager.
+# You should have received a copy of the GNU General Public License along with this program.  If not, 
+# see <http://www.gnu.org/licenses/>.
+# ===========================================================================================================================
+#
+# NOTE: The following four Perl modules need to be installed on your computer.
+#       Search for them on cpan.org or use your Linux distro's package manager.
 use LWP::Simple;
 use LWP::UserAgent;
 use Crypt::SSLeay;
-my $slp=2; # You can change this number for seconds to sleep between requests (currently 2 seconds)
-my $urllist="my-URL-list.txt";
+use Mojo::URL;
+my $slp=2; 	# You can change this number for seconds to sleep between requests (currently 2 seconds)
+my $urllist="URL-list.txt";
 my $alivelist="alive-wikis.txt";
 my $deadlist="dead-wikis.txt";
 my $pwrdby1="Powered by MediaWiki";
 my $pwrdby2="poweredby_mediawiki";
 my $genmw="meta name\=\"generator\" content\=\"MediaWiki";
 my $mwapi="MediaWiki API documentation page";
-my $lw=0; my $dw=0;
+my $mwapi2="API Home Page";				# found in an older version of the api
+my $indexphp="index.php";
+my $apiphp="api.php";
+my $wapiphp="w\/api.php";
+my $wikiapiphp="wiki\/api.php";
+my $apiurl="";
+my $live=0; my $dead=0;
 my $a=1; my $b=0; my $c=0;
 my $flag=0;
 my $ua = LWP::UserAgent->new;
-$ua->agent("Mozilla/5.0");			# use this user-agent to get into wikis that block spiders & robots
+$ua->agent("Mozilla/5.0");				# use this user-agent to get into wikis that block spiders & robots
 $ua->timeout(30);
 $ua->show_progress(1);

-# Here's where most of the work takes place:
 open (MYURLLIST,"<$urllist")
  or die "Cannot open the URL-list file: $!";
 open (ALIVEFILE,">$alivelist");
 open (DEADFILE,">$deadlist");
 while (<MYURLLIST>) {
-  if ((/\#(.*?)/) || (/^\s*$/)) {		# check to see if line is a comment or a blank line
-    next;					# if so - skip it
+  if ((/\#(.*?)/) || (/^\s*$/)) {			# check to see if line is a comment or a blank line
+    next;						# if so - skip it
  } else {
-    $url=$_;					# assign the current line to $url
+    $url=$_;						# assign the current line to $url
    chomp $url;
-    $req = HTTP::Request->new(GET => $url);  	#				 --|
-    $req->header('Accept' => 'text/html');	#				   |-- some of these lines
-    $res = $ua->request($req);			# send request 			   |-- were adapted from
-    if ($res->is_success) {			# if the URL still exists	   |-- lwpcook.pod
-       print "Got it! ";			#				   |
-       $doc=$res->content;			#				   |
-       &ParsePage;				# go to "ParsePage" sub-routine	   |
-    } else {					#				   |
-      $errormsg=$res->status_line;		#				 --|
+    $req = HTTP::Request->new(GET => $url);  		#				 --|
+    $req->header('Accept' => 'text/html');		#				   |-- some of these lines
+    $res = $ua->request($req);				# send request 			   |-- were adapted from
+    if ($res->is_success) {				# if the URL still exists	   |-- lwpcook.pod
+       print "Got it! ";				#				   |
+       $doc=$res->content;				#				   |
+       print "Parsing the document... ";
+       if (($doc=~/$pwrdby1/i) || ($doc=~/$pwrdby2/i)) {	# if the page contains: "Powered by MediaWiki"
+          print "It's alive and powered by Mediawiki\n";	# or: "poweredby_mediawiki"
+          $flag=1;$live++;					# then it's a MediaWiki wiki
+          & Check4api;
+       } elsif ($doc=~/$genmw/i) {				# if the content generator is MediaWiki
+          print "It's alive and powered by Mediawiki\n";	# then it's a MediaWiki wiki
+          $flag=1;$live++;
+          & Check4api;
+       } elsif ($doc=~/$mwapi/i) {				# if the api.php contains: "MediaWiki API documentation page"
+          print "It's alive and powered by Mediawiki\n";	# then it's a MediaWiki wiki
+          print ALIVEFILE "$url\n";
+          $flag=1;$live++;
+       } elsif ($doc=~/$mwapi2/i) {				# if the api.php contains: "API Home Page" (older version)
+          print "It's alive and powered by Mediawiki\n";	# then it's a MediaWiki wiki
+          print ALIVEFILE "$url\n";
+          $flag=1;$live++;
+       }
+       unless ($flag) {
+         print "It's alive but NOT powered by MediaWiki\n";
+         print DEADFILE "$url is alive but NOT powered by Mediawiki\n"; $dead++;
+       }
+    $flag=0;
+    } else {
+      $errormsg=$res->status_line;
      if (substr($errormsg,0,3) eq "500") {		# if response-code 500
-         print DEADFILE "$url\n"; $dw++;
+         print DEADFILE "$url\n"; $dead++;
      } elsif  (substr($errormsg,0,3) eq "401") {	# if Unauthorized (code 401)
-         print DEADFILE "$url\n"; $dw++;
+         print DEADFILE "$url\n"; $dead++;
      } elsif  (substr($errormsg,0,3) eq "403") {	# if forbidden (code 403)
-         print DEADFILE "$url is alive but access is denied.\n"; $dw++;
+         print DEADFILE "$url is alive but access is denied.\n"; $dead++;
      } elsif  (substr($errormsg,0,3) eq "404") {	# if URL is dead (code 404)
-         print DEADFILE "$url\n"; $dw++;
+         print DEADFILE "$url\n"; $dead++;
      } elsif  (substr($errormsg,0,3) eq "406") {	# if Not Acceptable (code 406)
-         print DEADFILE "$url\n"; $dw++;
+         print DEADFILE "$url\n"; $dead++;
      }
    }
    $c++; $b=$c/10; 
    if ($b==$a) { 
-       print "Checked $c URLs -- ";		# print the progress every 10 URLs
+       print "Checked $c URLs -- ";			# print the progress every 10 URLs
       $a++;
    } 
-    if ($slp > 0) {
-       print "Pausing for $slp seconds...\n\n"; sleep $slp;
-    } else { 					# don't pause - go on to the next URL
-    }
+    &PauseRoutine;
  }
 }
 close DEADFILE; close ALIVEFILE; close MYURLLIST;
-print "\nFinished! I found $lw live wikis and $dw dead or non-Mediawiki wikis.\n";
+print "\nFinished! I found $live live wikis and $dead dead or non-Mediawiki wikis.\n";

-# Here's the sub-routine
-# ======================
-sub ParsePage  {
-   print "Parsing the document... ";
-   if (($doc=~/$pwrdby1/i) || ($doc=~/$pwrdby2/i)) {	# if the page contains: "Powered by MediaWiki"
-      print "It's alive and powered by Mediawiki\n";	# or: "poweredby_mediawiki"
-      print ALIVEFILE "$url\n";				# then it's a MediaWiki wiki
-      $flag=1;$lw++;
-   } elsif ($doc=~/$genmw/i) {				# if the content generator is MediaWiki
-      print "It's alive and powered by Mediawiki\n";	# then it's a MediaWiki wiki
-      print ALIVEFILE "$url\n";
-      $flag=1;$lw++;
-   } elsif ($doc=~/$mwapi/i) {				# if the api.php contains: "MediaWiki API documentation page"
-      print "It's alive and powered by Mediawiki\n";	# then it's a MediaWiki wiki
-      print ALIVEFILE "$url\n";
-      $flag=1;$lw++;
+# Here's the sub-routines
+# =======================
+sub Check4api {
+   $pos=rindex($url,"\/");				# $pos will contain the position of the final "/" (counting from zero)
+   $base_plus=substr($url,0,($pos+1)); 			# $base_plus will contain everything up to & including the final "/"
+   my $len1=length($url); my $len2=length($base_plus);
+   if ($len2 < 10) {					# if $base_plus contains only "http://" or "https://"
+      $base_plus=$url;					# then assign $url to $base_plus
+      my $tail=substr $base_plus, -1;
+      if (!($tail=~/\//)) {				# if the last character of $base_plus is not a "/"
+         $base_plus=$base_plus."\/" ;			# then add it
+      } 
   }
-   unless ($flag) {
-     print "It's alive but NOT powered by MediaWiki\n";
-     print DEADFILE "$url is alive but NOT powered by Mediawiki\n"; $dw++;
+   $apiurl=$base_plus.$apiphp;				# $apiurl is our new URL with api.php tacked on the end
+   &PauseRoutine; & Fetch_api;				# pause & then try to get api.php
+   if ($res->is_success) {
+     print "Found api.php... "; $doc=$res->content;
+     &Parse_api;
+   }else{						# if no api.php...
+     $apiurl=$base_plus.$wapiphp;			# modify the URL
+     &PauseRoutine; & Fetch_api;			# pause & then try to get /w/api.php     
+     if ($res->is_success) {
+       print "Found api.php... "; $doc=$res->content;
+       &Parse_api;
+      }else{						# if no /w/api.php...
+        $apiurl=$base_plus.$wikiapiphp;			# modify the URL
+        &PauseRoutine; & Fetch_api;			# pause & then try to get /wiki/api.php
+        if ($res->is_success) {
+          print "Found api.php... "; $doc=$res->content;
+          &Parse_api;
+        }else{
+          if (/https:\/\//) { 
+            $scheme="https://";
+          } else { 
+            $scheme="http://"; 
+          }
+          $url = Mojo::URL->new($url);
+          $base = $url->host;				# extract just the host from $url & assign it to $base
+          $base=$scheme.$base;
+          my $tail=substr $base, -1;
+          if (!($tail=~/\//)) {				# if the last character of $base_plus is not a "/"
+          $base=$base."\/" ;				# then add it
+          } 
+          $apiurl=$base.$apiphp;			# $apiurl is our new URL with api.php tacked on the end
+          &PauseRoutine; & Fetch_api;			# pause & then try to get api.php
+          if ($res->is_success) {
+            print "Found api.php... "; $doc=$res->content;
+            &Parse_api;
+          }else{					# if no api.php...
+            $apiurl=$base.$wapiphp;			# modify the URL
+            &PauseRoutine; & Fetch_api;			# pause & then try to get /w/api.php     
+            if ($res->is_success) {
+              print "Found api.php... "; $doc=$res->content;
+              &Parse_api;
+            }else{					# if no /w/api.php...
+              $apiurl=$base.$wikiapiphp;		# modify the URL
+              &PauseRoutine; & Fetch_api;		# pause & then try to get /wiki/api.php
+              if ($res->is_success) {
+                print "Found api.php... "; $doc=$res->content;
+              &Parse_api;
+              }else{
+               if (!($url=~/index.php/i)) {		# if the URL does not end with index.php...
+                 print "There is no api.php -- I'll try index.php...\n";
+                 $indexurl=$base_plus.$indexphp;	# then tack on index.php...
+                 $req = HTTP::Request->new(GET => $indexurl); # and try to get it
+                 $req->header('Accept' => 'text/html');
+                 $res = $ua->request($req);		# send request
+                 if ($res->is_success) {
+                   $doc=$res->content;
+                   if (($doc=~/$pwrdby1/i) || ($doc=~/$pwrdby2/i)) {	# if the page contains: "Powered by MediaWiki"
+                     print ALIVEFILE "$indexurl\n";   			# or: "poweredby_mediawiki"
+                   }elsif ($doc=~/$genmw/i) {				# if the content generator is MediaWiki
+                      print ALIVEFILE "$indexurl\n";
+                   }else{
+                      print "There is no api.php OR index.php for $url\n";
+                      print ALIVEFILE "$url\n";
+                   }
+                 }else{ 
+                    print ALIVEFILE "$url\n";
+                 }
+               }else{
+                  print "There is no api.php for $url\n";
+                  print ALIVEFILE "$url\n";
+               }
+              }
+            }
+          }
+        }
+     }
   }
-   $flag=0;
+}
+
+sub Fetch_api {
+   $req = HTTP::Request->new(GET => $apiurl);
+   $req->header('Accept' => 'text/html');
+   $res = $ua->request($req);				# send request
+}
+
+sub Parse_api {
+   print "Parsing the document...\n ";
+   if ($doc=~/$mwapi/i) {				# if the api.php contains: "MediaWiki API documentation page"
+      print ALIVEFILE "$apiurl\n";			# then it's a MediaWiki wiki - print it to the list
+   }elsif ($doc=~/$mwapi2/i) {				# if the api.php contains: "API Home Page" (older version)
+      print "Found a valid api.php and writing it to the list\n"; # delete this line after testing
+      print ALIVEFILE "$apiurl\n";			# then it's a MediaWiki wiki - print it to the list
+   }else{
+      print "This api.php is not valid.\n";  		# then try to get index.php
+      $indexurl=$base_plus.$indexphp;
+      print "Trying to get $indexurl...\n";
+      $req = HTTP::Request->new(GET => $indexurl);
+      $req->header('Accept' => 'text/html');
+      $res = $ua->request($req);			# send request
+      if ($res->is_success) {
+        $doc=$res->content;
+        if (($doc=~/$pwrdby1/i) || ($doc=~/$pwrdby2/i)) { # if the page contains: "Powered by MediaWiki"
+          print ALIVEFILE "$indexurl\n";   		  # or: "poweredby_mediawiki"
+        }elsif ($doc=~/$genmw/i) {			  # if the content generator is MediaWiki
+           print ALIVEFILE "$indexurl\n";
+        }else{
+           print "There is no api.php OR index.php for $url\n"; 
+           print ALIVEFILE "$url\n";
+        }
+      }	
+   }
+}
+
+sub PauseRoutine {
+   if ($slp > 0) {
+      print "Pausing for $slp seconds...\n\n"; sleep $slp;
+   } else { }						# don't pause - go on to the next URL
 }

--- a/listsofwikis/readme-checkalive.txt
+++ b/listsofwikis/readme-checkalive.txt
@ -1,9 +1,11 @@
 Description
 -----------
-checkalive.pl is a Perl script that will go thru a list of URLs & determine if
-they are online and if they are Mediawiki wikis. It should work with
-"index.php/Main_Page", "index.php" and "api.php". As of 01/23/2014, I have
-started using version numbers.
+checkalive.pl is a Perl script that will go thru a list of URLs & determine if they are 
+online & if they are Mediawiki wikis. It should work with: "/index.php/Main_Page", 
+"index.php", "api.php" and even pages such as: "/wiki/Pagina_principale". If the URl is 
+not "api.php", it will look for it, check it, and output it if found to be a valid api.php.
+If not found, it will output the URL with "index.php" if that's available. 
+As of 01/23/2014, I have started using version numbers.

 Required programs and modules
 -----------------------------
@ -12,10 +14,13 @@ be on your system. You will also need to have the following Perl modules install
 LWP::Simple
 LWP::UserAgent
 Crypt::SSLeay
+Mojo::URL
 The first two are contained in LWP - The World-Wide Web library for Perl
-(aka: libwww-perl-6.x), available at CPAN, or through  your Linux distro's package manager.
+(aka: libwww-perl-6.x), available at CPAN, (http://www.cpan.org)or through your Linux 
+distro's package manager.
 Crypt::SSLeay (OpenSSL support for LWP) is also available at CPAN. This module
 is needed to properly handle any URLs beginning with "https".
+Mojo::URL is available at CPAN as well. It's needed to extract the domain name from a URL.

 Configuration
 -------------
@ -31,6 +36,12 @@ There are several variables you can change, or you can just use them as-is:
   and will be noted as such.
 Any other variable that you want to change - you do so at your own risk.

+Starting the script
+-------------------
+If you want to use the default configuration noted above, at a command prompt, simply
+type: "perl checkalive.pl" (without the quotes). You must be in the same directory (or 
+folder) as the script and the URL list that you want to check.
+
 Issues
 ------
 The script does NOT have a "resume" feature at this time. If you are running through a
@ -38,8 +49,3 @@ list of 1000's of URLs, and the script crashes, or you kill it, your lists of al
 dead URLs will NOT BE SAVED TO DISK. I suggest breaking up your list into smaller lists
 of a few hundred URLs in each list until I can implement a resume feature.

-The LWP library does transparent redirect handling, so I can't capture the new URL that
-is displayed on screen as the script is running. Therefore, any of the URLs that get
-redirected to a new URL will have the original URL saved to the appropriate list(whether
-it's dead or alive).
-