Parsing Web Logs Into Calendar Text/Stats

We aren’t quite happy with the web stats software we use. The information is not brief enough. Mostly, we want to see the last couple months, with the weekends visible, and the unique hosts and pages viewed. Also, we have a lot of content that gets snagged that isn’t really what we consider a valid page request, and so we don’t want to count that host. We want to see how many content pages were served up this month and the previous month, and how many different people looked at those pages.

The cal command, actually, has a format that is quite similar to what we want to see:

u-1@srv-1 u-1 $ cal -3
February 2003          March 2003            April 2003      
Su Mo Tu We Th Fr Sa  Su Mo Tu We Th Fr Sa  Su Mo Tu We Th Fr Sa 
1                     1         1  2  3  4  5 
2  3  4  5  6  7  8   2  3  4  5  6  7  8   6  7  8  9 10 11 12 
9 10 11 12 13 14 15   9 10 11 12 13 14 15  13 14 15 16 17 18 19 
16 17 18 19 20 21 22  16 17 18 19 20 21 22  20 21 22 23 24 25 26 
23 24 25 26 27 28     23 24 25 26 27 28 29  27 28 29 30          
30 31

It seemed simple enough to just replace the dates with the stats we want. First, we needed a utility that would give us the number of unique hosts and number of pages viewed. If we massage the input to the utility using grep and sort, we can simplify our task:

u-1@srv-1 log $ grep 04/Mar/2003 nlog | sort > sday

This will snag all lines for March 04, 2003, sort them by IP address, and put the result in the file sday. Now, if we use this perl script:

#!/usr/bin/perl
$uniquehosts=0;
$pages=0;
while (<>){
if(/GET.+.html HTTP/ || /GET \/ HTTP/){
$pages++;
m/^(\d+.\d+.\d+.\d+) -/;
if($1 != $lastip){
$uniquehosts++;
}
$lastip=$1;
}
}
printf ("%02d-%02d",$uniquehosts/100,$pages/100);

We can determine unique hosts and the number of content pages viewed:

u-1@srv-1 log $ grep 05/Mar/2003 nlog | sort > sday
u-1@srv-1 log $ cat sday | ./hc.pl
10-36u-1@srv-1 log $

There is no cairrage return, so the result is to the left of u-1. 1000 unique hosts viewed 3600 pages on March 05.

OK. That is pretty cool. Now, let’s write a script that hacks up the output of cal to create this for every day. Here is the script:

#!/usr/bin/perl
system("cal -3 > calout");
for($m=0;$m<2;$m++){
open(CAL, "< calout");
while (<CAL>){
$last=substr($_,0+22*$m,20);
if($last=~/[A-Z]/){
if($last=~/\d\d\d\d/){
@my=split " ",$last;
print $my[0].":\n\n";
$mo=substr($my[0],0,3);
}
else{
$last=~s/ /       /g;
print $last."\n";
$grabpad="yes";
}
}
else{
if($grabpad == "yes"){
$last=~m/^( +)\d/;
$pad=(length($1)-1)/3;
for ($p=1;$p <= $pad;$p++){
print "         ";
}
$grabpad = "no";
}
@el=split " ", $last;
for($i=0;$i<7;$i++){
if ($el[$i] != ""){
printf ("%02d ",$el[$i]);
if ($el[$i]=~/\d\d/){
system("grep ".$el[$i]."/".$mo."/".$my[1]." nlog | sort | ./hc.pl");
print " ";
}
else{
system("grep 0".$el[$i]."/".$mo."/".$my[1]." nlog | sort | ./hc.pl");
print " ";
}
}
}
print"\n";
}
}
close CAL;
}

First, we’ll run it, and then we will explain the script:

u-1@srv-1 log $ ./calc.pl
February:
Su       Mo       Tu       We       Th       Fr       Sa
01 03-04 
02 03-06 03 05-07 04 05-09 05 07-24 06 09-35 07 09-37 08 05-30 
09 05-36 10 09-50 11 10-60 12 10-55 13 09-54 14 08-37 15 05-30 
16 06-35 17 08-35 18 09-34 19 10-43 20 10-48 21 09-43 22 05-33 
23 05-25 24 10-45 25 10-37 26 10-45 27 10-68 28 09-48 
March:
Su       Mo       Tu       We       Th       Fr       Sa
01 05-30 
02 05-31 03 10-35 04 10-33 05 10-36 06 00-00 07 00-00 08 00-00 
09 00-00 10 00-00 11 00-00 12 00-00 13 00-00 14 00-00 15 00-00 
16 00-00 17 00-00 18 00-00 19 00-00 20 00-00 21 00-00 22 00-00 
23 00-00 24 00-00 25 00-00 26 00-00 27 00-00 28 00-00 29 00-00 
30 00-00 31 00-00

We start out by calling the cal program and outputing to calout. We only want this month and last month, so the for $m loop snags location 0 to 20 or location 22 to 42 using the substr command. If the current line contains caps, then the line is either the month/year or the days. If the line is the month/year, then we split out both into $my. If the line is the days of the week, then we hack them up so they spread out correctly when printed. We also set a flag $grabpad, which means we will need to count the spaces on the next line, so we can pad with spaces to get the data under the correct day. If we need to pad, we count the spaces, and do a calculation for the new padding, and set the flag to no again. If the current line is all digits, we fill up an array @el with the values. We then iterate through the values. If the values are not NULL (“”), we print the value. Now, this is the day of the week in number form. Remeber, this is all iterating over the output of cal -3. If the day is two digits long (\d\d), we run the associated command we ran above with hc.pl. We need to add a 0 to the grep command if we have one digit, since we don’t want to grab 02/Mar and 22/Mar. Agatha will probably embed the data onto her desktop using this technique.

Parsing Web Logs Into Calendar Text/Stats

Information

About