|
|
|
|
Perl for Web Site ManagementBy John CallenderOctober 2001 1-56592-647-1, Order Number: 6471 528 pages, $34.95 |
Chapter 8
Parsing Web Access LogsWeb server access logs are an excellent source of information about what your site's visitors are up to. The information on separate visitors is all mixed together, though, and for all but the smallest sites the raw access logs are too large to read directly. What you need is log analysis software to make the information in the log more easily accessible. You can buy commercial log analysis software to do this, but Perl makes it easy to write your own. The next three chapters describe how to build such a home-grown log-analysis tool.
This chapter focuses on the first part of the process: extracting and storing the information we're interested in. We talk about log file structure, converting IP addresses, and creating regular expressions capable of parsing web access logs. We also talk about creating a suitable data structure for storing the extracted data, so we can answer interesting questions about what our site's visitors have been doing. Along the way we discuss the difficulty of identifying those visitors in the web server's log entries and devise an approach for extracting at least an approximate version of that information.
The example continues in Chapter 9, which focuses on how to do computations involving dates and times, and finishes in Chapter 10, which covers the specifics of how we manipulate the "visit" information from our logs, as well as the actual output of the finished report.
Log File Structure
Most web servers store their access log in what is called the "common log format." Each time a user requests a file from the server, a line containing the following fields is added to the end of the log file:
- host
- This is either the IP address (like
207.71.222.231) or the corresponding hostname (likepm9-31.sba1.avtel.net) of the remote user requesting the page. For performance reasons, many web servers are configured not to do hostname lookups on the remote host. This means that all you end up with in the log file is a bunch of IP addresses. A bit later in this chapter, you'll develop a Perl script that you can use to convert those IP addresses into hostnames.
- identd result
- This is a field for logging the response returned by the remote user's identd server. Almost no one actually uses this; in every web log I've ever seen, this field is always just a dash (
-).
- authuser
- If you are using basic 'ecHTTP authentication (which we'll be talking about in Chapter 19) to restrict access to some of your web documents, this is where the username of the authenticated user for this transaction will be recorded. Otherwise, it will be just a dash (
-).
- date and time
- Next comes a date and time string inside square brackets, like:
[06/Jul/1999:00:09:12 -0700]. That's the day of the month, the abbreviated month name, and the four-digit year, all separated by slashes. Next come the time (expressed in 24-hour format, so 11:30 P.M. would be 23:30:00) and a time-zone offset (in this example, -0700, because the web server this log was from was using Pacific Daylight Time, which is seven hours behind Universal Time/Greenwich Mean Time).
- request
- This is the actual request sent by the remote user, enclosed in double quotes. Normally it will look something like:
"GET / HTTP/1.0". TheGETpart means it is a GET request (as opposed to a POST or a HEAD request). The next part is the path of the URL requested; in this case, the default page in the server's top-level directory, as indicated by a single slash (/). The last part of the request is the protocol being used, at the time of this writing typically HTTP/1.0 or HTTP/1.1.
- status code
- This is the status code returned by the server; by definition this will be a three-digit number. A status code of
200means everything was handled okay,304means the document has not changed since the client last requested it,404means the document could not be found, and500indicates that there was some sort of server-side error. (More detail on the various status codes can be found in RFC 1945, which describes the HTTP/1.0 protocol. See http://www.w3.org/Protocols/rfc1945/rfc1945.)
- bytes sent
- The amount of data returned by the server, not counting the header line.
An extended version of this log format, often referred to as the "combined" format, includes two additional fields at the end:
- referer
- The referring page, if any, as reported by the remote user's browser. Note that referer is consistently misspelled (with a single "r" in the middle) in the HTTP specification, and in the name of the corresponding environment variable.
- user agent
- The user agent reported by the remote user's browser. Typically, this is a string describing the type and version of browser software being used.
Assuming you have control over your web server's configuration, or can get your ISP to modify it for you, the combined format's extra fields can provide some very interesting information about the users visiting your site. The log analysis script described in this chapter will work with either format, however.
Converting IP Addresses
Before we jump into log-file analysis, let's return briefly to the problem of doing hostname lookups on the IP addresses that most likely comprise the "host" entries in our web access logs. Example 8-1 gives a script,
clf_lookup.plx, that does just that. (Like all the examples in this book, it is available for download from the book's web site, at http://www.elanus.net/book/.)
Example 8-1: A script to do hostname lookups on IP addresses in web access logs #!/usr/bin/perl -w# clf_lookup.plx# given common or extended-format web logs on STDIN, outputs# them with numeric IP addresses in the first (host) field converted# to hostnames (where possible).use strict;use Socket;my %hostname;while (<>) {my $line = $_;my($host, $rest) = split / /, $line, 2;if ($host =~ /^\d+\.\d+\.\d+\.\d+$/) {# looks vaguely like an IP addressunless (exists $hostname{$host}) {# no key, so haven't processed this IP before$hostname{$host} = gethostbyaddr(inet_aton($host), AF_INET);}if ($hostname{$host}) {# only processes IPs with successful lookups$line = "$hostname{$host} $rest";}}print $line;}The script itself is pretty simple, but it introduces some new concepts that are definitely worth learning about. The first new thing is this line:
use Socket;Here we are importing a module called
Socket.pm. Just as we did earlier, when we pulled in theCGI.pmmodule, we're doing this in order to let some more experienced programmers do our dirty work for us. Specifically, theuse Socketdeclaration in this script means we'll be able to do DNS lookups (converting numeric IP addresses to hostnames) using just a few lines of code.Thousands of Perl modules are available. Some are distributed as part of the Perl language itself; these are usually referred to as being in the standard distribution, or as the standard module
Walnuts.pm. (CGI.pmandSocket.pmare in the standard distribution.) Others can be found at CPAN, the Comprehensive Perl Archive Network, which we'll be learning more about in Chapter 11. If you can't wait until then, though (which I can totally understand, CPAN being something like the world's biggest toy store for a Perl programmer), see the accompanying sidebar, "Using CPAN," for details on how you can jump the gun and start exploring CPAN on your own.
Using CPAN
In any event, the following resources will help you get started with CPAN:
- http://www.cpan.org/README.html
- The top-level overview of what's in CPAN, with links to more-specific starting points
- http://www.cpan.org/modules/
- The top-level page within the modules portion of CPAN, with pointers to various views of the modules
- http://www.cpan.org/modules/00modlist.long.html
- A long, annotated list of all the modules in CPAN
- http://search.cpan.org/
- The CPAN search engine
The next thing in
clf_lookup.plxis amyvariable declaration for the%hostnamehash. This is going to be used to cache hostname lookups while the script is running. That way, each IP address will have to be looked up only once rather than every time it appears in the log. It is important to initialize the%hostnamehash out here, before thewhileloop that actually processes each line from the log file, because puttingmy %hostnameinside the loop block would make it so that a new copy of the hash was created each time through the loop.Let's get to the loop now. The beginning of the loop takes the form:
while (<>) {Here we're beginning a
whileloop, which you'll remember means we're going to run a block of code repeatedly as long as whatever is inside those parentheses evaluates to a true value. But what a weird thing we've got inside that logical test. It looks somewhat like the angle-input operator we use to read lines from a filehandle, but there's no filehandle inside it.What the
<>(which is sometimes called the diamond operator) is doing is this: it looks at the@ARGVarray (which you'll remember from Chapter 4 is the special variable holding your script's command-line arguments) and assumes that those arguments represent the names of one or more text files. The<>operator then returns the text from those files, one line at a time, so you can work with those lines in the body of yourwhileblock. It keeps feeding you lines of text until it has exhausted all of the first file mentioned in@ARGV, then goes on to the second file, and so on, until it has exhausted all the files mentioned in@ARGV. After the last line from the last file has been delivered, it returnsundef(the undefined value), which is false, ending the loop.You get an interesting extra feature with the
<>operator. If you don't give your script any command-line arguments, such that there are no files mentioned in@ARGV,<>instead will read from standard input (that is, from theSTDINfilehandle your script gets by default when it is started up). This in turn lets you do cool things like use your script in a shell pipeline to process the input or output for another program. In fact, we'll be using that feature with this script a little later.Where does the
<>operator put each line of text as it is working its way through the files mentioned in@ARGV? In the special scalar variable$_. As I mentioned previously, many of Perl's operators and functions are designed to work with$_by default, and this ends up being really handy because it lets you write certain common operations very quickly.In this case, though, we're going to go ahead and stick the contents of
$_into something a little more memorable. That happens in the next line:my $line = $_;Next comes the following:
my($host, $rest) = split / /, $line, 2;if ($host =~ /^\d+\.\d+\.\d+\.\d+$/) {# looks vaguely like an IP addressHere you are using the
splitfunction to take the current line from the log file and separate it into everything before the first space character (which goes into the scalar variable$host) and everything after the first space character (which goes into$rest). This takes advantage of an optional third argument to thesplitfunction, with that argument being a number tellingsplithow many fields to split the string into (in this case, two, because we don't need to keep splitting once we've split off the first field).Next comes an
ifstatement with a regular expression in the logical test. With your new understanding of regular expressions it should be pretty easy to decipher the meaning of/^\d+\.\d+\.\d+\.\d+$/: it matches a string consisting of four sets of one or more numbers each, separated by periods. This is not the exact same thing as an IP address (in which the component numbers can fall only within a certain range); this pattern is naïve, in that it would accept as IP addresses things like98765.1234.1.1, but it's close enough for our current purpose.Next come these two lines:
unless (exists $hostname{$host}) {# no key, so haven't processed this IP beforeAs discussed earlier, we're going to use
$hostname{$host}to keep track of the IP addresses we've already looked up. Sometimes, though, we will attempt to look up an IP address and find that it can't be resolved to a hostname. In such cases, we'll stickundefinto the value corresponding to$hostname{$host}. The hash will still have a key corresponding to that IP address, but there won't be an associated value. By testing for the existence of a particular hash key (which is what theexistsfunction lets us do), we can avoid entering thisifblock if we come across those hosts again.Next comes the actual looking up of the hostname, which is quite simple, thanks to the
Socket.pmmodule:$hostname{$host} = gethostbyaddr(inet_aton($host), AF_INET);}
gethostbyaddris a Perl function that provides an interface to the computer's underlying hostname lookup function. The two arguments ofinet_aton($host)andAF_INETare a little bit of magic provided courtesy ofSocket.pm.Next comes this:
if ($hostname{$host}) {# only processes IPs with successful lookups$line = "$hostname{$host} $rest";}}If the
gethostbyaddrreturned a false value for this particular$hostname{$host}(meaning this IP address couldn't be resolved to a hostname), the script will skip over this block. Otherwise, it will re-create the$linevariable (which corresponds to the current line from the log file) by interpolating the looked-up hostname in$hostname{$host}into a string, along with the$restvariable (which you will recall holds the rest of the line).The two closing curly braces end the two
ifblocks we were in, after which we print out just the current value of$line:print $line;}This script jumps through a number of hoops in the interest of cutting down the actual work it does -- caching the lookups and avoiding rebuilding
$lineunless it has to make the script a little more involved -- but it is worth taking that sort of care with a program like this because it may end up having to process some very big log files. Even with these tricks, because thegethostbyaddrfunction normally takes a certain amount of time to give up on an IP address that can't be resolved, this script will tend to take a long time to process large log files.One use of
clf_lookup.plxthat is kind of fun is to put it in a pipeline to convert your log file's IP addresses into hostnames on the fly. For example, if your log file is calledaccess.log, you could use thetailcommand with the-fswitch to watch that log growing in real time, piping the output throughclf_lookup.plxto convertthe hostnames, like this:[jbc@andros .logs]$ tail -f access.log | clf_lookup.plxThe Log-Analysis Script
Now that the hostname lookups are taken care of, it's time to write the log-analysis script. Example 8-2 shows the first version of that script.
Example 8-2: log_report.plx, a web log-analysis script (first version) #!/usr/bin/perl -w# log_report.plx# report on web visitorsuse strict;while (<>) {my ($host, $ident_user, $auth_user, $date, $time,$time_zone, $method, $url, $protocol, $status, $bytes) =/^(\S+) (\S+) (\S+) \[([^:]+):(\d+:\d+:\d+) ([^\]]+)\] "(\S+) (.+?)(\S+)" (\S+) (\S+)$/;print join "\n", $host, $ident_user, $auth_user, $date, $time,$time_zone, $method, $url, $protocol, $status,$bytes, "\n";}This first version of the script is simple. All it does is read in lines via the
<>operator, parse those lines into their component pieces, and then print out the parsed elements for debugging purposes. The line that does the printing out is interesting, in that it uses Perl'sjoinfunction, which you haven't seen before. Thejoinfunction is the polar opposite, so to speak, of thesplitfunction: it lets you specify a string (in its first argument) that will be used to join the list comprising the rest of its arguments into a scalar. In other words, the Perl expressionjoin '-', 'a', 'b', 'c'would return the stringa-b-c. And in this case, using\nto join the various elements parsed by our script lets us print out a newline-separated list of those parsed items.The Mammoth Regular Expression
The real juicy part of this script, though, is that giant regular expression used to parse each log file line into its component parts. Here's that part of the script again:
my ($host, $ident_user, $auth_user, $date, $time,$time_zone, $method, $url, $protocol, $status, $bytes) =/^(\S+) (\S+) (\S+) \[([^:]+):(\d+:\d+:\d+) ([^\]]+)\] "(\S+) (.+?)(\S+)" (\S+) (\S+)$/;There are a couple of important things to note here. The first is that it is actually fairly tricky to represent this regular expression, which is meant to be on a single line, within the limited width of this book's pages. It's particularly tricky in this case because the spaces between the various elements are important, but it's hard to keep track of those spaces when the expression is broken to fit onto multiple lines. If you are going to test this script yourself, be sure that your version of the expression is all on one line, with a single space character between the right parenthesis that ends the first line and the begin parenthesis that begins the second line. (Or you can just download the example from the book's web site, at http://www.elanus.net/book/, since the downloadable example doesn't feature those problematic line breaks.) You also can refer to the version of this expression created using the
/xmodifier, which is described in the accompanying sidebar, "Regular Expression Extensions," and use that version instead of the one-line version given here.
Regular Expression Extensions
my ($host, $ident_user, $auth_user, $date, $time, $time_zone, $method, $url, $protocol, $status, $bytes) = / # regexp begins ^ # beginning-of-string anchor (\S+) # assigned to $host \ # literal space (\S+) # assigned to $ident_user \ # literal space (\S+) # assigned to $auth_user \ # literal space \[([^:]+) # assigned to $date : # literal : (\d+:\d+:\d+) # assigned to $time \ # literal space ([^\]]+) # assigned to $time_zone \]\ " # literal string '] "' (\S+) # assigned to $method \ # literal space (.+?) # assigned to $url \ # literal space (\S+) # assigned to $protocol "\ # literal string '" ' (\S+) # assigned to $status \ # literal space (\S+) # assigned to $bytes $ # end-of-string anchor /x; # regexp ends, with x modifier The second interesting thing about this chunk of code is how we're taking advantage of the fact that a regular expression that contains capturing parentheses, if you place it in a list context, returns a list of all the elements captured by those parentheses. This means you can stick a list of scalar variables inside parentheses on the left side of the assignment operator, and a regular expression containing capturing parentheses on the right, and assign all the captured substrings in one fell swoop.
Let's go through the regular expression search pattern one chunk at a time:
/^(\S+) (\S+) (\S+)The first thing in the expression is the beginning-of-string anchor (
^). Next comes one or more nonwhitespace characters (\S+), which will match only if they are the first thing on the line (thanks to that beginning-of-string anchor). Because they are enclosed by parentheses, whatever matches will be captured into the$hostvariable in the corresponding list on the left side of the assignment (and into$1, too, though we're not doing anything with that).Next comes a literal space, then another sequence of one or more nonwhitespace characters, which is captured into
$ident_user. Next comes another literal space, and (again) one or more nonwhitespace characters, which are captured into$auth_user.After that comes another literal space, then the following interesting chunk:
\[([^:]+):(\d+:\d+:\d+) ([^\]]+)\]This part of the pattern starts off by matching a literal left bracket (
[). Next it captures one or more characters that are anything but colons (:). Then comes a colon, and then it captures a string consisting of three sets of one or more digits separated by colons. Next there is a space, after which the pattern captures one or more characters that are anything but a right bracket (]). Finally, the pattern matches a right bracket. In other words, it matches a date string that looks like:[06/Jul/1999:00:09:12 -0700]and in doing so captures the date, time, and time zone offset into the
$date,$time, and$time_zonevariables, respectively.After that part of the pattern comes another literal space, and then this:
"(\S+) (.+?) (\S+)"This part matches the request as sent by the web browser to the server. As mentioned earlier, that request typically looks something like this:
"GET / HTTP/1.0"The tricky thing about this part of the pattern is the stuff inside the middle set of capturing parentheses. That's where you match the path of the actual page requested. At first glance you would probably be tempted to use
(\S+)to match that part, on the theory that the requested path is unlikely to contain spaces, but occasionally a space will creep into the path, if only because a user accidentally typed one in when manually specifying the URL.You could use something like
([^"]+)to match the URL part of the request, which would match all the way out to the double quote, and count on the fact that Perl would then backtrack to match the time zone, which needs to have a double quote after it. The problem with this, though, is that it would be relatively inefficient because you'd be making Perl do a lot of backtracking on almost every line.The solution given here is better. By using
.+?to match the URL, you say "match one or more of anything, but don't be greedy." This means the expression will match only as much as it has to in order to make the rest of the expression match. Once it has matched all of the requested URL, the rest of the expression should match, meaning you'll get what you were looking for without a lot of backtracking.The last part of the regular expression is fairly simple, capturing the next two one-or-more-nonwhitespace chunks into
$statusand$bytes, with an end-of-string anchor at the end.And that's it. The only remaining part of the script is the debugging
print join "\n", $host, $ident_user, $auth_user, $date, $time,$time_zone, $method, $url, $protocol, $status,$bytes, "\n";Notice, by the way, how we've just stuck a
joinfunction. This chaining together of two or more functions, where the function on the right returns something that serves as the argument for the function on the left, is a handy shortcut you'll see experienced Perl users using all the time.Before we test the script, we should think for a minute about what will happen in the case where a line from the log file doesn't match that monster regular expression. There's a fairly good chance that all the lines will fail to match the first time we try it because that expression is big and complicated, and one typo will mess the whole thing up. Even after we have it working properly, though, there's always the chance that a screwy line will show up in the log and fail to be parsed properly. What will happen in that case?
What will happen is that the match will fail, and nothing will be assigned to all those variables. If they were global variables, and we were counting on the successful match to replace whatever was already in them from processing the previous line, our script would now go on to print out the previous line's data all over again, which would be a problem. Since we are using
myto give us a fresh batch of variables each time through the loop, though, we don't have to worry about that. Even so, we still have a problem. Since none of the variables were successfully assigned for this trip through the loop, the-wswitch will cause our script to emit a bunch of "Use ofuninitialized variable" warnings as soon as it comes to thatPerhaps the best thing to do in cases where a line doesn't match is to just bail out and go on to the next line. This behavior turns out to be very easy to add to the script: we just put
ornextat the end of the line containing the regular expression assignment:($host, $ident_user, $auth_user, $date, $time,$time_zone, $method, $url, $protocol, $status, $bytes) =/^(\S+) (\S+) (\S+) \[([^:]+):(\d+:\d+:\d+) ([^\]]+)\] "(\S+) (.+?)(\S+)" (\S+) (\S+)$/or next;How does this work? The
nextfunction tells Perl to go on to the next iteration of the loop we're in. Puttingor somethingat the end of a line of a Perl expression causes everything to the left of theorto be evaluated in a Boolean context (that is, evaluated to see if it yields a true or false value). If our regular expression fails to match, it will return an empty list. That not only means that all those variables will getundefassigned to them; it also means the whole expression will be false, which means the stuff to the right of theorwill be executed. In general, you can putor (something)on the right side of the regular expression, and whatever you put there will fire off only in cases where the expression fails to match.Different Log File Formats
It's fairly easy to modify this script to accept either the common or the extended log format. We do that by adding a configuration variable near the top of the script that looks like this:
my $log_format = 'common'; # 'common' or 'extended'Then we modify the part of the script where the regular expression parsing occurs to include some logic to check that
$log_formatvariable, along with a second version of the regular expression to be used on logs that are in the extended format:if ($log_format eq 'common') {($host, $ident_user, $auth_user, $date, $time,$time_zone, $method, $url, $protocol, $status, $bytes) =/^(\S+) (\S+) (\S+) \[([^:]+):(\d+:\d+:\d+) ([^\]]+)\] "(\S+) (.+?)(\S+)" (\S+) (\S+)$/or next;} elsif ($log_format eq 'extended') {($host, $ident_user, $auth_user, $date, $time,$time_zone, $method, $url, $protocol, $status, $bytes,$referer, $agent) =/^(\S+) (\S+) (\S+) \[([^:]+):(\d+:\d+:\d+) ([^\]]+)\] "(\S+) (.+?)(\S+)" (\S+) (\S+) "([^"]+)" "([^"]+)"$/or next;} else {die "unrecognized log format '$log_format'";}I think this probably qualifies as the ugliest block of code in this entire book. This is not the sort of code that anybody wants to have to make sense of more than once, but fortunately, once we get it right, we aren't likely to need to modify it.
Anyway, you'll notice that the new regular expression for extended-format logs has a couple of new chunks at the end, both of which look like
"([^"]+)". By now that should be an easy one for you: it means "match a literal double quote, then capture one or more characters that are anything but a double quote, then match another literal double quote." These two new chunks capture into the new$refererand$agentvariables that we've added at the end of the parenthetical list being assigned to.We've also added an
elseblock, which just does a quick sanity check, dying with an error message if the$log_formatvariable was inadvertently set to an unexpected value.You may have noticed that there is no
mydeclaration before the list of variables in either branch of theif-elsifconstruct. That's because declaring those variables asmyvariables here, inside the curly braces of theif-elsifblock, would limit their visibility later on in thewhileblock, where they need to be visible. As a result, we've moved themydeclaration for the variables above theif-elsifconstruct, just after thewhile(<>)line:my ($host, $ident_user, $auth_user, $date, $time,$time_zone, $method, $url, $protocol, $status, $bytes,$referer, $agent);We should also add the
$refererand$agentvariables to the list of variables that the debuggingprint join "\n", $host, $ident_user, $auth_user, $date, $time,$time_zone, $method, $url, $protocol, $status,$bytes, $referer, $agent, "\n";The entire script as it should look at this point is given in Example 8-3.
Example 8-3: Second version of the log_report.plx script #!/usr/bin/perl -w# log_report.plx# report on web visitorsuse strict;my $log_format = 'common'; # 'common' or 'extended'while (<>) {my ($host, $ident_user, $auth_user, $date, $time,$time_zone, $method, $url, $protocol, $status, $bytes,$referer, $agent);if ($log_format eq 'common') {($host, $ident_user, $auth_user, $date, $time,$time_zone, $method, $url, $protocol, $status, $bytes) =/^(\S+) (\S+) (\S+) \[([^:]+):(\d+:\d+:\d+) ([^\]]+)\] "(\S+) (.+?)(\S+)" (\S+) (\S+)$/or next;} elsif ($log_format eq 'extended') {($host, $ident_user, $auth_user, $date, $time,$time_zone, $method, $url, $protocol, $status, $bytes,$referer, $agent) =/^(\S+) (\S+) (\S+) \[([^:]+):(\d+:\d+:\d+) ([^\]]+)\] "(\S+) (.+?)(\S+)" (\S+) (\S+) "([^"]+)" "([^"]+)"$/or next;} else {die "unrecognized log format '$log_format'";}print join "\n", $host, $ident_user, $auth_user, $date, $time,$time_zone, $method, $url, $protocol, $status,$bytes, $referer, $agent, "\n";}Now we're ready to test the
log_report.plxscript on a real log file, to make sure the regular expression is actually parsing the way we think it should. We set the$log_formatvariable to the appropriate value for our log files, then try using something like this in the shell (substituting appropriate pathnames as needed) to redirect the log file into the script's standard input and then pipe the script's output tomore:[jbc@andros .logs]$ log_report.plx < access.log | moreIf our log file has only IP addresses and no hostnames, we can put the
clf_lookup.plxscript we created earlier at the beginning of the pipeline with something like this:[jbc@andros .logs]$ clf_lookup.plx < access.log | log_report.plx | moreThis makes use of a new shell redirection symbol that we haven't used before, the left angle bracket (
<). The<character tells the shell to redirect the contents of a file into a command's standard input.If we enter the command line given here and nothing prints out, we need to make sure we've got the log file path and filename correct. If that's not the problem, we need to double-check the configuration variable to make sure it has the appropriate value for our log file format (and take a careful look at the log file, too, to make sure it really is in the common or extended format).
If the script still doesn't output anything, the problem is probably in our regular expression. We need to make sure it is all on one line and has the appropriate spacing between the various elements. As a last resort, we can try shortening it a little bit at a time (or building it up from nothing a little bit at a time), getting it so that it successfully matches and captures something, at least, then adding additional elements until the whole thing is working.
Storing the Data
Now that we're successfully parsing out the individual elements from each line in the log file, what are we going to do with them? It's time to think about what sorts of things we want to keep track of, and how to represent them in our data structure.
One good thing to keep track of is the time of the first and last access processed. When printed out in our report, this will let us see what range of time is covered by the analyzed log file lines.
Another obvious thing to keep track of is how many raw hits are in the log file. Similarly, we can track the total amount of data (in megabytes) sent out by the server, and the number of HTML page views.
We'll begin implementing these features by adding the following to the top of the
log_report.plxscript, just before the start of thewhileloop that parses the log file lines:my($begin_time, $end_time, $total_hits, $total_mb, $total_views);This establishes a number of scalar variables that will be visible throughout the script, and will be used to store the various categories of information we're interested in tracking.
Now, at the end of the
whileloop, we'll comment out that debugging print statement and add the new lines shown here in order to store those various pieces of data:# print join "\n", $host, $ident_user, $auth_user, $date, $time,# $time_zone, $method, $url, $protocol, $status,# $bytes, $referer, $agent, "\n";unless ($begin_time) {$begin_time = "$date:$time";}$end_time = "$date:$time";++$total_hits;$total_mb += ($bytes / (1024 * 1024));next if $url =~ /\.(gif|jpg|jpeg|png|xbm)$/i;# don't care about these for visit-tracking purposes++$total_views;&store_line($host, $date, $time, $url, $referer, $agent);}We stick the assignment to
$begin_timeinside anunlessblock that checks to see if the variable has been assigned already, so it only gets assigned when the first line of the log file is processed. The$end_timevariable is just overwritten with the current values of$dateand$timefor every line, such that we end up with the date and time of the last access when we're done parsing the log file.Adding one to
$total_hitseach time through the loop using theauto-increment operator (++) is easy enough to understand.$total_mbis assigned using the interesting+=operator, which does what you would probably guess it does: it takes whatever number is on the right and adds it to the contents of the variable on the left, storing the new sum in the variable. It is thus the equivalent of:$total_mb = $total_mb + ($bytes / (1024 * 1024));except it's a bit easier to write. Dividing
$bytesby the product of1024 * 1024simply converts that number to megabytes.The next line uses that handy condensed form of an
ifstatement:do something ifsomething else. In this case, it says to bail out and go to the next cycle through thewhileloop (which in this case means going to the next line in the log file) if the contents of$urlend in.gif,.jpg,.jpeg,.png, or.xbm. This reflects the fact that we're only interested in actual "page views" at this point, and don't care about the image files whose requests also end up in the log file. We could instead have used something like:next unless $url =~ /\.html?$/;which would skip to the next line from the log file unless the current line's
$urlended in.htmor.html, but this would skip requests for CGI scripts and for directories that return a default page such asindex.html. It probably makes sense to count those requests in$total_views.Next, now that we've gotten rid of those extraneous log file entries, it's time to add one to the contents of
$total_views. And finally, we invoke a subroutine called&store_linewith the arguments$host,$time,$url,$referer, and$agent. We'll be using that subroutine in an effort to generate statistics on something more interesting: the activities of the individual visitors to our site.The "Visit" Data Structure
Trying to track individual visitors via the entries in a web server's access log is something of an exercise in futility. With things like proxy servers and client-side caching getting in the way, the series of accesses that show up in the log from a particular hostname or IP address can give only an approximate picture of what individual visitors are doing. Multiple users sharing the same IP address can have their activity merged into what looks like a single, very active visitor. Conversely, a single visitor can show up in the logs via a different IP address on each request, defying efforts to abstract those requests into a meaningful "visit." A proxy server at a major ISP can cache the site's pages, then satisfy hundreds of requests that never get recorded in the server's logs.
Even so, it's hard not to wonder what a log file would reveal if we could pluck out the requests corresponding to specific hosts and string them together to see what patterns emerge. Many users still browse from individual host addresses without intervening proxy servers; for these users, at least, the resulting "visit" tracking provides a fascinating look at the paths being followed through the site. It's also interesting to see how many incoming requests are actually being generated by robot "spider" programs, and to study the behavior of those programs as they interact with the server. Finally, it's an interesting programming exercise to see how we can assemble and present information on these "visits."
As with the data structure we used to create the SprocketExpo exhibitor directory in Chapters and , we could really benefit in this case by taking advantage of Perl's support for multilevel data structures. A hash of hashes (that is, a hash whose values are themselves hash variables) would make the task of storing and accessing information on these visits significantly easier. As it is, though, we won't be learning how to use multilevel data structures for several more chapters. That's okay; we can fake it by using the conventional variables we've been using already, just as we did for the SprocketExpo example.
For the purposes of this script, we're going to define a "visit" as a series of one or more requests received from the same host, with no more than 15 minutes elapsing between one request and the next. If we get another request from the same host but more than 15 minutes has elapsed since the last one, we will treat the new request as the start of a new "visit," counting it separately in our statistics.
We may as well make that 15-minute visit timeout a configuration variable up at the top of the script and store it in seconds to make our computations easier:
my $expire_time = 900; # seconds of inactivity to consider a# "visit" ended (0 = forever)Notice how the comment tells us we can set the
$expire_timevariable to0to make the expiration time "forever." We'll see how this works in a minute.A number of other variables, visible throughout the script and declared with
mynear the beginning, will be used to store the information on individual visits:
$total_visits- This scalar will be incremented by one for each new visit processed. Besides being used in the script's report to tell us how many visits there were in all, this count will be used to generate a unique visit number for each visit.
%visit_num- This hash will have keys consisting of hostnames or IP addresses, and values consisting of the currently "working" visit number corresponding to that host.
All of the following hash variables will have keys consisting of the visit number described previously:
%host- Key is visit number, value is the hostname or IP address corresponding to that visit number.
%first_time- Key is visit number, value is the date and time of that visit's first access.
%last_time- Key is visit number, value is the date and time of that visit's last (that is to say, most recent) access.
%last_seconds- Key is visit number, value is the number of seconds returned by the
&get_secondssubroutine for the date and time of that visit's last access.
%referer- Key is visit number, value is the
HTTP_REFERERenvironment variable supplied for that visit's first access.
%agent- Key is visit number, value is the user-agent string supplied for that visit's first access.
We'll add all these new variables to the big
mydeclaration up at the top of the script:my($begin_time, $end_time, $total_hits, $total_mb, $total_views,$total_visits, %visit_num, %host, %first_time, %last_time,%last_seconds, %page_sequence, %referer, %agent);The &store_line Subroutine
We previously saw how the
&store_linesubroutine was being invoked to process each line from the log file that didn't represent an image request. Now let's skip down to the bottom of the script and see what that&store_linesubroutine actually does:# script proper ends. subroutines follow.sub store_line {# store one line's worth of visit datamy($host, $date, $time, $url, $referer, $agent) = @_;my $seconds = &get_seconds($date, $time);if ($visit_num{$host}) {# there is a visit currently "working" for this hostmy $visit_num = $visit_num{$host};my $elapsed = $seconds - $last_seconds{$visit_num};if (($expire_time) and ($elapsed > $expire_time)) {# this visit has expired, so start a new one&new_visit($host, $date, $time, $url, $seconds,$referer, $agent);} else {# this visit has not expired, so add to existing record&add_to_visit($host, $date, $time,$url, $seconds, $elapsed);}} else {# there is no visit currently "working" for this host&new_visit($host, $date, $time, $url, $seconds,$referer, $agent);}}Most of the Perl in this subroutine should look pretty familiar by now. In essence, this routine is functioning as a traffic cop, using the host and the time of this access to figure out if this request represents a new visit, or another request in a previously started visit.
First, it checks to see if a key exists for the current line's
$hostin the%visit_numhash. If there is, it means we've previously processed a request from this host, so the script checks to see if the currently working visit for this host has "expired." That is, it looks to see if the time difference between this host's last access and the current access is greater than the value stored in the$expire_timeconfiguration variable. If it is, it means enough time has gone by that this access needs to be considered the beginning of a new visit, and the script invokes the&new_visitsubroutine. If it isn't, the script invokes the&add_to_visitsubroutine instead. Finally, if there wasn't any key for the current$hostin the%visit_numhash, if means this host hasn't been seen before at all. Accordingly, the&new_visitsubroutine is invoked to create a new entry for it.Here's where we've implemented the feature of turning off visit expiration for cases where the
$expire_timeconfiguration variable has been set to0. We've done that by making the logical test that determines whether a visit has ended actually contain two logical tests, both of which must be true for the "true" branch to be invoked:if (($expire_time) and ($elapsed > $expire_time)) {This works because joining two logical tests with
andrequires both of them to be true for the expression as a whole to be true. If$expire_timeis set to0, which is a false value, the test can never return true.So, again, three subroutines are invoked from within this
&store_linesubroutine: the&get_secondssubroutine, which accepts as arguments the date and time strings from the current log file line and converts them to something called Unix seconds. That routine, and the date arithmetic it performs, is the subject of the next chapter. The&new_visitand&add_to_visitroutines, which handle the updating of the script's data structure, will be covered in Chapter 10.
Back to: Perl for Web Site Management
© 2001, O'Reilly & Associates, Inc.
webmaster@oreilly.com