If you’ve got passable Perl skills and the desire to control your own destiny, you can use our code and build a simple page-tag analyzer.
The first hack in our “Build Your Own Web Measurement Application” series describes how the data will be collected. We’ll be using a JavaScript page tag [Hack #28] and, to the best of our knowledge, ours is the only freely available tag-based reporting application available today.
There are two components in our data collection strategy. The first is a piece of JavaScript code that must be inserted into every page on your web site. When the visitor’s web browser renders the page, the script is executed, causing a request for an image to be made to the web server. For now, the image URL contains basic information about the page and the referrer, although we shall see how to augment it in [Hack #90] .
The second component is a program that runs on the server. It writes the page and referrer information into a web server logfile, and then returns the image the browser is waiting for, which is an invisible one-pixel transparent image.
The logfile we build will look something like this:
1104772080 192.168.17.32 /index.html?from=google http://www.google.com/ search?q=widgets 192.168.17.32.85261104772101338 1104772091 192.168.17.32 /products.html http://www.example.com/index. html?from=google 192.168.17.32.85261104772101338
The first field on each line is the time of the request in
Unix time (seconds since 1/1/1970). The second
field is the client’s IP address, which the server knows. The third is
the URL of the page; the fourth is the URL of the referring page (the
page linked to this one); and the fifth is the visitor’s cookie (in
this case, generated by Apache’s mod_usertrack
module).
It might occur to you that all this information is already present in the web server’s own logfile. Why do we want to produce a second logfile to duplicate the data? In fact, there are several advantages to this approach.
The web server that is running the data collection program need not be the same web server that is hosting the web site. Sometimes the web site may already be hosted on a server from which you can’t access the logfiles, or the web site may span more than one server.
Our logfile will record only some of the data. That will make it quicker to analyze. In particular, we ignore hits [Hack #1] , which may be important for technical analysis, but are not useful for analyzing visitor behavior.
Spiders and robots do not execute JavaScript, so we automatically exclude them from the logfile [Hack #23] .
When the page is rendered a second time, the JavaScript will be re-executed, so we automatically bust the cache [Hack #24] .
We shall see in [Hack #90] that our approach makes it easy to add additional data fields that are not normally recorded in the web server logfile.
The disadvantages of this approach are similar to those observed with other client-side page tags. For example, it can be more difficult to set up, it won’t measure visitors who have disabled JavaScript, and you can’t go back and analyze historical data. But we believe the benefits associated with accuracy and the ability to gather customized data outweigh the problems for many people.
The JavaScript part of the code is very simple:
<script>
document.write('<img src="http://www.yourserver.com
/cgi-bin/readtag.
pl?url='+escape(document.location)+'&ref='+escape(document.referrer)+'">');
</script>
This script has to be inserted at the top of the <BODY
> element of every page you want
tracked, and the www.yourserver.com reference needs to be
changed to the real location of this script on your servers. Some web
servers are set up to insert a template at the top of every page, in
which case you can do this once and it will appear on all of your
pages. But usually you will have to edit each page.
When the browser renders the page, it will execute the script and insert into the page an HTML image tag like this one.
<img src="http://www.yourserver.com
/cgi-bin/readtag.pl?url=http%3A//
www.example.com/index.html%3Ffrom%3Dgoogle&ref=http%3A//www.google.com/
search%3Fq%3Dwidgets">
This is just the URL of the current page and the referring page, slightly encoded to avoid ampersands and equals signs looking as if they belonged to the image URL.
When the browser requests that image from the server, the server will record the current page and the referrer in our web server logfile and send a one-pixel transparent GIF back to the browser.
The image tag calls a script called readtag.pl that is listening for requests. To deploy readtag.pl, you should adjust your web server configuration as follows:
The server must execute programs in the /cgi-bin/ directory.
The server must be able to write to the logfile chosen below.
You should plan to cookie your visitors [Hack #15] . The program will work without this, but the visitor tracking will be less accurate. If you are using Apache, the
mod_usertrack
module will produce cookies.You should set up a P3P policy on the server using compact policy headers [Hack #27] . This is essential if the program is running on a different server from the web site; otherwise, Internet Explorer will reject the cookies. The minimum CP headers you’ll need to set to make this code function properly are:
P3P: policyref="http://www.yourserver.com/w3c/p3p.xml", CP="COR NID NOI OUR COM NAV STA"
If you are using Apache, you should use the
mod_perl
module. This will reduce the load on the server by starting Perl only once, instead of every time this program is run.
If you have any questions about these requirements, we recommend you consult with your web system administrator and explain what you’re trying to accomplish. Making changes to your Apache configuration is not without risk and should be attempted only by an experienced professional.
All of the following code should be saved in a file called
readtag.pl into your web server’s
/cgi-bin directory. The #!perl
line may need to be adjusted to point
to the location of Perl on your machine—for example #!/usr/bin/perl
.
# Remember to change the next line to your Perl location #!perl –w use strict; # Declare the location of the logfile. The CGI program needs to be given permission to write to this file. my $logfile = '/var/log/apache/page.log'; # The name of the cookie you are using. 'Apache' is the default for mod_ usertrack cookies. my $cookie_name = 'Apache'; # We shall use the standard CGI module. This does all the work of extracting the parameters from the query string and decoding them. use CGI; my $cgi = new CGI; my $url = $cgi->param('url'); # Get the url= and ref= parameters my $ref = $cgi->param('ref'); # Strip the server name off the front of the URL (we don't want to repeat it on every line of the file). $url =~ s!^https?://[^/]+!!; # As long as we've got a non-empty URL and a (possibly empty) referrer, write a line in the logfile. if ($url && defined($ref)) { # Look up the current time, the client name, and the cookie. my $time = time(); my $client = $cgi->remote_host(); my $cookie_val = $cookie_name ? $cgi->cookie($cookie_name) : ""; if (!defined($cookie_val)) { $cookie_val = ""; } # We need to open the logfile. # We also need to lock it, to make sure that we're not writing two requests at the same time. # If we can't open it or can't lock it, write a diagnostic message to STDERR, which is the server's error log. use Fcntl qw/:flock/; # Import the definition of LOCK_EX unless (open (LF, ">>", "$logfile") && flock(LF, LOCK_EX)) { my $lt = localtime; my $progname = $0 || 'readtag.pl'; print STDERR "[$lt] $progname: Can't open logfile\n"; } # Everything worked, so jump to the end of the logfile (this is necessary in case something was written between the time we opened it and the time we locked it), and write the line. else { seek(LF, 0, 2); print LF "$time\t$client\t$url\t$ref\t$cookie_val\n"; close LF; } } # Finally, send a one-by-one pixel transparent GIF image back to the browser (the long list of numbers just happens to be that GIF, byte by byte). print "Content-Type: image/gif\n\n"; print 'GIF89a'; print v1.0.1.0.145.0.0.0.0.0.255.255.255.255.255.255.0.0.0.33.249.4.1.0.0.2. 0.44.0.0.0.0.1.0.1.0.0.2.2.84.1.0.59;
That’s it!
Provided you’ve copied the code correctly, and set permissions
and P3P policy correctly on your web server, all you need to do is add the following code
near the top of the <BODY
>
element of each of your web pages:
<script>
document.write('<img src="http://www.yourserver.com
/cgi-bin/readtag.
pl?url='+escape(document.location)+'&ref='+escape(document.referrer)+'">');
</script>
That’s it. The scripts handle everything else. As soon as you turn them on and traffic starts flowing on your site, the scripts start to generate a logfile
1106000655 204.210.27.229 /discussion_list.asp http://www. webanalyticsdemystified.com/free_kpi_worksheet.asp 204.210.27.229. 319011106000542572 1106000657 204.210.27.229 / http://www.webanalyticsdemystified.com/ discussion_list.asp 204.210.27.229.319011106000542572 1106001299 207.111.202.223 / 207.111.202.223.319061106001281430 1106001303 207.111.202.223 /free_preview.asp http://www. webanalyticsdemystified.com/ 207.111.202.223.319061106001281430
You’ll learn more about how to add variables to the script and how to generate reports based on the script in subsequent hacks.
—Dr. Stephen Turner and Eric T. Peterson
Get Web Site Measurement Hacks now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.