O'Reilly Hacks
oreilly.comO'Reilly NetworkSafari BookshelfConferences Sign In/My Account | View Cart   
Book List Learning Lab PDFs O'Reilly Gear Newsletters Press Room Jobs  


APACHE HACK

Stop Email Harvesting Robots
You're fed up with the amount of spam you have been receiving, and you'd like to reduce the chance that spidering software will harvest your email from your webpages. Alternatively, you run a community site, and you'd like a quick and easy way of ensuring that your user's email addresses are protected from the same fate. You want to do something server side, not involving obfuscation of the email, javascript write's, or similar tricks that will eventually be worked around.

Contributed by:
Morbus Iff
[03/14/03 | Discuss (1) | Link to this hack]

Prerequisites

  • A default installation of Apache and the ability to modify httpd.conf.
  • Access to create or modify an .htaccess file (optional).
  • The Rewrite module (mod_rewrite) enabled (optional).

The solution involves bringing together an understanding of how web robots work, as well as what information gets sent back and forth during a connection to your website. Here are some magical sentences to get you up to speed:

  • Robots are automated pieces of software that scan webpages for information.
  • When a search engine indexes (or "spiders") your site, they're using a robot.
  • Bad robots exist that scan webpages for email addresses to use for spam.
  • Each time a page is requested, a User-Agent HTTP header is sent.
  • Apache can block page access by User-Agent, IP, or hostname.

The conclusion is simple: block bad robots that identify themselves with a User-Agent string. It's not a 100% perfect solution - there's nothing that will stop a bad robot from identifying itself as Internet Explorer, Mozilla, or Opera, but you can still make a decent sized dent in the amount of accesses.

Say we know that "EmailWolf" is the name of a User-Agent that harvests email addresses. To block this User-Agent from accessing our site, we could use the following configuration in our httpd.conf file:

{{{
   # if the current access is from a User-Agent of "EmailWolf"
   # then set an environment variable called "bad_bots".
   SetEnvIfNoCase User-Agent "^EmailWolf" bad_bots

   # now, in our root directory, allow all access UNLESS
   # the environment variable "bad_bots" has been set (which
   # would indicate that "EmailWolf" is the requesting User-Agent).
   <Directory "/usr/local/httpd/htdocs/">
      Order Allow,Deny
      Allow from all
      Deny from env=bad_bots
   </Directory>
}}}

If we were using .htaccess files, we could accomplish the same with this configuration, saved in the /usr/local/httpd/htdocs/ directory:

{{{
   SetEnvIfNoCase User-Agent "^EmailWolf" bad_bots
   Deny from env=bad_bots
}}}

Finally, if you've got mod_rewrite enabled, you can use the following, which would redirect the bad robot to the owner's local machine. On the other hand, if you're feeling particularly fiesty, you could also forward the bot to a page full of fake emails, or else log the spammer's IP address, and so on:

{{{
   RewriteCond %{REQUEST_FILENAME} html?$
   RewriteCond %{HTTP_USER_AGENT} ^EmailWolf
   RewriteRule ^.*$ http://localhost/ [R]
}}}

There are a large amount of various User-Agents that could be considered "bad", and more are being discovered each day. You'll be able to find lists of them at some of the URLs in the "See Also" section. Also, you can start collecting your own list by creating a honey-pot (see below).

Again, be forewarned: this solution will only work until robots start mimicking the User-Agents of well-known browsers (or until you start blocking legitimate User-Agents, like wget or libwww-perl). A smarter solution would be to set up a honey-pot, of which the intent is to catch the robot being "bad".

The quickest way to start collecting data on the downtrodden is to create or modify your site's robot.txt file (see the robotstxt.org URL for more information). We're going to "Disallow" access to a non-existant directory:

{{{
   Disallow: /honey-pot/
}}}

In the above example, you're specifically telling robots to NOT enter the honey-pot directory. You can be safely assured that any robot that does is not following standards, and should be deemed suspicious. From there, it's a simple matter of using a Perl script to create a list of "Deny from" rules based on the IP address:

{{{
   #!/usr/bin/perl -w
   use strict;

   # your Apache access log location.
   my $logfile = "/var/log/httpd/access_log";

   # your fake directory.
   my $fake_dir = "cmd.exe";

   # store IPs.
   my %ips;

   # open the logfile.
   open(LOG, "<$logfile") or die $!;

   # and loop through each line.
   while (<LOG>) {

      # skip over lines we're not interested in.
      next unless /GET/; next unless /$fake_dir/;

      # save the ip address.
      /([^-]+) -/; $ips{$1}++;
   }

   # and print out the data.
   foreach ( sort keys %ips ) {
      print "Deny from $_\n";
   }

   # close logfile and
   # exit the program.
   close(LOG); exit;
}}}

The above example, as written, will create a sorted list of rules that will block all IPs infected with the Code Red worm, which was rampant on Windows NT servers a year ago. It then becomes a simple matter of copying these rules into your httpd.conf or .htaccess file.

If you have your access_log configured to log the User-Agent string, you can also do an extraction based on that data and add them to your "bad_bots" environment variable rules that we started the hack with.

See also:



O'Reilly Home | Privacy Policy

© 2007 O'Reilly Media, Inc.
Website: | Customer Service: | Book issues:

All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners.