BUY THIS BOOK
Add to Cart

Print Book $24.95


Add to Cart

Print+PDF $32.44

Add to Cart

PDF $19.99

Safari Books Online

What is this?

Add to UK Cart

Print Book £17.50

What is this?

Looking to Reprint or License this content?


SpamAssassin
SpamAssassin By Alan Schwartz
July 2004
Pages: 222

Cover | Table of Contents | Colophon


Table of Contents

Chapter 1: Introducing SpamAssassin
The SpamAssassin system is software for analyzing email messages, determining how likely they are to be spam, and reporting its conclusions. It is a rule-based system that compares different parts of email messages with a large set of rules. Each rule adds or removes points from a message's spam score. A message with a high enough score is reported to be spam.
SpamAssassin was a trademark of Deersoft, and Deersoft has been acquired by Network Associates. In this book, I won't write SpamAssassin™ each time I mention it because that would be distracting, but you should assume that the trademark symbol is there.
Many spam-checking systems are available. SpamAssassin has become popular for several reasons:
  • It uses a large number of different kinds of rules and weights them according to their diagnosticity. Rules that have been demonstrated to be more effective at discriminating spam from non-spam email are given higher weightings.
  • It is easy to tune the scores associated with each rule or to add new rules based on regular expressions.
  • SpamAssassin can adapt to each system's email environment, learning to recognize which senders are to be trusted and to identify new kinds of spam.
  • It can report spam to several different spam clearinghouses and can be configured to create spam traps—email addresses that are used only to forward spam to a clearinghouse.
  • It is free software, distributed under either the GNU Public License or the Artistic License. Either license allows users to freely modify the software and redistribute their modifications under the same terms.
Example 1-1 shows a message that has been tagged as spam by SpamAssassin. Elements added by SpamAssassin appear in bold.
Example 1-1. A message tagged by SpamAssassin
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
How SpamAssassin Works
There are several ways that SpamAssassin makes up its mind about a message:
  • The message headers can be checked for consistency and adherence to Internet standards (e.g., is the date formatted properly?).
  • The headers and body can be checked for phrases or message elements commonly found in spam (e.g., "MAKE MONEY FAST" or instructions on how to be removed from future mailings)—in several languages.
  • The headers and body can be looked up in several online databases that track message checksums of verified spam messages.
  • The sending system's IP address can be looked up in several online lists of sites that have been used by spammers or are otherwise suspicious.
  • Specific addresses, hosts, or domains can be blacklisted or whitelisted. A whitelist can be automatically constructed based on the sender's past history of messages.
  • SpamAssassin can be trained to recognize the types of spam that you receive by learning from a set of messages that you consider spam and a set that you consider non-spam. (SpamAssassin and the spam-filtering community often refer to non-spam messages as ham. )
  • The sending system's IP address can be compared to the sender's domain name using the Sender Policy Framework (SPF) protocol (http://spf.pobox.com) to determine if that system is permitted to send messages from users at that domain. This feature requires SpamAssassin 3.0.
  • SpamAssassin can privilege senders who are willing to expend some extra computational power in the form of Hashcash (http://www.hashcash.org). Spammers cannot do these computations and still send out huge amounts of mail rapidly. This feature requires SpamAssassin 3.0.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Organization of SpamAssassin
At heart, SpamAssassin is a set of modules written in the Perl programming language, along with a Perl script that accepts a message on standard input and checks it using the modules. For higher-performance applications, SpamAssassin also includes a daemonized version of the spam-checker and a client program in C that can accept a message on standard input and check it with the daemon.
Most of SpamAssassin's behavior is controlled through a systemwide configuration file and a set of per-user configuration files. The per-user configuration can also be stored in an SQL database.
For a great deal more about Perl, check out Learning Perl, by Randal L. Schwartz and Tom Phoenix, or Programming Perl, by Larry Wall, Tom Christiansen, and Jon Orwant, both from O'Reilly.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Mailers and SpamAssassin
Although it's possible to run SpamAssassin manually on a single message, SpamAssassin becomes really useful when all incoming messages are scanned automatically. There are several ways that this can be done.
Figure 1-1 shows a typical mail transmission. The sending system connects to the recipient's mail transport agent (MTA) and transmits the message. If the message is destined for a user on the MTA's system, the MTA hands the message off to the local mail delivery agent (MDA), which is responsible for storing the message in a user's mailbox. Users may log into the system and read their mail directly from their mailboxes (as is typical on multiuser Unix systems), or, if the system runs the appropriate servers, users may download their mail using a mail client that supports the POP (Post Office Protocol) or IMAP (Internet Message Access Protocol) protocols.
Figure 1-1: A typical mail transmission
SpamAssassin can be run in three fundamental places: at the MTA, at the MDA, and as a POP proxy. Each has advantages and disadvantages.
Some MTAs provide a way for incoming messages to be passed through a filter during the SMTP transaction; others can pass messages through a filter after the SMTP transaction is complete. Spam-checking is one kind of filtering that can be usefully performed at the MTA; virus-checking is another. In many cases, sophisticated filtering daemons have been developed for specific MTAs, and these daemons are capable of calling SpamAssassin to perform spam checks.
Because all email destined for users on the system must pass through the MTA, it is a natural place for centralized spam-checking. If you run a gateway MTA that delivers mail to several internal systems, you can perform spam-checking at the gateway MTA to limit the amount of spam that any internal server will receive.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
The Politics of Scanning
If you're an ISP that provides email service, many of your users will want—perhaps even demand—spam-tagging or spam-filtering of their incoming email. Other users, however, may not want their email tagged or filtered, either because they don't get much spam, don't perceive the spam they receive to be a problem, or are concerned about the possibility of a real message being mistakenly tagged as spam.
Before you implement systemwide or sitewide spam-checking, consider carefully the needs of your users and your responsibilities toward them. At minimum, you must inform users (and would-be users) of any unconditional spam-checking you perform on their email. Better yet is to provide spam-tagging only for those users who opt to turn it on. Best of all is to enable each user to configure their own settings and threshold for how spam is recognized. This is doubly important if you not only tag messages for users but actually filter or block spam for them.
SpamAssassin is an excellent tool for distinguishing spam and non-spam email, but only if you've determined that your users want you to distinguish the two.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 2: SpamAssassin Basics
This chapter explains how to get and install SpamAssassin and its components, perform basic configuration, test the system, and start using it for spam-checking. It covers the basics of using SpamAssassin from the shell or from procmail, and discusses the setup of the daemonized version of the spam-checker. The configuration examples in this chapter provide only the basic functionality. The following chapters cover rule-tweaking, white- and blacklisting, and learning.
SpamAssassin is written for a Unix or Unix-like environment that includes Perl Version 5, preferably 5.6.1 or later. Perl is now standard on most Unix systems, but if you don't have it, the source code for Perl can be downloaded at http://www.cpan.org.
SpamAssassin requires several Perl modules to be installed. If you install SpamAssassin using CPAN (the Comprehensive Perl Archive Network), as described in the next section, these modules will be automatically downloaded and installed as well. If you install SpamAssassin manually, you'll need to be sure that you also have up-to-date versions of the Perl modules ExtUtils::MakeMaker, File::Spec, Pod::Usage, HTML::Parser, Sys::Syslog, DB_File, Digest::SHA1, and Net::DNS. You may also want Net::Ident and IO::Socket::SSL if you plan to use the daemonized checker (spamd) and its client (spamc) and you will allow remote clients to access your daemon.
SpamAssassin can consult several spam checksum clearinghouses. A spam clearinghouse is a server (or a distributed network of servers) that gathers spam messages reported by thousands of users around the world and provides a mechanism for a client to check a new message to see if it matches a message in the clearinghouse. These clearinghouses are known as
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Prerequisites
SpamAssassin is written for a Unix or Unix-like environment that includes Perl Version 5, preferably 5.6.1 or later. Perl is now standard on most Unix systems, but if you don't have it, the source code for Perl can be downloaded at http://www.cpan.org.
SpamAssassin requires several Perl modules to be installed. If you install SpamAssassin using CPAN (the Comprehensive Perl Archive Network), as described in the next section, these modules will be automatically downloaded and installed as well. If you install SpamAssassin manually, you'll need to be sure that you also have up-to-date versions of the Perl modules ExtUtils::MakeMaker, File::Spec, Pod::Usage, HTML::Parser, Sys::Syslog, DB_File, Digest::SHA1, and Net::DNS. You may also want Net::Ident and IO::Socket::SSL if you plan to use the daemonized checker (spamd) and its client (spamc) and you will allow remote clients to access your daemon.
SpamAssassin can consult several spam checksum clearinghouses. A spam clearinghouse is a server (or a distributed network of servers) that gathers spam messages reported by thousands of users around the world and provides a mechanism for a client to check a new message to see if it matches a message in the clearinghouse. These clearinghouses are known as checksum-based clearinghouses because rather than transmit and store complete email messages, they work with cryptographic checksums of messages. A cryptographic checksum is a much smaller data string (typically no more than 256 bits) that is, for all practical purposes, unique to the message from which it is computed.
As of version 3.0, SpamAssassin can consult three clearinghouses: Vipul's Razor
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Building SpamAssassin
The easiest way to download and install SpamAssassin is through CPAN. Here's what a CPAN-install of SpamAssassin looks like:
$ su
Password: 
               XXXXXXX
            
# perl -MCPAN -e shell

cpan shell -- CPAN exploration and modules installation (v1.61)
ReadLine support enabled

cpan> o conf prerequisites_policy ask

  prerequisites_policy ask

cpan> install Mail::SpamAssassin
CPAN: Storable loaded ok
CPAN: LWP::UserAgent loaded ok
Fetching with LWP:
  ftp://ftp.perl.org/pub/CPAN/authors/01mailrc.txt.gz
...
Running install for module Mail::SpamAssassin
Running make for J/JM/JMASON/Mail-SpamAssassin-2.60.tar.gz
Fetching with LWP:
ftp://ftp.perl.org/pub/CPAN/authors/id/J/JM/JMASON/Mail-SpamAssassin-2.60.tar.gz
CPAN: Digest::MD5 loaded ok
Fetching with LWP:
ftp://ftp.perl.org/pub/CPAN/authors/id/J/JM/JMASON/CHECKSUMS
Checksum for /root/.cpan/sources/authors/id/J/JM/JMASON/Mail-SpamAssassin-2.60.tar.gz 
ok
Scanning cache /root/.cpan/build for sizes
Mail-SpamAssassin-2.60/
Mail-SpamAssassin-2.60/ninjabutton.png
...
Mail-SpamAssassin-2.60/sample-spam.txt

  CPAN.pm: Going to build J/JM/JMASON/Mail-SpamAssassin-2.60.tar.gz

What email address or URL should be used in the suspected-spam report
text for users who want more information on your filter installation?
(In particular, ISPs should change this to a local Postmaster contact)
default text: [the administrator of that system] 
               postmaster@example.com
            


Checking if your kit is complete...
Looks good
Writing Makefile for Mail::SpamAssassin
Makefile written by ExtUtils::MakeMaker 6.03
/usr/bin/perl build/preprocessor -Mconditional -Mbytes -DPERL_VERSION=5.8.0 -Mvars -
DVERSION=2.60 -DPREFIX=/usr <lib/Mail/SpamAssassin/AutoWhitelist.pm >blib/lib/Mail/
SpamAssassin/AutoWhitelist.pm
...
gcc  -g -O2 spamd/spamc.c spamd/libspamc.c spamd/utils.c \
        -o spamd/spamc   -ldl 
...
Manifying blib/man3/Mail::SpamAssassin::PerMsgLearner.3pm
  /usr/bin/make  -- OK
Running make test
PERL_DL_NONLAZY=1 /usr/bin/perl "-MExtUtils::Command::MM" "-e" "test_harness(0,
'blib/lib', 'blib/arch')" t/*.t
t/basic_lint................ok
...
t/zz_cleanup................ok
All tests successful, 1 test skipped.
Files=40, Tests=301, 426 wallclock secs (238.53 cusr + 14.19 csys = 252.72 CPU)
  /usr/bin/make test -- OK
Running make install
Installing /usr/lib/perl5/site_perl/5.8.0/Mail/SpamAssassin.pm
Installing /usr/lib/perl5/site_perl/5.8.0/Mail/SpamAssassin/PerMsgLearner.pm
...
Installing /usr/bin/spamc
Installing /usr/bin/spamd
Installing /usr/bin/sa-learn
Installing /usr/bin/spamassassin
Writing /usr/lib/perl5/site_perl/5.8.0/i586-linux-thread-multi/auto/Mail/
SpamAssassin/.packlist
Appending installation info to /usr/lib/perl5/5.8.0/i586-linux-thread-multi/
perllocal.pod
/usr/bin/perl "-MExtUtils::Command" -e mkpath /etc/mail/spamassassin
...
  /usr/bin/make install  -- OK

cpan> 
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Invoking SpamAssassin with procmail
Running spamassassin from a shell is a handy way to test the system, but for daily use you'd like to have it automatically run on every incoming email message that's being delivered to your system's mailboxes. One easy way to do this is to have your system's MDA program filter all messages through SpamAssassin as part of the delivery process.
procmail is a mail-processing program that accepts messages on standard input and applies a set of rules or actions (a "recipe") for the disposition of the message. Because the default message disposition is "append to the user's mailbox," and because procmail is written to be very safe in its handling of messages, it makes an excellent MDA. Indeed, many Unix systems use the procmail program as their default local MDA. If procmail is available and isn't the system MDA, it's usually easy for users to configure the message-forwarding feature of the system's MTA to filter messages through procmail. In either environment, procmail can be a good place to pass messages through SpamAssassin. Figure 2-1 illustrates this configuration.
Figure 2-1: Invoking SpamAssassin with procmail
The easiest way to use SpamAssassin with procmail is to call it in the systemwide procmail recipe file, which is usually /etc/procmailrc. Example 2-6 shows a complete /etc/procmailrc.
Example 2-6. A complete /etc/procmailrc
DROPPRIVS=yes
PATH=/bin:/usr/bin:/usr/local/bin
SHELL=/bin/sh

# Spamassassin
:0fw
* <300000
|/usr/bin/spamassassin
In this example, the SpamAssassin recipe comprises the three lines beneath the comment # Spamassassin. The first line tells procmail that the message should be filtered (f) and that procmail should wait (w) for the filter's successful exit before considering the message filtered. The second line indicates that this recipe should be applied to messages less than 300,000 bytes in length and serves to prevent a lengthy SpamAssassin invocation on a long message that is unlikely to be spam. The third line directs procmail to pipe the message to
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Using spamc/spamd
If you are filtering a lot of incoming mail, the processing time required to invoke a new spamassassin script (and starting the Perl interpreter) for each message can become prohibitive. An alternative approach is to run the SpamAssassin daemon, spamd. spamd is started once at system boot and loads the SpamAssassin Perl modules to perform spam-checking. Instead of running the spamassassin script on each message, messages are piped to the spamc program. spamc is a lightweight client, written in C and compiled to an executable that simply takes messages, relays them to spamd, and returns the results.
spamd has several important command-line arguments that control its operation. Once it's properly set up, however, using spamc is simple.
By default, spamd is installed in /usr/bin. It is typically started by root from a system boot script but can also be started by root from the shell for testing. The simplest invocation of spamd is:
/usr/bin/spamd --daemonize --pidfile /var/run/spamd.pid
The --daemonize command-line option directs spamd to operate as a daemon in the background. The --pidfile command-line option specifies the file to which spamd will write its process ID number. This option is important because spamd must be signaled with a HUP signal to its process ID whenever the systemwide SpamAssassin configuration is changed (you'll find an example later in this chapter).
When spamd receives a connection, it forks a child process to handle the connection. Typically, the child process reads a request to perform spam-checking from the client (including the account name of the user making the request, the message to check, and other data), performs the requested check, returns the (possibly tagged) message back to the client, and exits.
Several options are used with spamd in many environments. The most common are detailed in the following sections.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Invoking SpamAssassin in a Perl Script
Because the heart of the SpamAssassin system is a set of Perl modules, it's fairly straightforward to call SpamAssassin from a Perl script to perform spam-checking of an email message. The Mail::SpamAssassin module (and its submodules) provide an object-oriented interface to the spam-checking and message-tagging logic. Many MTA-based filtering systems are written in Perl, and use the SpamAssassin modules to perform spam-checking on messages without invoking a separate program.
Examples Example 2-8 and Example 2-9 show Perl scripts that work like simple versions of the spamassassin script, accepting a message on standard input, checking it, and producing the (possibly rewritten) message on standard output. Example 2-8 illustrates the process for SpamAssassin 2.63.
Example 2-8. Using Mail::SpamAssassin 2.63 in Perl
#!/usr/bin/perl

use Mail::SpamAssassin;

my @lines = <STDIN>;
my $mail = Mail::SpamAssassin::NoMailAudit->new(data => \@lines);
my $spamtest = Mail::SpamAssassin->new( );
my $status = $spamtest->check($mail);
$status->rewrite_mail( ) if $status->is_spam( );
print $status->get_full_message_as_text( );
Before any SpamAssassin objects can be created, the script must use the Mail::SpamAssassin module. The message is read from standard input and saved to the array @lines. Then, the new( ) method of Mail::SpamAssassin::NoMailAudit is called, with a reference to the array provided as the value of the data parameter. This method returns a Mail::SpamAssassin::Message object encapsulating the email message, which I call $mail in the example.
A new Mail::SpamAssassin object called $spamtest is then created, and its check( ) method is called, passing in the message as an argument. check( ) returns a Mail::SpamAssassin::PerMsgStatus object, called $status in the script, that contains a copy of the message as well as the results of the spam check. In particular, the is_spam( ) method of $status returns 1 if the message was judged to be spam, and 0 otherwise.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
SpamAssassin and the End User
The discussion so far in this chapter has focused on getting SpamAssassin to analyze incoming mail and mark spam by modifying the message before delivery. For end users who read their email on the server or download it with a POP or IMAP client, the final step is to take action on messages. Messages processed through SpamAssassin fall into one of the categories described in the next four sections.
True negatives are messages that both you and SpamAssassin agree are non-spam, or ham, messages. SpamAssassin does not modify these messages much. It adds an X-Spam-Status header beginning with the word "No," and an X-Spam-Checker-Version header giving the version of SpamAssassin in use. These messages look just as they should to a user's mail reader.
True positives are messages that both you and SpamAssassin agree are spam. These messages are tagged by SpamAssassin. At minimum, SpamAssassin adds X-Spam-Level, X-Spam-Status, and X-Spam-Flag headers. If rewrite_subject is on, SpamAssassin also changes the subject of the message to begin with *****SPAM*****. Example 2-10 shows these headers.
Example 2-10. Headers added to spam by SpamAssassin
Subject: *****SPAM***** Live your dream life!!                MPNWSTU
X-Spam-Status: Yes, hits=12.9 required=5.0 tests=CLICK_BELOW,
        FORGED_MUA_EUDORA,FROM_ENDS_IN_NUMS,MISSING_OUTLOOK_NAME,
        MSGID_OUTLOOK_INVALID,MSGID_SPAM_ZEROES,NORMAL_HTTP_TO_IP,
        SUBJ_HAS_SPACES,SUBJ_HAS_UNIQ_ID autolearn=no version=2.60
X-Spam-Flag: YES
X-Spam-Checker-Version: SpamAssassin 2.60 (1.212-2003-09-23-exp)
X-Spam-Level: ************
Most people will want either to complain about spam to the spammer's ISP or to discard it. In the former case, simply being able to quickly identify spam messages on sight is usually sufficient, and the modified
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 3: SpamAssassin Rules
SpamAssassin performs its spam-checking by applying a series of tests to an email message. Most tests examine the message headers or body for patterns that are suggestive of spam; others perform Internet lookups against network-based blacklists of IP addresses or checksums of spam messages. Each positive test yields a score, and the sum of the scores is the total spam score of the message.
This chapter describes the SpamAssassin pattern-based and network-based tests: how they are written and scored, and how you can modify the score of a built-in test or write your own custom tests. This chapter also covers whitelist and blacklist rules, which can override SpamAssassin's usual determination of whether or not a message is spam.
The tests described in this chapter are all static tests—they don't change over time as SpamAssassin analyzes messages. Chapter 4 explains learning tests, which use information from messages seen in the past to improve decisions in the future.
Most SpamAssassin tests consist of the same basic components:
  • A test name, consisting of up to 22 uppercase letters, numbers, or underscores. Names that begin T_ refer to rules in testing.
  • A more verbose description of the test, which is used in the reports generated by SpamAssassin. Typically, descriptions are up to 50 characters long.
  • An indication of where to look. Tests can be applied to the message headers only, the message body only, uniform resource identifiers (URIs) in the message body, or the complete message. When testing the message body, the body can be analyzed in its raw state, after MIME-decoding the text, or after MIME-decoding, stripping of HTML, and removal of all line breaks.
  • A description of what to look for. Tests can specify a header to check for existence, a Perl regular expression pattern to match, a DNS-based blacklist to query, or a SpamAssassin function to evaluate.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
The Anatomy of a Test
Most SpamAssassin tests consist of the same basic components:
  • A test name, consisting of up to 22 uppercase letters, numbers, or underscores. Names that begin T_ refer to rules in testing.
  • A more verbose description of the test, which is used in the reports generated by SpamAssassin. Typically, descriptions are up to 50 characters long.
  • An indication of where to look. Tests can be applied to the message headers only, the message body only, uniform resource identifiers (URIs) in the message body, or the complete message. When testing the message body, the body can be analyzed in its raw state, after MIME-decoding the text, or after MIME-decoding, stripping of HTML, and removal of all line breaks.
  • A description of what to look for. Tests can specify a header to check for existence, a Perl regular expression pattern to match, a DNS-based blacklist to query, or a SpamAssassin function to evaluate.
  • Optional test flags that control the conditions under which the test is applied or other exceptional features.
  • A score or scores for the test. Tests can have a single score that is always used, or they can have separate scores for messages that test positive under each of four conditions:
    • When the Bayesian classifier and network tests are not in use
    • When the Bayesian classifier is not in use, but network tests are
    • When the Bayesian classifier is in use, but network tests are not
    • When the Bayesian classifier and network tests are both in use
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Modifying the Score of a Test
You may find some tests more indicative of spam than SpamAssassin does by default. If SpamAssassin already provides a test that you value but doesn't assign it a high enough score (higher scores are more indicative of spam), you can easily modify the score of the test. Similarly, if one of SpamAssassin's tests is giving you too many false positives, you can reduce its score or disable the test entirely by setting its score to 0. SpamAssassin will not attempt to run a test with a score of 0.
Make systemwide score adjustments in the systemwide configuration file, typically /etc/mail/spamassassin/local.cf. To modify the score of a test, you must first determine its test name, either by reading the ruleset files or by examining the spam report from a message. To get a spam report on a message that doesn't score high enough for SpamAssassin to generate a report, you can use spamassassin --test-mode, as described in Chapter 2.
To change the score of a test, simply add a new score directive to the configuration file, like this:
score HTML_WIN_OPEN 2
This will enable the HTML_WIN_OPEN test and add two points to the score of messages that test positive on this test.
You can use the same approach to modify the descriptions of tests by adding new describe directives. For example, the default description for the HOT_NASTY test is "Possible porn - Hot, Nasty, Wild, Young". To shorten that to "Possible porn", add this directive to the configuration file:
describe HOT_NASTY Possible porn
Users can use the score directive in per-user preference files to change the scoring of a test for an individual user. To do so, a user edits the .spamassassin/user_prefs file in her home directory and adds score directives. This approach to customizing scores is the simplest, but it requires users to have accounts on the system and access to files in their accounts.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Writing Your Own Tests
When none of the existing tests does what you'd like, you can write a custom test of your own. Custom tests are just like the distributed tests, except that you install them in the systemwide configuration file or in a per-user preference file.
Users can write their own tests in their per-user preference files, but for security reasons these tests will not be used when spamd is performing spam-checking, unless the allow_user_rules option is set to 1 in the systemwide configuration. However, setting this option is dangerous because spamd runs as root and a malicious or inexperienced user can construct a custom test that causes the system to hang or to invoke an arbitrary command as nobody or as spamd's uid. Users who want their own tests on a system that uses spamd should reinvoke the spamassassin script on their incoming mail (probably in their .procmailrc). Chapter 2 illustrates this approach.
The first step in writing a custom test is to choose a symbolic test name and write a meaningful test description with the describe directive. For now, do not begin any of your names with a double underscore (_ _). Test names that begin with two underscores are not listed in test hit reports, nor are they added to the spam score on their own; such names are used for creating sets of subtests that should be applied in combination. SpamAssassin calls these combinations meta tests, and they are discussed later in this section.
Second, determine what part of the message you wish to test. Table 3-1 summarizes the directives used to test different portions of a message. Each is covered in greater detail in the following sections.
Table 3-1: Message portions and associated test directives
Message part
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
The Built-in Tests
SpamAssassin is distributed with over 700 test rules defined for English-language spam. SpamAssassin 2.63 includes another 2,900 rules for spam in other languages. (Language support in SpamAssassin 3.0 is currently available only for French and German, but language support is likely to increase as SpamAssassin gets into wider release.) Reading the rules distributed with SpamAssassin is an excellent way to learn to write your own rules.
SpamAssassin's rules are defined in a set of files typically installed in /usr/share/spamassassin:
10_misc.cf
The 10_misc.cf file defines templates for the spam report that SpamAssassin attaches to spam messages, definitions of headers that SpamAssassin adds to messages, and default settings for the most common configuration options. This file is described in more detail later in this chapter.
10_plugins.cf (SpamAssassin 3.0)
This file provides a convenient place to load SpamAssassin plug-in modules with the loadplugin directive. Plug-ins extend SpamAssassin's features.
20_fake_helo_tests.cf
This file defines a set of rules used to test for forged HELO hostnames. This file is also described in more detail later in this chapter.
20_body_tests.cf
This file defines most tests against message bodies, spam clearinghouses, message languages, and message locales. It's described in more detail later.
20_dnsbl_tests.cf
This file defines tests against many different DNS blacklists, using the check_rbl( )
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Whitelists and Blacklists
Although SpamAssassin generally does a good job of avoiding false positives, you may find that some mail that you want to receive contains enough spamlike characteristics that SpamAssassin regularly tags them as spam. You may want to be sure that SpamAssassin will never mistake email from an important user, client, vendor, or other sender for spam. You may even have users who don't like spam-filtering. SpamAssassin allows you to set up systemwide or user-specific lists of senders whose mail should not be considered spam, and (systemwide) lists of users who don't want their email filtered. Such lists are called whitelists.
On the other hand, you may regularly receive unwanted mail from a particular sender that doesn't get tagged reliably by SpamAssassin. You may know ahead of time that you don't want to receive mail from certain organizations or senders. SpamAssassin also allows you to set up system-wide or user-specific lists of senders whose mail should be tagged as spam. Such lists are called blacklists.
This chapter discusses how to set up whitelists and blacklists. It begins by examining the SpamAssassin directives for systemwide whitelisting and blacklisting, and then explores two different ways to manage user-specific lists. A related feature, autowhitelists, is covered in Chapter 4.
SpamAssassin whitelists reduce the spam scores of messages when the sender or recipient appears on the whitelist. Whitelists are most commonly used to ensure that messages from important senders are not marked as spam, but they can also be used to change the spam threshold for recipients or enable recipients to effectively opt out of spam-tagging.

Section 3.5.1.1: Whitelisting senders

Use the whitelist_from directive to whitelist a sender's address. The sender's address is the address that appears in the Resent-From header, if that header exists, or in any of the headers: From, Envelope-Sender, Resent-Sender
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 4: SpamAssassin as a Learning System
SpamAssassin provides many rules that have proven useful in distinguishing spam from non-spam messages, and these rules are updated at each new release. But SpamAssassin provides more than just generic rules; it has the capability of learning about your email environment and adapting its detection behavior to maximize its accuracy in that environment.
SpamAssassin includes two adaptive systems that can be used in concert: autowhitelisting and Bayesian filtering. This chapter discusses the principles, configuration, and operation of both systems.
SpamAssassin's autowhitelisting algorithm learns each sender's history of sending spam or non-spam messages and modifies the spam score of their subsequent mailings on the basis of this history. The primary goal of autowhitelisting is to reduce false positives—to make it less likely that a non-spam message will be tagged as spam—by assuming that people who send you non-spam messages will not begin to spam you. It can also reduce false negatives if a spammer consistently sends email from the same email address, but this happens infrequently enough that autowhitelisting rarely has a significant effect on false negatives.
When autowhitelisting is enabled, SpamAssassin maintains a database keyed on message senders' email addresses and the IP addresses of their nearest untrusted relay (if any). Each time a message from a given sender is received, the message's spam score is added to the sender's total score in the database, and a count of the number of messages received from that sender is updated.
The average sender score—the total score divided by the number of messages received—is used to modify the spam score of new messages from that sender. Specifically, the difference between the average score and the new message's score is multiplied by a configurable factor, and the result is added to the new message's spam score. The effect is that when the new message has a higher spam score than average, its spam score is adjusted downward; when the new message has a lower spam score than average, its spam score is adjusted upward.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Autowhitelisting
SpamAssassin's autowhitelisting algorithm learns each sender's history of sending spam or non-spam messages and modifies the spam score of their subsequent mailings on the basis of this history. The primary goal of autowhitelisting is to reduce false positives—to make it less likely that a non-spam message will be tagged as spam—by assuming that people who send you non-spam messages will not begin to spam you. It can also reduce false negatives if a spammer consistently sends email from the same email address, but this happens infrequently enough that autowhitelisting rarely has a significant effect on false negatives.
When autowhitelisting is enabled, SpamAssassin maintains a database keyed on message senders' email addresses and the IP addresses of their nearest untrusted relay (if any). Each time a message from a given sender is received, the message's spam score is added to the sender's total score in the database, and a count of the number of messages received from that sender is updated.
The average sender score—the total score divided by the number of messages received—is used to modify the spam score of new messages from that sender. Specifically, the difference between the average score and the new message's score is multiplied by a configurable factor, and the result is added to the new message's spam score. The effect is that when the new message has a higher spam score than average, its spam score is adjusted downward; when the new message has a lower spam score than average, its spam score is adjusted upward.
As you might expect from this explanation, the autowhitelist tests are the last ones performed by SpamAssassin. All other tests must be run first in order to have the most accurate spam score for a message before comparing it to the sender's historical average. In addition, the sender's historical average is updated with the spam score of a new message before the autowhitelist modifier is applied.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Bayesian Filtering
SpamAssassin's Bayesian classifier learns to distinguish the features that characterize spam from those that characterize non-spam in the messages that you receive. Properly trained, the Bayesian classifier can reduce both false positives and false negatives.
Bayesian filtering is based on Bayes' Theorem, a statement of probability theory propounded by the Reverend Thomas Bayes in 1763. Bayes' Theorem is important in many fields where classifying data is essential, including computer vision, psychophysics, and diagnostic decision-making in health care. SpamAssassin's implementation is mostly based on the work of Paul Graham (archived at http://www.paulgraham.com) and Gary Robinson (http://www.garyrobinson.net).
Conceptually, Bayes' Theorem states that the probability of some event (such as a message being spam) given a test result (such as matching a spam-checking rule) depends on the baseline probability of the event before the test result is known and on the discriminating power of the test. A corollary is that the discriminating power of a test can be measured by comparing the probability of the event given a known test result to the baseline probability before the result is known. The more the test result can increase (or decrease) the probability from baseline, the stronger the test.
Actually, SpamAssassin's "Bayesian" system doesn't really compute the baseline probability or frequency of spam versus non-spam messages—which some have argued means it's not strictly Bayesian at all. Instead it assumes values that seem reasonable and useful.
In the context of spam-checking, a Bayesian approach amounts to developing potential rules and asking how much each rule, if matched, should change the system's perception of the likelihood that a message is spam. Very strong rules come in two forms. Some are patterns that only occur in spam (and never in non-spam), thus yielding a high probability that a message that matches one of the patterns is spam. Others are patterns that only occur in non-spam (and never in spam), thus yielding a low probability that a message that matches the pattern is spam. Weaker rules—patterns found in both spam and non-spam messages but with different frequencies—result in less extreme probabilities.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 5: Integrating SpamAssassin with sendmail
sendmail has long been the most widely used mail transport agent in the world. It was routing mail before the Internet existed as such and continues to form the backbone of many of the largest mail servers on the Net today. This chapter explains how to integrate SpamAssassin into a sendmail-based mail server to perform spam-checking for local recipients or to create a spam-checking mail gateway.
sendmail is a complex piece of software and can have several security implications for systems on which it runs. You should always run the most up-to-date version of sendmail and keep track of new bug reports and security advisories. This chapter assumes that you are running the latest release of sendmail—Version 8.12—and does not cover how to securely install, configure, or operate sendmail itself. For that information, see the sendmail documentation and the book sendmail by Bryan Costales and Eric Allman (O'Reilly).
The easiest way to add SpamAssassin to a sendmail system is to configure sendmail to use procmail as its local delivery agent, and to add a procmail recipe for spam-tagging to /etc/procmailrc. The advantages of this approach are
  • It's very easy to set up.
  • You can run spamd, and the procmail recipe can use spamc for faster spam-checking.
  • User preference files, autowhitelists, and Bayesian databases can be used.
There are also some disadvantages:
  • sendmail must complete the SMTP transaction and accept an email message for local delivery before spam-checking takes place. Accordingly, you can't save bandwidth or mailbox space by rejecting spam during the SMTP transaction.
  • sendmail only runs the local delivery agent for email destined for a local recipient. You cannot create a spam-checking gateway with this approach.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Spam-Checking at Delivery
The easiest way to add SpamAssassin to a sendmail system is to configure sendmail to use procmail as its local delivery agent, and to add a procmail recipe for spam-tagging to /etc/procmailrc. The advantages of this approach are
  • It's very easy to set up.
  • You can run spamd, and the procmail recipe can use spamc for faster spam-checking.
  • User preference files, autowhitelists, and Bayesian databases can be used.
There are also some disadvantages:
  • sendmail must complete the SMTP transaction and accept an email message for local delivery before spam-checking takes place. Accordingly, you can't save bandwidth or mailbox space by rejecting spam during the SMTP transaction.
  • sendmail only runs the local delivery agent for email destined for a local recipient. You cannot create a spam-checking gateway with this approach.
To configure sendmail to use procmail as its local delivery agent, add the following line to your sendmail.mc file (before the MAILER(`local') line) and regenerate sendmail.cf from it:
FEATURE(`local_procmail',`/path/to/procmail')dnl
When you restart sendmail, it will use procmail instead of the system's default local MDA (e.g., /bin/mail) for mail delivery.
Next, configure procmail to invoke SpamAssassin. If you want to invoke SpamAssassin on behalf of every user, do so by editing the /etc/procmailrc file. Example 5-1 shows an /etc/procmailrc that invokes SpamAssassin.
Example 5-1. A complete /etc/procmailrc
DROPPRIVS=yes
PATH=/bin:/usr/bin:/usr/local/bin
SHELL=/bin/sh

# Spamassassin
:0fw
* <300 000
|/usr/bin/spamassassin
If you run spamd, replace the call to spamassassin in procmailrc with a call to
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Spam-Checking During SMTP
If you want to refuse spam before it reaches your recipients, or set up a spam-checking gateway to an internal email server, you need a way to perform spam-checking during the SMTP transaction. If a message is found to be spam, you may want to refuse it and end the SMTP session, or accept it and add headers that users can use in their mail client filters. sendmail provides a general-purpose filtering interface, called milter, for use during the SMTP transaction.
In sendmail's parlance, milter refers to several things. Milter is an application programming interface (API) for writing filters for sendmail, and a protocol for communication between sendmail and a filter. A milter is also a filter program written using this API that listens for connections from a sendmail process and defines functions to call at different points of the SMTP transaction to accept, reject, discard, temporarily refuse, or modify a message. The milter library, libmilter, provides most of the code required to set up a milter and manage the work of calling your filtering functions during an SMTP transaction.
A milter can provide functions that sendmail will call at the following points in an SMTP transaction:
  • When a mail client connects to sendmail
  • After the SMTP HELO or EHLO commands
  • After the SMTP MAIL FROM command
  • After the SMTP RCPT TO command
  • After each message header is transmitted during the DATA step
  • After all message headers are transmitted
  • After each piece of the message body is transmitted
  • At the end of the DATA step, after the entire message has been transmitted
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Building a Spam-Checking Gateway
By combining sendmail, MIMEDefang, and SpamAssassin, you can build a complete spam-checking gateway. Such systems are increasingly popular as external mail exchangers, receiving messages from the Internet and relaying them to internal mail servers that don't perform their own spam-checking (either for performance reasons or because they run operating systems that don't provide cost-effective antispam solutions). I assume that users relay outgoing mail through an internal mail server, rather than through the spam-checking gateway. Figure 5-2