Chapter 4. SpamAssassin as a Learning System
SpamAssassin provides many rules that have proven useful in distinguishing spam from non-spam messages, and these rules are updated at each new release. But SpamAssassin provides more than just generic rules; it has the capability of learning about your email environment and adapting its detection behavior to maximize its accuracy in that environment.
SpamAssassin includes two adaptive systems that can be used in concert: autowhitelisting and Bayesian filtering. This chapter discusses the principles, configuration, and operation of both systems.
Autowhitelisting
SpamAssassin’s autowhitelisting algorithm learns each sender’s history of sending spam or non-spam messages and modifies the spam score of their subsequent mailings on the basis of this history. The primary goal of autowhitelisting is to reduce false positives—to make it less likely that a non-spam message will be tagged as spam—by assuming that people who send you non-spam messages will not begin to spam you. It can also reduce false negatives if a spammer consistently sends email from the same email address, but this happens infrequently enough that autowhitelisting rarely has a significant effect on false negatives.
Principles
When autowhitelisting is enabled, SpamAssassin maintains a database keyed on message senders’ email addresses and the IP addresses of their nearest untrusted relay (if any). Each time a message from a given sender is received, the message’s spam score is added to the sender’s total score in the database, and a count of the number of messages received from that sender is updated.
The average sender score—the total score divided by the number of messages received—is used to modify the spam score of new messages from that sender. Specifically, the difference between the average score and the new message’s score is multiplied by a configurable factor, and the result is added to the new message’s spam score. The effect is that when the new message has a higher spam score than average, its spam score is adjusted downward; when the new message has a lower spam score than average, its spam score is adjusted upward.
As you might expect from this explanation, the autowhitelist tests are the last ones performed by SpamAssassin. All other tests must be run first in order to have the most accurate spam score for a message before comparing it to the sender’s historical average. In addition, the sender’s historical average is updated with the spam score of a new message before the autowhitelist modifier is applied.
Configuration
The most important decisions to make in autowhitelisting are how much weight SpamAssassin should put on a sender’s history of sending spam or non-spam messages and how much weight it should put on the spam score of the message it is checking.
Use the
auto_whitelist_factor
directive to set the
multiplier that is applied to the difference between a
message’s spam score and the
sender’s historical average score. It can range from
to 1. The default factor is 0.5, which causes the final spam score to
be halfway between the message’s spam score and the
sender’s average score.
To put more weight on the historical average, increase the
auto_whitelist_factor
. When the
auto_whitelist_factor
is set to 1, the historical
average alone will be the new message’s spam score
(recall, however, that the score before autowhitelisting is performed
is fed back into the system and becomes part of the new historical
average).
To put less weight on the historical average, decrease the
auto_whitelist_factor
. When the
auto_whitelist_factor
is set to 0, the historical
average is ignored, and the current message’s spam
score will not be modified based on the sender’s
past messages.
Table 4-1 illustrates the impact of several
different settings for auto_whitelist_factor
. Each
row of the table represents a new message from the same sender. Table
columns show the spam score of each message before applying an
autowhitelist modifier, the sender’s historical
average score, and the spam score after applying an autowhistelist
modifier. In this example, the sender sends several non-spam messages
and then sends a message that looks like spam to SpamAssassin (a
false positive). As you can see, with autowhitelisting using factors
of 0.5, 0.75, or 1, the message will not reach the usual spam
threshold of 5 because of the sender’s history of
non-spam messages. Without autowhitelisting (i.e., with an factor of
0), the message receives a score of 6.
Message number |
Message score (before autowhitelist) |
Sender average score |
Score after autowhitelist with given AWF | |||
0 |
.5 |
.75 |
1 | |||
1 |
2 |
(none) |
2 |
2 |
2 |
2 |
2 |
1 |
2 |
1 |
1.5 |
1.75 |
2 |
3 |
1 |
1.5 |
1 |
1.25 |
1.375 |
1.5 |
4 |
0 |
1.33 |
0 |
0.67 |
1.00 |
1.33 |
5 |
2 |
1.0 |
2 |
1.5 |
1.25 |
1.0 |
6 |
6 |
1.2 |
6 |
3.6 |
2.4 |
1.2 |
SpamAssassin stores its autowhitelist data in database files.
SpamAssassin lets Perl’s AnyDBM
module choose which database format will be used, based on which
system libraries are available. In SpamAssassin 3.0, you can control
this choice by setting the
auto_whitelist_db_modules
option to a
space-separated list of Perl database modules to be tried in order;
the first module that loads successfully will be used. For example,
the default module order is specified like
this:
auto_whitelist_db_modules DB_File GDBM_File NDBM_File SDBM_File
How you configure autowhitelisting also depends on whether you want each user to have his own whitelist database, or whether you want to use one database in common across all users.
Configuring per-user autowhitelists
By default, SpamAssassin maintains a separate autowhitelist for each user on the system. SpamAssassin stores the autowhitelist database for a user in the auto-whitelist file in the .spamassassin subdirectory of each user’s home directory. SpamAssassin uses one of several database formats for this file, depending on what database libraries are available on the system; the Berkeley DB format is chosen when it’s available.
SpamAssassin 3.0 can also store autowhitelists in an SQL database, which is useful when users don’t have accounts on the mail server. To store addresses in SQL, you must install the DBI Perl module and an appropriate driver module for your SQL server. Common choices are DBD-mysql (for the MySQL server), DBD-Pg (for the PostgreSQL server), and DBD-ODBC (for connection to an ODBC-compliant server).
You should create a database and a user with privileges to access it. You must then create a table in the database to store the user autowhitelist. The SpamAssassin source code includes schemas for MySQL and PostgreSQL tables in the sql subdirectory. Here is the MySQL schema:
CREATE TABLE awl ( username varchar(100) NOT NULL default '', email varchar(200) NOT NULL default '', ip varchar(10) NOT NULL default '', count int(11) default '0', totscore float default '0', PRIMARY KEY (username,email,ip) ) TYPE=MyISAM;
Each row in this table specifies an autowhitelist entry for a single sender for an individual SpamAssassin user. SpamAssassin uses the columns to store the following information:
-
username
Stores the username or email address of the user (the latter is more useful in virtual hosting environments).
-
email
Stores the email address of a sender whose messages’ spam scores are being tracked.
-
ip
Stores the IP address of the sender.
-
count
Stores the total number of messages received from the sender.
-
totscore
Stores the total spam score of messages received from the sender.
To configure SQL support for autowhitelists, set the following configuration parameters in your systemwide configuration file (local.cf ):
-
auto_whitelist_factory Mail::SpamAssassin::SQLBasedAddrList
Configures SpamAssassin to use SQL-based autowhitelists instead of file-based autowhitelists.
-
user_awl_dsn
DSN
Defines the data source name for the SQL database, telling
spamd
how it will connect to the database server. A typical DSN for the Perl DBI module is written like this:DBI:
databasetype
:databasename
:hostname
:port
For example, to use a MySQL database named saawl running on a database server on the SpamAssassin host, the DSN would read:
DBI:mysql:saawl:localhost:3306
If the server were running PostgreSQL, the DSN would read:
dbi:Pg:dbname=saawl;host=localhost;port=5432;
-
user_awl_sql_username
username
Defines the username that will be used to connect to the database server. This user must have permission to modify the data in the table (including inserting and deleting rows).
-
user_awl_sql_password
password
Defines the password associated with the username that will be used to connect to the server.
-
user_awl_sql_table
tablename
Defines the name of the table that contains autowhitelist data. The default
tablename
isawl
.
Configuring a system-wide autowhitelist
It is often desirable to maintain a
single autowhitelist for all users of a system. When users
don’t have home directories, such an approach is not
just desirable but may be necessary if autowhitelisting is to be
used. You can configure a systemwide autowhitelist by setting the
auto_whitelist_path
directive to the full path of
the autowhitelist database file. Set
auto_whitelist_path
in the systemwide
configuration file. For example, to set up a systemwide autowhitelist
in the file
/etc/mail/spamassassin/auto-whitelist, use the
following directive:
auto_whitelist_path /etc/mail/spamassassin/auto-whitelist
If SpamAssassin encounters this directive, it checks to be sure the
database file exists. If the file does not exist, SpamAssassin
attempts to create it. You may not want to give SpamAssassin write
access to the directory you specify. One way around that is to create
the file as root, change its ownership to the
SpamAssassin user, and set the mode to allow read/write access, all
before you add the auto_whitelist_path
to your
configuration file.
However you create it, the systemwide
autowhitelist database file should be readable and writable by the
user running SpamAssassin. Depending on your configuration,
SpamAssassin may be running as root, as one of
several users on the system, or as a default unprivileged user such
as nobody
. If you let SpamAssassin create the
systemwide autowhitelist database file, you can use the
auto_whitelist_file_mode
directive to specify the
file’s mode. It defaults to 0700 but may need to be
set to 0770 or 0777 depending on your configuration, when multiple
users must access the
file.
Warning
Using a systemwide autowhitelist with mode 0777 (or 0770 and an inappropriate group) will enable a curious local user to learn the email addresses of message senders and their average spam scores or to modify those scores. A malicious user could modify the database to give legitimate senders a false history of spamming. In general, file modes other than 0700 should be avoided.
Using an Autowhitelist
Once the autowhitelisting system is
configured, you must instruct SpamAssassin to use it. In SpamAssassin
2.63, if you invoke SpamAssassin with the
spamassassin
script, add the
--auto-whitelist
option to direct the script to
consult your autowhitelist. If you invoke SpamAssassin with the
spamc
client, you should start
spamd
(the daemon) with the
--auto-whitelist
option to direct it to consult
user autowhitelists.
SpamAssassin 3.0 contains no
--auto-whitelist
command-line options. Instead,
autowhitelists are always used when the
use_auto_whitelist
configuration option is set in
a user’s (or a systemwide) configuration file.
You can use the
spamassassin
script to manipulate the contents of
your autowhitelist. The following command-line options to
spamassassin
operate on your
autowhitelist:
-
--add-addr-to-whitelist=
emailaddress
Adds
emailaddress
to the autowhitelist with an initial score of -100. SpamAssassin will forget any past history associated with the address.-
--add-addr-to-blacklist=
emailaddress
Adds
emailaddress
to the autowhitelist with an initial score of 100. SpamAssassin will forget any past history associated with the address.-
--remove-addr-from-whitelist=
emailaddress
Removes
emailaddress
from the autowhitelist. SpamAssassin will forget any past history associated with the address.-
--add-to-whitelist
When you pipe an email message to
spamassassin --add-to-whitelist
, SpamAssassin adds all email addresses found in the To, From, Cc, Reply-To, Sender, Errors-To, and Mail-Followup-To headers or in the body of the message to the autowhitelist with initial scores of -100. SpamAssassin will forget any past history associated with these addresses.-
--add-to-blacklist
When you pipe an email message to
spamassassin --add-to-blacklist
, SpamAssassin adds all email addresses found in the To, From, Cc, Reply-To, Sender, Errors-To, and Mail-Followup-To headers or in the body of the message to the autowhitelist with initial scores of 100. SpamAssassin will forget any past history associated with these addresses. Because this behavior will probably result in the blacklisting of your own email address, this option is usually useless.-
--remove-from-whitelist
When you pipe an email message to
spamassassin --remove-from-whitelist
, SpamAssassin removes all email addresses found in the To, From, Cc, Reply-To, Sender, Errors-To, and Mail-Followup-To headers or in the body of the message from the autowhitelist and forgets any past history associated with these addresses.
Warning
Be careful with --add-to-blacklist
. A malicious
spammer could send you HTML email with friendly addresses (including
your own) embedded in invisible mailto:
tags.
Piping this message to spamassassin
--add-to-blacklist
causes SpamAssassin to add all of
those addresses to the autowhitelist as likely
spammers! Using --add-addr-to-blacklist
with
individual email addresses is safer.
Bayesian Filtering
SpamAssassin’s Bayesian classifier learns to distinguish the features that characterize spam from those that characterize non-spam in the messages that you receive. Properly trained, the Bayesian classifier can reduce both false positives and false negatives.
Principles
Bayesian
filtering is based on Bayes’ Theorem, a statement of
probability theory propounded by the Reverend Thomas Bayes in 1763.
Bayes’ Theorem is important in many fields where
classifying data is essential, including computer vision,
psychophysics, and diagnostic decision-making in health care.
SpamAssassin’s implementation is mostly based on the
work of Paul Graham (archived at http://www.paulgraham.com
) and Gary Robinson
(http://www.garyrobinson.net
).
Conceptually, Bayes’ Theorem states that the probability of some event (such as a message being spam) given a test result (such as matching a spam-checking rule) depends on the baseline probability of the event before the test result is known and on the discriminating power of the test. A corollary is that the discriminating power of a test can be measured by comparing the probability of the event given a known test result to the baseline probability before the result is known. The more the test result can increase (or decrease) the probability from baseline, the stronger the test.
Tip
Actually, SpamAssassin’s “Bayesian” system doesn’t really compute the baseline probability or frequency of spam versus non-spam messages—which some have argued means it’s not strictly Bayesian at all. Instead it assumes values that seem reasonable and useful.
In the context of spam-checking, a Bayesian approach amounts to developing potential rules and asking how much each rule, if matched, should change the system’s perception of the likelihood that a message is spam. Very strong rules come in two forms. Some are patterns that only occur in spam (and never in non-spam), thus yielding a high probability that a message that matches one of the patterns is spam. Others are patterns that only occur in non-spam (and never in spam), thus yielding a low probability that a message that matches the pattern is spam. Weaker rules—patterns found in both spam and non-spam messages but with different frequencies—result in less extreme probabilities.
To use Bayesian filtering successfully, you must have a corpus of messages that you have decided are definitely spam, a corpus of messages that you have decided are definitely non-spam, and an algorithm for analyzing the two sets of messages to develop rules and test their strength. SpamAssassin provides the algorithm and a script that you can use to identify messages as spam or non-spam in order to train the filter. It also provides a mechanism for training itself with messages that are very likely to be spam or non-spam.
The results of the SpamAssassin learning process are a set of databases. One database contains tokens (strings of 3-15 characters) that have been seen, how often each has been seen in spam and non-spam messages, and the date and time that each token last proved useful in classifying a message. During learning, tokens are derived from both the message headers (with several commonly misleading headers ignored) and message body. Tokens that haven’t been useful in a long time may be removed from the database to increase efficiency. Another database keeps track of which messages have been learned, so SpamAssassin doesn’t waste time relearning old messages.
During spam-checking, a message to be checked is split into tokens. SpamAssassin then looks up each token in the token database. Up to 150 of the most diagnostic tokens in the message are identified, and their associated predictive values are combined using one of two mathematical functions to yield a final prediction of the probability that the message is spam. This predicted probability is matched by special SpamAssassin rules that associate probability ranges with spam score modifiers.
Configuration
SpamAssassin’s Bayesian classifier is controlled by more than a dozen configuration directives, though only a few are regularly modified by system administrators. These are the most useful:
-
use_bayes
This directive controls whether the Bayesian classifier is used at all. It defaults to 1 (use Bayesian filtering). By setting it to 0, Bayesian filtering is disabled completely.
-
bayes_auto_learn, bayes_auto_learn_threshold_nonspam, bayes_auto_learn_threshold_spam
These directives configure the automatic learning system, which automatically feeds messages with very high or very low spam scores to the Bayesian classifier. The
bayes_auto_learn
directive enables (1) or disables (0) this feature; it is enabled by default. Thethreshold
directives determine which messages will be automatically learned as spam or non-spam. Messages with spam scores lower thanbayes_auto_learn_threshold_nonspam
are learned as non-spam; this value defaults to 0.1. Messages with spam scores higher thanbayes_auto_learn_threshold_spam
are learned as spam; this value defaults to 12 and cannot be set lower than 6. The spam score used for making this determination does not include modifiers for the Bayesian system itself, for the autowhitelist, or for user-configured whitelists or blacklists.-
bayes_ignore_header
headername
This directive tells the Bayesian classifier to ignore the given header when learning or classifying messages. It is most often used when another spam-tagging system adds headers before SpamAssassin receives the message, in order to prevent the classifier from learning the other spam tag instead of the features of the actual message.
-
bayes_ignore_from
address
(SpamAssassin 3.0) This directive prevents Bayesian classification and learning from being performed on messages sent from
address
and is a form of whitelisting. It’s most useful when you want to receive messages from a few senders and the messages may include tokens that would otherwise suggest spam.You can use multiple
bayes_ignore_from
directives or multiple addresses in a single directive to whitelist several addresses. You can also use as asterisk (*) as a wildcard for zero or more characters and a question mark (?) as a wildcard for zero or one character, much as you would to specify filename patterns in a shell.-
bayes_ignore_to
address
(SpamAssassin 3.0) This directive prevents Bayesian classification and learning from being performed on messages sent to
address
, and is a form of whitelisting recipients. It’s useful in sitewide Bayesian filtering to prevent any learning from being performed from messages sent to postmaster, for example, who is likely to receive forwarded spam, non-spam messages discussing spam, etc. Specify addresses as you would to thebayes_ignore_from
directive discussed previously.-
bayes_learn_during_report
When this directive is enabled (1), messages that are reported to clearinghouses as spam with the
spamassassin --report
command are also learned as spam by the Bayesian classifier. This saves you an extra learning step. Set the directive to 0 to disable this feature. It is enabled by default.-
bayes_path
andbayes_file_mode
By default, SpamAssassin maintains separate Bayesian databases for each user on the system. The databases for a user are stored in the .spamassassin subdirectory of the user’s home directory and their names begin bayes_, such as bayes_seen and bayes_toks. These files are kept in one of several possible database formats (Berkeley DB format is generally preferred when it’s available to SpamAssassin).
Separate databases for each user are ideal for Bayesian learning because different users may receive different kinds of spam and non-spam messages. However, it is often necessary to maintain a single Bayesian database for all users of a system, either to save on disk space or because users don’t have home directories. You can configure a systemwide Bayesian database set by setting the
bayes_path
directive to the full path of the Bayesian database file prefix. For example, to set up systemwide Bayesian databases in the files /etc/mail/spamassassin/bayes_*, use the following directive:bayes_path /etc/mail/spamassassin/bayes
By default, the Bayesian databases are created with mode 0700. The
bayes_file_mode
directive can be used to set a different file mode (e.g., 0770) if you need to share the databases among a group. This might be necessary if SpamAssassin can be invoked with the privileges of different users. Care should be taken with this directive, as a malicious user with access to the Bayesian databases can cause legitimate email to be mistagged as spam.
The following directives influence the internal workings of the Bayesian classifier. For the most part, they can be left to the default settings.
-
bayes_min_ham_num
andbayes_min_spam_num
These directives set the minimum number of ham (non-spam) and spam messages that must be learned by SpamAssassin before it will use the predictions of the Bayesian classifier to score new messages. They default to 200 each; until 200 ham and 200 spam messages have been learned, the SpamAssassin rules that rely on the Bayesian classifier will not be applied to email.
-
bayes_use_hapaxes
Hapaxes are tokens that have been seen only once during learning so far. Accordingly, SpamAssassin’s concept of whether a hapax is associated with spam or ham is based on limited data and may not be reliable. On the other hand, SpamAssassin can learn hundreds or thousands of hapaxes, and using hapaxes seems to provide better accuracy, so this setting defaults to 1 (enabled).
-
bayes_use_chi2_combining
This directive controls which of the two mathematical functions are used to combine token probabilities into an overall message probability. When enabled (1), the approach is based on the distribution of the chi-squared statistic; when disabled (0), a so-called “naïve Bayesian” function combines the probabilities using the assumption that errors in classification from each token are independent of one another. SpamAssassin’s maintainers have found the chi-squared method more useful, and it is the default.
-
bayes_auto_expire
andbayes_expiry_max_db_size
When
bayes_auto_expire
is enabled (1), SpamAssassin will automatically attempt to remove old tokens during learning when the token database exceedsbayes_expiry_max_db_size
tokens. This is the default. When disabled (0), token expiration must be performed manually. Automatic expiration occurs no more than once every 12 hours.-
bayes_learn_to_journal
andbayes_journal_max_size
When
bayes_learn_to_journal
is enabled (1), SpamAssassin will store newly learned data in a journal file, rather than directly into the Bayesian databases. The journal file will be synchronized into the databases at least daily, or when the journal exceedsbayes_journal_max_size
bytes (102,400 by default). Using journaling reduces disk contention for the databases, which must be exclusively locked while being updated, but results in a delay between the time a message is learned and the time the learned tokens can be used to classify further messages. Journaling might be particularly useful if the journal could be kept in a different location than the databases (e.g., on a RAM disk), but this directive is not supported as of SpamAssassin 3.0.bayes_learn_to_journal
is disabled by default.
Training
There are two main strategies for training a Bayesian classifier: train everything and train-on-error. In the train everything strategy, you train the classifier with every message that you receive. This strategy is highly responsive to changes in spam patterns but may change too quickly in response to unrelated variability in messages. In addition, it is resource intensive to scan every message. In the train-on-error strategy, you train the classifier only with messages that it has previously classified incorrectly (i.e., false positives and false negatives). This strategy is resource efficient but may not train the classifier as quickly when spam patterns change.
Based on experiments
conducted by Greg Louis (and described at http://www.bgl.nu/bogofilter/
), the train
everything strategy appears to be more efficient for initial
training. Once a suitable number of messages have been learned,
however, switching to a train-on-error approach saves resources,
because many fewer messages must be trained. Louis suggests that
switching to train-on-error after 10,000 spam and 10,000 non-spam
message have been learned may be reasonable. You can train
SpamAssassin’s Bayesian classifier with either
strategy.
The sa-learn
script is your primary interface for training the Bayesian
classifier. The first step in using Bayesian filtering is collecting
a corpus of messages you’ve received that you have
verified are spam and a corpus that you’ve verified
are non-spam. The easiest and best way to do so is to simply start
saving spam you receive to one folder and any non-spam messages that
you would ordinarily delete to another. The two collections of
messages can either be in maildir format (in
which each file contains a single message) or
mbox format (in which a single file contains
multiple messages).
It’s important that the messages be from the same time period; if you train SpamAssassin with a set of spam messages from 2003 and a set of non-spam messages from 2004, it will quickly learn that an effective way to detect spam is to look for messages in 2003! Similarly, forwarded spam, or messages discussing spam in your corpus (“Hey, look at this spam I just got; it’s really strange. Here it is . . . “) can result in the classifier learning artificial rules that will degrade its accuracy with normal messages.
Next, run sa-learn
on
each corpus, using either the --spam
or
--ham
command-line options to specify what each
corpus represents. Example 4-1 shows the process for
a set of mbox files—a file of saved spam,
a file of saved (non-spam) messages related to a project, and the
user’s mail spool. The project files and mail spool
files together form a corpus of known good messages. This example
assumes that each user maintains her own Bayesian databases, so
sa-learn
is run by each user on her own messages.
$ls -F Mail
spam myproject $sa-learn --mbox --spam Mail/spam
$sa-learn --mbox --ham mail/myproject
$sa-learn --mbox --ham /var/spool/mail/$LOGNAME
Example 4-2 shows the process for a set of
maildirs, again assuming that each user has his
own Bayesian databases. The commands in the example are those that
would be executed by each individual user. Providing a directory as
an argument to sa-learn
causes it to learn from
every file in that directory. The example also illustrates the use of
the --no-rebuild
option to defer rebuilding of the
databases until the --rebuild
option is used. When
performing learning on a large set of small files (the very essence
of a maildir), deferring the expensive
database-rebuilding step is more efficient than rebuilding after each
file.
$ls -F mail
INBOX/ spam/ myproject/ $sa-learn --no-rebuild --spam mail/spam
$sa-learn --no-rebuild --ham mail/INBOX
$sa-learn --no-rebuild --ham mail/myproject
$sa-learn --rebuild
If you’re the sort who likes to see the progress of
the training (or who worries when you run a command that takes longer
than a few seconds to finish), you can add the
--showdots
option to cause
sa-learn
to print a period for each message it
processes.
You can also call sa-learn
on an individual file
containing a mail message, or you can pipe a mail message to
sa-learn
’s standard input.
Finally, you can put the names of mailboxes, files, or directories
into a file and run sa-learn
with the
--folders=
filename
option, and it will read the file and directory names from the
filename
file and learn from each.
Tip
The Bayesian classifier is most effective when trained on large collections of both spam and non-spam messages. In particular, training using many spam messages and fewer non-spam messages is likely to produce an ineffective filter. Aim for a couple thousand messages of each type, collected prospectively from your personally received mail.
If you mistakenly train the Bayesian classifier that a message is
spam, simply direct sa-learn
to relearn it as ham;
if you mistakenly learn a message as ham, you can direct
sa-learn
to relearn it as spam. This process is
also how you later train the classifier on errors. You can also cause
SpamAssassin to forget a message entirely by running
sa-learn
--forget
on the
message.
sa-learn
also accepts the same
--configpath
/path/to/ruleset/directory
,
--prefspath
/path/to/user_prefs
, and
--siteconfigpath
/path/to/sitewide/directory
directives
that the spamassassin
script does. They are
described in Chapter 2.
Daily Use
When you first enable the Bayesian classifier in SpamAssassin, you will initially notice little change in the way messages are checked for spam. Once you’ve trained the classifier with enough messages, however, your spam scores for messages will begin to change substantially in two ways:
Messages will show that they are hitting SpamAssassin rules with names like BAYES_44 or BAYES_80. These rules, which can be found in the 23_bayes.cf file, are triggered when the Bayesian classifier assigns a given probability of spam to a message. For example, the BAYES_44 rule is matched when a message has a probability of spam between 0.44 and 0.4999; the BAYES_80 rule is triggered when a message has a probability of spam between 0.80 and 0.90. Rules that match on probabilities less than 0.5 lower spam scores, and those that match on probabilities greater than 0.5 raise spam scores.
Most of the non-Bayesian rules assign different scores when the classifier is trained and in use than when it is not. In many cases, non-Bayesian rules produce less extreme scores, which reflects the supposition that the Bayesian classifier should be better than static rules at distinguishing spam from non-spam.
Ongoing training
Ongoing training is essential to maintaining the performance of a Bayesian filter. As in initial training, you must continue to provide examples of both spam and non-spam messages.
As you receive messages, check each message classified as spam to be
sure that it is really spam and not a false positive. If the
message’s spam score is higher than the threshold
for automatic learning, the message should have already been fed back
into the classifier to train it. You can determine if this has
happened by looking at the autolearn=
section of
the X-Spam-Status header added by SpamAssassin.
If the message’s spam score wasn’t
high enough for automatic learning, submit it to sa-learn
--spam
yourself. If you come across a false positive,
submit it to sa-learn
--ham
instead.
Similarly, you
can submit your non-spam messages to sa-learn
--ham
if their spam scores are too high for the
automatic learning threshold for ham. Any spam SpamAssassin misses
should definitely be submitted to sa-learn
--spam
.
You can make the ongoing training process more convenient using one
of two common ways. If you read your email with an email client that
allows you to bind commands to keys, you could define keystrokes to
invoke sa-learn
--ham
or
sa-learn --spam
on the current message. Another
approach is to save all spam messages into a single mail folder and
all non-spam messages that you plan to delete into a second folder,
and then run sa-learn
on each folder (and possibly
on your inbox if you keep many undeleted messages there) at the end
of your mail-reading session. Users or system administrators can set
up cron
jobs to automate this process.
Expiration and importing
Expiration and importing are two other functions of
sa-learn
that you will use infrequently.
Expiration removes old tokens from the database, and importing
updates the database if a new SpamAssassin release changes database
formats.
As discussed earlier in this chapter,
when bayes_auto_expire
is enabled (the default),
SpamAssassin’s Bayesian classifier regularly reviews
its database of tokens to determine if any should be expired.
Expiration is always skipped when fewer than 100,000 tokens are in
the database. The automatic expiration process runs no more than once
every 12 hours and only when the number of tokens exceeds
bayes_expiry_max_db_size
.
If you do not use
bayes_auto_expire
, or if you want to expire tokens
manually, you can force an expiration attempt by running
sa-learn
--force-expire
. Doing
so may not actually expire any tokens; for example, when fewer than
100,000 tokens or all tokens have been recently used, no tokens will
be expired.
The sa-learn
--import
command is used to update the Bayesian
databases from their format in an older version of SpamAssassin to
the current format. The release notes for new versions of
SpamAssassin should tell you when running sa-learn
--import
is necessary. In many cases, SpamAssassin
will perform importation when it automatically learns a new message,
so this command may not be necessary.
Storing Bayesian Data in SQL
SpamAssassin 3.0 can optionally store per-user Bayesian data in an SQL database, which is useful when users don’t have accounts on the mail server. To store Bayesian data in SQL, you must install the DBI Perl module and an appropriate driver module for your SQL server. Common choices are DBD-mysql (for the MySQL server), DBD-Pg (for the PostgreSQL server), and DBD-ODBC (for connection to an ODBC-compliant server).
You should create a database and a user with privileges to access it. You must then create a set of tables in the database to store the Bayesian data. The SpamAssassin source code includes schemas for MySQL, PostgreSQL, and SQLite tables in the sql subdirectory. Here is the MySQL schema:
CREATE TABLE bayes_expire ( username varchar(200) NOT NULL default '', runtime int(11) NOT NULL default '0', KEY bayes_expire_idx1 (username) ) TYPE=MyISAM; CREATE TABLE bayes_global_vars ( variable varchar(30) NOT NULL default '', value varchar(200) NOT NULL default '', PRIMARY KEY (variable) ) TYPE=MyISAM; INSERT INTO bayes_global_vars VALUES ('VERSION','2'); CREATE TABLE bayes_seen ( username varchar(200) NOT NULL default '', msgid varchar(200) binary NOT NULL default '', flag char(1) NOT NULL default '', PRIMARY KEY (username,msgid), KEY bayes_seen_idx1 (username,flag) ) TYPE=MyISAM; CREATE TABLE bayes_token ( username varchar(200) NOT NULL default '', token varchar(200) binary NOT NULL default '', spam_count int(11) NOT NULL default '0', ham_count int(11) NOT NULL default '0', atime int(11) NOT NULL default '0', PRIMARY KEY (username,token) ) TYPE=MyISAM; CREATE TABLE bayes_vars ( username varchar(200) NOT NULL default '', spam_count int(11) NOT NULL default '0', ham_count int(11) NOT NULL default '0', last_expire int(11) NOT NULL default '0', last_atime_delta int(11) NOT NULL default '0', last_expire_reduce int(11) NOT NULL default '0', PRIMARY KEY (username) ) TYPE=MyISAM;
For each user, these tables maintain information about token
expiration (bayes_expire
), messages seen
(bayes_seen
), tokens seen
(bayes_token
), and per-user configuration
variables (bayes_vars
). A table for global
configuration variables (bayes_global_vars
) is
also available. The names of rows in these tables are similar to the
corresponding SpamAssassin configuration variables and indicate the
data they store.
To configure SQL support for Bayesian data, set the following configuration parameters in your systemwide configuration file (local.cf):
-
bayes_store_module Mail::SpamAssassin::BayesStore::SQL
Configures SpamAssassin to use SQL-based storage for Bayesian data instead of file-based (DBM) storage.
-
bayes_sql_dsn
DSN
Defines the data source name for the SQL database. See the earlier definition of
bayes_awl_dsn
for examples of how to define a DSN.-
bayes_dsn_sql_username
username
Defines the username that will be used to connect to the database server. This user must have permission to modify the data in the table (including inserting and deleting rows).
-
bayes_dsn_sql_password
password
Defines the password associated with the username that will be used to connect to the server.
SpamAssassin will now store Bayesian data learned from messages
(either automatically or via sa-learn
) in the SQL
database and will look up tokens in this database when checking
messages for a user.
SpamAssassin provides one additional configuration variable for SQL storage of Bayesian data:
-
bayes_sql_override_username
someusername
When this directive is set, the SQL query for Bayesian data will use
someusername
in place of the current user’s name when adding new message data or retrieving data for message-checking. Generally, this directive should only be used in per-user configuration files so that most users have their own personal Bayesian data. In principle, you could also use it in the site-wide configuration file to create a sitewide Bayesian database, and then use it in per-user configuration files to exclude certain users from the sitewide data.
A Sitewide Bayesian Classifier
Bayesian filtering is most effective when each user maintains his own set of token databases trained from his own email. By learning about the peculiar characteristics of spam and non-spam messages received by an individual user, the Bayesian classifier becomes an effective test for future messages to that user. A pharmacist might receive a lot of legitimate email about sildenafil citrate, and having all of these messages tagged as spam (or worse) could be a serious problem.
Many sites, however, prefer to have a single set of databases for all users at the site, either to save disk space or because users do not have home directories and setting up SpamAssassin 3.0’s SQL storage is infeasible. Setting up a sitewide Bayesian classifier is possible with SpamAssassin. Perform the following steps:
Set
bayes_path
andbayes_file_mode
in the systemwide configuration file. Be sure the directory specified inbayes_path
is readable, writable, and searchable by the user that SpamAssassin will be running as, so that it can create the proper files. Thebayes_file_mode
should be as strict as possible, typically 0700, which is the default setting. It’s a good idea to set it explicitly, rather than rely on the default.Provide a mechanism for users or administrators to submit messages for training. This step is the most difficult part of a sitewide Bayesian classifier. Because the database files will be owned by the user that SpamAssassin runs as, even local users typically will not be able to run
sa-learn
with the proper permissions to update the databases.
One solution for enabling users to submit spam messages for training is to ask users to bounce any spam they receive to a central mailbox that can be processed by a privileged script. For example, set up an email alias of spamtrap on the SpamAssassin system that pipes incoming messages to a script like that shown in Example 4-3. As an extra benefit, you can publicize the spamtrap address on public web pages or in Usenet postings and actually use it as a spam trap—spammers who harvest the address and send spam to it will find their spam fed into your learning and reporting systems.
#!/bin/sh # # This script accepts an email message on its standard input # and feeds it to SpamAssassin's learning and/or reporting systems # It is meant to be run as root or as the user who owns the # SpamAssassin Bayesian databases PATH=/bin:/usr/bin:/sbin:/usr/sbin # Three choices: # 1. Uncomment the following line to use --report if # you have bayes_learn_during_report enabled. spamassassin --report # 2. Uncomment the following line to use sa-learn and # spamassassin --report when you don't have # bayes_learn_during_report enabled # sa-learn --spam | spamassassin --report # 3. Uncomment the following line to use sa-learn # alone. #sa-learn --spam
Warning
If you ask users to use a centralized
spamtrap address, it is crucial that they
bounce or redirect their
messages, rather than forward their messages. A
forwarded message’s headers will show the message as
being sent by the forwarding user, which is not what you want the
Bayesian classifier to learn! Most mail clients provide a function
for redirecting a message to a new address so that it still appears
to be coming from the original sender. If your mail clients add extra
headers when they do this, these headers are good candidates for
bayes_ignore_header
. You have to test to determine
which, if any, headers your mail clients add and to be sure
SpamAssassin is ignoring them.
A similar solution for non-spam messages is much more
difficult—for social, rather than technical, reasons. Users may
well be reluctant to forward their legitimate email to any central
address. Unfortunately, without a good corpus of non-spam messages,
the Bayesian filter will not perform well. One possible approach is
to raise the bayes_auto_learn_threshold_nonspam
slightly (e.g., to 0.5 or 1.0) so that much legitimate email will be
auto-learned.
Get SpamAssassin now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.