SpamAssassin

Home  Previous  Next

About SpamAssassin

 

SpamAssassin is a mature and highly sophisticated mail filter to identify spam. SpamAssassin was written in Perl and is distributed under the terms and conditions of the Apache Software License 2.0.

 

SpamAware uses SpamAssassin for Win32 and runs as a bachground service. You can download the original version of SpamAssassin under www.spamassassin.org or the Windows.version of the daemon under http://sawin32.sourceforge.net .

 

 

 

What's the power of SpamAssassin?

 

SpamAssassins power is based on three interrelated principles : multi-technique approach, modularity and extensibility.

 

Multi-technique approach refers to the different kind of analysis instruments, that can be used to filter mail.

Every approach analyses mail on a different level :

Analysis of mail origin by analysing the mail header information;
Static analysis depending on lexical or ,
Analysis depending on whitelists or blacklists,
Rule-based analysis,
Statistical analysis based on common traits of spam.

 

Although all levels of analysis do have their special merits, they all have their specific flaws also. Analysing mail on more than just one level, increases certainty, to filter only spam mail and to leave the ham mail unharmed.

 

SpamAssassin has a modular structure, reflecting it's multi-technique approach. Each of these modules performs a special analysis and returns a score for the examined email. The sum of all points of the modules results finally in the "Spam core". The higher this score is, the greater is the possibility that the examined email is spam.  This concept is ideal for combination with SpamAware, because SpamAware can choose, which scores to use for spam identification.

 

The most important methods are here listed:

• Tests on the e-mail head like queries against the servers over which the email was allegedly passed on, to find out whether these do really exist.

• Static tests: ranging from lexical investigations of the email head and trunk, up to the investigation of complete stereotypes

• Analysis of character sets with refer to local use

• Inquiry at RBL (Real-time Black hole Lists) servers, where hash values for already classified spam are registered.

• Inquiry of URL blacklists that list internet sides, which were already recognized as an advertising goal of spam mail.

• Automatic "white listing":  Putting the sender's email addresses on the whitelist, if the total score did not achieve a certain value.

• Use of a Bayes filter:  A filter, which evaluates the message by complex statistic algorithms.

 

Some of these methods produce negative score, like for example the whitelist.  Therefore, most desired mail do often obtain a negative total score.  Which method should be applied, is defined by "rules". Rules are stored in text files of SpamAssassin.  So it is simple to add your own rule collections in additional files.  Whether a mail will be classified as spam or not, is (finally)  the user's decision. 

Among other things, you can set a threshold value in the Spam Tab of the Options menu, which is a criterion for spam if reached. The score of an examined message can be found in the head of the email again.

 

 

What is a Bayes filter and why should I care (or not) ?

A Bayes filter is a statistical filter, identifying spam by gathering common patterns in spam mail. A Bayes filter has to be trained properly, to sucessful identify spam / ham. To train a statistical filter, you must offer a sample of messages, that are definitely spam and a sample, that is definitively  ham. The filter breaks these messages down to tokens, and every possible pattern of certain tokens is evaluated as spam to a certain degree repectively as ham to a certain degree.

The more spam-typical patterns occur in a mail, the greater is the likelihood of being a spam mail. Likewise, the more ham-typical patterns occur in a mail, the greater the likelihood of beeing a ham mail.

Training the filter by providing only spam mail as training material fails to achive the training goal. The filter should not only learn the unwanted patterns, but it must learn also the wanted pattern in a message, to be of any use.

Under certain conditions, Bayes filter can be a valuable means of spam identification. Especially if your mail volume is very high and you have a lot of collected training material (i.e. messages, that are spam, but were not recognised as spam by the rules), Bayes filter can decrease system load during spam identification.

The smaller your mail volume, the less beneficial a Bayes filter will be, simply because of a lack of training material. So it is recommended, to use the pretrained Bayes filter setting provided by SpamAware in the beginning.

 

 

What can I do, in order to increase the efficiency the filter?

You can train the Bayes filter manually with mail not recognized by the SpamAssassin.  Additionally, you may want to collect spam and ham mail, save them as text files and train the filter by using the "Learn Spam messages" function and/or the "Learn 'good' messages" function on these saved messages. 

Because the Bayes filter with activated "autolearn" function already learns all mail with a score over 120 as spam and under 0 as ham, this is particularly important for not recognized spam. The Bayes filter works with words, stereotypes and structures, which arise again and again in spam/ham mail. Therefore it may learn nearly as much from recognized spam as from not recognized spam (and/or ham).

 

 

Training the filter manually:

Manual training the filter is on the one hand a possibility to speed up the initialization of the filter (it becomes only active if it learned at least 200 ham mail messages and 200 spam mail messages) and on the other hand, if spam mail of one type break through again and again, to make SpamAssasin get them.  It is recommended to train the Bayes also with new ham, since that likewise increases the efficiency.

 

 

Add Rules:

All files in the Rules directory contain rules for SpamAssassin and with active Spam rules in SmartPOP2Exchange V6 they are automatically used.  If this is not sufficient, you should try to vary the value of the threshold or create or add own Spam rules  (provided or downloaded from the internet [e.g. from http://www.rulesemporium.com/rules.htm ]).