address Logo

Spam Filter

Contents

2008/12/01 first draft
2008/12/06 corrected some mistake in analyzing /sys/log/smtpd
2008/12/11 section "The ability"

Introduction

Two types of spams.

We observe two types of spams: commercial and non-commercial. Commercial spams are bothersome but not so much harmful. However most of non-commercial spams are used for criminal purpose: to scatter virus, to abstract personal information from victims, to direct users to some malicious web pages, etc.

How to protect spams? Today's MUA(mail user agent) has spam filter. For example, I receive mails using Mac's 'Mail.app', my MUA, which retrieves mails from my mail server. The MUA sieves mails to folders following string pattern in mails. Spam filter of MUA is effective only for commercial spams but useless for non-commercial spams.

Well then, how to protect non-commercial spams? As shown in next subsection, half of incoming mails are blocked by our smtpd which countermeasures against only trivial spams. Of cause they are non-commercial spams. Half percent blocking is enough? No! I am still receiving so many non-commercial spams every day.

Users in Plan 9 may have a file "pipeto" in /mail/box/$user. If the file exists, the file pipes incoming mail to mbox, mail box for individual user. So, I developed additional spam filter, "pipeto", which analyzes incoming mails and passes them to user's mbox. The filter gives an eye to the fact: regular mails (including commercial spams) come from smtpd with static domain name, on the other hand most of non-commercial spams come from three types of machines: (1) without domain name, (2) with dynamically allocated domain name and (3) with faked domain name.

Spam filer in smtpd

We can see the ability by inspecting /sys/log/smtpd.
The example:

ar Nov 27 09:41:24 ehlo from 72.245.206.188 as mail.mobiledealersplace.com
ar Nov 27 09:41:28 [mail.mobiledealersplace.com/72.245.206.188] mobiledealersplace.com!pr1127 sent 175 bytes to ar.aichi-u.ac.jp!arisawa
ar Nov 27 09:53:05 ehlo from 116.52.71.176 as server
ar Nov 27 09:53:05 Hung up on 116.52.71.176; claimed to be server
ar Nov 27 09:05:59 helo from 89.163.0.24 as angeleschorale.org
ar Nov 27 09:06:00 Disallowed angeleschorale.org!xsfciboxqw (angeleschorale.org/89.163.0.24) to blocked, unknown or invalid name ar.aichi-u.ac.jp!fossilarisawa

We observe two lines per mail: SMTP request and the result. The record format is new one that starts from Feb 19 2008.

ar Feb 19 13:55:06 ehlo from 130.203.4.6 as mail.cse.psu.edu
ar Feb 19 13:55:11 [mail.cse.psu.edu/130.203.4.6] cse.psu.edu!9fans-bounces+arisawa=ar.aichi-u.ac.jp sent 182 bytes to ar.aichi-u.ac.jp!arisawa

My /sys/log/smtpd is a mixture of old and new format. So I extracted only new records that range from Feb 19 2008 to Dec 5 2008 and, naming the extracted part as smtpd1, I got the result:

ar% grep '(helo|ehlo) from' smtpd1 |wc
 122805 1105245 8026041
ar% grep 'sent [0-9]+ bytes to' smtpd1 |wc
  53392  590127 6960192
ar% 

which means refused mails are as much as 57%. However, as shown later, most of mails that are sent to mbox are still spams.

Pipeto

As mentioned before, most of non-commercial spams come from clients: (1) that do not have DNS name, (2) that have dynamically allocated domain name and (3) that fake their domain name. Pipeto tries to identify such mails and put "spam" tag in the Subject header.

Spams that come from static DNS name will not be tagged "spam" unless the names is explicitly put into blacklist, however such spam can be handled also by MUA.

We can detect most of spams simply by inspecting 'Receive: from' header in which we have two important informations: HELO host's FQDN and the client IP address. The example is shown below.

    Received: from amnetmortgage.com ([201.240.156.32]) by ar; Mon Dec  1 12:39:27 JST 2008

HELO host's FQDN follows 'from' and client IP address is shown in [ ]. Note that SMTP client can forge the HELO host but the IP cannot be forged. The forgery can be detected by getting IP of the host using DNS query:

ar% ndb/dnsquery
> amnetmortgage.com
amnetmortgage.com ip    169.200.183.83
> 

Note that I don't require "reverse DNS query" because having the query is not requested, I believe, in SMTP related RFC.

Files for pipeto

Pipeto uses the following three files:

Inspection items are

Pipeto puts "spam" tag into Subject field of mail from the client

We can register information below both to blacklist and white list

It is desirable not to put IP into blacklist because the IP might be shared among many innocent machines. Putting HELO host pattern into blacklist should be confined only to commercial spams. Although pipeto allows string patterns in mail body in detecting spam, it is not recommended to use this feature because then you are restricted to avoid these strings in communicating with your friends.
Pipeto is sufficiently powerful in detecting spams, so you probably need not put anything into blacklist for non-commercial spam.

Don't put famous DNS name into white list. That will increase spam mails which escaped the filter.

Spam tags by pipeto

Pipeto puts several types of spam tags in Subject field.

(1) [spam:ip]   # the ip is in blacklist
(2) [spam:noname] # the client has no FQDN
(3) [spam:host] # the helo host is in blacklist
(4) [spam:suspect] # the client FQDN is suspected to be dynamic
(5) [spam:fake] # faked helo host
(6) [spam:] # spam pattern in mail headers or mail body
(7) [truncated] # the mail is normal but truncated

User can define additional spam patterns in mail headers and mail body.
For example, string pattern

    ^Contents-type: text/(html)

in blacklist will put "[spam:html]" in the subject header if the mail contains the pattern in headers. Without "( )", i.,e.,

    ^Contents-type: text/html

the subject header will be simply "[spam:]".

Installation

You can get 'pipeto' from http://plan9.aichi-u.ac.jp/netlib/spamfilter/

Format of white list and blacklist

files

    /mail/box/$user/white	# white list
    /mail/box/$user/black	# blacklist

syntax and semantics

(1) '&' + ' ' + IP + '/' + number	# checks client IP using CIDR format
(2) '*' + ' ' + RE		        # checks HELO host using the RE
(3) RE that begin with '^'	        # checks headers using the RE
(4) RE that does not begin with '^'	# checks mail body using the RE

where RE means regular expression.

example

example of white list:

^Subject: .*tip9ug:
& 219.106.227.66/32     # etour
& 202.224.39.0/24       # asahi-net
& 219.112.246.0/20  # news.mixi
* smtp\.aichi-u\.ac\.jp

blacklist format is same as white list. Lines that begin with "&" is effective only up to second field. Therefore you may put comments after the IP address.

NOTE: pattern specification in subject header such as "^Subject: .*tip9ug:" is bad idea, because the pattern can be hidden in base 64 encoding.

The ability

2008/12/11

I get mails from my mail server using POP protocol keeping old mails only within a week in mbox. Here is statistics that shows ability of pipeto. There was 1141 mails in the mbox. Time stamp of the first mail is "Dec 2 19:00:03 JST 2008" and the last is "Dec 11 12:34:40 JST 2008". The contents in the white list and the blacklist are shown in the next subsection.

The white list and the blacklist

/mail/box/arisawa/whitelist

^Subject: .*tip9ug:
& 203.138.203.0/24  # docomo
& 59.135.39.0/24    # ezweb
& 219.106.227.66/32     # etour
& 202.224.39.0/24       # asahi-net
& 219.112.246.0/20  # news.mixi
^.*arisawa\+zzz@ar.aichi-u.ac.jp
^.*some-list@ar.aichi-u.ac.jp

where "zzz" is a magic that is described in next section and "some-list@ar.aichi-u.ac.jp" is a mailing list on my server. (Sorry. I must not disclose the real name.)

/mail/box/arisawa/blacklist

^Return-Path: <.+@post\.fukubiki\.com>
^Return-Path: <.+@163\.com>
* \.lipetsk\.ru
* argus\.e-dentify\.nl
* relay03\.quesse\.it
* post\.fukubiki\.com
^.*<info@freeml.com>
^.*<.*@woman.co.jp>
^From: MAILER-DAEMON@
^From: postmaster@
^From: /dev/null
^From: .*arisawa@ar.aichi-u.ac.jp
^.*To: .*[Uu]ndisclosed

Look next section to understand "^From: /dev/null" and "^From: .*arisawa@ar.aichi-u.ac.jp" in the blacklist.

The statictics

The following table is an analysis of the mbox.

type total tagged not tagged
spam mails 706 644 62
regular mails 435 0 435
total 1141 644 497

The 62 spam mails that escaped filtering were commercial spams or came from some ill mail servers. Although regular mails that are tagged as spam are zero for my case, more data will be needed in white list to make pipeto more reliable.

Some notes

Backscattering spams

If spammers send mails to mail servers with non-existent recipient in "To:" field borrowing your mail address in "From:" field, then a great deal of error notification mails from targeted mail servers will be send to you. Then the senders are legitimate mail servers and have static DNS name. How to put spam tag to these mails? To resolve this problem, we can employ a solution suggested by Russ Cox in 9fans.

/mail/lib/remotemail

#!/bin/rc
shift
sender=$1
shift
addr=$1
shift
fd=`{/bin/upas/aliasmail -f $sender}
switch($fd){
case *.*
        ;
case *
        fd=ar.aichi-u.ac.jp
}
if(~ $sender /dev/null)
        exec /bin/upas/smtp -h $fd $addr $sender $*
if not
        exec /bin/upas/smtp -h $fd $addr $sender+zzz $*

where 'zzz' is any string that is allowed for mbox. Then 'Return-Path' of outgoing mail will be 'alice+zzz@aichi-u.ac.jp'.

Add something to /mail/lib/rewrite so that you can accept mails to alice+zzz@aichi-u.ac.jp.

\"(.+)\"                translate       "/bin/upas/aliasmail '\1'"
[^!@.]+                 translate       "/bin/upas/aliasmail '&'"

# deliver mail without a domain locally
local!"(.+)\+zzz"               >>              /mail/box/\1/mbox
local!(.*)\+zzz       >>              /mail/box/\1/mbox
local!"(.+)"            >>              /mail/box/\1/mbox
local!(.*)              >>              /mail/box/\1/mbox

# your local names
\l!(.*)                                 alias           \1
\l\.aichi-u\.ac\.jp!(.*)                alias           \1

# convert source domain address to a chain a@b@c@d...
...
...

We make alice\+zzz@ar.aichi-u.ac.jp white.
/mail/box/alice/white

    ^.*alice\+zzz@ar.aichi-u.ac.jp

and make mails from /dev/null black.
/mail/box/alice/black

    ^From: /dev/null

Note that header patterns in white list are examined prior to those in blacklist.

Incoming mail to mbox

We show a mail that comes to mbox as an example for understanding pipeto.

From acaoemailmarketing.com!envia Sun Nov 30 17:33:59 JST 2008 remote from ar
Received: from smtp2.braslink.com ([204.16.3.24]) by ar; Sun Nov 30 17:33:59 JST 2008
Received: (qmail 3075 invoked from network); 30 Nov 2008 04:39:32 -0000
...
Content-Transfer-Encoding: 8bit
Content-Type: text/html; charset="iso-8859-1"

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
...
...

The first line is a separator in mbox and useless for detecting spam.
The second line begins with

    Received: from

which is followed by HELO host and client IP address.

HELO host and client IP

According to rfc2821, HELO host name must be FQDN(Fully Qualified Domain Name) that identify the client.

4.1.1.1  Extended HELLO (EHLO) or HELLO (HELO)

   These commands are used to identify the SMTP client to the SMTP
   server.  The argument field contains the fully-qualified domain name
   of the SMTP client if one is available.

We can see the example of this rule from the mail:

    Received: from userg502.nifty.com ([202.248.238.82]) by ar

confirming

ar% ndb/dnsquery
> userg502.nifty.com
userg502.nifty.com ip   202.248.238.82
>

Most of mail servers stands on this rigorous interpretation of 'identify'.

Some of large sites such as Google stand on somewhat looser interpretation.

    Received: from qb-out-1314.google.com ([72.14.204.170]) by ar

then

ar% ndb/dnsquery
> qb-out-1314.google.com
qb-out-1314.google.com ip       72.14.204.168
qb-out-1314.google.com ip       72.14.204.174
qb-out-1314.google.com ip       72.14.204.173
qb-out-1314.google.com ip       72.14.204.172
qb-out-1314.google.com ip       72.14.204.175
qb-out-1314.google.com ip       72.14.204.169
qb-out-1314.google.com ip       72.14.204.170
qb-out-1314.google.com ip       72.14.204.171
> 

where HELO host identifies a set of smtp clients.

Some of large ISPs in Japan such as docomo break the rule. In fact, I receive mails with

    Received: from docomo.ne.jp ([203.138.203.197]) by ar

The HELO host has no DNS name!

ar% ndb/dnsquery
> docomo.ne.jp
!dns: resource does not exist
> 

Another problematic type is

    Received: from ezweb.ne.jp ([59.135.39.213]) by ar

In fact,

ar% ndb/dnsquery
> ezweb.ne.jp
ezweb.ne.jp ip  222.15.69.195
> 

It will be difficult to get relation between ezweb.ne.jp and 59.135.39.213.

Some mail server's IPs are a little different from the IPs obtained using DNS query. For example, I observed

    Received: from coraid.com ([12.51.113.4]) by ar; Sat Nov 22 07:55:40 JST 2008

However

ar% ndb/dnsquery
> coraid.com
coraid.com ip   12.51.113.3
> 

Probably this difference comes from a misconfiguration. Therefore, pipeto is designed to accept this case; i.e., IP comparison is limited to only first 24bits. I believe the breadth of mind brings more profit than harms.

For those mail servers out of RFC2822 rules, we are obliged to put their IPs to white list. My white list is far from perfect. You should be careful and improve white list so that incoming mails are not tagged "spam" from such servers.

Mails to system users

If you have users in your system, you probably want to send mails to your system users. However current Plan 9 smtpd does not put authentication mark to a mail even if the mail is authenticated in ESMTP session. Therefore it is impossible to distinguish the mail from spam mail.

This topic is discussed by someone in http://www.fehcom.de/qmail/smtpauth.html .
He proposes to put a mark "with ESMTPA" in the "Received: from" header following RFC3848 ( http://www.ietf.org/rfc/rfc3848.txt ). For example,

    Received: from [192.168.11.9] ([202.250.160.120]) by ar with ESMTPA; Tue Dec  2 16:43:47 JST 2008

Pipeto is designed not to put "spam" tag to mails with "with ESMTPA" supplied by its smtpd.

To put the mark, you need a patch to smtpd.c:

int
pipemsg(int *byteswritten)
{
        ...
        ...

        //nbytes += Bprint(pp->std[0]->fp, "by %s; %s\n", me, thedate());
        /* replaced by Kenar */
        if(authenticated)
                nbytes += Bprint(pp->std[0]->fp, "by %s with ESMTPA; %s\n", me, thedate());
        else
                nbytes += Bprint(pp->std[0]->fp, "by %s; %s\n", me, thedate());
        ...
        ...
}