![]() |
![]() |
/mail/box/arisawa/whitelist
/mail/box/arisawa/blacklist
2008/12/01 first draft
2008/12/06 corrected some mistake in analyzing /sys/log/smtpd
2008/12/11 section "The ability"
We observe two types of spams: commercial and non-commercial. Commercial spams are bothersome but not so much harmful. However most of non-commercial spams are used for criminal purpose: to scatter virus, to abstract personal information from victims, to direct users to some malicious web pages, etc.
How to protect spams? Today's MUA(mail user agent) has spam filter. For example, I receive mails using Mac's 'Mail.app', my MUA, which retrieves mails from my mail server. The MUA sieves mails to folders following string pattern in mails. Spam filter of MUA is effective only for commercial spams but useless for non-commercial spams.
Well then, how to protect non-commercial spams? As shown in next subsection, half of incoming mails are blocked by our smtpd which countermeasures against only trivial spams. Of cause they are non-commercial spams. Half percent blocking is enough? No! I am still receiving so many non-commercial spams every day.
Users in Plan 9 may have a file "pipeto" in /mail/box/$user. If the file exists, the file pipes incoming mail to mbox, mail box for individual user. So, I developed additional spam filter, "pipeto", which analyzes incoming mails and passes them to user's mbox. The filter gives an eye to the fact: regular mails (including commercial spams) come from smtpd with static domain name, on the other hand most of non-commercial spams come from three types of machines: (1) without domain name, (2) with dynamically allocated domain name and (3) with faked domain name.
We can see the ability by inspecting /sys/log/smtpd.
The example:
ar Nov 27 09:41:24 ehlo from 72.245.206.188 as mail.mobiledealersplace.com ar Nov 27 09:41:28 [mail.mobiledealersplace.com/72.245.206.188] mobiledealersplace.com!pr1127 sent 175 bytes to ar.aichi-u.ac.jp!arisawa ar Nov 27 09:53:05 ehlo from 116.52.71.176 as server ar Nov 27 09:53:05 Hung up on 116.52.71.176; claimed to be server ar Nov 27 09:05:59 helo from 89.163.0.24 as angeleschorale.org ar Nov 27 09:06:00 Disallowed angeleschorale.org!xsfciboxqw (angeleschorale.org/89.163.0.24) to blocked, unknown or invalid name ar.aichi-u.ac.jp!fossilarisawa
We observe two lines per mail: SMTP request and the result. The record format is new one that starts from Feb 19 2008.
ar Feb 19 13:55:06 ehlo from 130.203.4.6 as mail.cse.psu.edu ar Feb 19 13:55:11 [mail.cse.psu.edu/130.203.4.6] cse.psu.edu!9fans-bounces+arisawa=ar.aichi-u.ac.jp sent 182 bytes to ar.aichi-u.ac.jp!arisawa
My /sys/log/smtpd is a mixture of old and new format. So I extracted only new records that range from Feb 19 2008 to Dec 5 2008 and, naming the extracted part as smtpd1, I got the result:
ar% grep '(helo|ehlo) from' smtpd1 |wc 122805 1105245 8026041 ar% grep 'sent [0-9]+ bytes to' smtpd1 |wc 53392 590127 6960192 ar%
which means refused mails are as much as 57%. However, as shown later, most of mails that are sent to mbox are still spams.
As mentioned before, most of non-commercial spams come from clients: (1) that do not have DNS name, (2) that have dynamically allocated domain name and (3) that fake their domain name. Pipeto tries to identify such mails and put "spam" tag in the Subject header.
Spams that come from static DNS name will not be tagged "spam" unless the names is explicitly put into blacklist, however such spam can be handled also by MUA.
We can detect most of spams simply by inspecting 'Receive: from' header in which we have two important informations: HELO host's FQDN and the client IP address. The example is shown below.
Received: from amnetmortgage.com ([201.240.156.32]) by ar; Mon Dec 1 12:39:27 JST 2008
HELO host's FQDN follows 'from' and client IP address is shown in [ ]. Note that SMTP client can forge the HELO host but the IP cannot be forged. The forgery can be detected by getting IP of the host using DNS query:
ar% ndb/dnsquery > amnetmortgage.com amnetmortgage.com ip 169.200.183.83 >
Note that I don't require "reverse DNS query" because having the query is not requested, I believe, in SMTP related RFC.
Pipeto uses the following three files:
/mail/box/$user/pipeto # spam filter
/mail/box/$user/white # white list
/mail/box/$user/black # blacklist
Inspection items are
Pipeto puts "spam" tag into Subject field of mail from the client
We can register information below both to blacklist and white list
It is desirable not to put IP into blacklist because the IP might be shared among many innocent machines. Putting HELO host pattern into blacklist should be confined only to commercial spams. Although pipeto allows string patterns in mail body in detecting spam, it is not recommended to use this feature because then you are restricted to avoid these strings in communicating with your friends.
Pipeto is sufficiently powerful in detecting spams, so you probably need not put anything into blacklist for non-commercial spam.
Don't put famous DNS name into white list. That will increase spam mails which escaped the filter.
Pipeto puts several types of spam tags in Subject field.
(1) [spam:ip] # the ip is in blacklist (2) [spam:noname] # the client has no FQDN (3) [spam:host] # the helo host is in blacklist (4) [spam:suspect] # the client FQDN is suspected to be dynamic (5) [spam:fake] # faked helo host (6) [spam:] # spam pattern in mail headers or mail body (7) [truncated] # the mail is normal but truncated
User can define additional spam patterns in mail headers and mail body.
For example, string pattern
^Contents-type: text/(html)
in blacklist will put "[spam:html]" in the subject header if the mail contains the pattern in headers. Without "( )", i.,e.,
^Contents-type: text/html
the subject header will be simply "[spam:]".
You can get 'pipeto' from http://plan9.aichi-u.ac.jp/netlib/spamfilter/
/mail/box/$user/white # white list
/mail/box/$user/black # blacklist
(1) '&' + ' ' + IP + '/' + number # checks client IP using CIDR format (2) '*' + ' ' + RE # checks HELO host using the RE (3) RE that begin with '^' # checks headers using the RE (4) RE that does not begin with '^' # checks mail body using the RE
where RE means regular expression.
example of white list:
^Subject: .*tip9ug: & 219.106.227.66/32 # etour & 202.224.39.0/24 # asahi-net & 219.112.246.0/20 # news.mixi * smtp\.aichi-u\.ac\.jp
blacklist format is same as white list. Lines that begin with "&" is effective only up to second field. Therefore you may put comments after the IP address.
NOTE: pattern specification in subject header such as "^Subject: .*tip9ug:" is bad idea, because the pattern can be hidden in base 64 encoding.
2008/12/11
I get mails from my mail server using POP protocol keeping old mails only within a week in mbox. Here is statistics that shows ability of pipeto. There was 1141 mails in the mbox. Time stamp of the first mail is "Dec 2 19:00:03 JST 2008" and the last is "Dec 11 12:34:40 JST 2008". The contents in the white list and the blacklist are shown in the next subsection.
/mail/box/arisawa/whitelist^Subject: .*tip9ug: & 203.138.203.0/24 # docomo & 59.135.39.0/24 # ezweb & 219.106.227.66/32 # etour & 202.224.39.0/24 # asahi-net & 219.112.246.0/20 # news.mixi ^.*arisawa\+zzz@ar.aichi-u.ac.jp ^.*some-list@ar.aichi-u.ac.jp
where "zzz" is a magic that is described in next section and "some-list@ar.aichi-u.ac.jp" is a mailing list on my server. (Sorry. I must not disclose the real name.)
/mail/box/arisawa/blacklist^Return-Path: <.+@post\.fukubiki\.com> ^Return-Path: <.+@163\.com> * \.lipetsk\.ru * argus\.e-dentify\.nl * relay03\.quesse\.it * post\.fukubiki\.com ^.*<info@freeml.com> ^.*<.*@woman.co.jp> ^From: MAILER-DAEMON@ ^From: postmaster@ ^From: /dev/null ^From: .*arisawa@ar.aichi-u.ac.jp ^.*To: .*[Uu]ndisclosed
Look next section to understand "^From: /dev/null" and "^From: .*arisawa@ar.aichi-u.ac.jp" in the blacklist.
The following table is an analysis of the mbox.
| type | total | tagged | not tagged |
|---|---|---|---|
| spam mails | 706 | 644 | 62 |
| regular mails | 435 | 0 | 435 |
| total | 1141 | 644 | 497 |
The 62 spam mails that escaped filtering were commercial spams or came from some ill mail servers. Although regular mails that are tagged as spam are zero for my case, more data will be needed in white list to make pipeto more reliable.
If spammers send mails to mail servers with non-existent recipient in "To:" field borrowing your mail address in "From:" field, then a great deal of error notification mails from targeted mail servers will be send to you. Then the senders are legitimate mail servers and have static DNS name. How to put spam tag to these mails? To resolve this problem, we can employ a solution suggested by Russ Cox in 9fans.
/mail/lib/remotemail
#!/bin/rc
shift
sender=$1
shift
addr=$1
shift
fd=`{/bin/upas/aliasmail -f $sender}
switch($fd){
case *.*
;
case *
fd=ar.aichi-u.ac.jp
}
if(~ $sender /dev/null)
exec /bin/upas/smtp -h $fd $addr $sender $*
if not
exec /bin/upas/smtp -h $fd $addr $sender+zzz $*
where 'zzz' is any string that is allowed for mbox. Then 'Return-Path' of outgoing mail will be 'alice+zzz@aichi-u.ac.jp'.
Add something to /mail/lib/rewrite so that you can accept mails to alice+zzz@aichi-u.ac.jp.
\"(.+)\" translate "/bin/upas/aliasmail '\1'" [^!@.]+ translate "/bin/upas/aliasmail '&'" # deliver mail without a domain locally local!"(.+)\+zzz" >> /mail/box/\1/mbox local!(.*)\+zzz >> /mail/box/\1/mbox local!"(.+)" >> /mail/box/\1/mbox local!(.*) >> /mail/box/\1/mbox # your local names \l!(.*) alias \1 \l\.aichi-u\.ac\.jp!(.*) alias \1 # convert source domain address to a chain a@b@c@d... ... ...
We make alice\+zzz@ar.aichi-u.ac.jp white.
/mail/box/alice/white
^.*alice\+zzz@ar.aichi-u.ac.jp
and make mails from /dev/null black.
/mail/box/alice/black
^From: /dev/null
Note that header patterns in white list are examined prior to those in blacklist.
We show a mail that comes to mbox as an example for understanding pipeto.
From acaoemailmarketing.com!envia Sun Nov 30 17:33:59 JST 2008 remote from ar Received: from smtp2.braslink.com ([204.16.3.24]) by ar; Sun Nov 30 17:33:59 JST 2008 Received: (qmail 3075 invoked from network); 30 Nov 2008 04:39:32 -0000 ... Content-Transfer-Encoding: 8bit Content-Type: text/html; charset="iso-8859-1" <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> ... ...
The first line is a separator in mbox and useless for detecting spam.
The second line begins with
Received: from
which is followed by HELO host and client IP address.
According to rfc2821, HELO host name must be FQDN(Fully Qualified Domain Name) that identify the client.
4.1.1.1 Extended HELLO (EHLO) or HELLO (HELO) These commands are used to identify the SMTP client to the SMTP server. The argument field contains the fully-qualified domain name of the SMTP client if one is available.
We can see the example of this rule from the mail:
Received: from userg502.nifty.com ([202.248.238.82]) by ar
confirming
ar% ndb/dnsquery > userg502.nifty.com userg502.nifty.com ip 202.248.238.82 >
Most of mail servers stands on this rigorous interpretation of 'identify'.
Some of large sites such as Google stand on somewhat looser interpretation.
Received: from qb-out-1314.google.com ([72.14.204.170]) by ar
then
ar% ndb/dnsquery > qb-out-1314.google.com qb-out-1314.google.com ip 72.14.204.168 qb-out-1314.google.com ip 72.14.204.174 qb-out-1314.google.com ip 72.14.204.173 qb-out-1314.google.com ip 72.14.204.172 qb-out-1314.google.com ip 72.14.204.175 qb-out-1314.google.com ip 72.14.204.169 qb-out-1314.google.com ip 72.14.204.170 qb-out-1314.google.com ip 72.14.204.171 >
where HELO host identifies a set of smtp clients.
Some of large ISPs in Japan such as docomo break the rule. In fact, I receive mails with
Received: from docomo.ne.jp ([203.138.203.197]) by ar
The HELO host has no DNS name!
ar% ndb/dnsquery > docomo.ne.jp !dns: resource does not exist >
Another problematic type is
Received: from ezweb.ne.jp ([59.135.39.213]) by ar
In fact,
ar% ndb/dnsquery > ezweb.ne.jp ezweb.ne.jp ip 222.15.69.195 >
It will be difficult to get relation between ezweb.ne.jp and 59.135.39.213.
Some mail server's IPs are a little different from the IPs obtained using DNS query. For example, I observed
Received: from coraid.com ([12.51.113.4]) by ar; Sat Nov 22 07:55:40 JST 2008
However
ar% ndb/dnsquery > coraid.com coraid.com ip 12.51.113.3 >
Probably this difference comes from a misconfiguration. Therefore, pipeto is designed to accept this case; i.e., IP comparison is limited to only first 24bits. I believe the breadth of mind brings more profit than harms.
For those mail servers out of RFC2822 rules, we are obliged to put their IPs to white list. My white list is far from perfect. You should be careful and improve white list so that incoming mails are not tagged "spam" from such servers.
If you have users in your system, you probably want to send mails to your system users. However current Plan 9 smtpd does not put authentication mark to a mail even if the mail is authenticated in ESMTP session. Therefore it is impossible to distinguish the mail from spam mail.
This topic is discussed by someone in http://www.fehcom.de/qmail/smtpauth.html .
He proposes to put a mark "with ESMTPA" in the "Received: from" header following RFC3848 ( http://www.ietf.org/rfc/rfc3848.txt ). For example,
Received: from [192.168.11.9] ([202.250.160.120]) by ar with ESMTPA; Tue Dec 2 16:43:47 JST 2008
Pipeto is designed not to put "spam" tag to mails with "with ESMTPA" supplied by its smtpd.
To put the mark, you need a patch to smtpd.c:
int
pipemsg(int *byteswritten)
{
...
...
//nbytes += Bprint(pp->std[0]->fp, "by %s; %s\n", me, thedate());
/* replaced by Kenar */
if(authenticated)
nbytes += Bprint(pp->std[0]->fp, "by %s with ESMTPA; %s\n", me, thedate());
else
nbytes += Bprint(pp->std[0]->fp, "by %s; %s\n", me, thedate());
...
...
}