Mail System: April 2008 Archives

In this article, I don't wanna talk more details about how to build a mail system step by step, you may get lots of configuration documents about the popular MTA such as Sendmail, Postfix and qmail etc from Google.

I was once a mail system engineer to maintain a commercial mail system which sent out >100 million mails/day, here the number is just the site mail, not including the campaign email, campaign mail will be even more usually. So here, I'd like share with you some of my experiences about how to build a mail system with high scalability, manageability and performance.

1. Split
Most of the MTA has it's own internal policy to keep sending the mails which get the soft bounce(4xx) for a few days, the soft bounce error may due to the network latency or the other reasons, so the mail queue in a single box may become larger and larger, hence, this will cause the delay for sending the 'good mail' (the mails which can be delivered successfully in one time). You know, the site mail is critical and important for the business, they need to be delivered to end users timely.

How to resolve this issue? The answer is split.
 
Here I'd like introduce the concept of 'fallback', what does it mean? It means if the mail in the primary server is not delivered successfully for the first time, then it will be transferred to the fallback mail server for delivering, the benefit is the mail queue on the primary server will not get too high, so the 'good mail' can be delivered to end users timely, also this will reduce the primary mail server load.
 
Currently, most of the MTA supports this feature, you can check the MTA offical document to get more details. From my point of view, it's not difficult to implement the fallback feature on the current mail system, you don't need change a lot.
 
2. Load balance
For a commercial mail system, one or two servers are hard to handle the huge number of mails effectively, so usually, we should consider to use load balance to separate them into each single mail server.
 
For example, assumeing I have 50 powerful servers which act as the primary mail servers,  and 40 common servers for the fallback pool, we can setup two VIP domain names: mx.vip.isoracle.com and fallback.vip.isoracle.com for 'primary' and 'fallback' pools, then we can configure load balance(i.e. F5) to distribute the mails to mx.vip.isoracle.com or fallback.vip.isoracle.com pool for mail delivering.
 
By this way, without any downtime and impcat to the end users, we can easily add more and more servers into the current 'primary' or 'fallback' pool or remove them out from current pool when there have single node issue, the only one thing we need to do is add/remove the entries in the load balance or DNS server, so overall, the system scalability can be greatly improved.
 
If you don't have budget to buy hardware load balance such as F5 or NetScaler, Nginx or DNS lookup round robin is another choice.
 
3. OS and Storage
This is a common topic, I just want to emphasis that we'd better use high I/O performance storage to store the mail queue. SCSI hard disk is preferred, it occupies less CPU resource.
 
BTW, if we use SSD (Solid-stat Disk) to store mail queue, will the I/O performance be greatly improved? ^^
 
For the operating system, I think Linux is good enough. Also, I heared about that the last released FreeBSD 7.0 performance is perfect, but still need more tests to confirm. 
 
3. Monitoring
For any mail system, I think monitoring is very very important.
 
Basically, we need keep close eyes on the following items:
  • Mail queue per single host
  • CPU, Memory and Load, especially the load and CPU
  • Storage usage info, especially for the volume which stores the mail queue

 

4. Troubleshooting and Reporting
You know, as a mail system engineer, we often face many kinds of mail system issue, and usually, we need check the mail log to see what happened, how to get the useful infomation  from the huge mail system log timely for troubleshooting?

I developed a troubleshooting and reporting system when I worked on the mail system, it will pull all the raw mai log from each mail servers to a center mail log server, and some Perl/Shell scripts will analyze and process the huge mail log hourly, at last, the useful data including the sender, receiver, queue id, send time, relay IP, receiver IP, DSN and detailed error log etc will all be stored into Oracle database for analysis. And I also wrote another cgi web page, you just only need input the issue time range and the receiver email address, then you can easily get all the detailed error log from this web page, to be honest, this tool greatly help me during the mail issue troubleshooting.
 
Since we have the mail log information stored in the database, then we can also easliy write tool to generate the charts to show many key metrics, such as delivery rate, soft bounce#, hard bounce# etc. RRDTool is a good choice to store and generate the charts, I like it very much.
 
You know, mail system is very complex, so this topic cannot touch each area of a mail system, please provide your comments/feedback if there is any.
 

About this Archive

This page is an archive of entries in the Mail System category from April 2008.

Find recent content on the main index or look in the archives to find all content.