Recently in Site Architecture Category

Amazon S3 Availability Event: July 20, 2008

We wanted to provide some additional detail about the problem we experienced on Sunday, July 20th.

At 8:40am PDT, error rates in all Amazon S3 datacenters began to quickly climb and our alarms went off. By 8:50am PDT, error rates were significantly elevated and very few requests were completing successfully. By 8:55am PDT, we had multiple engineers engaged and investigating the issue. Our alarms pointed at problems processing customer requests in multiple places within the system and across multiple data centers. While we began investigating several possible causes, we tried to restore system health by taking several actions to reduce system load. We reduced system load in several stages, but it had no impact on restoring system health.

At 9:41am PDT, we determined that servers within Amazon S3 were having problems communicating with each other. As background information, Amazon S3 uses a gossip protocol to quickly spread server state information throughout the system. This allows Amazon S3 to quickly route around failed or unreachable servers, among other things. When one server connects to another as part of processing a customer's request, it starts by gossiping about the system state. Only after gossip is completed will the server send along the information related to the customer request. On Sunday, we saw a large number of servers that were spending almost all of their time gossiping and a disproportionate amount of servers that had failed while gossiping. With a large number of servers gossiping and failing while gossiping, Amazon S3 wasn't able to successfully process many customer requests.

At 10:32am PDT, after exploring several options, we determined that we needed to shut down all communication between Amazon S3 servers, shut down all components used for request processing, clear the system's state, and then reactivate the request processing components. By 11:05am PDT, all server-to-server communication was stopped, request processing components shut down, and the system's state cleared. By 2:20pm PDT, we'd restored internal communication between all Amazon S3 servers and began reactivating request processing components concurrently in both the US and EU.

At 2:57pm PDT, Amazon S3's EU location began successfully completing customer requests. The EU location came back online before the US because there are fewer servers in the EU. By 3:10pm PDT, request rates and error rates in the EU had returned to normal. At 4:02pm PDT, Amazon S3's US location began successfully completing customer requests, and request rates and error rates had returned to normal by 4:58pm PDT.

We've now determined that message corruption was the cause of the server-to-server communication problems. More specifically, we found that there were a handful of messages on Sunday morning that had a single bit corrupted such that the message was still intelligible, but the system state information was incorrect. We use MD5 checksums throughout the system, for example, to prevent, detect, and recover from corruption that can occur during receipt, storage, and retrieval of customers' objects. However, we didn't have the same protection in place to detect whether this particular internal state information had been corrupted. As a result, when the corruption occurred, we didn't detect it and it spread throughout the system causing the symptoms described above. We hadn't encountered server-to-server communication issues of this scale before and, as a result, it took some time during the event to diagnose and recover from it.

During our post-mortem analysis we've spent quite a bit of time evaluating what happened, how quickly we were able to respond and recover, and what we could do to prevent other unusual circumstances like this from having system-wide impacts. Here are the actions that we're taking: (a) we've deployed several changes to Amazon S3 that significantly reduce the amount of time required to completely restore system-wide state and restart customer request processing; (b) we've deployed a change to how Amazon S3 gossips about failed servers that reduces the amount of gossip and helps prevent the behavior we experienced on Sunday; (c) we've added additional monitoring and alarming of gossip rates and failures; and, (d) we're adding checksums to proactively detect corruption of system state messages so we can log any such messages and then reject them.

Finally, we want you to know that we are passionate about providing the best storage service at the best price so that you can spend more time thinking about your business rather than having to focus on building scalable, reliable infrastructure. Though we're proud of our operational performance in operating Amazon S3 for almost 2.5 years, we know that any downtime is unacceptable and we won't be satisfied until performance is statistically indistinguishable from perfect.

Sincerely,

The Amazon S3 Team


Original Link: http://status.aws.amazon.com/s3-20080720.html

In this article, I don't wanna talk more details about how to build a mail system step by step, you may get lots of configuration documents about the popular MTA such as Sendmail, Postfix and qmail etc from Google.

I was once a mail system engineer to maintain a commercial mail system which sent out >100 million mails/day, here the number is just the site mail, not including the campaign email, campaign mail will be even more usually. So here, I'd like share with you some of my experiences about how to build a mail system with high scalability, manageability and performance.

1. Split
Most of the MTA has it's own internal policy to keep sending the mails which get the soft bounce(4xx) for a few days, the soft bounce error may due to the network latency or the other reasons, so the mail queue in a single box may become larger and larger, hence, this will cause the delay for sending the 'good mail' (the mails which can be delivered successfully in one time). You know, the site mail is critical and important for the business, they need to be delivered to end users timely.

How to resolve this issue? The answer is split.
 
Here I'd like introduce the concept of 'fallback', what does it mean? It means if the mail in the primary server is not delivered successfully for the first time, then it will be transferred to the fallback mail server for delivering, the benefit is the mail queue on the primary server will not get too high, so the 'good mail' can be delivered to end users timely, also this will reduce the primary mail server load.
 
Currently, most of the MTA supports this feature, you can check the MTA offical document to get more details. From my point of view, it's not difficult to implement the fallback feature on the current mail system, you don't need change a lot.
 
2. Load balance
For a commercial mail system, one or two servers are hard to handle the huge number of mails effectively, so usually, we should consider to use load balance to separate them into each single mail server.
 
For example, assumeing I have 50 powerful servers which act as the primary mail servers,  and 40 common servers for the fallback pool, we can setup two VIP domain names: mx.vip.isoracle.com and fallback.vip.isoracle.com for 'primary' and 'fallback' pools, then we can configure load balance(i.e. F5) to distribute the mails to mx.vip.isoracle.com or fallback.vip.isoracle.com pool for mail delivering.
 
By this way, without any downtime and impcat to the end users, we can easily add more and more servers into the current 'primary' or 'fallback' pool or remove them out from current pool when there have single node issue, the only one thing we need to do is add/remove the entries in the load balance or DNS server, so overall, the system scalability can be greatly improved.
 
If you don't have budget to buy hardware load balance such as F5 or NetScaler, Nginx or DNS lookup round robin is another choice.
 
3. OS and Storage
This is a common topic, I just want to emphasis that we'd better use high I/O performance storage to store the mail queue. SCSI hard disk is preferred, it occupies less CPU resource.
 
BTW, if we use SSD (Solid-stat Disk) to store mail queue, will the I/O performance be greatly improved? ^^
 
For the operating system, I think Linux is good enough. Also, I heared about that the last released FreeBSD 7.0 performance is perfect, but still need more tests to confirm. 
 
3. Monitoring
For any mail system, I think monitoring is very very important.
 
Basically, we need keep close eyes on the following items:
  • Mail queue per single host
  • CPU, Memory and Load, especially the load and CPU
  • Storage usage info, especially for the volume which stores the mail queue

 

4. Troubleshooting and Reporting
You know, as a mail system engineer, we often face many kinds of mail system issue, and usually, we need check the mail log to see what happened, how to get the useful infomation  from the huge mail system log timely for troubleshooting?

I developed a troubleshooting and reporting system when I worked on the mail system, it will pull all the raw mai log from each mail servers to a center mail log server, and some Perl/Shell scripts will analyze and process the huge mail log hourly, at last, the useful data including the sender, receiver, queue id, send time, relay IP, receiver IP, DSN and detailed error log etc will all be stored into Oracle database for analysis. And I also wrote another cgi web page, you just only need input the issue time range and the receiver email address, then you can easily get all the detailed error log from this web page, to be honest, this tool greatly help me during the mail issue troubleshooting.
 
Since we have the mail log information stored in the database, then we can also easliy write tool to generate the charts to show many key metrics, such as delivery rate, soft bounce#, hard bounce# etc. RRDTool is a good choice to store and generate the charts, I like it very much.
 
You know, mail system is very complex, so this topic cannot touch each area of a mail system, please provide your comments/feedback if there is any.
 

Note:
The following contents are from internet, just FYI.

If you’ve ever wondered how microsoft.com uses our technology then read on.  I recently came across some good information from the folks over at the Operations team at Microsoft.com.  The thread basically talks about how we use IIS, Firewalls and Windows Server 2008.  I think as we come up to launch next year it’s a really good and quick insight into what they do and how they do it.  So enjoy the reading and let me know what you think..Pretend I’ve asked about how they protect our sites…

1) We don’t handle HBI data so we don’t have the need for external logging capabilities.  If  we did handle HBI, we’d have firewalls.

2) We have ~650GB/day of IIS logs just for www.microsoft.com and update.microsoft.com (not including the 6GB/hour for each download server).  Just IIS logs are a challenge without trying to parse another ~650GB of firewall logs.

3) 5+ years ago, there wasn’t a firewall solution that would scale to our needs and this forced us to focus on network, host, and application security.  Based on the success of that work, we’ve not looked further at firewalls even though there are solutions that I believe (haven’t tested) would handled the traffic load (our non-download based web traffic alone can be in the 8-9 Gbps range and ~30 total for internal hosted traffic).

4) We also used NLB for load balancing exclusively up until July 2006 and the micro segmentation of networks required by that solution made firewalls an expensive and very complex solution.  Again, especially at the scalability that used to be available.

5) Application security is critical since a firewall is likely going to allow traffic on the correct port and protocol through to the web servers so IIS/ASP.NET/Applications must deal with these requests gracefully.  I realize there are other options/features of firewalls/IPS that provide other options.

In terms of how we protect the sites, we utilize (starting at the outside edge of the network and working in):
1) Cisco Guards for DoS detection and automated response

2) Router ACLs are in place to block unnecessary ports

3) NetScalers for www.microsoft.com and MSDN/TechNet (NLB still for update.microsoft.com) and those also provide DoS protection inherently as well as providing a few other knobs we can turn when required.

4) Windows and IIS…rock solid and secure!  www.microsoft.com is on Windows Server 2008/IIS7, MSDN/TechNet are migrating to Win2k8/IIS7, and update.microsoft.com is on Windows Server 2003/IIS6.  We do all the normal shut-off-unused-services practices that line up with MS published security guidance and we utilize GFS images to ensure standardized builds of systems.

5) Automated Netmon/Perfmon captures for attack analysis on NLB systems when SYN floods occur (event trigger).  We’ve not yet done this for NetScaler systems, but we are noodling on how in our copious spare time :).

6) We do run AV on our servers when we can.  At times product adoption means we don’t install it, but we do normally run AV.

7) Application security as mentioned.  ACE is very good resource for this aspect.  ACE is an internal team that does threat modelling for applications.

-EOF-