Mail servers were slow today

Mail processing was slowed today due to a high load on the machine that checks mail for viruses and spam. The problem occured while performing upgrades to the operating system.

How is mail processed?   It’s far more complex than it appears…

There are 3 machines responsible for processing mail – 2 machines (named sylvio and paulie) serve as the front end and are responsible for initially receiving incoming and outgoing mail, making a few preliminary checks to see if the recipient is valid, and storing the mail to disk (a process called queuing).   Once the mail is queued a seperate process sends the mail to a third server (tony) to be checked for spam and viruses and then (presuming no viruses were found) returns it to sylvio or paulie where it is again queued to disk.   A third process then collects the queued mail and performs final delivery to the local mailbox (for local users) or the recipient’s mail server (for non-local users).

Why so complex?   A bunch of good reasons actually…

  • 2 front end machines allow us to work on one machine without disrupting mail processing.
  • Spam filtering and virus checking is a slow and difficult process and requires considerable resources (CPU, Memory).   Separating the storage and processing helps prevent client timeouts.   Many mail clients (i.e. Outlook Express, Outlook, Thunderbird, etc.) will generate error messages if the mail server does not accept mail quickly.
  • Delivering mail from disk (rather than from memory) is safer.   By queuing mail to disk before acknowledging acceptance we do not lose mail in the event of a software or server crash.
  • Mail is often bursty in nature – a few messages a minute to hundreds a minute.   Since it’s possible for the incoming rate to exceed the rate that messages can be checked for spam and viruses the front end servers hold the mail until the scanner can check it.

The servers have had an issue for some time where the servers will lock up when requested to make a ‘snapshot’ (backup) of the disk.  The lockup issue is a known problem with the operating system version we have been using.    We are in the process of upgrading the operating system which caused the high load on the server today.