Best Practice: Secondary mail server

Hi All,

We ran into a situation the other day in which our entire datacenter went down. So for about 5 hours none of my clients were able to use their email service. Since then I’ve brought up a couple of servers in another provider’s cloud. I’ve recently setup a secondary DNS server using tinyDNS and a CRON job to copy the data.cdb file every 15 minutes.

Does anyone know what the best practice is for having a secondary mail server for Interworx? I know Interworx has a clustering service, but because the servers will be in different geographic locations with no option of virtual IP’s, or SAN access I cannot use it.


Mail is extremely difficult to “secondary” without some really bad consequences. Mail is stateful, in that clients keep track of which UIDs they’ve seen, and servers keep track of which UIDs they’ve seen/sent. When these files get out of sync, it leads to mail problems, from messages simply being “unpopable” to messages being deleted from the server w/o ever being read (imap is better than pop in this respect, but you can still run into problems). If you need geographical diversity, you’re going to need filers which can handle this sort of syncing, and a fairly stout WAN connection to keep them synced. It’s doable, but not cheap, and almost certainly a little loss involved with a failover unless your wan pipe is big enough for realtime syncing. Failback needs to be addressed, too, for any failure scenario. If failback ocurrs before a sync ocurrs, you’re in the same boat you’d be in with bad failover.

A data center experiencing a 5 hour total outage is rare IME. Was it network outage, power outage, or something else? Or was the issue isolated to your colo? If it was really a total outage at the data center, and you feel this is a scenario likely to repeat itself, you should really consider moving to a different data center. If it was isolated to your own colo, you should examine what happened, why, and how to mitigate that type of failure in the future.

The outage was datacenter wide, there was a utility power outage and somthing damaged the breaker that switches between utility power and generator power. The datacenter ran off of battery backup for about 10 mins and didn’t alert us until 5 of those 10 mins had gone by.

I don’t know that it will happen again. I’ve never actually seen a “real” datacenter go down like that before, but it we’re doing everything in our power to duplicate critical services into other datacenters just in case.

Is there perhaps a way to at least have a secondary mx server in place to keep messages from bouncing and then have them relay back to the main server once it comes back up?

I can’t quote all of the RFC standards for mail delivery timeouts, but I’d expect a well behaving mail server to defer messages for longer than 5 hours. IIRC standard warn time is 4 hours.

For what you’re describing, you’ll probably want to look into redundant filers of some sort - there are lots of options depending on what your storage backend preferences and budget are (NAS, iscsi, FC, etc). Deduplication may or may not be necessary, but in general is thought to be able to reduce the amount of data that needs to be moved between two sites. That’s really going to be the best bet in cases like these - rsyncing between sites won’t work for “realtime” needs.

Yeah, we’re using a HyperV cluster with an ISCSI san, but to actively replicate that between datacenters would be very pricey in bandwidth costs. I decided to just put up a secondary qmail server to queue messages until the primary server comes back up. We’ll be fine as long as the datacenter doesn’t go down completely again, as the HyperV cluster can handle most failures. We have considerable redundancy inside the datacenter - its just a problem if the datacenter disappears from the Internet again. If they do then we’ll be switching datacenters. Thanks for your valuable input, I greatly appreciate it. It’s always nice to beable to bounce ideas off other professionals.

Have a great weekend,