Dog slow SMTP

newmind · November 7, 2011, 12:01am

Anyone else running into this issue?

CentOS 5.7 64-bit
iWorx v4.9

it takes a full 10 seconds or more to send a simple test email. I thought that it was isolated to my particular setup (VPN + localhost forward port 25), but clients without VPN or SSH access are reporting the same.

Just moved to our new server, a complete beast (Dell R610) compared to our old (SC1425) production server, and am generally quite pleased with the performance, but something is totally wrong with TCP connections like SMTP and MySQL (localhost forward 3306 for SqlYog).

Thought it was the firewall, so stopped APF and iptables, flushed, and tried to send another test email: same deal, dog slow.

I’ve got ClamAV running along with SA, would this be the cause for the huge delay in sending? Receiving emails is fine, not blazing fast, but good enough.

Could also be ESXi 4.1, virtualization may be the culprit here as well.

Anyone else have TCP issues, let me know…

Thanks

zombie_process · November 7, 2011, 10:22am

What do the logs say? Anything interesting in the config? That kind of issue could be the result of dns resolution failure or a slow rbl if you’re rbling outbound - seen both issues before, elsewhere.

Our entire mail cluster is built on esxi 4.1 currently (small ISP) and while it has its share of issues, nothing like you’re describing. I assume you’ve installed the vmware tools.

IWorx-Paul · November 7, 2011, 12:29pm

Hi Newmind,

This sounds to me like it could be a DNS issue, like the mail server is hanging trying to do DNS lookups when the SMTP connection happens. What is in the /etc/resolv.conf file? And are the nameserver(s) listed there readily accessible?

Paul

newmind · November 9, 2011, 12:33pm

Paul,

cat /etc/resolv.conf

nameserver 127.0.0.1
nameserver 172.16.65.1

telnet on port 25 brings smtp welcome/response instantly

and when I tail -f /var/log/stmp/current, I see the test mail to be sent display instantly, but then a hang of 10 seconds or more until log shows the message as being sent.

Keep in mind that the ESXi host sits behind a Cisco ASA, so I have to NAT name server public IPs to DMZ subnet. I assume that given NAT, I need the 172.16.65.1 address in /etc/resolve.conf in addition to 127.0.0.1

Obviously there are more moving parts than the old bare metal setup where it was physical server + iptables = azz-in-the-wind. Still, current setup should be blazing fast on all fronts, must be missing something…

newmind · November 9, 2011, 12:50pm

@zombie, nice, same deal ESXi 4.1 (update 2). Really happy with the setup overall, just puzzled by the slow smtp. Web traffic rips, FTP similarly fast, why does tcp over port 25 drag to a grinding halt?

vmware-tools is latest version as is esxi update 2 (from October 27th, 2011)

I’ll dig around a bit more, must be something off in my setup.

newmind · November 9, 2011, 1:57pm

Paul, I reversed the resolve.conf entries from to:
nameserver 172.16.65.1
nameserver 127.0.0.1

This speeds things up considerably, although still a bit draggy compared to old bare metal machine.

It’s a bit odd, dig @127.0.0.1 any-external-domain and instant reply; do the same for a local domain and timeout. For dig@172.16.65.1 it’s the exact opposite, snappy local domain and timeout against external domain. Arghhhh, the joy

zombie_process · November 9, 2011, 1:59pm

It really sounds like reverse lookups are failing causing qmail to hang (which seems to be a fairly common complaint if google can be trusted). On that host, how long does it take to complete a “dig -x 4.2.2.2”? Try a few other IPs, too. If there’s big hang, try it with a +trace option “dig +trace -x 67.67.67.67”. 4.2.2.2 is a well known openish resolver, and I have no idea what 67.67.67.67 is, but it returns something.

zombie_process · November 9, 2011, 2:01pm

[QUOTE=newmind;18841]Paul, I reversed the resolve.conf entries from to:
nameserver 172.16.65.1
nameserver 127.0.0.1

This speeds things up considerably, although still a bit draggy compared to old bare metal machine.

It’s a bit odd, dig @127.0.0.1 any-external-domain and instant reply; do the same for a local domain and timeout. For dig@172.16.65.1 it’s the exact opposite, snappy local domain and timeout against external domain. Arghhhh, the joy ;-)[/QUOTE]

Well, looks like you beat me to the punch there. That +trace will help you see where the lag occurrs in “real” time, though.

newmind · November 9, 2011, 11:41pm

@zombie,

yes, since the ESXi host is behind a Cisco ASA, I have to NAT the public IPs to DMZ subnet equivalents.

In the case of our primary NS, that’s the 172.16.65.1 address which iWorx TinyDNS is listening on. So, our local domains (i.e. domains that we host) resolve on 172.16.65.1, but external domains resolve on 127.0.0.1 via the ASA which uses SagoNet (our colo provider) for its default DNS (I should probably set external DNS directly at VM level though, doesn’t make sense for the cpu/memory weak ASA to resolve DNS)

Anyway, the setup creates lagtime one way or the other since resolve.conf entries appear to be processed in order, so if, for example, qmail needs to resolve bl.spamcop.net RBL, then it will first try via external-domain-blind 172.16.65.1, fail, and then try 127.0.0.1 which more or less instantly resolves the external domain. The opposite is true for local domains.

Until I figure out a workaround, for the time being it seems the better call to have our hosted domains resolve quickly and pass off the lag (which is not terrible) to resolving external domains.

On another note, what are you using for your vNICs? I’m currently using vmxnet3, but am going to test out on e1000 which seems to be fully supported in CentOS 5.7 64-bit. Since I only have a 4X gigabit NIC, vmxnet3 doesn’t really make sense (i.e. I don’t have the means to pass the 10gb bandwidth that vmxnet3 is designed for; neither do I have a SAN, just 6X local SCSIs)

Really to cool to hear of another iWorx user with same/similar setup to mine

zombie_process · November 10, 2011, 5:37am

I’m using vmxnet3 everywhere since that’s what I get by default with “flexible.” When I was troubleshooting an issue I was seeing with input errors a while ago, I couldn’t really see a difference between e1000 or vmxnet. I’m a 32 bit luddite, though.

newmind · November 10, 2011, 10:08am

@zombie, seems to make zero difference, e1000 vs vmxnet3, have tried both.

VMware, however, does suggest fixing end points to actual network speed (i.e. not autonegotiate). I have done so, 100mb/sec full for DMZ NIC connecting to the ASA and 1,000mb/sec for LAN NIC connecting to gigabit switch.

There is of course, no difference performance-wise, but at least it feels like I’m doing the “right” thing

Considering going 9000MTU on the LAN NIC and backup server, that may help transferring the 10-20gb vmdk files.

In the end, it may be that a 5 year old bare metal machine (SC1425) is going to outperform the ESXi Dell R610 host, relatively a powerhouse. Price to pay for multiple VMs under one host I guess. Was hoping for snappiness all around, but not happening just yet.

Are you getting ripping speeds on your ESXi host(s), or slightly less than bare metal?

zombie_process · November 11, 2011, 7:44pm

IME you definitely pay a performance penalty, but the upsides far outweigh the downsides, IMO. I love the flexibility, and I love the ability to easily access hosts that are misbehaving and reboot them, etc, w/o having to resort to ilo/ipmi. And yes, there’s also the added benefit of being able to fire up several VMs on a single box.

I am far from being a vmware expert, though - I’ve never even seen any of the for-pay stuff outside of CBTs, for instance. I really wish I had vmotion, etc, available.

newmind · December 3, 2011, 8:11am

@zombie,

ya, vmotion would be nice indeed, but did not have a mirrored backup server for the bare metal machine (just nightly rync), so was flying by the seat of my pants then as well

The difference now is that I have physical firewall (ASA 5505) in front of my servers. Pretty amazing my nightly logwatch is no longer full of hacking attempts, and APF/BFD combo jails the ones that do get through. Furthermore, Interworx control panel login is locked down to VPN users.

So, psyched, ESXi on R610 + ASA + Interworx = awesome combo.

BTW, performance issue is fine now, not quite bare metal, but again, I don’t sit there watching my outbox clear out.

I actually see zero difference between bare metal and esxi for general use, this R610 is a beast…