we are trying to pin down a problem with our mailserver. Since a few weeks every now and then we got reports form customers, that “mail isn’t working”. Most of the time we checked and found nothing. Then we encountered problems ourselves, i.e. I set up test accounts in my Outlook and had those check for new mails every 5 minutes. Sure enough I sometimes got error messages from Outlook in the likes of “server is not answering”. But those problems always went away by themselves and we didn’t find any hints in the log files. We didn’t pursue this with full force, since it was only intermittent (but we should have).
Then a week ago another customer called and said, he couldn’t send out mail. I checked, and I couldn’t either. I restartet the primary SMTP (25, TLS optional) to no avail, then the Alternate SMTP (Port 587 with TLS requiered) and mails went through again. So we checked more thoroughly, but still couldn’t find anything related in the logs (btw, how do you guys analyze the logs, do you convert them in order to make the time stamps readable?)
Today we had another call and the same thing happened again. I took some time to look a log files and checked if all services were running as they should and found nothing out of the ordinary. So I restarted the Alternate SMTP again, nothing. I restarted all other MTA services, nothing. I restarted MDA services, the firewall, still nothing. As this took too much time already, I rebootet and everything was fine again.
So, here I am asking for ideas on how to track this bugger down.
We think SSL/TLS might be a possible cause. Whenever this problem happened, I could still send out mails through the primary SMTP unencrypted. And the problems Outlook sometimes reported are also occuring when checking for new mail (i.e. IMAPs/POP3s). The mailserver did work fine for months now, might this have to do with the changes to SSL since poodle OR the upgrade to centOS 6.6?
Any ideas are very welcome, as we can’t really pin this down…
Thanks in advance.
I hope your well
I don’t think it’s connected with poodle or changes made to ciphers for mail server.
I would ask what versions of outlook are showing issues, and if using a different client program, does it portray same issue. Emclient is a good alternative to outlook.
Also, if you access webmail whilst issue is present, does this show any issues.
The logs I believe run though in order, showing correct date/time so it’s more human readable.
Lastly, do you mind giving an approx volume of email throughput you currently use
Also, what are you current settings for number of connections to smtp on either port, I’m thinking it might be a small DDos attack on port
I’ll check tommorow morning, but the above should help in trying to find cause/resolution for you
I am fine, thank: And you too, I hope.
The issues is not only with Outlook but any client, on PC and on Mac. Will check Webmail the next time it happens, good idea.
The qmail log’s timestamps are TAI64N encoded (http://cr.yp.to/libtai/tai64.html ) so not readable at all
The server only has to handle a few hundred mails per day maximum, so no real volume there. The settings should be prety much default, afaicr:
SMTP IN: 20 connections max. timeout 1200
SMTP out: 255 connections max. response timeout 600, connect timeout 60
A DDOS would be visible in the logfiles
I haven’t checked webmail yet when the problem occurs, but afaict you are referring to something else. Our problem is regarding communiction problems between all clients and the server, e.g. when sending mails.
Many thanks for the update
I’m not too sure which logs your trying to read, the relevant logs should already be made human readable from the log screen of IW. If not, there is I believe a small program which will change it, but I’d need to look it up.
I think, but could be wrong sorry, your issue relates directly to the number of simultaneous incoming connections, which at 20 is very low IMHO, and can very quickly become fully in use. Please bear in mind this setting is for all incoming, ie from clients and other mail providers.
I would increase your MTA to 1000 simultaneous incoming, and leave your outgoing at 255
I hope that helps, and our MTA is set similar to the above, but I’ll have a little think further, and await any further information you may post, but this is why I thought of a DDos, which at 20, may not be noticeable and DDos maybe wrong term to use, but if above 20, it is the same effect, even if it is only 22 simultaneous connections trying to send on a MTA of 20.
yeah, I failed to mention that I am talking about the log files on the server, the actual files and not the viewer within Nodeworx, which is much too limited for real searching and analysis. I have seen programs, that convert TAI64N, but I have yet to test one.
Ok, I’ll try raising the connection limit. 20 is the iworx default setting though, isn’t it? If I look at the log files though, I see from the number of entries, that we have way below 20 connections at the same time. But I will try it, anything might help.
It just happened again so I could test a few things:
- Raising the SMTP connection limit didn’t change anything
- When it happens I can’t send out mails through Secure SMTP, can’t connect to secure IMAP or request IMAP folders via SSL. POP3 on 110 works fine, SMTP on 25 too. I have yet to test IMAP on 143 during the outage.
- Services seeem to recover slowly one after another. IMAPs worked before SMTP w. TLS worked again.
- Everything returned to normal within 10 Minutes
- I tried to login to webmail, that took a while but worked any I could send mails. (How does Roundcube communicate with qmail? Which protocol and port are used?)
Remember, this might have happened much more then we know, as we only get to know this, if we see it ourselves (Mailclient complaining about lost connectivity) or if a customer calls (which is bad). And we have two cases, where the services did not recover by themselves and we had to restart them or even rebbot the machine (once so far).
There’s a couple of points to also bear in mind, and that’s some clients may also use mobiles, which can hold the connection open, or the stream open and not release - it does happen, we’ve seen it happen on our enterprise mailers, and even restarting services does not clear it. What cleared it was a full port shutdown for about 3 - 5 minutes to time out the offending connection or device, then when ports reopened, all was well.
Also, a good test is to telnet to mail server
I hope that helps a little, but I’m pretty sure setting the MTA to a much higher value would resolve your issue
Sorry our posts crossed
I’ll have a think about your further post, but it looks like it is only connected to secure then, and not unsecure.
Have you checked your webserver log for SSL error, and if anything unusual showing
Sorry, when outlook failed, what was the precise error reason
I set up some test accounts on another mail client (Claws), which has better logging capabilities than Outllook. Here’s what happened during one of the minor outages in IMAP:
12:26:07] IMAP4< * OK [CAPABILITY IMAP4rev1 UIDPLUS CHILDREN NAMESPACE THREAD=ORDEREDSUBJECT THREAD=REFERENCES SORT QUOTA IDLE] Courier-IMAP ready. Copyright 1998-2003 Double Precision, Inc. See COPYING for distribution information.
[12:26:07] IMAP4> 1 STARTTLS
[12:26:07] IMAP4< 1 NO Error in IMAP command received by server.
** IMAP Fehler auf mail.domain.com: STARTTLS Fehler
** TLS-Sitzung konnte nicht gestartet werden. [Translation: TLS-session could not be started]
[12:26:07] IMAP4> 2 LOGOUT
[12:26:07] IMAP4< * BYE Courier-IMAP server shutting down
[12:26:07] IMAP4< 2 OK LOGOUT completed
- Konto ‘SSL-Imap’: Verbinde mit IMAP4-Server: mail.@domain.com:993…
** IMAP Fehler auf mail.@domain.com: Verbindung abgelehnt [Translation: IMAP error at mail.@domain.com: connection refused]
And this is the corresponding server log:
2014-12-02 12:26:06.518172500 tcpserver: pid 1187 from 123.456.789.1
2014-12-02 12:26:06.518172500 tcpserver: ok 1187 srv02.domain.com:::ffff:188.8.131.52:143 :::ffff:123.456.789.1::20276
2014-12-02 12:26:06.536892500 INFO: Connection, ip=[123.456.789.1]
2014-12-02 12:26:06.594362500 INFO: LOGOUT, ip=[123.456.789.1]
2014-12-02 12:26:06.594413500 tcpserver: end 1187 status 0
Many thanks, and I believe your issue is TLS could not start
I believe of connecting to secure, and no security is available, it closes the connection, but could be wrong sorry
Has any of your SSL cert expired or been changed
Has any of your ciphers been changed
Did you stop SSLv3
When manually restarting mail server via ssh, are there any failures
Is port 587 open in firewall - this was pointed out that IW shows as open in cp but in reality it is closed - I don’t think this is the issue here though
Would you mind pm me a domain to run a check on SSL cert
Have you manually tried to telnet mail server to see if it yields any further messages not seen, please remember you’ll need to base64 credentials
Please ignore pm request, I believe I found your actual SSL domain, and if so, your port 587 is open correctly.
I might suggest you set the imap connections to a higher value, we have ours set above 3000
I do believe your issue is the non starting of TLS though, and not a limiting factor of amount of connections.
When it happens again, can you ssh try to startTLS using OpenSSL to see if there are any errors