Does anyone else experience random server instability?

Our server with Iworx-CP has been very stable since the day we set it up, but recently it has been acting a bit under the weather. I have yet to find any log messages associated with the instability. :frowning:

Everything will be working just fine, and suddenly several/most of the virtual hosts of Apache will stop serving requests. The virtual hosts that die are all on the same IP(s), but others continue to load (all on other IP’s). The auo-restart feature of IWorx-CP can do nothing because Apache is still technically running, therefore a manual restart of Apache is required to fix the problem.

The MySQL daemon will do the same thing. It will be functioning just fine, and suddenly stop answering requests completely. This causes any DB-enabled pages to break. What’s worse, all blogs and e-commerce stores fail altogether – with full page messages like “WordPress could not connect to the DB.” A manual restart of MySQL is required to fix the problem.

Regarding the server specs, it is a Dual P4 2.8 GHz w/ 1 GB of RAM, mirrored SCSI RAID1 w/ hotspare, redundant PSU, and Redhat Linux Enterprise 4. We only have ~25 domains (businesses/friends we know), and no one (except me and one other) has shell access. The server is definitely not under load, but the problem always presents itself while we are doing a lot of active work on a customer’s e-store or blog. (That means we never arrive at work to find the server failing. It only fails when we are actively working with it.)

Does anyone else experience issues like this? We have had 10+ months uptime on this server without a single “problem,” and now this sort of thing is happening at least 1-3 times per week. Very frustrating…

Have you tried “tweaking” your configuration settings for the daemons you have mentioned?

Maybe an increase in the MySQL MAx connections, and the same setting in Apache?

By default, the Interworx servers use default Apache/MySQL settings, so some tweaking will alwaysbe required in different environments.

EverythingWeb,

Thanks for your input. After observing the server more, it did appear that the MaxClients of Apache was being reached. I have quadrupled the limits and reduced the KeepAlive timeout. (The same goes for MySQL.) Now the server appears to be holding fine.

So a few questions come to mind:[LIST=1]

  • Why does Apache fail forever after hitting the MaxClients value?
  • The log shows evidence of SIGTERM and child processes seg faulting when the limit is reached. Is that to be expected? (I figured MaxClients would just mean reduced service until the spike leveled off.)
  • Why does the InterWorx-CP auto-restart feature not help? I have seen the server down for up to 30min, but the setting is enabled.
  • Are the Apache defaults tuned for a Pentium 200? :) We apparently exceeded their expectations with an averge 1-2% CPU usage and very low bandwith consumption. :D[/LIST]
    1. Why does Apache fail forever after hitting the MaxClients value?
    2. The log shows evidence of SIGTERM and child processes seg faulting when the limit is reached. Is that to be expected? (I figured MaxClients would just mean reduced service until the spike leveled off.)

    segfaults should not be expected. This could point to a larger issue jimp as hitting the maxclients isn’t “bad” in the sense that it makes apache unstable. It may slow your machine down depending on the memor that’s installed but segfaults shouldn’t be the norm for any program.

    1. Why does the InterWorx-CP auto-restart feature not help? I have seen the server down for up to 30min, but the setting is enabled.

    The auto-restart system, at least as it applies to Apache is powered by SIM, which is a (great) tool from rfx-networks. The way we have it configured it polls http://127.0.0.1/watch-flush I believe to make sure that URL is alive. If it is then SIM assumes that Apache is alive. Are you sure Apache is 100% down (even from localhost)?

    1. Are the Apache defaults tuned for a Pentium 200? We apparently exceeded their expectations with an averge 1-2% CPU usage and very low bandwith consumption.

    The defaults are pretty conservative so while I wouldn’t say they were tuned for a Pentium 200 they will almost always need some tweaking based on your setup.

    Regarding the issues in general I’m honestly not sure. I wouldn’t mind checking out a ‘ps aux’ when the symtoms are occuring and also a listing of the currently running mysql queries (viewable from nodeworx or a utility such a mytop).

    Chris

    We have had a few power-related incidents with our data center, which eventually led to minor file system corruption. I suppose it is possible a component is corrupted. Would it hurt to uninstall/reinstall the Apache RPM – to freshen the installation?

    No, it is not 100% down. Some IP’s work (the box has 12 configured), but most of them are dead. 127.0.0.1 must still be alive.

    I’ll instruct everyone to record this extra data if it happens again. We have been taking snap shots of error_log and recording the date and time, but the running processes and active queries have not been recorded. It is somewhat unlikely we are going to see it again anytime soon, though. At least I hope not. Increasing the limits has allowed the Apache process to continue uninterrupted for days now, and I really need to respect the e-commerce stores that are depending on that server. (Otherwise, I would rather dig until the problem is identified and fixed properly.)

    Thanks EverythingWeb and Chris for your help!