Cluster Manager Backup

I’m trying to figure out the best way to have a semi-ha cluster manager. I’d love to use pacemaker, but I’m using esxi guests for the cluster, and there’s unfortunately no sane way of handling fencing (that I’m aware of at least) in that scenario. I’m comfortable using ssh stonith and suicidal for, say pop3 or smtp-out, but get a little queasy thinking about a split-brain scenario with the cluster manager, but I may end up going this route in any case, and just crossing my fingers really hard.

Our filer is NAS-only unless we want to drop $15k on an iscsi license, and NFS datastore performance is abysmal, so that option is out. This may be a constraint of gig-e and having no way to align my guests, but there you go. Apparently the only way to get NFS datastore viable is to buy netapp and netapp support and use their tools - even then apparently linux LVM isn’t supported, and grub needs re-installation after the netapp tool is run. In short, at least in my environment it’s a non-starter.

So -

What data/dirs on the cluster manager would be critical to getting a cluster back in service semi-quickly after a crash? I understand that some databases remain on the cluster manager even after mysql has been offloaded, so I assume those need dumping.

In the iworx-ha demo, they have 3 drbd replicated mounts - home, iworx, and mysql. Is this the extent of what needs backing up? Seems like at a minimum /etc would be important as well, but I’m guessing the clustering setup replicates that info across nodes.

Ideas?

Any chance at all that someone from iworx could post the output of “crm configure show?” Barring that, could someone please post the actual paths for the 3 drbd mounts?

This is turning out to be quite a PITA.

DRBD:
RHEL/CentOS/SL don’t include drbd, so you’re stuck either compiling or using elrepo-testing to get it for centos6 - not planning to use centos 5.7 for anything forward-thinking. Not a huge issue, but to be safe, this requires also using the protect-base pluging, just in case something gets added to elrepo which is newer than centos or iworx, especially when you consider this requires using “testing.” That said, it appears, at least at fist blush, to work just fine, though I haven’t actually gotten a fully-working setup going. I’ll admit I’m a little squeamish about this.

Pacemaker/Corosync:
Still working on this. Current issues, some overcome, others not:

SIM versus corosync resource monitoring:
It appears easy enough to simply use corosync ( or is is pacemaker? not sure whose resources they really are in HA these days) resource monitoring instead of SIM for things like apache and mysql. Haven’t run into any issues here yet, but I don’t really like taking this functionality away from iworx, since I believe this means that iworx nodes won’t restart services automatically. I don’t want the two arguing over who restarts daemons. Kind of fugly.

Getting HA cluster IPs to show up in iworx apparently requires them having a name/label. In other words IPs that show up in “ip addr” but not in “ifconfig” don’t show up in the iworx ip management screen. Easy to fix with the IPaddr2 options “nic=” and “iflabel=” - once that’s done, they show up in iworx just fine. I’m hoping that the service “iworx” will handle moving the IPs generated by iworx itself. I haven’t proved this out yet, either, but hopefully will this week. TBH, I haven’t even proved out that “iworx” is LSB compliant yet, though usually that’s workable with a little hackery (I’ve done it in the past with saslauthd and postfix-policyd).

Apparently the version of OCF IPsrcaddr shipped with centos6 (or perhaps its centos6 itself - I haven’t been able to determine this since an older version works just fine with ubuntu) bombs out when adding a route with a src addy, so I get corosync logs with stuff about “either “to” is a duplicate or “x.x.x.x/x” is a garbage” and the route add fails. Running a manual “ip route add” works, and I’ve found a few posts in various newsgroups indicating there’s a patch that fixes this, including new docs that imply you can also add a netmask directive in newer versions of this OCF. Bummer, since this is essential for active/passive CM + iworx cluster for iworx licensing as well as things like NFS mounts.

This last obstacle appears to be a show-stopper, or at least any hackish workarounds that I can come up with so far are way to gross to put into production. I could get around the licensing issue by simply purchasing an additional license for a host that just lies dormant (hopefully) 99% of the time (if I could convince management to pony up, that is), but so far I can’t figure out how to deal with the NFS issue.

Centos doesn’t appear to be working very hard to maintain parity with rhel at this point, and TBH, I have no idea what the status of any of this is in rhel-land these days, though I’ve heard that corosync, at least, is unsupported pay-to-play. I believe that DRBD is in mainline kernel in more recent kernels. I may see if scientific linux, which is much closer to parity with rhel, helps any with these issues, and if it does, see how upset iworx would be with me running on a distro which should be 100% compatible but is not listed on their “supported distros.”

TLDR: no workee currently

Well, after putting way-too-much work into the source address issue, it turns out that the issue was me being stupid/me having been lucky in the past. IPsrcaddr was trying to run before IPaddr2 and bombing because of that. Fixed that with an “order” statement, and now it works like a champ. ::shame::

Oh, and while the new IPsrcaddr is nice, it’s not needed. I grabbed it from here:

https://raw.github.com/ClusterLabs/resource-agents/master/heartbeat/IPsrcaddr

Also, turns out I’m a fatty, fatty, fat, fat liar - drbd was in the regular elrepo repo.