Everything you think you know about clustering is ...

Eric_Hennessey1 · ‎04-10-2012

In my last post I mentioned a few common misconceptions about HA clustering that I'd be debunking; namely that it's unreliable, complex, and expensive. There are others that we'll get to in later posts, but for this one I want to tackle the myths of unreliability and complexity, since they kind of go hand-in-hand.

The vast majority of our customers using Veritas Cluster Server (VCS) for high availability have been using it for quite some time and are completely happy with it. But we do hear from time to time from customers who say they've used HA clustering in the past - either VCS or Brand X - and stopped using it because it "broke". Frankly, this reaction baffles me. As an IT guy who's been in the business for - well, let's just say a long time, OK? - I learned early on that if something worked yesterday and isn't working today, something changed. You didn't just throw out the thing that broke, you stopped and asked troubleshooting question #1: "What changed?".

Let's take a simple and common enough scenario. A 2-node cluster hosting a single application is deployed. That application requires two file systems which are also managed by the cluster. Everything works fine after the cluster is first configured and deployed, and the cluster proves its worth one night when the active node crashes and the app and all its required storage and network resources successfully failover to the idle node.

But a few months later the app developers make some changes, and the app's file layouts change. The app now uses three file systems instead of two, but that third file system never gets added to the cluster configuration. Some time later, a hardware fault triggers a failover, and this time the application fails to start on the idle node because the cluster is unaware of the new file system and doesn't mount it up on the idle node.

And that's the anecdotal definition of "clustering didn't work", which brings me to Hennessey's First Law of Computing: Computers rarely do what you want them to do, but they always do what you tell them to do. Clearly, this scenario describes a shortcoming in internal processes, and not so much in the clustering solution.

While Brand X might be highly error-prone and vulnerable to (ahem) environmental challenges, we've taken great pains to harden VCS to things like the "configuration drift" described above, as well as other factors that commonly plague lesser HA solutions. Our Fire Drill and Health Check features largely negate the problems raised by configuration drift and make sure the cluster configuration maintains consistency. To guard against administrative errors, such as starting an application manually on an idle node when it's already running elsewhere in the cluster, we added concurrency violation prevention in VCS 6.0. Those are just two examples of measures we've taken in VCS to make our HA as bullet-proof as possible.

While HA clustering is no more inherently complex than anything else in your data center, it does add a layer of administration for which allowances must be made. With Veritas Cluster Server, we've made great strides in reducing complexity and eliminating those little things that can lead users astray.

Next up: Busting the "clustering is expensive" myth.

VOX

Everything you think you know about clustering is wrong: Busting the myths of unreliability and complexity