So, at Stratus we are and have been the leaders in reliable computing infrastructure for decades. However, like many terms in technology the definition of High Availability (HA) is very broad. Last year we saw a survey from a highly respected analyst firm that said the majority of those surveyed thought that High Availability meant having a disaster recovery plan. And we have found the definition also moves about when you talk to people with history in the different computing platforms (i.e. the mainframe vs dev/ops guys see this very differently). IDC has a set of Availability Levels they have used for years but they seem a bit broad since most of the technologies out there fall into the super broad AL3 category.
So, here’s our definitions grouped by end user impact.
Significant End User Impact (Generally measured in hours of downtime – IDC calls this AL1 and AL2)
Unprotected – This is likely pretty easy to understand. This is a workload that has no special reliability features implemented either at the application, hypervisor or infrastructure layer. If it goes down; it’s down.
Backup – This is a workload that is periodically copied (or snapshotted) to a different node or data center. This is a nice compliance measure and can help to recover (if you have hours or more)
Disaster Recovery – This is a more robust form of backup that is automated for quicker recovery in the event of a major failure event (this could be human error or a major data center failure due to weather)
Minimal End User Impact (Generally measured in seconds to minutes of downtime – IDC calls this AL3)
Automated High Availability – This is very common in the virtualized world. When there is a failure a new instance of the workload is redeployed to a new node or data center. A common implementation of this is VMware’s HA feature. This feature has minimal infrastructure impact but has fairly high user interruption and all in-flight data is lost. This is a good solution for load balanced, scaled out applications like web servers.
Instant High Availability – This is the world of clusters in the bare metal world or redundant instances and replicated storage in the virtualized world. The interruption of service is minimal (even a sub-second in some cases). However, any inflight data and or transactions are lost. If your application is stateless but not load balanced this is a great solution.
Zero End User Impact (No Downtime – IDC calls this AL4)
Fault Tolerance – This is a capability that was once only known in the mainframe and minicomputer world. However, Stratus makes hardware, software and cloud solutions that provide this level of protection to off the shelf operating systems and hypervisors at a price point that is comparable to lower protection levels. Fault tolerance is a complete redundancy of the workload that also shares the inflight data and application state. This means that there is continuous, uninterrupted operation even in the event of a failure.
Multi-Site Fault Tolerance – This is the highest level of protection a workload can get. It provides Fault Tolerance, so there is no loss of state or data but the redundant workloads are hosted in different sites. Naturally, there is a higher network cost to this type of solution, but when only the highest levels will do, this is the best.
Hopefully this helps de-mystify all of the types of protection you can get. When evaluating what you need consider not only what specifically is being protected, but also the recovery time and the infrastructure costs – mainly processing and networking.