So, what is fault tolerance?

Virtualization and fault-tolerance are decades-old technologies reinvented for the demands of modern enterprise computing … industry-standard platforms, business continuity, application availability, server consolidation, end-to-end business process reliability, flexibility and rapid response.

We’ve always had a bit of a problem describing “fault-tolerant.” Even knowledgeable people in the availability industry struggle with a concept whose very name is misleading. “Fault-tolerant” leads people to believe that a system works to manage or “tolerate” a failover, when that simply isn’t true. Since everything is duplicated and runs twice, nothing happens to the system when there is a problem. The application continues running on the working server and the performance stays the same while the faulty one calls home for pro-active service.

Analogies for fault tolerance

For an analogy, think of the Radio City Rockettes as our server, and the kicking action as our application. If, in the middle of the show, a Rockette falls off the stage, kicking still happens. However, in our model, the horror of watching the fall has probably disrupted the audience, and the show is no longer a success.

So let’s try another analogy. As it turns out, your body’s most important organs are fault-tolerant (with two notable exceptions.) You have two lungs, two kidneys, two eyes, two legs, and women have two ovaries.  If we think of the functions of these organs (say, sight,) as the application and the organ (here, eyes) as the server, people with one working eye can still see. But, as Kevin Butler our web guy pointed out, one-eyes people can’t gauge distance, have limited range, and many other problems. So, your body isn’t quite as fault as we had hoped.

Another example we came up with were races. Imagine two identical runners in a race, each running for the same team. Here, the race is the application and the runners are the fault-tolerant servers. If the gun goes off, and one runner trips, a runner still makes it to the finish line for a medal for his team and if both runners had crossed the finish line, still only one medal would have been earned. It, too, doesn’t quite fit.

The duplicated, wasted energy is, in every other facet of life, eliminated or nonexistent. Fault-tolerant servers literally do all of the work of each application twice, for just one result. Half of the work is completely superfluous unless its twin server has faulted in which case, it merely continues doing the same work alone until the faulted twin is back online.

Does anyone have a better analogy? How do you explain fault tolerance? Does this ever occur in nature?


Fault-tolerant hardware vs software

While not by design, virtualization and fault tolerance are made for each other. Virtualization vendors are pushing up the availability stack and fault-tolerant solution providers are wrapping themselves around virtualization. Well, kumbaya.

Having been in the availability business for nearly three decades, Stratus knows a thing or two about supporting mission-critical computing environments. Our ftServer systems lead the x86 world for field-tested uptime reliability. Believe us when we say, it’s not easy to do.

New declarations of fault-tolerant systems today are coming from companies with solutions in , not hardware like Stratus does it. By the narrowest of definitions, these software solutions are fault-tolerant, and Hurricane Ike could have been described as inclement weather; both statements are correct but neither captures the true nature of the situation.

Software-based FT comes up short in several ways. It has not conquered how to prevent transient errors from crashing a system or propagating the error to other servers or across the network; how to root-cause an outage to prevent it from happening again; and how to quash latency when applications or VMs move from one side of the cable to the other.

Most important, software-based solutions don’t support symmetric multi-processing (SMP); i.e. they cannot scale beyond a single processor core per socket. That means that if the application cannot execute on a single core, it won’t be supported in a software fault-tolerant environment.

Delivering continuous availability – mission-critical application availability – requires more than saying you have fault-tolerance. Continuous availability demands a combination of hardware, software and, as important, service without quibbling over whose problem it is.

Learn about Stratus’ full-circle including hardware, software, and service.


A brief history lesson in fault tolerance

Having been in this industry for going on 30 years now, I have seen many things “recycle”, sometimes the terminology changes, sometimes it stays the same. What I find interesting is how when something ‘comes back around again’ in many ways it’s thought of as “new and innovative”. The first thing that comes to mind for me is virtualization. Over the last 18 to 24 months, this has been the hottest topic in the IT industry, and for good reason. But in reality it’s not new. Virtualization was new in the late 1960’s when IBM put it on the System 360.

Most recently another 35-year-old technology is making a comeback. OK, let me rephrase that, because it never really went away, it sort of went to the outer rings of the radar screen. I’m talking about fault tolerance. Fault-tolerant machines began to make their mark on the IT industry in the late 1970’s. These machines where large, proprietary and very expensive, but then again, so was every other computer of the late 70’s! Check out this commentary on the how & why fault tolerance is back, by Director of Product Management Denny Lane – Rediscovering FT


You might be fault-tolerant if…

First I want to go on record and apologize to Jeff Foxworthy for butchering his tag line, but I thought it was an interesting way to get a couple of points across.

  • If you built it from ‘the ground up’ with no single point of failure – “You might be fault-tolerant”
  • If the ten’s of thousands of machines in production at customer sites are monitored daily, and you post an uptime of 99.9999% on your company home page – “You might be fault-tolerant”
  • If you understand that being fault-tolerant is more then a piece of hardware or software, but an entire infrastructure including services – “You might be fault-tolerant”
  • If customers around the world have been trusting you with their most mission critical applications for almost 30 years – “You might be fault-tolerant”