Fault-Tolerant Systems

What is fault-tolerance?

Fault-tolerance describes a superior level of availability characterized by 5 nines uptime (99.999%) or better. Fault-tolerant systems are able to deliver these levels of availability, because they can “tolerate” or withstand both hardware and software “faults” or failures.  They typically do this by either proactively monitoring and preventing critical systems from failing in the first place, or by completely mitigating the risk of a catastrophic component or system failure.

Software-based vs. hardware-based fault-tolerance

Fault-tolerance can be achieved using both software-based and hardware-based approaches.

In a software-based approach, all data committed to disk is mirrored across redundant systems. More sophisticated software-based approaches also replicate uncommitted data, or data in memory, to a redundant system. In the event of a primary system failure, a secondary backup system resumes operation, taking over from the exact moment the primary system fails, so that no transactions or data are either duplicated or lost.

In a hardware-based approach, redundant systems run simultaneously. Parallel servers perform identical tasks, so that if one server fails, the other server continues to process transactions or deliver services. This approach relies on the statistical probability of both systems simultaneously failing being extremely low. Only one server is actually needed to deliver applications, but having two servers helps ensure that at least one will always be running.

How everRun® Enterprise and ztC™ Edge deliver fault-tolerant workloads

Stratus everRun Enterprise software and Stratus ztC Edge computing platforms both use software-based approaches to deliver fault tolerant applications and protect data.

The main challenge with software-based approaches is efficiently replicating data while minimizing system overhead. Don’t replicate enough and your recovery times increase. Replicate too often and you use too much of your system resources just to ensure availability.

everRun Enterprise and Stratus Redundant Linux, the operating platform that powers Stratus’ ztC Edge solution, replicate all data written to disk (for highly available workloads) and use a unique checkpointing engine to continuously replicate data in memory and CPU states (for fault tolerant workloads). All I/O operations are queued until checkpoints are completed and verified. Proprietary algorithms dynamically adjust checkpointing frequency, based on the type and amount of data changes and I/O throughput. If/when one node fails, a two second pause is used to prevent split brain scenarios, resulting in a sub five second recovery time – below the TCP/IP threshold for queueing and resubmitting requests.

In addition to its unique, highly efficient checkpointing engine, Stratus solutions are differentiated by their operational simplicity. No application or guest operating system modifications are required to make them cluster-aware. No additional failover scripts are needed to ensure application availability and data integrity. All that’s needed is for the applications to be installed in a virtual machine and launched to make them fault tolerant.

How ftServer® delivers fault-tolerant workloads

Stratus ftServer uses a hardware-based approach to deliver fault-tolerant applications and data.

The main challenge with hardware-based approaches is ensuring the precise synchronization of processes and threads – making sure that the exact same things are happening at the exact same time on both nodes of a redundant system.

Stratus ftServer uses proprietary field programmable gate arrays (FPGA) to ensure lock-step processing across two identical halves of an ftServer system. The two identical customer replaceable units (CRU) run in parallel. Each act as the primary or secondary server as needed. Each executes the same process at the same time. With ftServer, there is no recovery time when there’s a failure in a single component or CRU. The available CRU simply takes over as the primary server until the unavailable CRU is replaced. For organizations that cannot tolerate even a second of unplanned downtime, Stratus ftServer is a viable option.

In addition to its use of FPGAs and lock-step approach, Stratus ftServer is differentiated by its operational simplicity. Applications, virtualization platforms, or guest operating systems that are installed in ftServer do not require special modification or configuration to make them fault-tolerant.