Two major methods of rating the degree to which a device can deliver continuous operation are Availability and Failure Rate. These are fundamentally different concepts. Availability (A) is simply the proportion of a total time interval that a given device will be operational. Mathematically speaking, that is:
where tu is total uptime, and ttotal is the total time when the device is intended to be operational. Obviously, if it never fails, tu = ttotal and everyone is happy. But what if it does? That’s where Failure Rate comes in. Assuming a device follows an Exponential Failure Distribution (just a geeky way of saying that it fails randomly at a constant rate – but more on that in later posts), we use the Greek symbol λ to represent its Failure Rate in units of failures / device-hr. A lot of times, we use Mean Time to Failure (MTTF) as a measurement of Failure Rate, and MTTF is simply the reciprocal of λ (which inherently makes sense – if some device is expected to fail three times per hour on average, its mean uninterrupted runtime interval should be 20 minutes, or 1/3 device-hrs / failure). MTTF is a measurement of Reliability, which is frequently defined as the probability of surviving a specific runtime interval without failure. Both MTTF/Reliability and Availability are important metrics to measure a system’s continuous ability to serve its intended function; however, being highly available does not imply being highly reliable (and vice versa).
Some systems are highly available, but may endure failures frequently. For example, suppose you have a personal computer that is somewhat unstable. It seems to fail approximately every three months (~2190 hrs); however, the failure type is such that a simple reboot will restore system operation. Let us say that this reboot operation takes a minute and a quarter (0.020833 hours). This system’s MTTF is 2190 device-hours per failure, which is somewhat poor (i.e. on average, it will endure 4 failures in a year). In this case, to determine its Availability, we can examine one average runtime cycle, or the mean time taken for the computer to fail and then be “repaired” to operational status (i.e. a reboot, in this case). Our MTTF, or 2,190 hours, is our average runtime interval prior to a crash, which we set as tu. ttotal is the length of the complete average runtime cycle, or the sum of tu and the average downtime following a crash (the repair time). Using the following equation, we can calculate its availability: