What causes computer systems to go down? Is it mostly hardware failures, software crashes, or other factors like user error or environmental problems? Lots of studies have been conducted over the years, but many of them suffer from small sample sizes, both in systems and in runtime. Researchers at Carnegie Mellon University wanted to really dig into the problem and have analyzed a set of system failure data spanning nine years (although most of the data seems to be between 2001 and 2005), almost 5,000 systems, over 100 million system hours of runtime, and a whopping 23,000 distinct failures from Los Alamos National Laboratories! This analysis produced a paper by Bianca Schroeder and Garth A. Gibson entitled A Large-Scale Study of Failures in High-Performance Computing Systems (http://www.pdl.cmu.edu/PDL-FTP/stray/dsn06.pdf). Best of all, every single failure, its date, its root cause (if known), and the amount of downtime it induced is publically available and downloadable on the Los Alamos website (http://institutes.lanl.gov/data/fdata/).
I recently downloaded the raw data, and it is quite impressive. Their study included several different types of servers — everything from SMP 2-way and 4-way systems to NUMA systems containing as many as 256 processors per node. But, what was really striking was the detail surrounding the failures. Not only was the amount of node downtime resultant from the failures charted to the minute, but each failure’s root cause (if known) was recorded to the component level. For example, hardware failures were isolated to the failing component (e.g. DIMM, CPU, PCI card, disk drive) and software failures to that piece of software that failed (e.g. kernel/OS, file system).
The following is a graph presented in their report (Schroeder and Gibson, pg. 3, Figure 1) on the data showing the breakdown of all failures across different system architectures (each letter, D through H, represents a unique system architecture):
Two major methods of rating the degree to which a device can deliver continuous operation are Availability and Failure Rate. These are fundamentally different concepts. Availability (A) is simply the proportion of a total time interval that a given device will be operational. Mathematically speaking, that is:
where tu is total uptime, and ttotal is the total time when the device is intended to be operational. Obviously, if it never fails, tu = ttotal and everyone is happy. But what if it does? That’s where Failure Rate comes in. Assuming a device follows an Exponential Failure Distribution (just a geeky way of saying that it fails randomly at a constant rate – but more on that in later posts), we use the Greek symbol λ to represent its Failure Rate in units of failures / device-hr. A lot of times, we use Mean Time to Failure (MTTF) as a measurement of Failure Rate, and MTTF is simply the reciprocal of λ (which inherently makes sense – if some device is expected to fail three times per hour on average, its mean uninterrupted runtime interval should be 20 minutes, or 1/3 device-hrs / failure). MTTF is a measurement of Reliability, which is frequently defined as the probability of surviving a specific runtime interval without failure. Both MTTF/Reliability and Availability are important metrics to measure a system’s continuous ability to serve its intended function; however, being highly available does not imply being highly reliable (and vice versa).
Some systems are highly available, but may endure failures frequently. For example, suppose you have a personal computer that is somewhat unstable. It seems to fail approximately every three months (~2190 hrs); however, the failure type is such that a simple reboot will restore system operation. Let us say that this reboot operation takes a minute and a quarter (0.020833 hours). This system’s MTTF is 2190 device-hours per failure, which is somewhat poor (i.e. on average, it will endure 4 failures in a year). In this case, to determine its Availability, we can examine one average runtime cycle, or the mean time taken for the computer to fail and then be “repaired” to operational status (i.e. a reboot, in this case). Our MTTF, or 2,190 hours, is our average runtime interval prior to a crash, which we set as tu. ttotal is the length of the complete average runtime cycle, or the sum of tu and the average downtime following a crash (the repair time). Using the following equation, we can calculate its availability: