What causes computer systems to go down?  Is it mostly hardware failures, software crashes, or other factors like user error or environmental problems?  Lots of studies have been conducted over the years, but many of them suffer from small sample sizes, both in systems and in runtime.  Researchers at Carnegie Mellon University wanted to really dig into the problem and have analyzed a set of system failure data spanning nine years (although most of the data seems to be between 2001 and 2005), almost 5,000 systems, over 100 million system hours of runtime, and a whopping 23,000 distinct failures from Los Alamos National Laboratories!  This analysis produced a paper by Bianca Schroeder and Garth A. Gibson entitled A Large-Scale Study of Failures in High-Performance Computing Systems (http://www.pdl.cmu.edu/PDL-FTP/stray/dsn06.pdf).  Best of all, every single failure, its date, its root cause (if known), and the amount of downtime it induced is publically available and downloadable on the Los Alamos website (http://institutes.lanl.gov/data/fdata/).

I recently downloaded the raw data, and it is quite impressive.  Their study included several different types of servers — everything from SMP 2-way and 4-way systems to NUMA systems containing as many as 256 processors per node.  But, what was really striking was the detail surrounding the failures.  Not only was the amount of node downtime resultant from the failures charted to the minute, but each failure’s root cause (if known) was recorded to the component level.  For example, hardware failures were isolated to the failing component (e.g. DIMM, CPU, PCI card, disk drive) and software failures to that piece of software that failed (e.g. kernel/OS, file system).

The following is a graph presented in their report (Schroeder and Gibson, pg. 3, Figure 1) on the data showing the breakdown of all failures across different system architectures (each letter, D through H, represents a unique system architecture):

Graph (a), to the left, shows the percentage of outages caused by each failure source.  Graph (b), to the right, conversely shows the percentage of total downtime caused by each failure source.  In both cases, overall, it can be seen that hardware was the largest source of both downtime and system crashes (~60% from both of the overall graphs).  Software was the second largest (~20%), followed by unknown failures (~15%).  The other factors seemed to be very small contributors to outages, including human error (although, as Schroeder and Gibson mention in their report, depending on the distribution of the unknown failures, these percentages may be higher).  However, despite the somewhat large unknown factor, even if none of the unknown failures were hardware failures, hardware would remain the number one source of both outages and downtime.

But what caused these hardware failures?  System type E’s largest contributor of hardware failures was processor failures; however, the paper admits that system E’s processor contained a “design flaw.”  Generally, on all other system architectures studied, the research found that memory (DIMM) failures were the largest cause of hardware failures.   Below these two major categories, other major contributors seemed to be disk drives, power supplies, expansion (PCI) cards, and interconnect/mother board related issues.

Overall, this data can be quite useful for system designers and IT personnel.  With better knowledge of why systems fail and how failures are distributed, IT service providers can better design their system architectures and maintenance strategies to maximize uptime and minimize system interruptions.