In 1991, twelve million people in Washington, D.C., Pittsburgh, Los Angeles and San Francisco were unable to make or receive phone calls when service was sporadically disrupted between June 26 and July 2. This was a time when second generation cell phones were just hitting the market, so many of those without phone service had no backup option. Having no access to a telephone line impacted profits as workdays were disrupted, and public safety as there was no way to call emergency services.
According to the New York Times, the disruption in service caused a minor panic as “Telephone company executives and federal regulators said they were not ruling out the possibility of sabotage by computer hackers.” There was also speculation that there was a defect somewhere in the networking software.
The real cause? A single, mistyped character. A typographical error that the software company, DSC Communications Corporation of Plano, Texas had made while entering the ten million lines of code that a System 7 station required. Not a hacker, not a software design error, but a human error.
1991 may seem like a long time ago, but the concept of human error remains true. The more complex the system, the greater the chance for human error.
Stratus’s Downtime Prevention Buyer’s Guide, talks about the six questions you should be asking to prevent downtime, including how to insure against human error. Stratus suggests asking, “Does your solution require any specialized skills to install, configure, and/or maintain?”
“In addition to a solution’s recovery times and ease of integration, it is important to understand exactly what is involved in deploying and managing various availability alternatives. Some are simple to implement and administer, while others demand specialized IT expertise and involve significant ongoing administrative effort.
For example, deployment of high availability clusters requires careful planning to eliminate single points of failure and to properly size servers. Plus, whenever you make changes to hardware or software within the cluster, best practices suggest that you update and test failover scripts — a task that can be both time consuming and resource intensive. Some planned downtime is typically required to conduct the tests and ensure that the environment is working correctly.
Other solutions provide a more plug-and-play approach to availability. Today’s fault-tolerant approaches prevent downtime without the need for failover scripting, repeated test procedures, or any extra effort required to make applications cluster-aware. With fault-tolerant solutions, your applications run seamlessly with no need for software modifications or special configuration changes. Fault-tolerant servers even provide a “single system view” that presents and manages replicated components as one system image, thereby simplifying installation,configuration, and management.
Before investing in a fault-tolerant solution to protect your critical applications against downtime, take serviceability into account, too. Ask about features like 24/7 system monitoring and automatic problem diagnosis, automated identification of failed components and replacement part ordering, customer-replaceable units with automatic system resynchronization features — all of which help ensure continuous operations and eliminate the need for specialized IT expertise.”