Dr. Michael Koriwchak wrote an article this week for Wired EMR Practice entitled “Our Disaster Recovery ‘Fire Drill.’”
Preparedness is always a good thing, but I had some comments about disaster recovery as opposed to disaster prevention.
Preparedness in the event of an outage is essential and – amidst the rush to EHR adoption – it’s often done too late, after a failure. In that regard Dr. Koriwchak’s approach and insights are excellent. So too is the comment from Mr. Bletnitsky regarding the implications of downtime.
But more emphasis should be placed on reducing the risk of downtime in the first case. Call it preventive medicine for EHR. Yes, you need effective back up, recovery and continuity measures, like you need a doctor when you’re sick. But common sense is to try and avoid being sick. Or in this case, reduce the risk up front of downtime. Ironically, healthcare reform is ultimately trying to achieve this, so should the implementation of an EHR system.
Virtualization doesn’t ensure uptime in the case of hardware problems. It efficiently uses resources but still requires an application and associated processes to restart – and if your SQL server is one of those processes that went down you’re looking at hours and not minutes of recovery even if everything goes perfectly. Too often things don’t however.
Ensuring uptime requires three things – resilient technologies, proactive monitoring to identify and mitigate failures before they occur, and best practices. These can be achieved working with companies like Stratus and achieve lower costs with far less IT staff time and keep your EHR system healthy.
Demonstrations in our business have long been a minor annoyance to us in hardware. Often times, these demonstrations are contrived, and do not show the full ability of a system, or are contrived enough to mask shortcomings in ft solutions. For instance, we’ve seen demos where a hard disk is pulled (RAID 1 covers this) or an Ethernet cable is disconnected (Teaming). In cases like that, one can make almost any system appear to survive simple tests. This is done, as it is difficult to demonstrate random component failure, such as a multi-bit ECC.
Our solution was to come up with a short video demo of our own. Hope you enjoy it.
Two major methods of rating the degree to which a device can deliver continuous operation are Availability and Failure Rate. These are fundamentally different concepts. Availability (A) is simply the proportion of a total time interval that a given device will be operational. Mathematically speaking, that is:
where tu is total uptime, and ttotal is the total time when the device is intended to be operational. Obviously, if it never fails, tu = ttotal and everyone is happy. But what if it does? That’s where Failure Rate comes in. Assuming a device follows an Exponential Failure Distribution (just a geeky way of saying that it fails randomly at a constant rate – but more on that in later posts), we use the Greek symbol λ to represent its Failure Rate in units of failures / device-hr. A lot of times, we use Mean Time to Failure (MTTF) as a measurement of Failure Rate, and MTTF is simply the reciprocal of λ (which inherently makes sense – if some device is expected to fail three times per hour on average, its mean uninterrupted runtime interval should be 20 minutes, or 1/3 device-hrs / failure). MTTF is a measurement of Reliability, which is frequently defined as the probability of surviving a specific runtime interval without failure. Both MTTF/Reliability and Availability are important metrics to measure a system’s continuous ability to serve its intended function; however, being highly available does not imply being highly reliable (and vice versa).
Some systems are highly available, but may endure failures frequently. For example, suppose you have a personal computer that is somewhat unstable. It seems to fail approximately every three months (~2190 hrs); however, the failure type is such that a simple reboot will restore system operation. Let us say that this reboot operation takes a minute and a quarter (0.020833 hours). This system’s MTTF is 2190 device-hours per failure, which is somewhat poor (i.e. on average, it will endure 4 failures in a year). In this case, to determine its Availability, we can examine one average runtime cycle, or the mean time taken for the computer to fail and then be “repaired” to operational status (i.e. a reboot, in this case). Our MTTF, or 2,190 hours, is our average runtime interval prior to a crash, which we set as tu. ttotal is the length of the complete average runtime cycle, or the sum of tu and the average downtime following a crash (the repair time). Using the following equation, we can calculate its availability: