Archive for October, 2010

DisasterTest1 1080P

10/26/2010

Demonstrations in our business have long been a minor annoyance to us in hardware. Often times, these demonstrations are contrived, and do not show the full ability of a system, or are contrived enough to mask shortcomings in ft solutions. For instance, we’ve seen demos where a hard disk is pulled (RAID 1 covers this) or an Ethernet cable is disconnected (Teaming). In cases like that, one can make almost any system appear to survive simple tests. This is done, as it is difficult to demonstrate  random component failure, such as a multi-bit ECC.

Our solution was to come up with a short video demo of our own. Hope you enjoy it.

Tags: , , , , , , , , , ,
Posted in Uncategorized 2 Comments »

Redundancy without Replacement – Not quite the increase in Reliability that You’d Expect

10/25/2010

The purpose of redundant, parallel designs is to allow a system to continue performing its intended function even in the presence of a subcomponent failure or failures. Obviously, when components break, we all know we should fix them to replenish the lost redundancy that comes with reduced (or eliminated) parallelization. But, what if we don’t (or can’t) fix them? How much does the redundancy help the overall system failure rate if subcomponents are not replaced after failure? The answer is – not as much as you’d think.

To illustrate the point, let’s use a simple example. Let’s say that NASA (I know, this is the second NASA example in two posts) wants to design a small circuit to control a telescope to be mounted onto an unmanned probe for deep space exploration. Let us also assume that this probe will never be able to be repaired in the event of failure. To enhance the reliability of the circuit, NASA designs it such that there are two parallel subcomponents – only one needs to be working for the circuit to function properly. Let us further assume that there is no “failover” time in the event of a subcomponent failure; the circuit continues to function until all of the subcomponents have failed. If each individual subcomponent has a Mean Time to Failure (MTTF) of 1 device-year per failure , how long will the dual-redundant parallel array last until it fails?

Click to continue reading “Redundancy without Replacement – Not quite the increase in Reliability that You’d Expect”

Tags: , , , , ,
Posted in Uncategorized 3 Comments »

A Reaction to Elden Christensen’s MSDN blog post, “Evaluating High-Availability (HA) vs. Fault Tolerant (FT) Solutions.”

10/07/2010

This morning, I read Elden Christensen’s MSDN blog post, “Evaluating High-Availability (HA) vs. Fault Tolerant (FT) Solutions.” I found it an interesting post, but am uncomfortable with some of his fact.

First, it was unclear to me as to whether he was talking about software FT solutions or hardware FT. I also found it a bit misleading. For example, he stated, “In the event that there is a software fault (such as a hang or crash), both machines are affected and the entire solution goes down. There is no protection from software fault scenarios and at the same time you are doubling your hardware and maintenance costs. At the end of the day while a FT solution may promise zero downtime, it is in reality only to a small set of failure conditions.” With a Stratus ftServer this is totally incorrect. Stratus works with the OS vendors, like Microsoft, to harden the OS which allows our servers to ride through these types of software faults and transient errors. We also eliminate the possibility of these errors propagating across the server, which is something that occurs quite often in HA cluster solutions.

Stratus also provides root cause analysis of the fault to find out what caused it and so it does not reoccur. With Stratus ftServers there is no “doubling” of hardware or maintenance, as each unit is a single entity with total redundancy built in, licensed once managed as a single x86/x64 server, with none of the complexities and additional skills or planning required to manage a cluster. There is no scripting and the applications do not need to be ‘cluster aware’. With ftServer its drop in hardware fault tolerance. A key differentiation between Stratus’ full function hardware fault tolerant servers vs. software FT solutions is performance. Stratus builds and engineers our servers from the ground up to eliminate downtime, while maintaining 100% of the machines performance, even during and after a failure occurs. Stratus’ full function fault tolerant servers differ from HA cluster solutions in many ways, but probably the most important differentiation is we eliminate failover, not recover from it. There is no data loss, no restart, or reboot, which could take several minutes, or worse.

Microsoft and Stratus have been OEM partners for well over a decade now, and there are thousands of Stratus ftServers running Windows supporting some of the World’s most mission critical applications around the globe. In fact we just announce support for Hyper-V on the ftServer. Elden might be interested in the quote by Mike Neil, general manager, Windows Server & Virtualization at Microsoft, in our press release this week – “Customers can now experience Microsoft Hyper-V, its tools and features such Live Migration in the easy-to-use and familiar Microsoft User Interface with 99.999+ percent mission-critical availability running on Stratus ftServer systems. ftServer and critical application support are synonymous and Stratus now includes mission-critical support for Microsoft Hyper-V on ftServer systems. Giving customers the option of choosing the ftServer platform with Windows Server 2008 R2 and Hyper-V adds an availability dimension that hardens the entire solution against downtime and data loss.” The entire press release can be found here – Microsoft Hyper-V on Stratus ftServer Systems. Also, there is a quote by Claude Lorenson, director of SQL Server marketing at Microsoft Corp, from an April joint press release. – “For SQL Server users that also need the highest degree of hardware availability to complete their solution, the ftServer system from Stratus Technologies has a decade-long record of uptime performance that only a fault-tolerant server architecture can deliver.” -The complete press release can be found here – Microsoft SQL Server 2008 R2 with Stratus ultra-high server uptime.

Tags: , , , ,
Posted in Uncategorized No Comments »

//pardot tracking code