Stratus Blog

Showing archives for category Failure

Super Bowl Shines Worldwide Spotlight on Downtime

2.5.2013Cost of Downtime, Failure, High Availability, Mission CriticalBy: Over 100 million people worldwide tuned in to watch Super Bowl XLVII. Therefore, it could be argued that was the most viewed and infamous power outage to wreak havoc on the grandest of scales.

It just goes to show, downtime happens.

We can’t really say for sure how or what occurred, although early speculation placed blame on Beyoncé’s lights-out performance, a manager at the Superdome, site of the game, said it was not the halftime show, but that a local energy company is claiming it had trouble with one of the two main lines that deliver power to the stadium from a local substation.

It could have been a software glitch, or a hardware problem that sacked power to the stadium for 33 minutes and left the NFL with a black eye.  But the downtime incident powered a social media surge, as hundreds of thousands of people began Tweeting about the #poweroutage.

Which brings us to Twitter itself? Having suffered its own downtime nightmare back on January 31, Twitter was able to handle the blitz of people tweeting about the Super Bowl’s misfortune. Twitter announced it processed just over 24 million tweets during the game, with the mini missives coming in at a rate of 231,500 a minute during the power outage.

Downtime appears in many different forms and at many different times, across all industries and business landscapes. The Twitter downtime occurrence was much different from that the NFL witnessed, but both incidents took their tolls financially and in terms of a hit to brand reputation.

Within the enterprise there is an acceptable level of downtime that occurs each year. On average, businesses suffer between three and five hours of downtime per year, far too much in our humble opinion, at an average cost of $138,888 per hour. While that’s a staggering figure, the damage to the brand can be even more catastrophic.

Let’s get back to the Super Bowl and the power outage. The City of New Orleans, which hosted the game, is already worried it’ll lose out on hosting future games because of what happened. That’s a city known for its ability to show its visitors a good time, but those businesses that depend on major events like the Super Bowl to draw in tourism dollars could suffer from that 33-minute absence of electricity.

Again, downtime comes in many forms depending on the industry and the ramifications have the potential to throw their victims for a significant loss. It’s like that old saying that you need to expect the unexpected. When the unexpected does arrive you have to be prepared to come back from that downtime swiftly and with as little disruption to your business as possible. With the right technology and the right best practices in place, you can minimize the damage and decrease the chance of downtime seriously hampering your ability to do business.

Protecting Public Safety Applications from Downtime

4.5.2012Failure, Fault Tolerance, High Availability, Mission Critical, PSAPBy: Uptime for Public Saftety Answering Point (PSAP) applications is critical for a multitude of reasons. Downtime causes slower emergency response times, impacts the ability of computer systems to capture and disseminate vital information, jeopardizes the safety of first responders when location history and fire inspection data is not available, harms public perception and reputation of your department, and even opens the department up to potential lawsuits.

Every one of those reasons is a great one to protect your PSAP applications from downtime, especially if you are already considering a technology upgrade for legacy software or are about to publish an RFP for updating your PSAP operations.

There is a better reason, though, for making sure your 9-1-1 system is on 24/7/365.

Odds are, you live in the same community in which you work. Your family and your closest friends are among the people you are charged to protect. You want the quickest response possible from emergency responders who have the best? information and technology behind them.

Uptime for your PSAP applications means peace of mind for you that your family and your neighbors are protected around the clock.

Stratus’ newest ebook details different strategies to improve the availability of mission-critical Public Safety applications and minimize the possibility of disruptions caused by server failure. It describes the advantages and disadvantages of dedicated servers, cluster technology, virtualization, high availability and fault-tolerance. Download it now  to learn why 99% system uptime is not good enough for PSAP applications, what technologies can increase system uptime, and what technologies best fit for your specific environment.

Prevent Manufacturing Application Downtime

3.28.2012Cost of Downtime, Failure, Fault Tolerance, High Availability, Mission Critical, Technology, uptimeBy: Downtime for manufacturing applications is getting costlier and costlier. Efficiency improvements in the increasingly competitive landscape center on resource consolidation, information technology and automation. Manufacturers are deploying more business critical applications on the production floor to increase and optimize product quality and output, without sacrificing abilities to respond to changing raw material quality, market conditions, and customer demands.

These changes are good because they allow manufacturers access to better information on improvements, infrastructure costs, and resource availability. They also allow some manufacturers to run continuously.

However, the consolidation of computer resources and server virtualization pushes more and more applications onto fewer pieces of infrastructure, creating a single point of failure  for the plant. Even a minor loss of uptime can be catastrophic for productivity, and in some cases, entire batches of products are ruined.

Manufacturers are facing enough pressures including tense global competition, government regulations, and a lack of skilled workers.  The last thing they need is a breakdown of processes due to a faulted server.

ARC Advisory Group recently conducted a survey on application downtime specific to manufacturing. Their webcast, Application downtime, your productivity killer,” discusses the critical nature of downtime and how best in class manufacturing organizations are addressing this issue.

 

Watch the webcast, Application downtime, your productivity killer,”, to hear John Blanchard, a principal analyst at ARC Advisory Group, explain manufacturing trends are making uptime assurance more important than ever, and how to protect your own plant from downtime consequences.

Datacenter Downtime: How Much Does It Really Cost?

3.24.2012Cost of Downtime, Failure, Fault Tolerance, Fault Tolerant Storage, High Availability, Mission Critical, Technology, uptimeBy: Calculating the cost of downtime is perhaps the biggest hurdle for IT departments addressing availability concerns. In February 2012, Aberdeen Group conducted an in-depth analysis of a number of factors surrounding datacenter downtime. Survey respondents were asked questions concerning the average number of downtime events per year, the average length of an event, the cost per hour of downtime and the time it took to recover 90% of business operations following a business interruption.

Instead of worrying about how downtime is hurting your business without being able to pinpoint the dollar amount per hour,  reputation loss or productivity loss, read the report and find out exactly how outages affect best-in-class companies like yours – and how they are taking steps to address the issue.

Server Crash Disrupts Medical Office and Patients

3.15.2012Cost of Downtime, EMR, Failure, Healthcare, High AvailabilityBy: “Our server is down,” is not what I wanted to hear when I called my doctor’s office to make an appointment.

“You’ll have to call back in an hour,” the receptionist said, and I could hear other phones ringing off the hook in the background.

An hour later, I tried again and got the same harried receptionist, stuck without a network and a way to schedule appointments.

For me, it was a slightly annoying – calling and calling and calling again, only to make an appointment for a checkup I’m reluctant to get anyway.

Multiply my frustration with the number of patients both coming in for appointments that can’t be carried out, and patients who can’t cancel, reschedule, or make appointments. Add to that the frustration of patients calling for any number of other reasons: change insurance plans, get lab results, or consult their physician about a health issue.

All of this patient angst, however, pales in comparison with the headache a server fault brings upon the medical office personnel. The doctors, nurses, and administrators are all brought to a halt, helpless to do anything without access to their scheduling system, email, or their electronic health records.

Unfortunately, it happens all the time. Downtime disrupts business, costs time and money, as well as the reputation of the office. Comment below, or send us a tweet at @Stratus4uptime if this has happened to you, and how you manage your business’ uptime.

Keeping Computer Aided Dispatch Software Up and Running: St. Charles County Case Study

3.9.2012Disaster Recovery, Failure, Fault Tolerance, High Availability, Mission Critical, PSAPBy: computer aided dispatch softwareAnyone in the public safety sector will tell you that the key to a safe neighborhood and a successful first-response system is teamwork. Everyone is essential. For example, even on one small car fire, the person who calls 9-1-1; the dispatch operator who answers the phone and sends the proper emergency personnel; the fire engine driver who navigates the truck safely through crowded streets; the firefighters who extinguish the blaze; the policemen who keep onlookers at a safe distance; the emergency medical technicians and paramedics who triage patients and get them to the hospital; and then the nurses , doctors and technicians in the hospital that treat victims, are all critical to keeping the public safe.

The same is true of the equipment. Every piece of the line is essential. The phone lines connect the 9-1-1 caller to the dispatcher and then the dispatcher to the fire station. All of the firefighters gear must work, along with the truck, the hydrant, and the hoses. The ambulance crew, similarly, must be fully-equipped and transportable. There is little room for error when lives and property are at stake.

About that equipment. First responder organizations rely on top-of-the-line tools. Have you ever seen a firefighter haul out a green garden hose, struggling to untangle the kinks, in an effort to put out a fire? Have you ever seen a policeman take control of a robbery situation using a squirt gun? Have you ever seen a lifeguard swim to a victim and instead of tossing them a buoy, fitted them with floaties? No, and you won’t. Ever.

In public safety, there is no substituting the right tools to get the job done. Every piece is essential, and it must work exactly as designed, every single time.

Or, in the case of the server that supports the public safety applications, every single second.

St. Charles County Department of Dispatch and Alarm is a great example of a department that looked beyond the fire trucks, police cars and ambulances to find vulnerabilities that could possibly hurt public safety performance and put their citizens in danger. They implemented a highly reliable computer-aided dispatch (CAD) system built on Stratus® ftServer® systems and TriTech Software Systems’ VisiCAD™ software to ensure uninterrupted performance of their dispatch software. 40,000 service calls come through the dispatch a year, and every single one could be life-saving. TriTech’s VisiCAD software is flexible enough to service their 16 ambulances and 34 fire stations, encompassing a total of 120 mobile units.  VisiCAD is dymanic enough to locate the closest response team to the accident, while monitoring backup vehicles should they be needed.

The ftServers running the Computer Aided Dispatch system, as well as storing all of the electronic information of the calls, ensure the systems have unparalleled uptime. St. Charles County IT Manager Travis Hill said they have been running their original ftserver system for more than nine years without any server downtime. That means nice years of proactive protection for the citizens of St. Charles County.

To find out more about why St. Charles County specifically chose VisiCAD software on ftServers, click here to read the case study.

Preventing Public Safety Outages

3.7.2012Cost of Downtime, Disaster Recovery, Failure, Fault Tolerance, High Availability, Mission Critical, PSAPBy: Saturday’s 911-system outage in the District of Columia highlights the necessity for fault tolerant systems running mission-critical applications. Due to a PEPCO power outage to the call site on Martin Luther King Jr. Avenue, citizens could not reach EMS personnel from 1:53 to 2:16 p.m. Although traditional and social media channels did their best to get the word out about alternate numbers, all 617,996 citizens of the District were put at risk. Perhaps nothing is more critical to a city than public safety systems like EMS, Fire and Police response.

@AriAnkhNeferet from Twitter said it best, “Someone please explain to me how it’s possible that 911 is experiencing a power outage?! Come on DC. we have to do better.”

She is right – the most mission critical systems and applications shouldn’t be subject to outages, power or otherwise. Backup systems, fault tolerant servers, and disaster recovery solutions are all possible ways to make your EMS system safer for the community. Servers wired for two distinct power sources that come from separate power grids, like our ftServers, are an easy way to guard against power outages. Live data replication and split-site capabilities, two features of our Avance high availability software, are two other ways to ensure your systems are protected.

Besides power failures, server crashes, memory failures, disk drive failures and a countless number of other technical problems can crash servers much more often. Saturday’s power outage demonstrates what could happen if a public safety system goes down for any number of reasons, and reinforces that steps need to be taken to protect systems from more normal/frequent occurrences.

When lives are at stake, you cannot be too careful. However, @AriAnkhNeferet’s tweet shows that something else is at stake: reputation. What happens when if public loses trust in the EMS system to respond? A large Metro can get 30,000 9-1-1 calls per day. That would mean the 20+ minute outage could have affected 400+ 9-1-1 calls, leaving citizens stranded and the city’s first line of defense helpless to respond.

If you run life-saving systems, it might be best to run through some worst-case scenarios on your existing architecture. What happens when a power failure happens in your call center? What happens when a server has a hardware failure? What is your disaster recovery plan in the case of an earthquake, fire, or flood? Are there dedicated resources available 24-hours in the case of a failure?

To learn how Stratus can help you with these and other public safety technology issues, click here to download more information.

Keeping Electronic Health Record (EHR) Applications Available at Alice Peck Day Memorial Hospital

3.1.2012Disaster Recovery, EMR, Failure, Healthcare, High AvailabilityBy: Today, your Danskos are going to power over the linoleum floors, moving from patient room to patient room. In a sea of charts, beeping machines, gurneys and meal carts, you know that one small misstep can set back your whole day.

It isn’t a large leap, then, to understand that one lapse of even a small amount of downtime for the a hospital’s electronic health record (EHR) system can bring the entire hospital – staff, patients, and machines alike – to a standstill.

Ten years ago, when doctors and nurses used paper charts, the risk of inaccessible data was low, as was the level of efficiency. Aside from the occasional misfile or lost folder, patient medical histories were never completely unavailable. Electronic medical records have done wonders to streamline accessibility to patient information, but they also created vulnerability and a single point of failure in the server.

Click here to learn how Alice Peck Day Memorial Hospital prevents downtime.

The HITECH (Health Information Technology for Economic and Clinical Health) Act, however, demands “meaningful use” of technology in healthcare environments, with a $2 billion incentive behind it. Designed to make the exchange of healthcare information between healthcare professionals easier and more accurate while improving the level of care patients receive, the bill strongly encourages healthcare practices to adopt EMR .

Once the tedious process of data entry and document scanning is complete, medical practices can reap the rewards of a paper-less system, but that efficiency comes with a catch: If the EHR system goes down, medical records are as good as gone. As a result, protecting servers and applications from downtime becomes paramount.

Alice Peck Day Memorial Hospital , a 25-bed hospital in the northeast, implemented virtualization technology with high availability software to address concerns over medical records accessibility. To see their prescription for success, read the case study.

What is Fault Tolerance?

12.21.2011Cost of Downtime, Failure, Fault Tolerance, Fault Tolerant Storage, Mission Critical, TechnologyBy: >So, What is Fault Tolerance?

Virtualization and fault-tolerance are decades-old technologies reinvented for the demands of modern enterprise computing … industry-standard platforms, business continuity, application availability, server consolidation, end-to-end business process reliability, flexibility and rapid response.

We’ve always had a bit of a problem describing “fault tolerant.” Even knowledgeable people in the availability industry struggle with a concept whose very name is misleading. “Fault tolerant” leads people to believe that a system works to manage or “tolerate” a failover, when that simply isn’t true. Since everything is duplicated and runs twice, nothing happens to the system when there is a problem. The application continues running on the working server and the performance stays the same while the faulty one calls home for pro-active service.

Analogies for Fault Tolerance

For an analogy, think of the Radio City Rockettes as our server, and the kicking action as our application. If, in the middle of the show, a Rockette falls off the stage, kicking still happens. However, in our model, the horror of watching the fall has probably disrupted the audience, and the show is no longer a success.

So let’s try another analogy. As it turns out, your body’s most important organs are fault tolerant (with two notable exceptions.) You have two lungs, two kidneys, two eyes, two legs, and women have two ovaries.  If we think of the functions of these organs (say, sight,) as the application and the organ (here, eyes) as the server, people with one working eye can still see. But, as Kevin Butler our web guy pointed out, one-eyes people can’t gauge distance, have limited range, and many other problems. So, your body isn’t quite as fault as we had hoped.

Another example we came up with were races. Imagine two identical runners in a race, each running for the same team. Here, the race is the application and the runners are the fault-tolerant servers. If the gun goes off, and one runner trips, a runner still makes it to the finish line for a medal for his team and if both runners had crossed the finish line, still only one medal would have been earned. It, too, doesn’t quite fit.

The duplicated, wasted energy is, in every other facet of life, eliminated or nonexistent. Fault tolerant servers literally do all of the work of each application twice, for just one result. Half of the work is completely superfluous unless its twin server has faulted in which case, it merely continues doing the same work alone until the faulted twin is back online.

Does anyone have a better analogy? How do you explain fault tolerance? Does this ever occur in nature?

 

Fault Tolerant Hardware vs Software

While not by design, virtualization and fault tolerance are made for each other. Virtualization vendors are pushing up the availability stack and fault-tolerant solution providers are wrapping themselves around virtualization. Well, kumbaya.

Having been in the availability business for nearly three decades, Stratus knows a thing or two about supporting mission-critical computing environments. Our ftServer systems lead the x86 world for field-tested uptime reliability. Believe us when we say, it’s not easy to do.

New declarations of fault-tolerant systems today are coming from companies with solutions in , not hardware like Stratus does it. By the narrowest of definitions, these software solutions are fault-tolerant, and Hurricane Ike could have been described as inclement weather; both statements are correct but neither captures the true nature of the situation.

Software-based FT comes up short in several ways. It has not conquered how to prevent transient errors from crashing a system or propagating the error to other servers or across the network; how to root-cause an outage to prevent it from happening again; and how to quash latency when applications or VMs move from one side of the cable to the other.

Most important, software-based solutions don’t support symmetric multi-processing (SMP); i.e. they cannot scale beyond a single processor core per socket. That means that if the application cannot execute on a single core, it won’t be supported in a software fault-tolerant environment.

Delivering continuous availability – mission-critical application availability – requires more than saying you have fault-tolerance. Continuous availability demands a combination of hardware, software and, as important, service without quibbling over whose problem it is.

Learn about Stratus’ full-circle including hardware, software, and service.

 

A Brief History Lesson in Fault Tolerance

Having been in this industry for going on 30 years now, I have seen many things “recycle”, sometimes the terminology changes, sometimes it stays the same. What I find interesting is how when something ‘comes back around again’ in many ways it’s thought of as “new and innovative”. The first thing that comes to mind for me is virtualization. Over the last 18 to 24 months, this has been the hottest topic in the IT industry, and for good reason. But in reality it’s not new. Virtualization was new in the late 1960’s when IBM put it on the System 360.

Most recently another 35-year-old technology is making a comeback. OK, let me rephrase that, because it never really went away, it sort of went to the outer rings of the radar screen. I’m talking about fault tolerance. Fault tolerant machines began to make their mark on the IT industry in the late 1970’s. These machines where large, proprietary and very expensive, but then again, so was every other computer of the late 70’s! Check out this commentary on the how & why fault tolerance is back, by Director of Product Management Denny Lane – Rediscovering FT

 

You Might Be Fault Tolerant If…

First I want to go on record and apologize to Jeff Foxworthy for butchering his tag line, but I thought it was an interesting way to get a couple of points across.

  • If you built it from ‘the ground up’ with no single point of failure – “You might be fault tolerant
  • If the ten’s of thousands of machines in production at customer sites are monitored daily, and you post an uptime of 99.9999% on your company home page – “You might be fault tolerant”
  • If you understand that being fault tolerant is more then a piece of hardware or software, but an entire infrastructure including services – “You might be fault tolerant”
  • If customers around the world have been trusting you with their most mission critical applications for almost 30 years – “You might be fault tolerant”


 

Cluster-aware Applications

12.20.2011Failure, High Availability, Technology, uptimeBy: cluster-awareThere are many reasons clusters are difficult to manage: the number of servers, their connecting components, the application managing them such as Microsofts’ Cluster Administrator, and the inherent problem that clusters are designed to failover to each other. In essence, they are specifically designed to react to the problem after it happens.

There is also a requirement that your applications are “cluster-aware.” A cluster solution is not so much of a product as a tool set. To have the cluster work, your applications must be written with the tools so the application has knowledge of the cluster solution built into it.
Now, lots of commercial software such as SQL-server, for example, is cluster aware, however, the cluster-aware version of such software often carries a much higher license fee.

According to Microsoft’s website, applications are capable of being cluster-aware if they use TCP/IP as a network protocol, maintains data in a configurable location, and supports transaction processing. Essentially, in order to use a cluster solution, you must first choose applications that is “cluster aware.”

As an alternative, the virtualization-ready ftServer requires no specialties of application, requires no workarounds, and is a simple, plug-in-and-go solution.

Pageof 3

Share