Across the spectrum of industries, one thing all companies agree on is that the cost of unplanned downtime is quite substantial. The vexing question is how much?
Surprisingly, a survey of operations people found that 71% of respondents admitted their company is not tracking downtime cost with any quantifiable metrics. That means most companies won’t know what an outage costs until it occurs and by then it’s too late to prevent such an incident.
In stark contrast, Stratus customers are keenly aware of how unplanned downtime could impact their businesses. In fact, a recent TechValidate survey of 533 Stratus users identified the five biggest cost factors from unplanned downtime:
Loss of Productivity – Think about a critical production line sitting idle for hours or days. Or dozens of employees forced to revert to manual processes during an outage of operations systems. One of our manufacturing customers calculated their cost of unplanned production downtime at $33,000 per hour.
Loss of Revenue – If you can’t process and fulfill customer orders due to failed systems, revenue is inevitably reduced. For one Stratus customer-a national stock exchange handling more than one billion trade messages daily-even a few microseconds of downtime can mean revenue losses of tens of thousands of dollars.
Damage to Brand and Reputation – It’s a simple fact: When customers lose confidence in your business, they may go to a competitor. This also makes it difficult to attract new customers. In some cases, it could take years to rebuild your brand image and restore lost revenue.
Loss of Data – When critical systems fail, you could lose valuable transactional and historic data, such as intellectual property, customer records, and financial accounts. Without proper data protection, the cost to your business could be in the millions of dollars.
Non-Compliance – For highly regulated industries, such as public utilities, unplanned downtime can mean stiff fines. Regulators often require demonstrable proof of continuous data availability. The cost of non-compliance can quickly add up, and in some instances result in suspension of your operating license.
With these considerations in mind, and the fact that every day we help customers size these variables based on their inputs, we developed an online Stratus Cost-of-Downtime Calculator. This tool helps professionals like you figure out the full financial impact of downtime on your organization. Check it out, it will help you easily determine how quickly downtime can affordably be prevented.
And assuming your business could benefit from a solution that prevents downtime, we recommend a three-step approach that is extremely reliable and cost efficient:
- First, virtualize your critical systems to drastically reduce the number of physical systems in your environment—and the number of potential points of failure.
- Next, run your virtualized systems on Stratus always-on servers. With integrated redundancy, Stratus servers ensure continuous availability of your virtualized applications, without a single point of failure or risk of data loss.
- Finally, for maximum protection, mirror the always-on Stratus solution to a geographically remote site. That way, even if you lose your production site, your business keeps running.
The Stratus philosophy is simple—the best way to avoid the major costs of unplanned downtime is to prevent it from happening in the first place.
Organizations today expect their applications to be continuously available. Their businesses depend upon it. But what happens when unplanned downtime occurs?
Data is lost.
Brand reputations are impacted.
There is potential for regulatory fines.
Revenue is lost.
With these types of impacts organizations must be taking steps to reduce or prevent downtime, right? The reality is that conversations around unplanned downtime and strategies to prevent it are often met with skepticism. It is usually because the impact of downtime is something an organization only appreciates after they’ve experienced an outage. We at Stratus know this all too well because we have been helping customers implement solutions to prevent unplanned downtime for 36+ years.
The results of new Stratus survey further reinforce the spiraling problem. The survey, completed by 250 IT decision makers involved in purchasing or managing high availability solutions for IT or OT platforms, found that while it is known that IT applications cannot tolerate the average length of a downtime incident, decision makers struggle to quantify the cost of downtime and in turn struggle to justify the investment of the right solutions to meet the business requirements.
Among the survey’s findings:
- Unplanned downtime is a huge vulnerability in today’s IT systems: 72% of applications are not intended to experience more than 60 minutes of downtime, well below the average downtime length of 87 minutes
- The cost of downtime is the primary ROI justification when assessing availability solutions: 47% of respondents said the estimated cost of downtime is the primary cost justification when considering the adoption of fault-tolerant or high availability solutions
- However, most organizations cannot quantify the impact of unplanned downtime: 71% of respondents are not tracking downtime with a quantified measure of its cost to the organization
Despite struggling to justify the need to invest in solutions that prevent unplanned downtime, IT organizations are looking for ways to address the problem. One of the common strategies is to leverage high availability features of the current infrastructure including clusters and virtualization technologies. The same study reported that 84% of responding IT organizations who look to virtualization to ensure the availability of their applications still struggle to prevent downtime. The top reasons provided include the high costs of additional operating system or application licenses, the complexity of configuring and managing the environment or the failover time of the solution not meeting SLAs.
The challenges facing IT decision makers is going to continue to grow with the increased adoption of edge based systems, including the Industrial Internet of Things (IIoT) technologies. The availability of applications
will ensure that information is flowing in our ever-increasing connected world. To meet the challenges of today and prepare for the future, organizations must make efforts to eliminate the risk from the equation.
The bottom line? Unplanned downtime presents a growing risk to organizations that are increasingly reliant on their applications being always available.
Stratus helps organizations prevent application downtime, period. While there are other solutions that can achieve 99.95% availability, Stratus solutions enable the simple deployment and management of cost-effective continuously available infrastructures without changing your applications. Move the hours of downtime and complexity of other solutions aside and support your applications with operationally simple continuous availability.
Want to learn more about how to determine the right amount of investment to combat the negative impacts of downtime? Download this Aberdeen Analyst Insight.
Craig Resnick of the ARC Advisory Group shared his insights on how to eliminate unplanned downtime and future-proof automation system assets in a recent webinar. The webinar reviewed the ever-present consequences that can occur from unplanned downtime and some of the leading causes. Strategies to reduce unplanned downtime through implementing updated SCADA systems and using technologies such as virtualization and fault-tolerant computers were discussed, as well as how organizations can leverage those strategies to prepare for the coming wave IIoT.
Here’s a summary of the key take-aways:
- Understanding the true impact of unplanned downtime can lead to a better understanding of where investments can be made in automation systems to reduce such events.
- Unplanned downtime can occur from a variety of areas, including human errors, failure of assets that are not part of the direct supervisory and control chain, and failure of the SCADA systems themselves. The result is lowered OEE, decreased efficiency and reduced profitability.
- Adopting standards-based platforms and implementing technologies such as virtualization can consolidate SCADA server infrastructure and deliver a range of benefits, such as simplified management, easy testing and upgrading of existing and new applications and preparation for the IIoT.
- When virtualizing it is important to understand that you need to protect your server assets, as moving everything to a single virtualized platform means that everything fails if the platform fails. There are various strategies to prevent this, but it is important to ensure that you don’t swap the complexity of a single server per application for a complex failure recovery mechanism in a virtualized environment.
- Fault-tolerant platforms are a key way to avoid this complexity, delivering simplicity and reliability in virtualized implementations, eliminating unplanned downtime and preventing data loss – a critical element in many automation environments, and essential for IIoT analytics. It is important to note that disaster recovery should not be confused with fault-tolerance. DR provides geographic redundancy in case of catastrophic failures, but will not prevent some downtime of data loss. In fact fault-tolerance and DR are complementary and they are often implemented together.
- IIoT is driving OT and IT together so it is important to understand the priorities of each organization. In fact, OT & IT share a lot of common ground when it comes to key issues and this is a good starting point to cooperate in the move towards IIoT. Common requirements include no unscheduled downtime, cyber-security, the need for scalable and upgradeable systems and applications, as well as measurable increases in ROI, ROA and KPI’s. Last but not least is future-proofing systems and preparation for future IIoT applications.
This webinar is a good way to start the process of looking into what needs to be considered for upgrading and modernizing automation compute assets, using technologies such as virtualization and fault tolerance, as the industry evolves to increased levels of efficiency and moves towards implementing IIoT.
In 2014, downtime impacted the bottom line more than ever before. Twenty-three percent of the most expensive outages reported were caused by unplanned IT equipment failures and the unfortunate truth is 52% of enterprise executives believe most of these incidents could have been prevented. So, why gamble on downtime?
An incident at any time of year can have significant repercussions on your reputation, brand, revenue, staffing, and compliance posture, but the stakes are even higher during peak revenue periods (such as the holidays) for credit card processors, retailers, manufacturers, airlines, telcos and other businesses. It is during these periods when always-on availability is mandatory.
Can you imagine the effect of one single outage during a peak time of the year that accounts for nearly 20% of total annual retail sales? Or, the impact on the entire commerce ecosystem if there was an outage on Single’s Day – China’s biggest commercial holiday – when shoppers bought more than $9 billion in goods for themselves on November 11th? To put that in perspective, that’s close to three times what Americans spent last year on Black Friday and Cyber Monday combined!
To put a dollar value to this, consider the cost implications of the Best Buy website outage just this past Black Friday – an issue reportedly brought on by the onslaught of mobile shoppers. If just one minute of IT downtime costs on average $8,023 and the website was down for 2.5 hours, the consumer electronics store would have lost an estimated $1.20 million dollars. That’s a hefty price in goods and it doesn’t account for the many disgruntled consumers either.
So, what do you do to ensure you don’t experience downtime during a peak season and the most critical times of the year? Think always-on availability to give you the peace of mind of having the highest levels of reliability when and where you need it most. We offer several flavors of reliability to suit a variety of customer needs:
- Fault Tolerance: First, we announced the Stratus ftServer, with fault-tolerant hardware to prevent downtime and data loss for mission critical applications. Fast forward to today, now we have brought this mainframe-like technology to software with everRun Enterprise, broadening our portfolio of continuous availability solutions to meet the dynamic needs of businesses that can’t afford any downtime. Imagine the chaos and damage that would occur if an airline’s security scans, baggage transport, or air traffic control stopped working at any time – never mind during the busiest travel days of the year?
- Reliability at the Edge: No matter the season, it’s critical to ensure continuous availability at the extremes of your network. You need the assurance and peace of mind your edge applications will have the reliability and processing power they need to deliver the performance required at the edge where you may have limited IT support. Consider the importance for a food manufacturer to ensure the uninterrupted production of its products at each of its manufacturing plants during peak demand periods. Companies are ensuring the highest levels of reliability both in the data center and at the network’s edge with Stratus’ everRun Enterprise and ftServer.
- Software Defined Availability: What if you could “dial up” availability as required on an application by application basis? Software Defined Availability will do exactly that. In other words we’re providing the ability to dynamically adjust availability levels – from fault tolerance during peak times (such as the holidays for retailers) to lower levels of availability during a lull. This is a significant cost and resource advantage for businesses that only require the assurance of zero downtime for portions of their fiscal year – think payroll applications for example.
- Reliability in the Cloud: Enterprises are resistant to move business critical applications to the cloud because there is a huge gap between the levels of availability they need (less than 5 minutes of downtime per year) and the levels provided by most private cloud solutions (less than 5 minutes of downtime per week). Achieving these levels without building availability into the application itself is exactly what Stratus is working on. We recently gave the first public demo of how we run business critical workloads in an OpenStack cloud environment at the OpenStack Summit in Paris.
No matter what your situation, whether you’re a business “livin’ on the edge,” in the data center, or in the cloud, Stratus is committed to providing the reliability you need, when you need it, so that you are completely protected during the crucial times of the year.
Lots of good statistics about the causes, costs and next steps that companies can utilize for understanding their risk and potential costs related to downtime, so they can procure additional funds to protect against future availability issues.
Let’s take a quick look at the high level findings published in the Infographic.
91% still experience downtime
33% of all downtime is caused by IT equipment failure
IT equipment failure is the most expensive outage (23%). Twice as high as every other except cyber crime (21%).
Average length of downtime is still over 86 minutes
Average cost of downtime has increased 54% to $8,023 per minute.
Based on these statistics, 30% (33% of 91%) of all data centers will have downtime related to IT equipment failure. Assuming they only have one incident of the average length, they would incur $689,978 (86 x $8,023) in downtime related costs.
Stratus can address 33% of the most costly downtime with our fault-tolerant hardware and software solutions.
52% believe the outages could have been prevented. This makes sense, because 48% is caused by accident and human error. Only training, personnel changes or outsourcing can improve that cause of downtime.
70% believe cloud is equal or better than their existing availability. That’s if you don’t look too close at the SLA details (i.e. excluding “emergency maintenance” or downtime only counts toward SLA if over xx min per incident). Certainly most cloud providers can provide better than the 99.98% [(525,600-86)/525,600] availability these data centers are currently averaging (assuming only one incident of average length). But remember, all SLAs are limited to the cost of the service, which I assume is far less than the almost $700k downtime related cost most in the survey have realized.
Cloud solutions are constantly improving; but we continue to hear from our customers that availability still has a long way to go, especially when it comes to stateful legacy workloads that don’t have availability built into the application like native cloud apps. Of course, this is something that we at Stratus are working on.
I say look into availability options and invest upfront in the best availability you can afford, it might not pay dividends upfront, but an ounce of prevention is worth a pound of cure. Because $50k spent on availability might be worth $700k in related costs, not to mention headaches and tarnished reputation.
What if you had the choice of having your applications available 99% of the time versus 99.9995% of the time – would you really experience a difference?
What is your typical morning like? Perhaps it begins with breakfast, followed by the morning news and a 30-minute workout. But not everything goes as planned. Sometimes the cereal you were hoping for or the web site you frequent for current events aren’t available. For these daily decisions, the answers are easy – eat something else or try another URL. Honestly, if your favorite cereal was only available 90% of the time, you’d be fine.
A typical day at the office generally starts out with a similar pattern – turn on the computer, log-on and begin using the applications essential to your job and company’s success. For the majority of the time, most days go as planned. However, what happens when your routine goes awry? What’s the effect on your company’s productivity when the applications you and your colleagues depend upon go down and everything comes to a screeching halt? What if the application is outward facing and affects customers trying to do business with you? What if it happens at a peak time? These are all questions someone considered when deciding what type of availability solution was required for the application (or at least you hope they did). The effects are as much as the potential costs – but that is a story for another post.
This Availability Journey Infographic does a great job of representing almost every factor you should consider and classifies the probable solutions by their average yearly downtime. This average is translated into a “Downtime Index Multiplier” that can be used to help calculate your company’s “Yearly Downtime Risk”. The Downtime Index Multiplier is shown at each stage in the infographic. It is derived from the average downtime for the given solution — converting hours, minutes and seconds into a decimal format for multiplication. So, a solution with 99% availability has about 87 hours and 36 minutes of yearly downtime – converting to a Downtime Index Multiplier of 87.6 (87+(36 /60)). You use this multiplier to calculate your yearly downtime risk for the solution as shown on the Availability Journey Infographic. For example, if you calculated your application’s hourly cost of downtime at only $10,000 – your yearly downtime risk, at a 99% availability rate, would be $876,000 ($10,000 x 87.6). In comparison, a 99.9995% solution has only 2 minutes and 38 seconds of yearly downtime – or an index of only 0.04. Using the example above, the yearly downtime risk would be $400 ($10,000 x .04). Thus, if your application’s hourly downtime costs were only $10,000 an hour, the difference in yearly risk between the lowest and highest availability solutions would be $875,600.
Today’s top-of-the-line availability solutions are not the purpose-built exorbitantly priced mainframes of yesterday. They are industry-standard plug-and-play solutions that fit into almost any infrastructure including virtualized and cloud. One thing I can guarantee; unless you’re a credit card company, fault-tolerant hardware or software won’t cost you a fraction of the $876K at risk in the example above. Then again, if you were, you’d already be utilizing fault-tolerance because your risk is probably in the billions even without considering the hidden costs of downtime like damaged reputation, regulatory impact and lost customers.
So, what is the cost of downtime and availability goal for your company’s applications in this always-on world? Well, 67% of best-in-class organizations use fault-tolerant servers for high availability6. Be careful if you’re part of the 66% who still rely on traditional backup for availability8… because you are taking a huge gamble.
Over 100 million people worldwide tuned in to watch Super Bowl XLVII. Therefore, it could be argued that was the most viewed and infamous power outage to wreak havoc on the grandest of scales.
It just goes to show, downtime happens.
We can’t really say for sure how or what occurred, although early speculation placed blame on Beyoncé’s lights-out performance, a manager at the Superdome, site of the game, said it was not the halftime show, but that a local energy company is claiming it had trouble with one of the two main lines that deliver power to the stadium from a local substation.
It could have been a software glitch, or a hardware problem that sacked power to the stadium for 33 minutes and left the NFL with a black eye. But the downtime incident powered a social media surge, as hundreds of thousands of people began Tweeting about the #poweroutage.
Which brings us to Twitter itself? Having suffered its own downtime nightmare back on January 31, Twitter was able to handle the blitz of people tweeting about the Super Bowl’s misfortune. Twitter announced it processed just over 24 million tweets during the game, with the mini missives coming in at a rate of 231,500 a minute during the power outage.
Downtime appears in many different forms and at many different times, across all industries and business landscapes. The Twitter downtime occurrence was much different from that the NFL witnessed, but both incidents took their tolls financially and in terms of a hit to brand reputation.
Within the enterprise there is an acceptable level of downtime that occurs each year. On average, businesses suffer between three and five hours of downtime per year, far too much in our humble opinion, at an average cost of $138,888 per hour. While that’s a staggering figure, the damage to the brand can be even more catastrophic.
Let’s get back to the Super Bowl and the power outage. The City of New Orleans, which hosted the game, is already worried it’ll lose out on hosting future games because of what happened. That’s a city known for its ability to show its visitors a good time, but those businesses that depend on major events like the Super Bowl to draw in tourism dollars could suffer from that 33-minute absence of electricity.
Again, downtime comes in many forms depending on the industry and the ramifications have the potential to throw their victims for a significant loss. It’s like that old saying that you need to expect the unexpected. When the unexpected does arrive you have to be prepared to come back from that downtime swiftly and with as little disruption to your business as possible. With the right technology and the right best practices in place, you can minimize the damage and decrease the chance of downtime seriously hampering your ability to do business.
Have you ever thought what a minute of your time is worth?
Let’s say you get paid $60 an hour – then one minute is worth $1. If you are reading this, then, my bet is you are probably willing to spend $1 waiting for an answer. Chances are you will wait much longer, especially if it’s on someone else’s dime. But, how long is too long to wait?
If you run a 911 response center (emergency phone service in the USA) then one minute of downtime is not measured in dollars but lives. Maybe you are the IT manager of a financial company, how many credit card transactions could you lose in one minute? One hundred? One thousand? Maybe many more.
In both these and many other commercial examples, the cost of downtime is both known and quantifiable. Businesses not only perform risk assessments on downtime but also, they make business decisions to avoid it. In 2011, eWeek reported a business could lose an average of about $5,000 per minute in an outage. As they say, “at that rate, $300,000 per hour is not something to dismiss lightly.” Given that, I think we can all agree for critical business applications – uptime is pretty important to many business and now, to me too.
This week, I start a new job as chief marketing officer at Stratus Technologies, one of the world’s leaders in ensuring up-time for your applications. You will find our software and servers behind many things you use day-to-day and you would be pretty upset if they didn’t work. Examples would be supporting credit card transactions and 911 services. What makes this role interesting is not just these types of services, but also, how our solutions apply to others. Let me give you an example.
I was sitting in the Austin airport waiting to board the first of two flights that would take me to Boston, my new home. I wanted to let my friends on Facebook know that I had started my journey so I thought I would check-in on Foursquare – which automatically updates Facebook. Foursquare is down.
I wait until Dallas (I am changing through DFW – one of the downsides to Austin) and Foursquare is still down. When I arrive in Boston, hours later, Foursquare is up, so I check in. Of course, I could have given up on Foursquare and just checked-in on Facebook. In the cloud, there are often alternative ways of doing things.
This may seem like a trivial example, especially compared to a 911 service, but if you are Foursquare and in search of a business model, I suspect this is not good news. As that social site looks to monetize its platform, my guess is it will use ads. I need to be on the Foursquare service to see the Ads. Another outage like this and I will not be on the service. The reality is it may not have been the site’s fault, it maybe the service provider’s fault, but as a user, I don’t care.
Just a few weeks ago, in the CIO section of the Wall Street Journal, they reported “Netflix Amazon Outage Shows ‘Any Company Can Fail’.” Forrester Research analyst Rachel Dines is quoted as saying, “It’s all about timing. This was a big deal because it was one of the worst possible times it could happen as families gathered during Christmas to watch movies.” OK, so families could have talked to each other, but you get the point and there are plenty of other alternatives to Netflix.
What excites me about Stratus Technologies was not just how our technologies applies to established commercial businesses but to these new cloud-based services. I have no doubt that as cloud applications become more important in our lives, Stratus Technologies will have a critical role to play in making them available all the time.
For now, I have a lot to learn about the business and I look forward to blogging about it as I go.
It’s no secret that system downtime is bad for business. For one thing, it’s expensive. According to a 2012 Aberdeen Group report, the average cost of an hour of downtime is now $138,888 USD — up more than 30% from 2010. Given these rising costs, it’s no wonder that ensuring high availability of business-critical applications is becoming a top priority for companies of all sizes.
When it comes to choosing the right downtime protection, there are a couple of important things to keep in mind. First, deployment of applications on hypervisor software for server virtualization is increasing at a steady pace and is expected to continue until almost all applications are implemented on virtualized servers. As a result, you need to make sure that your downtime protection is able to support virtualized as well as non-virtualized applications. Second, with IT spending and headcount on the decline, downtime protection should be easy to install and maintain since there are fewer IT resources available to manage the assets.
Available downtime protection options range from adding no additional protection other than that offered by general-purpose servers to deploying applications on fault-tolerant hardware. Which option you choose will depend on the type of application in question. If the application is mission-critical, then you’ll need higher levels of protection. A strong segment of companies are choosing to protect each of their mission critical applications with fault-tolerant servers because they provide the highest availability, require no specialized IT skills, and are now priced within reach of even small to mid-size companies. Looking for guidance in choosing the right downtime protection for your “can’t fail” applications? Download the Aberdeen Group report to learn more.
This month, we’re kicking off a new blog series that will examine availability in terms of the nines. We’ll provide you with some basic, real world examples to give you a better idea of how one little 9 can make a big impact. In this installment, we’ll look at what this means for your favorite social media sites – Twitter and Facebook.
Any discussion of server availability and downtime will inevitably lead to a conversation about the nines, or the percentage of uptime that can be expected from the server environment. For our purposes, we’ll look at how two social media sites measure up in five different levels of uptime – from the lowest at 99 percent, to the highest at 99.999 percent (which Stratus ensures with our fault-tolerant ftServers).
According to The Social Skinny, Twitter now has more than 140 million active users, who send 340 million tweets every day. How many tweets would fail to be sent if the site experienced downtime?
|Level of Uptime||100%||99.999%||99.99%||99.95%||99.9%||99%|
|Tweets that would be sent||340 Million||339,996,600||339,966,000||339,830,000||339,660,000||336,600,000|
|Tweets that would go unsent||0||3,400||34,000||170,000||340,000||3,400,000|
What would you do if the tweet announcing your new product was one of the 3 million plus that didn’t make it up due to a 99 percent availability rate?
Facebook: Over the years Facebook has evolved from a social network for college students to a platform for businesses to connect with consumers and partners, and a way for family and friends to stay connected even though they may be many miles apart. Needless to say, one of the best ways to do all of this is through sharing pictures. In fact, according to GigaOM, on average more than 300 million photos were uploaded to Facebook per day from January – March, 2012.
|Level of Uptime||100%||99.999%||99.99%||99.95%||99.9%||99%|
|Pictures that would be uploaded||300 Million||299,997,000||299,970,000||299,850,000||299,700,000||297,000,000|
|Pictures that wouldn’t be uploaded||0||3,000||30,000||150,000||300,000||3,000,000|
Wouldn’t you be upset if the picture of your CEO giving his keynote speech at an industry show wasn’t able to be uploaded, due to downtime? How about missing out on some major milestone pictures of your grandchild, niece or nephew that lives across the country?
Now imagine that, instead of photos or tweets, these are dollars, dollars that your business looses when critical IT systems go down.
For more information on how much data center downtime costs, download our most recent report with Aberdeen Group, “Datacenter Downtime: How Much Does It Really Cost?”