Organizations today expect their applications to be continuously available. Their businesses depend upon it. But what happens when unplanned downtime occurs?
Data is lost.
Brand reputations are impacted.
There is potential for regulatory fines.
Revenue is lost.
With these types of impacts organizations must be taking steps to reduce or prevent downtime, right? The reality is that conversations around unplanned downtime and strategies to prevent it are often met with skepticism. It is usually because the impact of downtime is something an organization only appreciates after they’ve experienced an outage. We at Stratus know this all too well because we have been helping customers implement solutions to prevent unplanned downtime for 36+ years.
The results of new Stratus survey further reinforce the spiraling problem. The survey, completed by 250 IT decision makers involved in purchasing or managing high availability solutions for IT or OT platforms, found that while it is known that IT applications cannot tolerate the average length of a downtime incident, decision makers struggle to quantify the cost of downtime and in turn struggle to justify the investment of the right solutions to meet the business requirements.
Among the survey’s findings:
- Unplanned downtime is a huge vulnerability in today’s IT systems: 72% of applications are not intended to experience more than 60 minutes of downtime, well below the average downtime length of 87 minutes
- The cost of downtime is the primary ROI justification when assessing availability solutions: 47% of respondents said the estimated cost of downtime is the primary cost justification when considering the adoption of fault-tolerant or high availability solutions
- However, most organizations cannot quantify the impact of unplanned downtime: 71% of respondents are not tracking downtime with a quantified measure of its cost to the organization
Despite struggling to justify the need to invest in solutions that prevent unplanned downtime, IT organizations are looking for ways to address the problem. One of the common strategies is to leverage high availability features of the current infrastructure including clusters and virtualization technologies. The same study reported that 84% of responding IT organizations who look to virtualization to ensure the availability of their applications still struggle to prevent downtime. The top reasons provided include the high costs of additional operating system or application licenses, the complexity of configuring and managing the environment or the failover time of the solution not meeting SLAs.
The challenges facing IT decision makers is going to continue to grow with the increased adoption of edge based systems, including the Industrial Internet of Things (IIoT) technologies. The availability of applications
will ensure that information is flowing in our ever-increasing connected world. To meet the challenges of today and prepare for the future, organizations must make efforts to eliminate the risk from the equation.
The bottom line? Unplanned downtime presents a growing risk to organizations that are increasingly reliant on their applications being always available.
Stratus helps organizations prevent application downtime, period. While there are other solutions that can achieve 99.95% availability, Stratus solutions enable the simple deployment and management of cost-effective continuously available infrastructures without changing your applications. Move the hours of downtime and complexity of other solutions aside and support your applications with operationally simple continuous availability.
Want to learn more about how to determine the right amount of investment to combat the negative impacts of downtime? Download this Aberdeen Analyst Insight.
Unplanned downtime has long been the nemesis of industrial operations. In recent years, we’ve seen tolerance for unplanned downtime get even lower. In fact, a recent survey by Stratus and the ARC Group reports that almost 40% of respondents said they could handle no more than 10 minutes of downtime per incident.
More than 20% said they could not tolerate downtime at all.
One reason is that industrial control systems (ICSs) produce data that’s become increasingly valuable to the business. A modern ICS can collect data down to the millisecond. When combined with analytics, this data enables initiatives like real-time automation and predictive maintenance, as well as accelerates adoption of Industrial Internet of Things (IIoT), Industry 4.0, and smart factories. Simply put, the more you automate and reduce human errors through real-time system intelligence, the more you improve operational efficiency and drive higher profitability.
Recently, when I led an IndustryWeek webinar, I asked attendees what concerned them most about unplanned downtime. Not surprisingly, 54% identified potential revenue loss, 15% referenced loss of visibility resulting in a safety violation, and 13% highlighted the additional cost to run things manually.
Industry statistics support these concerns. According to ARC’s research, unplanned downtime results in 2-5% production loss in the petro-chemical industry. It costs natural gas companies about $10,000 per hour if a compression station goes down. Across the board, unplanned downtime in process industries costs ten times more than planned maintenance.
Modernizing ICS can lower these impacts and improve operating efficiency. So why don’t more organizations modernize? Many are concerned about complexity. They have numerous applications running on different machines that vary widely in age and configuration. The thought of upgrading such a jumble of systems can be a major inhibitor.
That’s why we see virtualization as the prime way forward for modernizing the ICS. Instead of needing lots of hardware, virtualization can often reduce everything to a single physical machine running multiple applications assigned to individual virtual machines. This makes it much easier to manage various elements of industrial automation, as well as add or upgrade applications.
Virtualization also takes the pain out of modernizing ICS because you can migrate systems gradually. A virtualized system can easily reside alongside your existing systems. Then you just move one application at a time from the traditional environment to the virtualized one.
Now, the infrastructure you choose for your virtualized ICS environment is critical. I asked the IndustryWeek webinar attendees what they considered the most important decision factor. Nearly 40% of respondents identified lifetime value because this is a system that could be in operation for at least seven to ten years. Another 26% of attendees referenced operational simplicity. Automation engineers don’t want to spend their valuable time on system administration; they want to focus on running the plant. And they want an infrastructure that helps minimize, if not eliminate, unplanned downtime.
Stratus fault-tolerant servers address every one of these points and more. So if you’re looking to modernize your ICS, Stratus can provide you with some compelling options.
In the world of securities trading, few things are more important than speed and availability. With electronic, algorithm-driven trading, delays of even a few microseconds can translate to financial losses of tens of thousands of dollars. Longer periods of system downtime could be ruinous, with financial losses soaring into the millions.
Tune in to a short video and learn how Stratus enables this leading international stock exchange to be always on without compromising performance.
That’s why stock exchanges around the world rely on Stratus. One of our international stock exchange customers handles more than a billion trade messages daily, making them one of the largest stock exchanges in the world by volume. It would be national, if not international, news if their systems went down. With Stratus, the exchange’s critical trading applications are always on and performing at top efficiency.
Unlike many other financial institutions, the exchange did not want to run horizontally distributed applications on traditional hardware clusters for higher availability due to the higher operational expenditures. The stock exchange also veered from building high availability into the application software because it would increase software overhead and engineering complexity.
Instead, the stock exchange avoids all that with Stratus hardware fault tolerance. An IT executive at the exchange, explains, “Hardware fault tolerance gives us bullet-proof reliability without the cost of losing compute cycles.”
Those compute cycles are critical. Not only can software complexity steal cycles, but it also leads to numerous internal system events that can cause “jitter” and slow down a stock transaction. Seemingly inconsequential system events like turning on a fan to moderate operating temperature consume valuable computing resources and put a drag on performance. Complex clustering approaches only add to performance problems by introducing network latency. The combination of jitter and latency could pause a trading algorithm for hundreds of microseconds. That’s enough time for a stock price to change.
At Stratus, we’ve built controls into our hardware to control jitter and minimize latency, helping to avoid a potential negative business impact. In fact, by choosing Stratus hardware fault tolerance, this stock exchange customer’s trading application runs faster than it would had it pursued custom-coded high-availability software.
Plus, there is no need for in-house software engineers to write and maintain special code. Such “home grown” solutions just add cost and complexity, and the systems may still fail to meet the performance and availability demands of electronic trading. With Stratus, fault tolerance and performance optimization are built into the hardware, eliminating the need to modify software.
As a result, this stock exchange has saved millions of dollars by avoiding the cost of writing high-availability code into its software. Not to mention the savings in floor space and in administration time by choosing Stratus fault tolerance instead of a traditional hardware cluster. And the exchange maintains superior uptime levels of 99.999% with Stratus which was not achievable with custom-coded high-availability software.
The IT executive at the exchange explains, “Stratus allows us to deploy our applications without worrying about hardware failure. That’s been a big advantage for us. It’s made our architecture and manageability a lot simpler. And we’ve been able to run without issue. It’s been a good ride for us.”
More and more, we’re seeing operations organizations virtualizing critical industrial automation (IA) applications such as Supervisory Control and Data Acquisition (SCADA) and human machine interface (HMI) historians. And whether virtualizing these systems or not, many companies choose to run their applications on fault-tolerant systems. Here are three examples of why:
Water and wastewater treatment facility
A municipal water and wastewater treatment facility struggled with declining income due to a shrinking tax base. On top of that, the EPA was tightening regulations, such as regular testing of field wells and other water sources and assurance of no data loss. Another big concern was control room “blindness” due to unplanned downtime of its SCADA systems.
By virtualizing and running their SCADA systems on always-on Stratus ftServers, the facility eliminated unplanned downtime. The facility also was able to demonstrate to regulatory auditors that continuous data availability was ensured. Plus, virtualizing reduced software licensing costs and the self-healing features of ftServers saved on staffing otherwise needed for system monitoring and support. As a result, the municipality ensures high water quality and satisfies EPA regulations all on a tight budget.
This scenario involves a manufacturing plant specializing in kraft-style paper and packaging. The company was running manufacturing execution system (MES) and sales order processing (SOP) applications on 20-year-old legacy systems. Their biggest concern was unplanned downtime, which cost the business $33,000 per hour.
The company replaced its legacy systems with Stratus always-on ftServers running state-of-the-art MES and SOP applications. Unplanned downtime is now a thing of the past. The Stratus systems are easy to operate and support, so the plant no longer needs IT staff on call. And, continuous operations without any line stoppages helped the company increase profitability.
A major natural gas transmission company operates about 80 compression stations along thousands of miles of pipeline. Most of their stations are in remote locations with limited space and power. Their existing SCADA/HMI infrastructure simply wasn’t built for those difficult conditions, and servers started to fail, causing downtime of two or three days each time.
So the company virtualized and now runs multiple IA applications on a single Stratus ftServer. This eliminated the downtime problem. It also reduced the number of servers at each compression station from eight to one, which decreased their IT expenditures significantly at the remote sites. Plus, the company eliminated data loss. This is critical to their predictive analytics systems, which tell them when equipment requires maintenance to avoid catastrophic failures.
This is a small sampling of how virtualization and fault tolerance benefit both SCADA/HMI and analytics. In fact, data availability of analytics is becoming one of the most important requirements in today’s modern operations environments—a trend that we’re seeing across all segments in the IA space.
It’s often difficult to fully understand the impact of modernizing an automation system and how an investment in fault-tolerant platforms from Stratus, along with updated Programmable Automation Controllers (PACs) can deliver rapid results. For Columbia Pipeline Group (recently purchased by TransCanada) the answer is approximately $2.3 million in 2014 alone, even with a partial pipeline upgrade. Using the Stratus ftServer platform, which delivers high availability in excess of five nines (99.999%), Columbia has achieved an overall system-wide availability level of over 99.5%. Through having a virtualized platform with continuous availability, not only do the SCADA applications experience no unplanned downtime due to system failures, but no data loss means that asset management systems and IIoT predictive analytics applications can effectively control access and provide accurate data to maximize maintenance effectiveness.
As building automation and security systems become increasingly reliant on server technology, ensuring the availability—or uptime—of the applications running on those servers is absolutely critical. But how much availability is “good enough”? And what’s the best way to achieve that level of availability?
To answer those questions, it’s important to understand the three basic approaches to server availability:
1. Data backups and restores:
Having basic backup, data-replication, and failover procedures in place is perhaps the most basic approach to server availability. This will help speed the restoration of an application and help preserve its data following a server failure. However, if backups are only occurring daily, significant amounts of data may be lost. At best, this approach delivers approximately 99 percent availability.
That sounds pretty good, but consider that it equates to an average of 87.5 hours of downtime per year—or more than 90 minutes of unplanned downtime per week. That might be good enough for a business application that is not mission critical, but it clearly falls short of the uptime requirements for building security and life-safety applications.
2. High availability (HA)
HA includes both hardware-based and software-based approaches to reducing downtime. HA clusters are systems combining two or more servers running with an identical configuration, using software to keep application data synchronized on all servers. When one fails, another server in the cluster takes over, ideally with little or no disruption. However, HA clusters can be complex to deploy and manage. And you will need to license software on all cluster servers, increasing costs.
HA software, on the other hand, is designed to detect evolving problems proactively and prevent downtime. It uses predictive analytics to automatically identify, report and handle faults before they cause an outage. The continuous monitoring that this software offers is an advantage over the cluster approach, which only responds after a failure has occurred. Moreover, as a software-based solution, it runs on low-cost commodity hardware.
HA generally provides from 99.95 percent to 99.99 percent (or “four nines”) uptime. On average, that means from 52 minutes to 4.5 hours of downtime per year—significantly better than basic backup strategies.
3. Fault-tolerance (FT)
Also called an “always-on” solution, FT’s goal is to reduce downtime to its lowest practical level. Again, this may be achieved either through sophisticated software or through specialized servers.
With a software approach, each application lives on two virtual machines with all data mirrored in real time. If one machine fails, the applications continue to run on the other machine with no interruption or data loss. If a single component fails, a healthy component from the second system takes over automatically.
FT software can also facilitate disaster recovery with multi-site capabilities. If, for example, one server is destroyed by fire or sprinklers, the machine at the other location will take over seamlessly. This software-based approach prevents data loss, is simple to configure and manage, requires no special IT skills, and delivers upwards of 99.999 percent availability (about one minute of downtime a year)—all on standard hardware.
FT server systems rely on specialized servers purpose-built to prevent failures from happening and integrate hardware, software and services for simplified management. They feature both redundant components and error-detection software running in a virtualized environment. This approach also delivers “five nines” availability, though the specialized hardware required does push up the capital cost.
Making server availability a cornerstone of your building security automation strategy pays dividends both in terms of day-to-day management and when situations arise that test your security. With the right strategy up front, your building’s security systems will be there when it really counts today and in the future. In today’s constantly changing, “always-on” world, that’s all the time.
Downtime prevention is becoming a top priority for organizations across all market sectors — from manufacturing, building security and telecommunications to financial services, public safety and healthcare. What’s driving this requirement for always-on applications? It’s partly due to the rapid expansion of users, environments, and devices. Increasingly, however, organizations require high application availability to compete successfully in a global economy, comply with regulations, mitigate potential disasters, and plan for business continuity. All these factors contribute to a growing demand for high-performance availability solutions to keep applications up and running.
The good news is that there are many effective availability solutions available on the market today including standard servers with backup, continuous data replication, traditional high-availability clusters, virtualization and fault-tolerant solutions. But with so many options, figuring out which approach is good enough for your organization can seem overwhelming.
Understanding the criticality of your computing environment is a good place to start. This involves assessing downtime consequences on an application-by-application basis. If you’ve virtualized applications to save costs and optimize resources, remember that your virtualized servers present a single point of failure that extends to all the virtual machines running on them, thereby increasing the potential impact of downtime. Depending on the criticality of your applications, you may be able to get by with the availability features built into your existing infrastructure or you may need to invest in a more powerful and reliable availability solution — perhaps one that proactively prevents downtime rather than just speeding and simplifying recovery.
But availability level is not the only factor to consider when selecting a best-fit solution to protect your applications against downtime. Stratus has created a Downtime Prevention Buyer’s Guide to streamline the evaluation process and, ultimately, help you make the right choice of availability solution. The guide presents six key questions you should ask vendors along with valuable insights into the strengths and limitations of various approaches. You can use vendors’ responses to objectively compare solutions and identify those that best meet your availability requirements, recovery time objectives, IT management capabilities, and return on investment goals, while integrating seamlessly within your existing IT infrastructure.
Alaska is an environment tailor-made for Stratus Technologies’ solutions. The state is remote, access to IT expertise is a challenging in many locations and the environmental condition can present a myriad of issues. Combine this with small, remote communities with the need to run automation systems for water, waste water and other utilities, and the requirement for fault-tolerant solution is easy to understand. Alaska’s main industries include oil & gas, fishing, processing, and mining.
At CB Pacific’s recent Automation Symposium in Anchorage, Alaska, I had the opportunity to meet with many end users and system integrators to discuss their challenges and understand how Stratus could be of service to them. Given the fact that oil and gas commodity prices are under pressure, resulting in financial constraints on the major Alaska producers, these producers are looking for cost saving solutions and efficiencies across all aspect of their business. With only one major road up to the North slope, and remote pumping and compression stations that are only accessible by helicopter, highly redundant systems are an absolute necessity. In addition, the weather can add another level of complexity for those who service and maintain the critical infrastructure.
Traditional solutions rely on multiple standard servers configured in a variety of redundant configurations to keep things running. Stratus, working with its partner CB Pacific and local systems integrators is able to offer a single fault-tolerant solution. Through integrated redundancy that continuously monitors and diagnoses any potential problems, allowing for an environment that has no downtime and no data loss. CB Pacific and its system integrator partners handle delivery of the easy to replace hot-swappable replacement parts. This streamlined solution makes for a very cost-effective implementation and eliminates the fear of blind moments caused by unplanned downtime. Only Stratus can provide a Fault Tolerant solution from which to run your critical monitoring and safety applications.
The partnership of Stratus, the global standard for fault-tolerant hardware solutions, and Solarflare, the unchallenged leader in application network acceleration for financial services, at face value seems like an odd one. Stratus ‘always on’ server technology removes all single points of failure, which eliminates the need to write and maintain costly code to ensure high availability and fast failover scenarios. But up until recently Stratus and high performance have rarely been used in the same sentence.
Let’s go back further… Throughout the 1980’s and 90’s Stratus, and their proprietary VOS operating system, globally dominated financial services from exchanges to investment banks. In those days, the priority for trading infrastructures was uptime which was provided by resilient hardware and software architectures. With the advent of electronic trading the needs of today’s capital markets has shifted. High Frequency Trading (HFT) has resulted in an explosion in transactional volumes. Driven by the requirements of one of the largest Stock Exchanges in the world, Stratus realized that critical applications need to not only be highly available, but also extremely focused on performance (low latency) and deterministic (zero jitter) behavior.
Stratus provides a solution that guarantees availability in business critical trading systems, without the costly overhead associated with today’s software based High Availability (HA) solutions, or the need for multiple physical servers. You could conceivably cut your server footprint in half by using a single Stratus server where before you’d need at least two physical servers. Stratus is also a “drop and go” solution. No custom code needs to be written, there is no concept of Stratus FT built customer applications. This isn’t just for Linux environments, Stratus also has hardened OS solutions for Windows and VMWare as well.
Solarflare brings low latency networking to the relationship with their custom ethernet controller ASIC and Onload Linux Operating System Bypass communications stack. Normally network traffic arrives at the server’s network interface card (NIC) and is passed to the Operating System through the host CPU. This process involves copying the network data several times, and switching the CPU’s context from kernel to user mode one or more times. All of these events take both time and CPU cycles. With over a decade of R&D Solarflare has considerably shortened this path. Under Solarflare’s control applications often receive data in about 20% of the time it would typically take. The saving is measured in micro-seconds (millionths of a second), typically several or more. In trading speed often speed matters most, so a dollar value can be placed on this saving. Back in 2010 one trader valued the savings at $50,000/micro-second for each day of trading.
Both Stratus and Solarflare have worked together to dramatically reduce jitter to nearly zero. Jitter is caused by those seemingly inevitable events that distract a CPU core from its primary task of electronic trading. For example the temperature of thermal sensor somewhere in the system may exceed a predetermined level and it raises a system interrupt. A CPU core is then assigned to handle that interrupt and determine which fan needs to be turned on or sped up. While this event, known as “Jitter”, sounds trivial the distraction to processes this interrupt and return to trading often results in a delay measured in the 100’s of micro-seconds. Imagine you’re trading strategy normally executes in 10s of micro-seconds, network latency adds 1-2 microseconds, and then all the sudden the system pauses your trading algorithm for 250 micro-seconds while it does some system house-keeping. By the time control is returned to your algo it’s very possible that the value of what you’re trading has changed. Both Stratus and Solarflare have worked exceedingly hard to remove Jitter from the FT platform.
Going forward, Solarflare and Stratus will be adding Precision Time Protocol support to a new version of Onload for the Stratus FT Platform.
Craig Resnick of the ARC Advisory Group shared his insights on how to eliminate unplanned downtime and future-proof automation system assets in a recent webinar. The webinar reviewed the ever-present consequences that can occur from unplanned downtime and some of the leading causes. Strategies to reduce unplanned downtime through implementing updated SCADA systems and using technologies such as virtualization and fault-tolerant computers were discussed, as well as how organizations can leverage those strategies to prepare for the coming wave IIoT.
Here’s a summary of the key take-aways:
- Understanding the true impact of unplanned downtime can lead to a better understanding of where investments can be made in automation systems to reduce such events.
- Unplanned downtime can occur from a variety of areas, including human errors, failure of assets that are not part of the direct supervisory and control chain, and failure of the SCADA systems themselves. The result is lowered OEE, decreased efficiency and reduced profitability.
- Adopting standards-based platforms and implementing technologies such as virtualization can consolidate SCADA server infrastructure and deliver a range of benefits, such as simplified management, easy testing and upgrading of existing and new applications and preparation for the IIoT.
- When virtualizing it is important to understand that you need to protect your server assets, as moving everything to a single virtualized platform means that everything fails if the platform fails. There are various strategies to prevent this, but it is important to ensure that you don’t swap the complexity of a single server per application for a complex failure recovery mechanism in a virtualized environment.
- Fault-tolerant platforms are a key way to avoid this complexity, delivering simplicity and reliability in virtualized implementations, eliminating unplanned downtime and preventing data loss – a critical element in many automation environments, and essential for IIoT analytics. It is important to note that disaster recovery should not be confused with fault-tolerance. DR provides geographic redundancy in case of catastrophic failures, but will not prevent some downtime of data loss. In fact fault-tolerance and DR are complementary and they are often implemented together.
- IIoT is driving OT and IT together so it is important to understand the priorities of each organization. In fact, OT & IT share a lot of common ground when it comes to key issues and this is a good starting point to cooperate in the move towards IIoT. Common requirements include no unscheduled downtime, cyber-security, the need for scalable and upgradeable systems and applications, as well as measurable increases in ROI, ROA and KPI’s. Last but not least is future-proofing systems and preparation for future IIoT applications.
This webinar is a good way to start the process of looking into what needs to be considered for upgrading and modernizing automation compute assets, using technologies such as virtualization and fault tolerance, as the industry evolves to increased levels of efficiency and moves towards implementing IIoT.