Stratus Blog

Showing archives for category Fault Tolerance

How Much Does Downtime Cost Your Organization?

2.8.2017Cost of Downtime, Fault ToleranceBy:  

Across the spectrum of industries, one thing all companies agree on is that the cost of unplanned downtime is quite substantial. The vexing question is how much?

Surprisingly, a survey of operations people found that 71% of respondents admitted their company is not tracking downtime cost with any quantifiable metrics. That means most companies won’t know what an outage costs until it occurs and by then it’s too late to prevent such an incident.

In stark contrast, Stratus customers are keenly aware of how unplanned downtime could impact their businesses. In fact, a recent TechValidate survey of 533 Stratus users identified the five biggest cost factors from unplanned downtime:

Loss of Productivity – Think about a critical production line sitting idle for hours or days. Or dozens of employees forced to revert to manual processes during an outage of operations systems. One of our manufacturing customers calculated their cost of unplanned production downtime at $33,000 per hour.

Loss of Revenue – If you can’t process and fulfill customer orders due to failed systems, revenue is inevitably reduced. For one Stratus customer-a national stock exchange handling more than one billion trade messages daily-even a few microseconds of downtime can mean revenue losses of tens of thousands of dollars.

Damage to Brand and Reputation – It’s a simple fact: When customers lose confidence in your business, they may go to a competitor. This also makes it difficult to attract new customers. In some cases, it could take years to rebuild your brand image and restore lost revenue.

Loss of Data – When critical systems fail, you could lose valuable transactional and historic data, such as intellectual property, customer records, and financial accounts. Without proper data protection, the cost to your business could be in the millions of dollars.

Non-Compliance – For highly regulated industries, such as public utilities, unplanned downtime can mean stiff fines. Regulators often require demonstrable proof of continuous data availability. The cost of non-compliance can quickly add up, and in some instances result in suspension of your operating license.

With these considerations in mind, and the fact that every day we help customers size these variables based on their inputs, we developed an online Stratus Cost-of-Downtime Calculator. This tool helps professionals like you figure out the full financial impact of downtime on your organization. Check it out, it will help you easily determine how quickly downtime can affordably be prevented.

And assuming your business could benefit from a solution that prevents downtime, we recommend a three-step approach that is extremely reliable and cost efficient:

  1. First, virtualize your critical systems to drastically reduce the number of physical systems in your environment—and the number of potential points of failure.
  2. Next, run your virtualized systems on Stratus always-on servers. With integrated redundancy, Stratus servers ensure continuous availability of your virtualized applications, without a single point of failure or risk of data loss.
  3. Finally, for maximum protection, mirror the always-on Stratus solution to a geographically remote site. That way, even if you lose your production site, your business keeps running.

The Stratus philosophy is simple—the best way to avoid the major costs of unplanned downtime is to prevent it from happening in the first place.

Major Gas Pipeline Company Boosts Safety, Reduces Costs

2.2.2017Fault Tolerance, IABy:  

When it comes to utilities, we as consumers find interruptions to electricity, heat, water, and phone service as extremely disruptive and even dangerous. For utility providers, the impact of such outages also is severe when it comes to lost revenue, customer dissatisfaction, and liability risks. In the natural gas industry, downtime incidents can present even more dire consequences.

This became abundantly clear when a compressor station operated by a North America gas pipeline company suffered a catastrophic failure. The result was a fire that cost more than $550,000 in damages and lost natural gas. Because the station was in a rural location, fire and damage fortunately was contained to the compressor and there were no fatalities.

While the pipeline company highly valued safety and reliability, this frightening incident was a lightning rod to take continuous operations to the next level. The pipeline company engaged in a detailed analysis of 15,000 miles of pipeline and facilities across 16 states, which transports over one trillion cubic feet of natural gas per year to customers.

The resulting modernization report recommended significant system upgrades to comply with the Control Room Management (CRM) regulations issued by the Pipeline and Hazardous Materials Safety Administration (PHMSA). For example, the pipeline company implemented compressor stations with fully redundant systems, such as compressor pumps, turbines, valves, and safety and control systems.

A bigger challenge was creating a continuous availability computer solution to operate the company’s SCADA, historian, HMI, and related control system applications. The pipeline company also wanted the solution to support big data analytics that would proactively predict, detect, and resolve compressor station problems before unplanned outages occurred.

Initially, the pipeline firm planned to deploy six or eight servers to support the full range of applications but discovered this approach had several shortcomings. For example, there were significant space and power constraints and lack of IT support at the compressor stations. If a server failed, automation staff at the headquarters location would need to reconfigure the server’s operating environment, physically deliver it, and perform the install. The unacceptable outcome: two to three days of server downtime and data loss that would generate sub-par analytic results and decrease operational efficiency.

After considering various options, the pipeline company chose a Stratus ftServer, a virtualized continuous availability solution with integrated redundancy. This centralized, easy to manage solution reduced the number of servers and associated service burdens. Automation engineers now remotely run virtualized applications from the primary control centers without requiring trained IT staff at the compressor station to conduct maintenance. Uninterrupted access to real-time analytics also provides the firm with complete operational visibility, eliminating “blind moments” and further improving availability and efficiency.

In fact, since implementing the ftServer three years ago, the pipeline company has run operational systems without any downtime or data loss. According to a lead automation electrical engineer at the company Stratus provides an added benefit: “We can get a lot more flexibility by adding applications in the compressor stations without the need for IT expertise.”

Are you looking to improve the safety and reliability of your operations while reducing costs and increasing efficiency? Stratus offers a compelling solution with virtualization, continuous availability and integrated redundancy.

NEXT STEP: Watch this video and learn more about how this pipeline company added value to their critical operations

Unplanned downtime continues to be a huge vulnerability with today’s applications

12.12.2016Cost of Downtime, Fault ToleranceBy:  
Organizations today expect their applications to be continuously available. Their businesses depend upon it. But what happens when unplanned downtime occurs?

Productivity declines.

Data is lost.

Brand reputations are impacted.

There is potential for regulatory fines.

Revenue is lost.

With these types of impacts organizations must be taking steps to reduce or prevent downtime, right? The reality is that conversations around unplanned downtime and strategies to prevent it are often met with skepticism. It is usually because the impact of downtime is something an organization only appreciates after they’ve experienced an outage. We at Stratus know this all too well because we have been helping customers implement solutions to prevent unplanned downtime for 36+ years.

The results of new Stratus survey further reinforce the spiraling problem. The survey, completed by 250 IT decision makers involved in purchasing or managing high availability solutions for IT or OT platforms, found that while it is known that IT applications cannot tolerate the average length of a downtime incident, decision makers struggle to quantify the cost of downtime and in turn struggle to justify the investment of the right solutions to meet the business requirements.

Among the survey’s findings:

  • Unplanned downtime is a huge vulnerability in today’s IT systems: 72% of applications are not intended to experience more than 60 minutes of downtime, well below the average downtime length of 87 minutes
  • The cost of downtime is the primary ROI justification when assessing availability solutions: 47% of respondents said the estimated cost of downtime is the primary cost justification when considering the adoption of fault-tolerant or high availability solutions
  • However, most organizations cannot quantify the impact of unplanned downtime: 71% of respondents are not tracking downtime with a quantified measure of its cost to the organization

Despite struggling to justify the need to invest in solutions that prevent unplanned downtime, IT organizations are looking for ways to address the problem. One of the common strategies is to leverage high availability features of the current infrastructure including clusters and virtualization technologies. The same study reported that 84% of responding IT organizations who look to virtualization to ensure the availability of their applications still struggle to prevent downtime. The top reasons provided include the high costs of additional operating system or application licenses, the complexity of configuring and managing the environment or the failover time of the solution not meeting SLAs.

The challenges facing IT decision makers is going to continue to grow with the increased adoption of edge based systems, including the Industrial Internet of Things (IIoT) technologies. The availability of applications

will ensure that information is flowing in our ever-increasing connected world. To meet the challenges of today and prepare for the future, organizations must make efforts to eliminate the risk from the equation.

The bottom line? Unplanned downtime presents a growing risk to organizations that are increasingly reliant on their applications being always available.

Stratus helps organizations prevent application downtime, period. While there are other solutions that can achieve 99.95% availability, Stratus solutions enable the simple deployment and management of cost-effective continuously available infrastructures without changing your applications. Move the hours of downtime and complexity of other solutions aside and support your applications with operationally simple continuous availability.

Want to learn more about how to determine the right amount of investment to combat the negative impacts of downtime? Download this Aberdeen Analyst Insight.

A Modernized ICS holds the key to reducing unplanned downtime

9.30.2016Fault Tolerance, Industrial AutomationBy:  

Unplanned downtime has long been the nemesis of industrial operations. In recent years, we’ve seen tolerance for unplanned downtime get even lower. In fact, a recent survey by Stratus and the ARC Group reports that almost 40% of respondents said they could handle no more than 10 minutes of downtime per incident.

More than 20% said they could not tolerate downtime at all.

One reason is that industrial control systems (ICSs) produce data that’s become increasingly valuable to the business. A modern ICS can collect data down to the millisecond. When combined with analytics, this data enables initiatives like real-time automation and predictive maintenance, as well as accelerates adoption of Industrial Internet of Things (IIoT), Industry 4.0, and smart factories. Simply put, the more you automate and reduce human errors through real-time system intelligence, the more you improve operational efficiency and drive higher profitability.

Recently, when I led an IndustryWeek webinar, I asked attendees what concerned them most about unplanned downtime. Not surprisingly, 54% identified potential revenue loss, 15% referenced loss of visibility resulting in a safety violation, and 13% highlighted the additional cost to run things manually.

Industry statistics support these concerns. According to ARC’s research, unplanned downtime results in 2-5% production loss in the petro-chemical industry. It costs natural gas companies about $10,000 per hour if a compression station goes down. Across the board, unplanned downtime in process industries costs ten times more than planned maintenance.

Modernizing ICS can lower these impacts and improve operating efficiency. So why don’t more organizations modernize? Many are concerned about complexity. They have numerous applications running on different machines that vary widely in age and configuration. The thought of upgrading such a jumble of systems can be a major inhibitor.

That’s why we see virtualization as the prime way forward for modernizing the ICS. Instead of needing lots of hardware, virtualization can often reduce everything to a single physical machine running multiple applications assigned to individual virtual machines. This makes it much easier to manage various elements of industrial automation, as well as add or upgrade applications.

Virtualization also takes the pain out of modernizing ICS because you can migrate systems gradually. A virtualized system can easily reside alongside your existing systems. Then you just move one application at a time from the traditional environment to the virtualized one.

Now, the infrastructure you choose for your virtualized ICS environment is critical. I asked the IndustryWeek webinar attendees what they considered the most important decision factor. Nearly 40% of respondents identified lifetime value because this is a system that could be in operation for at least seven to ten years. Another 26% of attendees referenced operational simplicity. Automation engineers don’t want to spend their valuable time on system administration; they want to focus on running the plant. And they want an infrastructure that helps minimize, if not eliminate, unplanned downtime.

Stratus fault-tolerant servers address every one of these points and more. So if you’re looking to modernize your ICS, Stratus can provide you with some compelling options.

Always-on Trading Shines at International Stock Exchange

9.22.2016Fault Tolerance, FinancialBy:  

In the world of securities trading, few things are more important than speed and availability. With electronic, algorithm-driven trading, delays of even a few microseconds can translate to financial losses of tens of thousands of dollars. Longer periods of system downtime could be ruinous, with financial losses soaring into the millions.


 

Tune in to a short video and learn how Stratus enables this leading international stock exchange to be always on without compromising performance.

That’s why stock exchanges around the world rely on Stratus. One of our international stock exchange customers handles more than a billion trade messages daily, making them one of the largest stock exchanges in the world by volume. It would be national, if not international, news if their systems went down. With Stratus, the exchange’s critical trading applications are always on and performing at top efficiency.

Unlike many other financial institutions, the exchange did not want to run horizontally distributed applications on traditional hardware clusters for higher availability due to the higher operational expenditures. The stock exchange also veered from building high availability into the application software because it would increase software overhead and engineering complexity.

Instead, the stock exchange avoids all that with Stratus hardware fault tolerance. An IT executive at the exchange, explains, “Hardware fault tolerance gives us bullet-proof reliability without the cost of losing compute cycles.”

Those compute cycles are critical. Not only can software complexity steal cycles, but it also leads to numerous internal system events that can cause “jitter” and slow down a stock transaction. Seemingly inconsequential system events like turning on a fan to moderate operating temperature consume valuable computing resources and put a drag on performance. Complex clustering approaches only add to performance problems by introducing network latency. The combination of jitter and latency could pause a trading algorithm for hundreds of microseconds. That’s enough time for a stock price to change.

At Stratus, we’ve built controls into our hardware to control jitter and minimize latency, helping to avoid a potential negative business impact. In fact, by choosing Stratus hardware fault tolerance, this stock exchange customer’s trading application runs faster than it would had it pursued custom-coded high-availability software.

Plus, there is no need for in-house software engineers to write and maintain special code. Such “home grown” solutions just add cost and complexity, and the systems may still fail to meet the performance and availability demands of electronic trading. With Stratus, fault tolerance and performance optimization are built into the hardware, eliminating the need to modify software.

As a result, this stock exchange has saved millions of dollars by avoiding the cost of writing high-availability code into its software. Not to mention the savings in floor space and in administration time by choosing Stratus fault tolerance instead of a traditional hardware cluster. And the exchange maintains superior uptime levels of 99.999% with Stratus which was not achievable with custom-coded high-availability software.

The IT executive at the exchange explains, “Stratus allows us to deploy our applications without worrying about hardware failure. That’s been a big advantage for us. It’s made our architecture and manageability a lot simpler. And we’ve been able to run without issue. It’s been a good ride for us.”

Putting Fault-Tolerant HMI/SCADA to the Test: Three Industrial Examples

9.1.2016Fault Tolerance, Industrial AutomationBy:  

More and more, we’re seeing operations organizations virtualizing critical industrial automation (IA) applications such as Supervisory Control and Data Acquisition (SCADA) and human machine interface (HMI) historians. And whether virtualizing these systems or not, many companies choose to run their applications on fault-tolerant systems. Here are three examples of why:

Water and wastewater treatment facility

A municipal water and wastewater treatment facility struggled with declining income due to a shrinking tax base. On top of that, the EPA was tightening regulations, such as regular testing of field wells and other water sources and assurance of no data loss. Another big concern was control room “blindness” due to unplanned downtime of its SCADA systems.

By virtualizing and running their SCADA systems on always-on Stratus ftServers, the facility eliminated unplanned downtime. The facility also was able to demonstrate to regulatory auditors that continuous data availability was ensured. Plus, virtualizing reduced software licensing costs and the self-healing features of ftServers saved on staffing otherwise needed for system monitoring and support. As a result, the municipality ensures high water quality and satisfies EPA regulations all on a tight budget.

 

Paper and packaging manufacturer

This scenario involves a manufacturing plant specializing in kraft-style paper and packaging. The company was running manufacturing execution system (MES) and sales order processing (SOP) applications on 20-year-old legacy systems. Their biggest concern was unplanned downtime, which cost the business $33,000 per hour.

The company replaced its legacy systems with Stratus always-on ftServers running state-of-the-art MES and SOP applications. Unplanned downtime is now a thing of the past. The Stratus systems are easy to operate and support, so the plant no longer needs IT staff on call. And, continuous operations without any line stoppages helped the company increase profitability.

 

Natural gas transmission company

A major natural gas transmission company operates about 80 compression stations along thousands of miles of pipeline. Most of their stations are in remote locations with limited space and power. Their existing SCADA/HMI infrastructure simply wasn’t built for those difficult conditions, and servers started to fail, causing downtime of two or three days each time.

So the company virtualized and now runs multiple IA applications on a single Stratus ftServer. This eliminated the downtime problem. It also reduced the number of servers at each compression station from eight to one, which decreased their IT expenditures significantly at the remote sites. Plus, the company eliminated data loss. This is critical to their predictive analytics systems, which tell them when equipment requires maintenance to avoid catastrophic failures.

This is a small sampling of how virtualization and fault tolerance benefit both SCADA/HMI and analytics. In fact, data availability of analytics is becoming one of the most important requirements in today’s modern operations environments—a trend that we’re seeing across all segments in the IA space.

 

NEXT STEP: Read more about how Stratus is ensuring fault-tolerance for Industrial Automation

Columbia Pipeline Group Reaches 99.5 Percent Availability

8.23.2016Fault Tolerance, Industrial AutomationBy:  

It’s often difficult to fully understand the impact of modernizing an automation system and how an investment in fault-tolerant platforms from Stratus, along with updated Programmable Automation Controllers (PACs) can deliver rapid results. For Columbia Pipeline Group (recently purchased by TransCanada) the answer is approximately $2.3 million in 2014 alone, even with a partial pipeline upgrade. Using the Stratus ftServer platform, which delivers high availability in excess of five nines (99.999%), Columbia has achieved an overall system-wide availability level of over 99.5%. Through having a virtualized platform with continuous availability, not only do the SCADA applications experience no unplanned downtime due to system failures, but no data loss means that asset management systems and IIoT predictive analytics applications can effectively control access and provide accurate data to maximize maintenance effectiveness.

Watch Video
Hear Columbia Pipeline’s Steve Adams talk about his experience of using Stratus ftServer.

Security System Availability: Understanding Your Options

8.12.2016Building Automation, Fault Tolerance, High AvailabilityBy:  

As building automation and security systems become increasingly reliant on server technology, ensuring the availability—or uptime—of the applications running on those servers is absolutely critical. But how much availability is “good enough”? And what’s the best way to achieve that level of availability?

To answer those questions, it’s important to understand the three basic approaches to server availability:

1. Data backups and restores:

Having basic backup, data-replication, and failover procedures in place is perhaps the most basic approach to server availability. This will help speed the restoration of an application and help preserve its data following a server failure. However, if backups are only occurring daily, significant amounts of data may be lost. At best, this approach delivers approximately 99 percent availability.

That sounds pretty good, but consider that it equates to an average of 87.5 hours of downtime per year—or more than 90 minutes of unplanned downtime per week. That might be good enough for a business application that is not mission critical, but it clearly falls short of the uptime requirements for building security and life-safety applications.

2. High availability (HA)

HA includes both hardware-based and software-based approaches to reducing downtime. HA clusters are systems combining two or more servers running with an identical configuration, using software to keep application data synchronized on all servers. When one fails, another server in the cluster takes over, ideally with little or no disruption. However, HA clusters can be complex to deploy and manage. And you will need to license software on all cluster servers, increasing costs.

HA software, on the other hand, is designed to detect evolving problems proactively and prevent downtime. It uses predictive analytics to automatically identify, report and handle faults before they cause an outage. The continuous monitoring that this software offers is an advantage over the cluster approach, which only responds after a failure has occurred. Moreover, as a software-based solution, it runs on low-cost commodity hardware.

HA generally provides from 99.95 percent to 99.99 percent (or “four nines”) uptime. On average, that means from 52 minutes to 4.5 hours of downtime per year—significantly better than basic backup strategies.

3. Fault-tolerance (FT)

Also called an “always-on” solution, FT’s goal is to reduce downtime to its lowest practical level. Again, this may be achieved either through sophisticated software or through specialized servers.

With a software approach, each application lives on two virtual machines with all data mirrored in real time. If one machine fails, the applications continue to run on the other machine with no interruption or data loss. If a single component fails, a healthy component from the second system takes over automatically.

FT software can also facilitate disaster recovery with multi-site capabilities. If, for example, one server is destroyed by fire or sprinklers, the machine at the other location will take over seamlessly. This software-based approach prevents data loss, is simple to configure and manage, requires no special IT skills, and delivers upwards of 99.999 percent availability (about one minute of downtime a year)—all on standard hardware.

FT server systems rely on specialized servers purpose-built to prevent failures from happening and integrate hardware, software and services for simplified management. They feature both redundant components and error-detection software running in a virtualized environment. This approach also delivers “five nines” availability, though the specialized hardware required does push up the capital cost.

Making server availability a cornerstone of your building security automation strategy pays dividends both in terms of day-to-day management and when situations arise that test your security. With the right strategy up front, your building’s security systems will be there when it really counts today and in the future. In today’s constantly changing, “always-on” world, that’s all the time.

Buyer’s Guide: Helping You Choose the Right Availability Solution

8.10.2016Fault Tolerance, Technology, uptimeBy:  

Downtime prevention is becoming a top priority for organizations across all market sectors — from manufacturing, building security and telecommunications to financial services, public safety and healthcare. What’s driving this requirement for always-on applications? It’s partly due to the rapid expansion of users, environments, and devices. Increasingly, however, organizations require high application availability to compete successfully in a global economy, comply with regulations, mitigate potential disasters, and plan for business continuity. All these factors contribute to a growing demand for high-performance availability solutions to keep applications up and running.

The good news is that there are many effective availability solutions available on the market today including standard servers with backup, continuous data replication, traditional high-availability clusters, virtualization and fault-tolerant solutions. But with so many options, figuring out which approach is good enough for your organization can seem overwhelming.

Understanding the criticality of your computing environment is a good place to start. This involves assessing downtime consequences on an application-by-application basis. If you’ve virtualized applications to save costs and optimize resources, remember that your virtualized servers present a single point of failure that extends to all the virtual machines running on them, thereby increasing the potential impact of downtime. Depending on the criticality of your applications, you may be able to get by with the availability features built into your existing infrastructure or you may need to invest in a more powerful and reliable availability solution — perhaps one that proactively prevents downtime rather than just speeding and simplifying recovery.

But availability level is not the only factor to consider when selecting a best-fit solution to protect your applications against downtime. Stratus has created a Downtime Prevention Buyer’s Guide to streamline the evaluation process and, ultimately, help you make the right choice of availability solution. The guide presents six key questions you should ask vendors along with valuable insights into the strengths and limitations of various approaches. You can use vendors’ responses to objectively compare solutions and identify those that best meet your availability requirements, recovery time objectives, IT management capabilities, and return on investment goals, while integrating seamlessly within your existing IT infrastructure.

Alaska’s need for highly redundant systems at the edge

7.29.2016Edge, Fault Tolerance, Oil and GasBy:  

Alaska is an environment tailor-made for Stratus Technologies’ solutions. The state is remote, access to IT expertise is a challenging in many locations and the environmental condition can present a myriad of issues. Combine this with small, remote communities with the need to run automation systems for water, waste water and other utilities, and the requirement for fault-tolerant solution is easy to understand. Alaska’s main industries include oil & gas, fishing, processing, and mining.

At CB Pacific’s recent Automation Symposium in Anchorage, Alaska, I had the opportunity to meet with many end users and system integrators to discuss their challenges and understand how Stratus could be of service to them. Given the fact that oil and gas commodity prices are under pressure, resulting in financial constraints on the major Alaska producers, these producers are looking for cost saving solutions and efficiencies across all aspect of their business. With only one major road up to the North slope, and remote pumping and compression stations that are only accessible by helicopter, highly redundant systems are an absolute necessity. In addition, the weather can add another level of complexity for those who service and maintain the critical infrastructure.

Traditional solutions rely on multiple standard servers configured in a variety of redundant configurations to keep things running. Stratus, working with its partner CB Pacific and local systems integrators is able to offer a single fault-tolerant solution. Through integrated redundancy that continuously monitors and diagnoses any potential problems, allowing for an environment that has no downtime and no data loss. CB Pacific and its system integrator partners handle delivery of the easy to replace hot-swappable replacement parts. This streamlined solution makes for a very cost-effective implementation and eliminates the fear of blind moments caused by unplanned downtime. Only Stratus can provide a Fault Tolerant solution from which to run your critical monitoring and safety applications.

Pageof 6

Share