This week I am in Dusseldorf, Germany showing our ETSI PoC#35 titled, Availability Management with Stateful Fault Tolerance. This Proof of Concept demonstrates how virtualized network functions (VNFs) from multiple vendors can be easily deployed in a highly resilient software infrastructure environment, that provides complete and seamless fault management to achieve fault tolerance, which means continuous availability with state protection (by remembering the preceding events in a given sequence of interactions) in the event of a system fault or failure.

The results were compelling in that for the first time we have been able to prove a number of things:

  • OpenStack based VIM mechanisms alone are insufficient for supporting carrier grade availability objectives. Baseline functionality is only adequate for supporting development scenarios and non-resilient workloads.
  • All phases of the fault management cycle (fault detection, fault localization, fault isolation, fault recovery and fault repair) can be provided as infrastructure services using a combination of NFVI and MANO level mechanism to deploy VNFs with varying availability and latency requirements – all without any application (i.e. VNF) level support mechanisms.
  • We also demonstrated that NFVI services can offer a sophisticated VM based state replication mechanism (CheckPointing and I/O StateStepping) to ensure globally consistent state for stateful applications in maintaining both high service accessibility and service availability, without application awareness.

We believe that this is a major step forward in proving that the vision of a carrier grade cloud is viable and a software infrastructure solution is beneficial to both VNF providers and network operators/service providers.

  • For network operators/service providers, it enables the deployment any KVM/OpenStack application with transparent and instantaneous fault tolerance for service accessibility and service continuity, without requiring code changes in the VNFs.
  • For VNF providers, it reduces the time, complexity and risk associated with adding high availability and resiliency to every VNF

While there is still much more progress to be made, the very possibility that reliable carrier grade workloads can be maintained will help accelerate the adoption of NFV worldwide. If you’d like to see the details of our POC click here. Non ETSI NFV members can download PDF versions of the PoC Proposal that describes the testing we performed as well as the PoC Report that describes the findings and results of the testing.  If you’d like to know more about the technology Stratus provides to enable these results check our Cloud Solution Brief and contact Ali_Kafel@Stratus.com for a white paper with more details.

Brief overview of the Stratus Fault Tolerant Cloud Infrastructure

The Stratus Fault Tolerant Cloud Infrastructure provides seamless fault management and automatic failover for all applications, without requiring code changes.  The applications do not need to be modified to become redundant and resilient because the software infrastructure enables every virtual machine (including its application) to automatically live on two virtual machines simultaneously — generally on two physical servers. If one VM fails, the application continues to run on the other VM and the processing is automatically switched to the other, with no interruptions or data loss.

Two Key Benefits

Reduce time, complexity and the risk in achieving instantaneous resiliency

  • Seamless and instantaneous fault management and continuous availablity for any application, without code changes – includes fault detection, localization, isolation, recovery and repair

Flexibity in deployment multiple levels of availability to suit the applications

  • Dynamically specify availability level at deployment time based on application type – for example some applications may require globally consistent state at all times, while others may only require an immediate and automatic restart
  • Enables mixed deployments decomposed control plane elements (CE) that may be state protection, and forwarding plane elements (FE) may be stateless, leveraging DPDK and SR-IOV for higher performance and lower latency processing

What and how we tested

  • The Stratus Fault Tolerant Cloud Infrastructure conforms to the blue elements in the ETSI NFV reference architecture below

 

 

We showed three configurations:

  1. Unprotected server – shows that upon a system failure, the applications will go down until manually restarted
  2. Highly Available (HA) servers – stateless protection – upon a system failure, the service will go down for a short period but will automatically and immediately be restarted by the software infrastructure
  3. Fault Tolerant (FT) server – stateful protection – upon a system failure, the applications will continue to run without any interruption or loss of state, because the software infrastructure will perform all fault management, state protection (on another server) and automatic failover

The Cobham Wireless TeraVM virtualized IP tester was one of the VNFs deployed, which was generating and measuring traffic. In this case the traffic we showed was a streaming video because it is easy to see if there is a failure.

The TeraVM is a fully virtualized IP test and measurement solution that can emulate and measure millions of unique application flows. TeraVM provides comprehensive measurement and performance analysis on each and every application flow, with the ability to easily pinpoint and isolate problem flows.

 

While video traffic was streaming through the system passing (which includes the Firewall and QoS servers) and visible on each of the three laptops, we simulated failure for each of the three sets of systems. As expected, the video stream coming from the unprotected server stopped and never recovered. The HA system stopped and restarted after a few seconds. As for the FT system, it continued without any loss of traffic!