A few weeks ago, Stratus hosted a Webinar with Light Reading titled “Achieving Instantaneous Fault Tolerance for Any Application on Commodity Hardware” aimed at Telcos and Communications Application Providers. The event was very successful, with 150 attendees dialing in live and an additional 200 attendees who registered but were not able to attend at that specific time. We had many questions during the session, some of which were answered at the time and others that went unanswered due to time constraints. This blog post summarizes all of the questions that were asked and our responses.
Before we get to the Q&A, let me first define everRun in simple terms
everRun is a Software Defined Availability (SDA) infrastructure that moves fault management and automatic failover from the applications to software infrastructure. This provides fully automated and complete instantaneous fault tolerance for all applications, which includes fault detection, localization, isolation, service restoration, redundancy restoration, and if desired, state replication – all without changes to application code and with dynamic levels of resiliency. This means any application can be instantaneously deployed with high resiliency, multiple levels of state protection and ultra-fast service restoration speed – on commercial off-the-shelf (COTS) hardware in any network, without the complexity, time consuming effort and risk associated with modifying and testing every application. This is why everRun is ideal for communications applications that include video monitoring, network management, signalling gateways, firewalls, network controllers and more!
Now, on to the Q&A:
- Do I need a separate Linux distribution to run everRun?
- everRun supports multiple guest OSs that include Windows, CentOS Linux and RHEL Linux. everRun comes with its own CentOS distribution which installs on a bare metal commodity server but you will need to install an OS (as a guest OS) for every VM.
- What if I have a mix of Windows and Linux applications?
- No problem. As we stated, you can install multiple Guest OS because everRun leverages the KVM hypervisor where the Stratus fault-tolerant code resides so that all VMs regardless of guest OS will be seamlessly protected without requiring changes to application code. Some VMs can be Linux, other Windows on the same everRun configuration.
- Do you have solutions for things like BGP which is layered on top of TCP? (Typically called a Non-Stop-Routing)
- We don’t offer applications, just the software platform that runs these applications. Essentially any application that uses any protocol that runs on TCP/IP on any guest OS will run on everRun.
- Assuming there is an MME entity I need to make fault tolerant, how will your Availability Engine maintain the MME’s applications internal state? There could be multiple internal states for multiple streams that are maintained by this entity.
- Unlike application-based HA solutions which require application code changes, this solution automatically creates VM pairs between hosts in an anti-affinity configuration. This means that the state of a VM (and all its applications) are captured regularly and asynchronously, based on a highly sophisticated Stratus StatePoint algorithm that ensures globally consistent state for all applications deployed in a stateful fault tolerant mode. If a fault occurs on the primary server at state “n”, the system automatically switches over to the secondary server which resumes automatically from the most recent statepoint, “n”, without any application disruption or degradation.
- What tends to be the service level degradation experience by adding the fault-tolerant functionality and protection within the software, such as impact on latency, state locking, or real time processing?
- There are two major types of protection that everRun offers. An application or application component can be deployed in fault-tolerant (FT) mode which means the highest level of protection in terms of total state replication and fast service restoration time. In this scenario the average total “added latency” for the whole process including checkpointing with the I/O barrier, is less than on millisecond (about 750 micro seconds).
- How far apart can the active and standby be?
- It depends on the bandwidth of the link between the primary and secondary servers and the sensitivity to latency…. But generally no more than a few miles, because longer distance means longer propagation delay.
- Can all products use the Stratus fault-tolerant system? For products that use a lot of states within their software, are there any challenges that we would face to integrate this solution?
- Any application can run on everRun as long as they can run on Ubuntu, SUSE, CentOS, Red Hat Enterprise Linux (RHEL), or even Windows. While every application needs fault management, not all of them need state protection or require the same speed of service restoration – hence everRun supports multiple levels of redundancy – this means some applications running in FT mode will have complete state redundancy and protection, while others that run in HA will not have state protection but will be automatically restarted if there is a failure – this uses a lot less systems resource. Even within the same application, different components of an application may require different levels of redundancy. For example, for applications that also have data plane forwarding elements (such as vFirewalls and vRouters) and are de-componentized into separate VMs for the control element (CE) and data plane forwarding elements (FE), the CE could be run in FT mode (state protection), while the FE could run in just HA mode which means it will be restarted quickly and automatically if it fails. This means as long as the CE is protected, a new FE will be restarted with no disruption or degradation in the service
- What changes do I need to make to my application to use everRun?
- No application changes are required. Any application that runs in any of OS we support (Red Hat Enterprise Linux, Ubuntu, SUSE, CentOS or Windows) will run fine on everRun (in their guest OS), and everRun will seamlessly protect these VMs including state replication, without application awareness.
- I understand I will need two physical servers, when one fails how will I know and what do I need to do?
- If one fails, the system will automatically failover to the secondary server. Applications running in FT mode will be automatically and very quickly (within milliseconds) resumed while the HA applications will be automatically restarted. There are multiple ways to be alerted to system faults (SNMP, everRun Manager, email) so that faulty components can be repaired.
- How does your solution compare to VMware’s FT solution?
- Stratus is the market leader in resiliency and is leveraging 35 years of domain expertise of tuning our FT algorithms to maximize the system performance and resource utilization based on thousands of real deployment workloads. Generally, customers consider VMware when they are consolidating. But when they require availability and resiliency they buy everRun.