Building High Availability for industrial and embedded systems
February 02, 2017
High Availability (HA) is not just for the data center. While the principles of achieving extreme uptime have been honed by enterprise IT teams, it's...
High Availability (HA) is not just for the data center. While the principles of achieving extreme uptime have been honed by enterprise IT teams, it’s just as important for industrial and embedded applications, which are often deployed in mission-critical environments. By understanding and leveraging HA principles perfected in the enterprise environment, industrial and embedded servers can be made more robust, reliable, and resilient.
HA is an approach to systems design that seeks to reduce downtime as much as possible – or even eliminate it. Rather than focusing purely on preventing failure by increasing reliability, high availability systems are also focused on resiliency – the ability to recover quickly from failure.
Availability is defined as the time that a system is up and operating to specification, and the standard for highly available systems is 99.999 percent uptime (five-nines availability). That means less than 5 minutes of downtime a year, or less than one second of downtime a day.
When discussing availability it’s helpful to understand the terms mean time to failure (MTTF) and mean time to repair (MTTR): MTTF denotes the average elapsed time for a component to fail, and is an indication of reliability; MTTR is the time it takes for a system or component to recover from a failed state and return to a fully operational one.
Availability is defined formulaically as MTTF/(MTTF+MTTR), which indicates that availability can be improved either through increases in reliability (MTTF) or resiliency, by reducing the time it takes to recover from a failure.
In the real world, reliability can be improved only so much, as more reliable components are often more expensive. To address availability on a practical level, improving resilience is essential. This is done through designing fault-tolerant systems, which can operate even in the face of component failure.
A fault tolerant system reduces or eliminates MTTR using redundancy and failover. Critical systems are duplicated to eliminate single points of failure. Those points of failure are then monitored and backup systems are put online as soon as possible when failures occur. In order to keep MTTR at a minimum, systems and processes used to detect failures and roll over operation to backup systems must be automated as much as possible.
HA in an embedded and industrial context
HA originated in the enterprise, but not all of the techniques and design patterns that work in the data center translate directly to industrial applications.
The enterprise approach to HA combines sophisticated clustering and load balancing software with multiple layers of hardware and network redundancy to achieve HA even with relatively low-cost commercial hardware. For instance, a web application may be load balanced across multiple servers with failover – if any single server fails the remaining ones take on the load. Similarly, enterprise storage and networking involves multiple redundancies to prevent single points of failure.
While the enterprise approach is able to successfully achieve HA even with relatively unreliable hardware, it depends greatly on trained IT personnel to design, monitor, and maintain complex HA infrastructure and software.
Industrial and embedded systems operate in a significantly different context. These systems are often called on to perform with little or no maintenance, and when they’re set out into the field, they often have to “just work” without IT staff continually monitoring and configuring them. In addition, space is usually at a premium at the system level, and often in the environment as well. Space-constrained industrial systems can’t afford as many layers of redundancy as large data centers, and industrial environments often can’t accommodate the multiple servers that are typical in an enterprise environment.
All this means that industrial systems have to be designed to provide HA that works out of the box. A hardware-first approach is much more necessary than in the enterprise context, and reliable components are key for improving availability without massive redundancy. In addition, monitoring and failover processes have to be automated and foolproof as there will often be little to no staff in the field to monitor and configure the system.
Building embedded and industrial HA systems
By focusing on hardware points of failure, industrial and embedded systems can achieve HA that’s easily deployable. For industrial computers, the hardware subsystems that fail the most often are power, storage, networking, and memory. Each of these subsystems can be addressed by improving reliability, redundancy, or a combination of both.
Power
Power supplies are one of the most common failure points for any computer system. Capacitor failure, fan failures, power surges, or blackouts are some of the reasons for power supplies to fail or power to be cut to a system.
Power supply reliability can be addressed by buying power supplies through reputable manufacturers. FSP, for instance, has a low failure rate and uses quality capacitors and other components for higher reliability.
Power supplies can also be redundant. Industrial power supplies from FSP like the FSP700-70RGHBE1 provide power from two sources under normal conditions, but in case of a power supply or circuit failure, will switch to the other power supply (Figure 1). These redundant power supplies are hot swappable, which virtually eliminates MTTR since the failed power supply can be replaced with a new one without the system ever going offline.
[Figure 1 | Highly available industrial systems use reliable components with fault tolerant design, such as this redundant, hot swappable PS2 power supply from FSP (FSP700-70RGHBE1).]
For the best redundancy, each power supply should be attached to a separate power circuit to make sure that power failures don’t affect both supplies at once. Providing power to the supplies using a universal power supply (UPS) can also help clean up the power going to the supply and improve MTTF.
Storage
Storage is another common failure point that can be addressed through both increasing reliability and resiliency.
Modern storage systems should ideally use solid-state drives (SSDs). Without moving parts, flash memory storage can provide much greater MTTF than traditional hard drives. However, flash memory does have write endurance challenges that traditional disk based drives do not have to face – flash memory cells that are written to too often can fail to store charge, affecting the data stored.
Like any other component, buying a higher quality SSD can improve reliability and service life. Top tier SSD manufacturers use high-quality wear-leveling algorithms that extend the life of the drive by spreading writes around the disk. The type of flash memory can also affect long-term reliability. Higher quality single-level cell (SLC) flash can last an order of magnitude longer in terms of write cycles than lower grade multi-level cell (MLC) flash. Though SLC costs significantly more than MLC, the price may be worth it for write-intensive applications.
The MTTR of storage subsystems can be addressed through redundancy. Redundant array of independent disks (RAID) is a storage virtualization technology that allows multiple drives to store redundant copies of data while presenting a single logical image.
Several types of RAID exist, with some focusing more on performance and others on reliability. RAID 1, which simply mirrors data across two disks, is the best choice for most industrial PCs as it provides data protection through redundancy with a minimum number of drives. RAID 1 also has the lowest MTTR of all the redundant RAID levels since it does not require time for data to rebuild.
Networking
Networking failures can happen for a variety reasons, and are addressed by choosing reliable components and implementing redundancy.
Networking redundancy can be implemented at the link level and the card level. Network interface cards (NICs) with dual Ethernet ports provide redundancy in case one link fails, and two NICs provide an even greater level of redundancy by protecting against card failure.
Memory
Unfortunately, memory is one system component that is difficult to address with redundancy. To increase availability, therefore, reliability has to be addressed instead.
In industrial environments, electromagnetic interference (EMI) is an issue that can potentially flip bits in RAM. In these cases, error-correcting code (ECC) memory can help prevent errors. For additional reliability, memory modules can also be applied with conformal coatings – chemical dips and sprays that protect against environmental hazards such as moisture, dust, and other contaminants.
Final thoughts
While industrial embedded systems are often asked to perform in mission-critical applications, the traditional approach of increasing availability by only addressing system reliability is inefficient and often expensive.
By leveraging the principles of HA from the enterprise IT market to combine benefits of reliable components with resilient system design, industrial designers can make cost-effective industrial PCs and embedded computers with five-nines availability.
FSP Group