Why ECC matters in embedded design
January 14, 2015
Imagine yourself as an IT professional and you've been tasked with deploying server-based applications and software on a traditional desktop PC. It's...
Imagine yourself as an IT professional and you’ve been tasked with deploying server-based applications and software on a traditional desktop PC. It’s hard to imagine such a PC running mission-critical applications like corporate firewalls, CRM, accounting, or network security that’s accessed by thousands around the globe simultaneously.
We’ve become used to the fact that our day-to-day electronics sometimes require a reset. Our phones, GPS devices, or wireless routers get stuck, so we simply give it a “hard reset” (power off) and wait for it to work again. And we don’t view it as a defect in the product so long as it works again after the restart. So when it comes to servers, companies are understandably cautious in running and guarding their largest information investments, and regularly pay premiums for server-grade hardware over traditional desktop PCs.
But hardware-wise, there aren’t many differences between desktop PCs and servers – mostly just stronger power supplies, more connectors, and interfaces that are rack mountable. The biggest advantage servers have over PCs is their error-correction performance (ECC) by the CPU on the DRAM. Desktop CPUs are typically based a on 64-bit wide bus. But server memory buses are 72 bits wide, allowing for an additional 8 parity bits for each 64-bit wide data-word. The parity bits are generated by the CPU and are written to server memory modules with a 72-bit bus. Upon a read, the CPU can detect and correct bit-flips, and through the server’s operating system ECC activity can be monitored to get an idea of how often a desktop PC might have sustained a critical error.
In the embedded market, you’ll find hardware powering high-reliability solutions 24/7, including industrial electronics, medical, aerospace, or even safety-critical designs. But you’ll rarely find products that incorporate ECC. The memory chips and modules used on the majority of products are similar to those used in desktop PCs or consumer-grade electronics, and a majority of CPUs utilized in embedded designs don’t support ECC. Yet the expectation for reliability of industrial and embedded systems is higher than that of consumer electronics and desktop PCs.
By all accounts, there’s a lack of awareness in the embedded space about the importance of ECC and the DRAMs’ sensitivity to single bit errors. Everyone is aware when hardware fails and the problems it can cause in the field. We see it happen everyday in our electronics at home. But unlike consumer goods, industrial customers can’t simply “hard reset” hardware.
DRAM devices are the “softest” parts used in electronic designs. We say “soft” because the data bits don’t function like a clear on-off switch. They’re stored by charging a capacitor with either lower or higher charge. The amount of charge per memory cell is a fraction of a femtocoulomb (less than 10 to 15 Coulombs (C)) held in a capacitor with a fairly high leakage, losing its charge after a few milliseconds. This is why DRAM requires a periodic refresh from the CPU.
Unfortunately, DRAM bit errors are random and not reproducible. A cell might function well for a million cycles, then flip and work again for another million cycles. This is why DRAM manufacturers can’t possibly avoid them even with heaviest burn-in testing. And DRAMs have no spare blocks nor bad block management like flash memory. And it wouldn’t help anyway, as the cells themselves aren’t actually bad. All that’s occurred in a soft error is a transient error effect. The next bit might flip in a totally different cell, and thus the errors can’t be mapped out.
The only real way to guard against ECC errors in industrial electronics is to incorporate ECC into the design. This can be done in one of two ways: by utilizing ECC-capable CPUs or FPGAs, or by utilizing DRAM with built-in ECC.
The first solution is the current “go-to” option that incurs a greater expense in both components and design. And in some cases, it’s simply impossible, as ECC + CPU functionality can only be achieved with nine or more DRAM components that may not fit on smaller embedded boards.
The second solution, DRAM with built-in ECC functionality, is a relatively new option that comes with a nominal component expense increase per chip, and no changes on the design side. ECC DRAM shares the same footprint as existing components and can be achieved on any number of DRAM components in an application. Thus far, Intelligent Memory is the first to successfully incorporate ECC functionality onto the DRAM chip itself in DDR1, DDR2, DDR3, and Mobile DDR components.
Thorsten Wronski is the President of Sales and Technology for Memphis Electronic, based in Germany. Nicholas Urbano is the Director of Sales for the Americas for Memphis Electronic, based in the U.S.