Be very sure of your device's reliability
July 05, 2017
Blog
For an embedded system, reliability means no unexpected data loss. Looking below the application, this breaks down into two main categories: Whether the device remains functional, often after...
For an embedded system, reliability means no unexpected data loss. Looking below the application, this breaks down into two main categories:
- Whether the device remains functional, often after being shelved for a significant period of time. The software concern here is usually the flash media, and primarily involves hardware specifications and environmental conditions. This has also been referred to as “data integrity.”
- Whether data just written actually resides on the media, usually after a system crash or unexpected power loss. File-system and flash-management software are integral to this version of reliability, and the best way to demonstrate that is through effective testing.
When an embedded device isn’t writing data to the media, a power interruption or system crash will lose uncommitted file data in RAM. Application programmers are usually familiar with this, and tend to issue flush (and related) commands to make sure data isn’t lost. Power interruptions in this situation will only cause problems if an atomic multi-block write is interrupted AND the file system doesn’t have a way to handle this (such as discarding the results of the partial commit).
It’s also safe to say that most embedded systems spend the majority of their time in this state—basically not writing data. This means that random power interruption testing will hit this state most frequently, proving only what the system designer has already planned for.
The more interesting failure location is at the point of media write. It’s here that we believe focused power interruption testing should be conducted. This will enable system designers to discover how the file system and flash media firmware or drivers handle an interrupted write operation, including what sorts of errors (if any) are returned to the application. Testing here will also examine how interrupted atomic (and non-atomic) writes are handled, and under what conditions files can be corrupted. We focused on both of these topics for the FAT file system in a white paper, Where does FAT fail?
If effort isn’t spent validating all the power interruption options, some small measure of benefit could be gained by reducing the amount of writing required by an embedded system. This is an area of rapidly diminishing returns; what’s the point of devices that can log data and usage and even failure statistics if that data isn’t actually written due to fear of interruption? Even systems that write seldom do write sometimes, and an interruption that’s not planned for there could be a major expense later. It’s always better (and less expensive) to use software designed to handle power interruption.
Thom Denholm is a Technical Product Manager at Datalight, combining a strong focus on operating system and file system internals with a knowledge of modern flash devices. He holds a BS in Mathematics and Computer Science from Gonzaga University. In his spare time, he works as a professional baseball umpire and an Internet librarian. Though he has lived in and around Seattle all his life, he has never had a cup of coffee.