STOP doing error checking! Part 2: Check for errors, instead.
March 06, 2017
Blog
My last STOP error checking post alluded to "a few, ordinary laws of physics" that give us insight into how we might do true and useful error checking...
My last STOP error checking post alluded to “a few, ordinary laws of physics” that give us insight into how we might do true and useful error checking. I’ll cover those in a moment, but first I’d like to briefly explore the notion of the error itself as it relates to our discussion.
In psychology, human error is defined as “something done that was not intended by the actor.” The topic is often subdivided into mistakes, lapses, slips, and violations, each recognizing distinct pathways to an erroneous outcome, as noted by James Reason in Human Error. The details make for interesting reading, and they’ll help you understanding the notion of “error” in all its forms, including those applicable to embedded computing.
Philosophically speaking, we’ve always conceived of computers as mere “better humans” that just do the same things we do, only faster and (hopefully) more reliably. The extent of our anthropomorphism is so complete, in fact, that we reflexively apply the human-error standard of “that’s not what I wanted,” whenever our computers surprise us with unexpected acts.
Embedded systems obviously aren’t humans, so drawing parallels with human error will get you only so far. The definition of human error incorporates intent, for example, but computers don’t ever “intend” to do something; they either do that thing or they don’t, according to the logic of their programs.
If we hold them to the human standard, and it seems that we do, can intent-free computers ever really commit errors? No. And because of this, your “error checking” is probably only making things worse.
Wait, let me finish!
When did you last write code that looks like this?
int set_gpio(int pin, int val)
{
if (val) {
PINCTRL |= (1 << pin);
return 0;
} else {
PINCTRL &= ~(1 << pin);
return 0;
}
return -1; /* error */
}
…
int main(void) {
int ret;
…
ret = set_gpio(PIN, 1);
if (ret)
return ret;
…
}
Probably yesterday. Maybe not something quite so obvious, but I bet that whatever you wrote had an error code that suggested whether the requested outcome was achieved or not.
Even without looking at your code, I think we can all agree that a function like set_gpio()
can’t possibly know why the calling function desires to set the requested pin to the requested state (if it did, then we’d probably name it something reflecting that, like pour_beer()
). All we can say for sure is that the calling function is requesting that we configure the pin’s circuitry to assert the indicated state. Thus, our error code can accurately indicate only whether we configured the circuitry as requested, or not. Communicating anything more than that means we’re trying to guess the caller’s intent.
In the strictest sense, a function like set_gpio()
can’t even test that the pin itself physically changed to the desired state. It has no way of knowing, for example, that it’s connected to an external circuit that indicates the position of a slow-moving valve by electrically back-driving the same pin. A brief disagreement between the pin’s configuration and physical state indicates that the system is actually working correctly! (I2C busses indicate ACK/NAK using this same method, so this signaling mechanism isn’t nearly as unusual as it sounds).
The logical conclusion is this: enumerations returned from assertive functions like set_gpio()
can’t ever be evidence that a system is functioning correctly because those functions can’t ever understand the overall system’s intended behavior. If you don’t know the intended behavior, you can’t check the system’s actual behavior against it.
So if things like checking the result of set_gpio()
are the extent of your “error checking”, then you don’t have any error checking.
As an industry, we need to stop pretending that doubling down on clever enumerations, try/catch clauses, coding standards, “goto considered harmful” mantras, and so on will improve our systems. They haven’t, and they won’t, because at best all they can confirm is that the system is actuating and communicating in the way we’ve already designed it to. Spilled beer is still spilled beer, whether you intended it or not; simply confirming that a GPIO pin is a one or a zero only makes things worse if the installer wires a replacement tap valve backwards by mistake.
The good news is that, given the above, we can stop writing set_gpio()
-like functions that try to do more than just configure pin circuitry. We can also stop testing the enumerations returned from such functions for more insight than they can give us, and then back-propagating those overreaches to the calling function. Our code just got a lot simpler, and we’re no longer surprised by system failures that we “somehow missed” because we weren’t ever really checking for them in the first place.
(Actually, there’s even better news: by striking all those “error checks” that don’t actually check for errors, we’re likely to improve the quality of the system even if we don’t do anything else. That’s because study after study has shown a direct correlation between the number of lines of code we write and the number of bugs that get shipped with the system. But that’s a topic for another day).
So, what does true “error checking” actually look like? Exactly how it sounds: affirmative testing for conditions that, if found, indicate that an error is occurring. Let’s head in that direction now, shall we?
Consider again our simple, GPIO-actuated valve. Those pesky, aforementioned laws of physics already dictate that when the valve gate is truly, physically open, any pressure differential between the two sides will cause material to flow whether we want it to or not. Likewise, no material will flow if the valve gate is closed. Mother Nature has eliminated any possibility for errors here.
What we care somewhat more about is whether the valve opens and closes under our control. Unfortunately, we usually can’t answer this question directly unless we can physically observe the condition of the valve gate itself in a way that’s reliable enough to trust. That sounds like an expensive setup to me, one that probably also includes radioactive isotopes and/or x-rays.
Even if we could somehow observe the valve gate directly, that wouldn’t help us much from an error-checking perspective unless we could also open and close the valve at will to confirm that it moves as we want it to, when we want it to. I haven’t encountered a system like that recently. Perhaps ever.
All we’re left with, then, is the obvious: Is there flow only when intended?
It turns out, implementing that question literally is often the best way to answer it. We could install pressure sensors on either side of the valve, for example, and periodically check that the pressure differential we observe matches the valve state expressed by the input signal we’ve been sent:
for (;;) {
/* sense */
intent = get_input();
/* control */
set_gpio(ACTUATOR, intent); /* open/close valve */
/* check for error */
deltap = get_pressure_differential();
switch (intent) {
case OPEN: /* valve should be open */
if (deltap) goto error; /* … but seems closed */
break;
case CLOSED: /* valve should be closed */
if (!deltap) goto error; /* … but seems open */
break;
}
}
Those of you with any background in fluids (beer or otherwise) will stop me here, protesting that there are plenty of cases where pressure differentials across a valve gate aren’t usefully correlated with mass flow. Open an ordinary ball valve all the way, for example, and the restriction that causes the pressure drop disappears entirely. There’s no measurable differential until the valve starts to close again, throttling the fluid moving through it.
And then there are systems like hydraulic rams: Their valves are usually pressurized equally on both sides, whether open or closed, because the load and pump are literally pushing against each other with roughly equal force. Any observed pressure difference here means that the valve is restricting flow while the load is in motion, suggesting that the system is literally at work. That’s hardly an error.
With all due respect, objecting to my pseudocode on those grounds means you might have missed my point entirely (admittedly, I set you up for it because I wanted to discuss as many sides of this issue as possible. Please don’t take it personally that I like to be thorough whenever I can).
The main concept reflected in the above code is that we are directly, literally checking for deviations from the intended behavior of the system. That kind of “error checking” is actually useful. The fact that the GPIO pin’s driver circuitry follows the instructions of our program doesn’t mean much, given all the other ways that the valve can fail. And besides, our job is to control a valve, not a GPIO pin.
Is the value returned by a function like set_gpio()
ever useful? Not really. And in fact, paying attention to such a value anyway can affect you in ways you probably won’t like.
I’ve already mentioned the existence of signaling mechanisms that briefly override the desired state of the physical GPIO pin in ways that don’t indicate a broken circuit. If an error code returned by set_gpio()
were to indicate a mismatch between the desired and actual pin state, then we’d have to add logic somewhere else to say “but that’s ok, the installer used a different valve this time.” More code, and the worst kind: unusual, conditional, and probably poorly-tested.
Even in more ordinary settings, paying such close attention to the outcome of set_gpio()
risks binding our implementation to a specific polarity of valve driver. By directly testing for flow instead, we can determine on our own the voltage levels necessary to control it. We can even make this the ordinary behavior of the system, defeating the “conditional code” objection AND eliminating an entire class of installer errors.
But really, the main reason we want to ignore the results of set_gpio()
is that it’s just diagnostic data, useful only for ruling out the cause of a failure we’ve already found by other means. The odds of failure in ways that set_gpio()
could detect are exceedingly unlikely (compared to, say, a ruptured valve), and at that point we’re probably scrambling to mop up the spilled beer, not to write down error codes.
Meanwhile, the code you added to religiously carry around the value returned by set_gpio()
has probably created real bugs, ones that are far more likely to cause the system to fail than breaking the GPIO pin’s on-chip driver circuitry.
If you’re still reading at this point, then you’ve probably got one more reservation: I fixed the absence of error-checking in our simple GPIO-controlled valve example by just adding more hardware. Is that fair?
No, it isn’t: We’re judging a system that had no error-checking against one that actually does.
My goal for this and the next few articles is to create a space where we can all finally admit that our approach to “error checking” as an industry is, in a word, pathetic. Because admitting a problem is the first step towards solving it, right? Or, so I hear. I remain hopeful.
Our current preoccupation with carefully observing relatively useless, unlikely diagnostic data creates a lot of code, but it isn’t making our systems any better or safer. The proof is the abysmal quality of our systems themselves, which seem to be getting worse every day.
What’s the fix? Let’s start actually checking for errors, rather than pretending that observing diagnostic data can keep our systems and users safe. I’m not sure what the options are for true error-checking in a system that consists solely of a microcontroller (MCU) and GPIO-actuated driver circuitry to open and close a valve. It may come down to just admitting that it can’t check for errors on its own, but I’ll be thinking about that until next time.
Until then, code deliberately.
Bill Gatliff