Pitfalls of multicore software: Why data races are never benign
March 01, 2012
Static analysis tools build models to help single out race conditions that would otherwise run away with software integrity.
Programming multicore processors to take advantage of their power means writing multithreaded code. C and C++ were not designed for concurrency, so developers must use a library such as pthreads for those languages. Multithreaded code is more difficult to get right than single-threaded due to the risk posed by entirely new classes of programming defects.
In the rogue’s gallery of concurrency bugs, the race condition is a notorious repeat offender. A race condition occurs when a program checks a resource property and performs an action on the assumption that the property has not changed, even though an external actor has slipped in and changed that property.
A data race is a particular type of race condition that involves concurrent access to memory locations in a multithreaded program. This defect occurs when there are two or more execution threads that access a shared memory location, at least one thread is changing the data at that location, and there is no explicit mechanism for coordinating access. If a data race occurs, it can leave the program in an inconsistent state.
The insidious nature of data races
It is widely assumed that some data races are harmless and can be safely ignored. Unfortunately, this is only true in very rare circumstances. It is best to explain why by introducing an example.
The singleton pattern is a commonplace idiom where the program maintains a reference to a single underlying object and a Boolean variable encodes it if it has been initialized. This pattern is also known as lazy initialization. The following code is an example of the pattern:
if (!initialized) {
object = create();
initialized = true;
}
... object ...
This code is perfectly adequate for a single-threaded program, but it is not thread-safe because it has a data race on the variable named initialized. If called by two different threads, there is a risk that both threads will observe initialized as false at essentially the same time, and both will call create(), thus violating the singleton property.
To make this thread safe, the natural approach is to protect the entire if statement with a lock. However, acquiring and releasing locks can be costly, so programmers try to avoid the expense by using the double-checked locking idiom – a check outside the scope of the lock and another inside. The inner check is there to confirm that the first check still holds after the lock has been acquired:
if (!initialized) {
lock();
if (!initialized) {
object = create();
initialized = true;
}
unlock();
}
... object ...
Superficially, this looks like it will suffice, and indeed, it will as long as the statements are guaranteed to execute in that order. However, an optimizing compiler might generate code that essentially switches the order of object = create() and initialized = true. After all, there is no explicit dependence between those two statements. In that case, if a second thread enters this code any time after the assignment to initialized, that thread would then use the value of object before it has been initialized.
Optimizing compilers are inscrutable beasts. Those that optimize for speed will take many esoteric considerations into account, few of which are obvious to a programmer. It is common for them to generate instructions that are apparently out of order because doing so might result in fewer cache misses, or because fewer instructions are needed.
It is wrong to assume that because the reordering introduced a race condition in the previous example that the compiler is at fault. The compiler is doing exactly what it is allowed to do. The language specification is perfectly clear and unambiguous about this: The compiler is allowed to assume that there are no data races in the program.
In actuality, the specification is somewhat broader: Compilers are allowed to do anything in the presence of undefined behavior. This is sometimes facetiously referred to as catch fire semantics; the specification gives the compiler permission to set a computer on fire if the program has undefined behavior. As well as data races, many traditional bugs such as buffer overruns, dereferences of invalid addresses, and so on constitute undefined behavior. Because compilers are free to do anything, rather than burn the building down they typically do the sensible thing, which is to assume that the undefined behavior will never happen and optimize accordingly.
The consequences of this can sometimes be surprising, even to those who are experts in concurrency and compilers. It can be difficult to convince programmers that code that looks completely correct can be compiled into code that has serious errors.
Another example is worth describing. Suppose there are two threads where one reads a shared variable and the other writes to it. Let’s assume that it does not matter to the reader if it sees the value before or after it has been changed by the writer (this is not an uncommon pattern). If those accesses are not protected by a lock, then there is clearly a data race. Notwithstanding the catch fire rule, however, most programmers would conclude that this is completely benign.
As it turns out, there are at least two plausible ways in which this code could be compiled where the reader would see a value that is wrong. The first way is easy to explain: Suppose the value were a 64-bit quantity on an architecture that can only read 32-bit words. Then both the reader and the writer need two instructions, and an unlucky interleaving might mean that the reader sees the top 32 bits of the old value and the bottom 32 bits of the new, which when combined can be a value that is neither the old nor the new.
The second way in which wrong code could be generated is more subtle. Suppose the reader did the following, where the data race is on the variable named global:
int local = global; // Take a copy of
// the global
if (local == something) {
...
}
... // Some non-trivial code that does
// not change global or local
if (local == something) {
...
}
Here the reader is making a local copy of the racy variable and referring to that value twice. It is reasonable to expect that the value will be the same in both places, but again, the optimizing compiler can generate code where that expectation is unmet. If local is assigned to a register then it will have one value for the purposes of the first comparison, but if the code between the two conditionals is sufficiently nontrivial then that register may get spilled – in other words, reused for a different purpose. In that case, at the second conditional, the value of local will be reloaded from the global variable into the register, by which time the writer might have changed it to a different value.
Programmers should be very skeptical of claims that some data races are acceptable and should strive to find and remove all of them from their code.
Techniques for finding risky defects
When it comes to finding concurrency defects, traditional dynamic testing techniques might be inadequate. A program that passes a test a hundred times is not guaranteed to pass the next time, even with the same inputs and the same environment. Whether these bugs manifest or not is exquisitely sensitive to the timing, and the order in which operations in threads are interleaved is essentially nondeterministic.
New dynamic testing techniques for finding data races are emerging. These techniques work by monitoring the applications as they execute and observing the locks held by each thread, as well as the memory locations being accessed by those threads. If an anomaly is found, then a diagnostic is issued. Other tools help diagnose data races suspected of causing failures. Some companies now offer tools to facilitate diagnosis of data races that allow the replay of events leading up to an anomaly.
Static analysis tools can also be useful for finding data races and other concurrency errors. Whereas dynamic testing tools find defects that occur for particular executions of a program with a fixed set of inputs, static analysis tools check all possible executions and all possible inputs. For performance reasons, tools might place limits on how much exploration is done and thus might not be completely exhaustive; even so, they can cover much more than can ever be feasible with dynamic testing. The advantage of static analysis is that test cases are not required because the program is never actually executed.
Instead, these tools work by creating a model of the program and then exploring the model in various ways to find anomalies. GrammaTech’s CodeSonar finds data races by creating a model that represents the set of locks held by each thread and by performing a symbolic execution of the program that explores execution paths. It records the sets of variables protected by locks and uses this information to find interleavings that can result in shared variables being used without proper synchronization. Similar techniques can be used to find other concurrency defects such as deadlock and lock mismanagement.
Once found, data races are usually easy to fix, albeit doing so correctly can incur a performance penalty. In some cases, there might be a temptation to use the C volatile keyword to correct the data race, but this is not recommended because volatile was not designed to solve concurrency problems, and in any case is a poorly understood construct that is frequently miscompiled. The latest versions of C and C++ have embraced concurrency and support atomic operations. Compiler support for these operations is slowly emerging, and until it becomes readily available, the best approach is to use locks.
To achieve high-quality software for multicore processors, a zero-tolerance policy for data races is recommended. Find them using a combination of both static and dynamic techniques, and take care not to rely too heavily on esoteric compiler techniques to fix them. These defects are so risky and unpredictable that eliminating them systematically is the only safe way to be sure that they do not cause harm.
GrammaTech
607-273-7340
[email protected]
www.grammatech.com