There are two ways in which the first code could be less efficient than the second, at least on some hypothetical architecture. On x86, my guess would be that they compile to the same code.
The first issue is that an atomic load might affect the performance of other processors. On alpha, which is often a good "outlier" case in studying memory consistency, you'd be issuing a memory barrier instruction over and over again, which could potentially lock the memory bus (on a non-NUMA machine), or do something else to force write atomicity of stores by two other CPUs.
The second issue is that a barrier affects all previous loads, not just the load of ready_
. So maybe on a NUMA machine, ready_
actually hits in the cache because there is no contention and your CPU is already caching it in exclusive mode, but some previous load is waiting for the memory system. Now you have to stall the CPU to wait for the previous load instead of potentially continuing to execute instructions that don't conflict with the stalled load. Here's an example:
int a = x.load(memory_order_relaxed);
while (!ready_.load(std::memory_order_relaxed))
;
std::atomic_thread_fence(std::memory_order_acquire);
int b = y;
In this case the load of y
could potentially stall waiting for x
, whereas if the load of ready_
had been done with acquire semantics, then the load of x
could just continue in parallel until the value is needed.
For the second reason, you might actually want to structure your spinlock differently. Here is how Erik Rigtorp suggests implementing a spinlock on x86, which you could easily adapt to your use case:
void lock() {
for (;;) {
if (!lock_.exchange(true, std::memory_order_acquire)) {
break;
}
while (lock_.load(std::memory_order_relaxed)) {
__builtin_ia32_pause();
}
}
}
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…