Recently, I’ve been working on OS kernel and have learned something about atomic operations, memory barriers, and synchronization primitives. It has been an incredibly rewarding experience.
Before this, my understanding of concurrent programming was almost nonexistent. During my undergraduate studies, I only briefly encountered Linux mutexes and semaphores. Although I wrote quite a bit of real-time networking stacks and device driver code during my Ph.D., my focus was almost entirely on implementing algorithms rather than understanding concurrency itself. I never developed a systematic understanding of when instruction reordering could become a problem or what kinds of locks should be used in different situations. Looking back now, I realize that many of my previous assumptions, and probably quite a bit of the code I wrote, were incorrect.
In this article, I’d like to organize as much as what I’ve learned so far, together with some of my own thoughts. Hopefully, it can serve as a gentle introduction to concurrent programming for others who are just getting started.
This article was written entirely by the author over the course of about two days and was not generated by AI. Claude Opus 4.8 was used only to proofread factual inaccuracies. This article is licensed under CC BY 4.0. Please provide attribution if you repost or redistribute it.
Let’s begin with a simple question: what is concurrency? Wikipedia defines it as follows [1]:
Concurrency refers to the ability of a system to execute multiple tasks through simultaneous execution or time-sharing (context switching), sharing resources and managing interactions.
In other words, concurrency means multiple tasks making progress either simultaneously or by taking turns sharing the processor.
For example, two tasks may run on the same CPU core through time slicing:
Core0: A -> B -> A -> B
Or they may execute truly in parallel on different CPU cores:
Core0: A -> A
Core1: B -> B
Once multiple CPU cores are involved, an obvious question arises. Suppose two threads, A and B, share a variable x in memory. What happens if A modifies x while B reads it before the update has become visible?
For a long time, I thought this problem was simply caused by the multi-level cache hierarchy used in modern processors. At least that’s how I remembered learning it during my undergraduate years. Since CPUs communicate with main memory through L1, L2, and L3 caches, I assumed that when A updated x, the modification first lived only in A’s cache, and it naturally took time before the updated value propagated to B’s cache.
It turns out this explanation is incomplete.
Modern processors have long solved this problem through cache coherence protocols. Hardware guarantees that caches remain coherent across CPU cores. For example, under the MESI protocol, every cache line is associated with one of four states:
Suppose A wants to modify x. The cache line containing x becomes Modified in A’s cache, while the corresponding copies in other cores are marked Invalid. Later, when B attempts to read x, the hardware automatically performs the necessary coherence operations so that both cores eventually observe a consistent value, typically transitioning the cache line into the Shared state.
Conceptually, this is somewhat similar to Rust’s borrow checker. At any given time, only one core is allowed to own a writable copy of the data, and before another core can access it, the previous writable copy must be invalidated first [2].
So if hardware already guarantees cache coherence, why do we still need all these complicated concepts like atomic operations, memory barriers, and locks?
The reason is that cache coherence only guarantees that all cores eventually agree on the value stored at a single memory location. It does not guarantee atomicity or ordering, which are separate problems altogether.
Consider the seemingly simple operation:
x++;
On AArch64, it is typically compiled into three instructions:
ldr w1, [x0] // load x
add w1, w1, #1 // w1 = w1 + 1
str w1, [x0] // store x
Cache coherence guarantees that each individual ldr and str observes a coherent view of memory. However, it does not guarantee that the entire read-modify-write sequence executes as a single indivisible operation.
This issue exists regardless of whether the program runs on one core or multiple cores.
For example, suppose x = 0, and two threads both execute x++ simultaneously.
Intuitively, we’d expect the final value of x to be 2.
However, the following execution is perfectly possible:
ldr w1, [x0] // x = 0
ldr w2, [x0] // x = 0
add w1, w1, #1 // w1 = 1
add w2, w2, #1 // w2 = 1
str w1, [x0] // x = 1
str w2, [x0] // x = 1
As a result, one increment is lost, and the final value becomes 1 instead of 2.
This is exactly the problem that atomic operations solve.
For example,
__atomic_fetch_add(&x, 1);
may compile into something like the following on AArch64 [3]:
1:
ldxr w1, [x0] // exclusive load
add w1, w1, #1
stxr w2, w1, [x0] // exclusive store
cbnz w2, 1b // retry if another core modified x
The ldxr instruction establishes an exclusive reservation on the memory location, while stxr succeeds only if no other processor has modified that location in the meantime.
If another core writes to x before stxr executes, the store fails, and the code simply retries until it succeeds.
C11 provides a rich set of atomic operations, including:
atomic_storeatomic_loadatomic_exchangeatomic_compare_exchangeatomic_fetch_add, atomic_fetch_sub, atomic_fetch_or, atomic_fetch_xor, and many others.One question confused me for quite a while.
If load and store are already single machine instructions, why do we even need atomic_load() and atomic_store()?
There are two reasons.
First, the compiler is not required to emit a single load or store instruction for an ordinary memory access. Under the C11 memory model, concurrent accesses to a non-atomic variable constitute a data race, which results in undefined behavior. Once a program has undefined behavior, the compiler is free to perform optimizations that would otherwise seem surprising.
Second, atomic loads and stores are not just about atomicity. They can also carry memory ordering semantics, allowing them to act as synchronization points.
We’ll talk about memory ordering in the next section.
Besides cache coherence and atomicity, another essential ingredient of correct concurrent programs is memory ordering. In other words, operations must occur in an order that preserves the intended semantics of the program.
Unfortunately, modern systems do not execute instructions strictly in the order they appear in the source code.
Instruction reordering comes from two different places:
This means that the order written in your program is not necessarily the order observed by other threads.
Consider the following example [4]:
data = None;
has_data = False;
// Thread 1
write(&data, "hello");
atomic_store(&has_data, True, Relaxed);
// Thread 2
if (atomic_load(&has_data, Relaxed)) {
d = read(&data);
assert(d == "hello");
}
At first glance, this looks perfectly reasonable. Thread 1 writes the data first, then sets has_data to True. Thread 2 waits until has_data becomes True before reading the data.
However, the assertion may still fail.
The reason is that memory_order_relaxed provides atomicity but does not impose any ordering guarantees. Both the compiler and the CPU are free to reorder independent memory operations.
Compilers only preserve ordering when there is an explicit dependency between operations. For example:
int a = 5;
b = a + 1;
Here, b depends on the value of a, so the compiler cannot move the assignment to b before the initialization of a.
In contrast, the following two operations are completely independent:
write(&data, "hello");
atomic_store(&has_data, True, Relaxed);
Since the compiler sees no dependency between them, it is free to reorder them if doing so improves optimization.
The standard solution is to use atomic operations with memory ordering semantics.
data = None;
has_data = False;
// Thread 1
write(&data, "hello");
atomic_store(&has_data, True, Release);
// Thread 2
if (atomic_load(&has_data, Acquire)) {
d = read(&data);
assert(d == "hello");
}
Here, the release store ensures that every memory write before it becomes visible before the store itself.
Conversely, the acquire load guarantees that once it observes True, every memory operation that happened before the corresponding release store is also visible.
As a result, if Thread 2 sees has_data == True, it is guaranteed to read the updated contents of data.
This release-acquire pair forms one of the most fundamental synchronization patterns in concurrent programming.
Although memory ordering is often attached directly to atomic operations, another common approach is to use memory fences explicitly.
C11 provides several fence operations, among which the following three are used most frequently:
atomic_thread_fence(memory_order_acquire);
atomic_thread_fence(memory_order_release);
atomic_thread_fence(memory_order_seq_cst);
These correspond to different classes of ordering guarantees [4].
An acquire fence prevents later memory operations from moving before the fence.
/* load/load ordering */
r1 = x;
fence(acquire);
r2 = y;
The load of x cannot be reordered after the load of y.
Likewise,
flag = ready;
fence(acquire);
result = 1;
the read of ready is guaranteed to happen before the write to result.
In other words, an acquire fence enforces load-load and load-store ordering.
A release fence prevents earlier memory operations from moving past the fence.
/* store/store ordering */
data = value;
fence(release);
ready = 1;
The write to data must become visible before ready is updated.
Similarly,
flag = ready;
fence(release);
result = 1;
the load of ready cannot be reordered after the store to result.
A release fence therefore enforces store-store and load-store ordering.
A sequentially consistent fence, usually abbreviated as seq_cst, is the strongest kind of fence.
x = 1;
fence(seq_cst);
r = y;
It prevents a store from being reordered with a later load and also participates in a single global ordering shared by all sequentially consistent operations.
Although it is the easiest model to reason about, it is also the most restrictive and may incur additional performance costs.
We have already seen that atomic read-modify-write operations are implemented using instructions such as LDXR and STXR.
Standalone memory fences, on the other hand, are typically implemented using ARM’s DMB (Data Memory Barrier) instruction [5].
ARM provides several variants with different scopes and ordering guarantees, including:
DMB ST // Wait only for stores.
DMB LD // Wait only for loads.
DMB ISH // Apply within the inner-shareable domain.
DMB ISHST // Inner-shareable stores only.
DMB ISHLD // Inner-shareable loads only.
These variants allow the hardware to provide exactly the ordering required, without imposing stronger constraints than necessary.
Besides memory barriers, there is another kind of barrier known as a compiler barrier.
A common example is
asm volatile("" ::: "memory");
Unlike a memory fence, a compiler barrier emits no machine instructions.
Its only purpose is to prevent the compiler from reordering memory accesses across that point.
Compiler barriers are particularly useful for benchmarking, where you want to ensure that timestamp instructions are not moved by the optimizer.
They are also useful on single-core systems or in situations where only compiler reordering needs to be prevented.
For example:
while (flag) {
asm volatile("" ::: "memory");
}
Without the compiler barrier, the compiler might assume that flag never changes inside the loop and optimize it into an infinite loop by loading flag only once.
The compiler barrier forces the compiler to reload flag from memory on every iteration.
One thing I found particularly interesting is that memory barriers exist at several different abstraction levels.
| Layer | Example | Purpose |
|---|---|---|
| Language standard | atomic_thread_fence(memory_order_release) |
Portable memory model |
| Compiler built-ins | __atomic_thread_fence(), __sync_synchronize() |
Compiler-specific interfaces |
| Kernel or library | Linux smp_mb(), rmb(), wmb() |
Platform-specific abstractions |
| Inline assembly | asm volatile("dmb ish" ::: "memory") |
Directly emit hardware instructions |
| Hardware instruction | DMB, DSB, ISB |
Actual CPU instructions |
My own rule of thumb is fairly simple.
Whenever possible, prefer standard C/C++ atomic operations with explicit memory ordering. If an explicit fence is needed, use the standard library interfaces and let the compiler choose the appropriate instructions for the target architecture.
Only drop down to inline assembly when you’re writing very low-level code and the language can no longer express the synchronization semantics you need.
Atomic operations guarantee that a single memory location can be updated atomically. However, many real-world operations involve multiple variables and cannot be expressed as a single atomic instruction.
In these cases, we need locks to make a sequence of operations appear atomic.
A classic example is inserting a node into a doubly linked list:
void insert(Node* a, int pos) {
/* search for the insertion point */
Node* curr = root;
for (int i = 0; i != pos; i++) {
if (!curr->next) {
assert("position does not exist");
}
curr = curr->next;
}
Node* next = curr->next;
/* insert a between curr and next */
if (next) {
next->prev = a;
}
curr->next = a;
a->next = next;
a->prev = curr;
}
Neither the traversal nor the insertion can be implemented with a single atomic operation. More importantly, we need to preserve the invariants of the linked list throughout the update.
If another thread observes the list halfway through the insertion, it may see an inconsistent state and corrupt the data structure.
The simplest solution is to protect the entire operation with a single spin lock:
void insert(Node* a, int pos) {
spin_lock(global);
...
spin_unlock(global);
}
void delete(int pos) {
spin_lock(global);
...
spin_unlock(global);
}
void get(int pos) {
spin_lock(global);
...
spin_unlock(global);
}
A spin lock works by repeatedly polling a shared flag.
If the flag is 0, the lock is free.
If it is 1, another thread owns the lock.
A minimal implementation looks like this:
void spin_lock(uint64_t* global) {
while (atomic_cmpxchg(global, 0, 1) != 0)
asm volatile("" ::: "memory");
}
void spin_unlock(uint64_t* global) {
atomic_store(global, 0);
}
Although simple, this design has several major drawbacks.
The first problem is that the lock protects far more code than necessary.
If every linked-list operation acquires one global lock, then all operations become serialized, even when they access completely different parts of the list. This defeats much of the benefit of parallel execution.
A general rule in concurrent programming is that critical sections should be kept as small as possible.
Wikipedia defines a critical section as follows:
The parts of the program where the shared resource is accessed need to be protected in ways that avoid concurrent access.
For a linked list, we usually only need to protect the nodes currently being modified, not the entire list.
One possible implementation is hand-over-hand locking (also known as lock coupling):
lock(curr);
while (i++ < pos) {
lock(curr->next);
unlock(curr);
curr = curr->next;
}
lock(curr->next);
... // insert
unlock(next);
unlock(curr);
Compared with using a single global lock, this approach allows multiple threads to operate on different parts of the list concurrently.
The downside is that the implementation becomes significantly more complicated.
For doubly linked lists in particular, lock acquisition order must be carefully designed. Otherwise, it is easy to introduce deadlocks.
The second problem is choosing the wrong kind of lock.
A spin lock continuously polls the lock variable while waiting for it to become available.
This is perfectly reasonable if the critical section is very short, since the lock is likely to be released before putting the thread to sleep would pay off.
However, if the critical section is long, spinning simply wastes CPU time that could otherwise be used to execute other threads.
In these situations, a mutex is usually the better choice.
Instead of spinning, a mutex blocks the waiting thread.
If the lock is already held, the current thread is placed into a waiting queue and put to sleep. When the lock is released, one waiting thread is removed from the queue and woken up to continue execution [7].
Conceptually, a mutex behaves like this:
void mutex_lock(mutex_t* global) {
while (atomic_cmpxchg(global->guard, 0, 1) != 0)
asm volatile("" ::: "memory");
if (global->lock) {
enqueue(current_thread);
block(current_thread);
global->guard = 0; // must be atomic with enqueue + block
} else {
global->lock = 1;
global->guard = 0;
}
}
void mutex_unlock(mutex_t* global) {
while (atomic_cmpxchg(global->guard, 0, 1) != 0)
asm volatile("" ::: "memory");
if (queue_not_empty) {
thread = dequeue();
wakeup(thread);
} else {
global->lock = 0;
global->guard = 0;
}
}
Recently, I’ve been exposed to many different kinds of locks, including seqlocks, reader-writer locks, hash locks, and many others.
One thing I’ve realized is that choosing the right lock is almost an art in itself. Different workloads have very different access patterns, and the “best” lock often depends entirely on those patterns.
Rather than trying to cover every synchronization primitive here, I’d like to dedicate a future article to discussing them in detail.
The third problem goes beyond the locking algorithm itself.
Even if two locking algorithms have similar theoretical complexity, they can behave very differently on real hardware.
For example, suppose several CPU cores repeatedly spin on the same lock variable.
Every failed attempt to acquire the lock generates coherence traffic, causing the cache line containing the lock to bounce from one core to another.
As contention increases, the processor spends more and more time maintaining cache coherence instead of making useful progress.
This phenomenon is commonly referred to as cache-line bouncing. It is one of the primary reasons why a simple test-and-set spin lock scales poorly on many-core systems.
To address this issue, researchers have developed a family of scalable locks [8].
One classic example is the MCS queue lock, which allows each waiting thread to spin on a private cache line instead of a shared lock variable. This dramatically reduces coherence traffic and enables much better scalability under contention.
I’ve been spending quite a bit of time learning about these scalable locking algorithms recently. They’re fascinating, but they also deserve an article of their own.
At this point, we’ve covered the three building blocks that appear almost everywhere in concurrent programming:
The deeper I dive into concurrent programming, the more I realize how subtle it is. Even concepts that initially seem straightforward often hide surprising complexity once you start looking beneath the surface.
Concurrent programming is hard… but that’s also what makes it so interesting.
[1] Concurrency Wiki: https://en.wikipedia.org/wiki/Concurrency_(computer_science)
[2] Cache coherency primer (2014): https://fgiesen.wordpress.com/2014/07/07/cache-coherency/
[3] Atomics in AArch64 (2021): https://cpufun.substack.com/p/atomics-in-aarch64
[4] Fences are Memory Barriers (2016): https://www.modernescpp.com/index.php/fences-as-memory-barriers/
[5] ARM v8 Manual: https://developer.arm.com/documentation/dui0489/e/arm-and-thumb-instructions/miscellaneous-instructions/dmb–dsb–and-isb
[6] xv6 locking: https://pekopeko11.sakura.ne.jp/unix_v6/xv6-book/en/Locking.html
[7] mutex implementation https://www.cs.princeton.edu/courses/archive/fall08/cos318/lectures/Lec6-MutexImplementation.pdf
[8] Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors https://www.cs.rochester.edu/research/synchronization/pseudocode/ss.html
skewcy@gmail.com