Atomic Operations, Memory Barriers, and Locks

Recently, I’ve been working on OS kernel and have learned something about atomic operations, memory barriers, and synchronization primitives. It has been an incredibly rewarding experience.

Before this, my understanding of concurrent programming was almost nonexistent. During my undergraduate studies, I only briefly encountered Linux mutexes and semaphores. Although I wrote quite a bit of real-time networking stacks and device driver code during my Ph.D., my focus was almost entirely on implementing algorithms rather than understanding concurrency itself. I never developed a systematic understanding of when instruction reordering could become a problem or what kinds of locks should be used in different situations. Looking back now, I realize that many of my previous assumptions, and probably quite a bit of the code I wrote, were incorrect.

In this article, I’d like to organize as much as what I’ve learned so far, together with some of my own thoughts. Hopefully, it can serve as a gentle introduction to concurrent programming for others who are just getting started.

Concurrent Programming

Let’s begin with a simple question: what is concurrency? Wikipedia defines it as follows [1]:

In other words, concurrency means multiple tasks making progress either simultaneously or by taking turns sharing the processor.

Core0: A -> B -> A -> B

Core0: A -> A
Core1: B -> B

Once multiple CPU cores are involved, an obvious question arises. Suppose two threads, A and B, share a variable x in memory. What happens if A modifies x while B reads it before the update has become visible?

For a long time, I thought this problem was simply caused by the multi-level cache hierarchy used in modern processors. At least that’s how I remembered learning it during my undergraduate years. Since CPUs communicate with main memory through L1, L2, and L3 caches, I assumed that when A updated x, the modification first lived only in A’s cache, and it naturally took time before the updated value propagated to B’s cache.

Modern processors have long solved this problem through cache coherence protocols. Hardware guarantees that caches remain coherent across CPU cores. For example, under the MESI protocol, every cache line is associated with one of four states:

Suppose A wants to modify x. The cache line containing x becomes Modified in A’s cache, while the corresponding copies in other cores are marked Invalid. Later, when B attempts to read x, the hardware automatically performs the necessary coherence operations so that both cores eventually observe a consistent value, typically transitioning the cache line into the Shared state.

Conceptually, this is somewhat similar to Rust’s borrow checker. At any given time, only one core is allowed to own a writable copy of the data, and before another core can access it, the previous writable copy must be invalidated first [2].

So if hardware already guarantees cache coherence, why do we still need all these complicated concepts like atomic operations, memory barriers, and locks?

The reason is that cache coherence only guarantees that all cores eventually agree on the value stored at a single memory location. It does not guarantee atomicity or ordering, which are separate problems altogether.

Atomic Operations

x++;

Cache coherence guarantees that each individual ldr and str observes a coherent view of memory. However, it does not guarantee that the entire read-modify-write sequence executes as a single indivisible operation.

This issue exists regardless of whether the program runs on one core or multiple cores.

For example, suppose x = 0, and two threads both execute x++ simultaneously.

As a result, one increment is lost, and the final value becomes 1 instead of 2.

__atomic_fetch_add(&x, 1);

The ldxr instruction establishes an exclusive reservation on the memory location, while stxr succeeds only if no other processor has modified that location in the meantime.

If another core writes to x before stxr executes, the store fails, and the code simply retries until it succeeds.

If load and store are already single machine instructions, why do we even need atomic_load() and atomic_store()?

First, the compiler is not required to emit a single load or store instruction for an ordinary memory access. Under the C11 memory model, concurrent accesses to a non-atomic variable constitute a data race, which results in undefined behavior. Once a program has undefined behavior, the compiler is free to perform optimizations that would otherwise seem surprising.

Second, atomic loads and stores are not just about atomicity. They can also carry memory ordering semantics, allowing them to act as synchronization points.

Memory Barriers

Besides cache coherence and atomicity, another essential ingredient of correct concurrent programs is memory ordering. In other words, operations must occur in an order that preserves the intended semantics of the program.

Unfortunately, modern systems do not execute instructions strictly in the order they appear in the source code.

This means that the order written in your program is not necessarily the order observed by other threads.

data = None;
has_data = False;

// Thread 1
write(&data, "hello");
atomic_store(&has_data, True, Relaxed);

// Thread 2
if (atomic_load(&has_data, Relaxed)) {
    d = read(&data);
    assert(d == "hello");
}

At first glance, this looks perfectly reasonable. Thread 1 writes the data first, then sets has_data to True. Thread 2 waits until has_data becomes True before reading the data.

The reason is that memory_order_relaxed provides atomicity but does not impose any ordering guarantees. Both the compiler and the CPU are free to reorder independent memory operations.

Compilers only preserve ordering when there is an explicit dependency between operations. For example:

int a = 5;
b = a + 1;

Here, b depends on the value of a, so the compiler cannot move the assignment to b before the initialization of a.

write(&data, "hello");
atomic_store(&has_data, True, Relaxed);

Since the compiler sees no dependency between them, it is free to reorder them if doing so improves optimization.

Acquire and Release

The standard solution is to use atomic operations with memory ordering semantics.

data = None;
has_data = False;

// Thread 1
write(&data, "hello");
atomic_store(&has_data, True, Release);

// Thread 2
if (atomic_load(&has_data, Acquire)) {
    d = read(&data);
    assert(d == "hello");
}

Here, the release store ensures that every memory write before it becomes visible before the store itself.

Conversely, the acquire load guarantees that once it observes True, every memory operation that happened before the corresponding release store is also visible.

As a result, if Thread 2 sees has_data == True, it is guaranteed to read the updated contents of data.

This release-acquire pair forms one of the most fundamental synchronization patterns in concurrent programming.

Memory Fences

Although memory ordering is often attached directly to atomic operations, another common approach is to use memory fences explicitly.

C11 provides several fence operations, among which the following three are used most frequently:

atomic_thread_fence(memory_order_acquire);
atomic_thread_fence(memory_order_release);
atomic_thread_fence(memory_order_seq_cst);

Acquire Fence

/* load/load ordering */
r1 = x;
fence(acquire);
r2 = y;

flag = ready;
fence(acquire);
result = 1;

In other words, an acquire fence enforces load-load and load-store ordering.

Release Fence

/* store/store ordering */
data = value;
fence(release);
ready = 1;

flag = ready;
fence(release);
result = 1;

Sequentially Consistent Fence

A sequentially consistent fence, usually abbreviated as seq_cst, is the strongest kind of fence.

x = 1;
fence(seq_cst);
r = y;

It prevents a store from being reordered with a later load and also participates in a single global ordering shared by all sequentially consistent operations.

Although it is the easiest model to reason about, it is also the most restrictive and may incur additional performance costs.

How Are Memory Fences Implemented?

We have already seen that atomic read-modify-write operations are implemented using instructions such as LDXR and STXR.

Standalone memory fences, on the other hand, are typically implemented using ARM’s DMB (Data Memory Barrier) instruction [5].

ARM provides several variants with different scopes and ordering guarantees, including:

DMB ST      // Wait only for stores.
DMB LD      // Wait only for loads.
DMB ISH     // Apply within the inner-shareable domain.
DMB ISHST   // Inner-shareable stores only.
DMB ISHLD   // Inner-shareable loads only.

These variants allow the hardware to provide exactly the ordering required, without imposing stronger constraints than necessary.

Compiler Barriers

Besides memory barriers, there is another kind of barrier known as a compiler barrier.

asm volatile("" ::: "memory");

Its only purpose is to prevent the compiler from reordering memory accesses across that point.

Compiler barriers are particularly useful for benchmarking, where you want to ensure that timestamp instructions are not moved by the optimizer.

They are also useful on single-core systems or in situations where only compiler reordering needs to be prevented.

while (flag) {
    asm volatile("" ::: "memory");
}

Without the compiler barrier, the compiler might assume that flag never changes inside the loop and optimize it into an infinite loop by loading flag only once.

The compiler barrier forces the compiler to reload flag from memory on every iteration.

Memory Barriers Exist at Multiple Layers

One thing I found particularly interesting is that memory barriers exist at several different abstraction levels.

Whenever possible, prefer standard C/C++ atomic operations with explicit memory ordering. If an explicit fence is needed, use the standard library interfaces and let the compiler choose the appropriate instructions for the target architecture.

Only drop down to inline assembly when you’re writing very low-level code and the language can no longer express the synchronization semantics you need.

Locks

Layer	Example	Purpose
Language standard	`atomic_thread_fence(memory_order_release)`	Portable memory model
Compiler built-ins	`__atomic_thread_fence()`, `__sync_synchronize()`	Compiler-specific interfaces
Kernel or library	Linux `smp_mb()`, `rmb()`, `wmb()`	Platform-specific abstractions
Inline assembly	`asm volatile("dmb ish" ::: "memory")`	Directly emit hardware instructions
Hardware instruction	`DMB`, `DSB`, `ISB`	Actual CPU instructions

Atomic operations guarantee that a single memory location can be updated atomically. However, many real-world operations involve multiple variables and cannot be expressed as a single atomic instruction.

void insert(Node* a, int pos) {
    /* search for the insertion point */
    Node* curr = root;
    for (int i = 0; i != pos; i++) {
        if (!curr->next) {
            assert("position does not exist");
        }
        curr = curr->next;
    }

    Node* next = curr->next;

    /* insert a between curr and next */
    if (next) {
        next->prev = a;
    }

    curr->next = a;
    a->next = next;
    a->prev = curr;
}

Neither the traversal nor the insertion can be implemented with a single atomic operation. More importantly, we need to preserve the invariants of the linked list throughout the update.

If another thread observes the list halfway through the insertion, it may see an inconsistent state and corrupt the data structure.

The simplest solution is to protect the entire operation with a single spin lock:

void insert(Node* a, int pos) {
    spin_lock(global);
    ...
    spin_unlock(global);
}

void delete(int pos) {
    spin_lock(global);
    ...
    spin_unlock(global);
}

void get(int pos) {
    spin_lock(global);
    ...
    spin_unlock(global);
}

void spin_lock(uint64_t* global) {
    while (atomic_cmpxchg(global, 0, 1) != 0)
        asm volatile("" ::: "memory");
}

void spin_unlock(uint64_t* global) {
    atomic_store(global, 0);
}

Problem 1: The Lock Granularity Is Too Large

If every linked-list operation acquires one global lock, then all operations become serialized, even when they access completely different parts of the list. This defeats much of the benefit of parallel execution.

A general rule in concurrent programming is that critical sections should be kept as small as possible.

The parts of the program where the shared resource is accessed need to be protected in ways that avoid concurrent access.

For a linked list, we usually only need to protect the nodes currently being modified, not the entire list.

One possible implementation is hand-over-hand locking (also known as lock coupling):

lock(curr);

while (i++ < pos) {
    lock(curr->next);
    unlock(curr);
    curr = curr->next;
}

lock(curr->next);
... // insert
unlock(next);
unlock(curr);

Compared with using a single global lock, this approach allows multiple threads to operate on different parts of the list concurrently.

For doubly linked lists in particular, lock acquisition order must be carefully designed. Otherwise, it is easy to introduce deadlocks.

Problem 2: Spin Locks Waste CPU Cycles

A spin lock continuously polls the lock variable while waiting for it to become available.

This is perfectly reasonable if the critical section is very short, since the lock is likely to be released before putting the thread to sleep would pay off.

However, if the critical section is long, spinning simply wastes CPU time that could otherwise be used to execute other threads.

If the lock is already held, the current thread is placed into a waiting queue and put to sleep. When the lock is released, one waiting thread is removed from the queue and woken up to continue execution [7].

void mutex_lock(mutex_t* global) {
    while (atomic_cmpxchg(global->guard, 0, 1) != 0)
        asm volatile("" ::: "memory");

    if (global->lock) {
        enqueue(current_thread);
        block(current_thread);
        global->guard = 0;   // must be atomic with enqueue + block
    } else {
        global->lock = 1;
        global->guard = 0;
    }
}

void mutex_unlock(mutex_t* global) {
    while (atomic_cmpxchg(global->guard, 0, 1) != 0)
        asm volatile("" ::: "memory");

    if (queue_not_empty) {
        thread = dequeue();
        wakeup(thread);
    } else {
        global->lock = 0;
        global->guard = 0;
    }
}

Recently, I’ve been exposed to many different kinds of locks, including seqlocks, reader-writer locks, hash locks, and many others.

One thing I’ve realized is that choosing the right lock is almost an art in itself. Different workloads have very different access patterns, and the “best” lock often depends entirely on those patterns.

Rather than trying to cover every synchronization primitive here, I’d like to dedicate a future article to discussing them in detail.

Problem 3: Hardware Matters Too

Even if two locking algorithms have similar theoretical complexity, they can behave very differently on real hardware.

For example, suppose several CPU cores repeatedly spin on the same lock variable.

Every failed attempt to acquire the lock generates coherence traffic, causing the cache line containing the lock to bounce from one core to another.

As contention increases, the processor spends more and more time maintaining cache coherence instead of making useful progress.

This phenomenon is commonly referred to as cache-line bouncing. It is one of the primary reasons why a simple test-and-set spin lock scales poorly on many-core systems.

To address this issue, researchers have developed a family of scalable locks [8].

One classic example is the MCS queue lock, which allows each waiting thread to spin on a private cache line instead of a shared lock variable. This dramatically reduces coherence traffic and enables much better scalability under contention.

I’ve been spending quite a bit of time learning about these scalable locking algorithms recently. They’re fascinating, but they also deserve an article of their own.

At this point, we’ve covered the three building blocks that appear almost everywhere in concurrent programming:

The deeper I dive into concurrent programming, the more I realize how subtle it is. Even concepts that initially seem straightforward often hide surprising complexity once you start looking beneath the surface.

Concurrent programming is hard… but that’s also what makes it so interesting.

Reference

[1] Concurrency Wiki: https://en.wikipedia.org/wiki/Concurrency_(computer_science)

[2] Cache coherency primer (2014): https://fgiesen.wordpress.com/2014/07/07/cache-coherency/

[4] Fences are Memory Barriers (2016): https://www.modernescpp.com/index.php/fences-as-memory-barriers/

[5] ARM v8 Manual: https://developer.arm.com/documentation/dui0489/e/arm-and-thumb-instructions/miscellaneous-instructions/dmb–dsb–and-isb

[6] xv6 locking: https://pekopeko11.sakura.ne.jp/unix_v6/xv6-book/en/Locking.html

[7] mutex implementation https://www.cs.princeton.edu/courses/archive/fall08/cos318/lectures/Lec6-MutexImplementation.pdf

[8] Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors https://www.cs.rochester.edu/research/synchronization/pseudocode/ss.html