Memory Barriers =============== Modern CPUs make extensive use of pipe-lining and out-of-order execution, meaning that the CPU is often executing more than one instruction at a time, and not necessarily in the order that the source code would suggest. Furthermore, even before the CPU gets a chance to reorder operations, the compiler may (and often does) reorganize the code for greater efficiency, particularly at higher optimization levels. Optimizing compilers and out-of-order execution are both critical for good performance, but they can lead to surprising results when multiple processes access the same memory space. Example ======= Suppose x is a pointer to a structure stored in shared memory, and that the entire structure has been initialized to zero bytes. One backend executes the following code fragment: x->foo = 1; x->bar = 1; Meanwhile, at approximately the same time, another backend executes this code fragment: bar = x->bar; foo = x->foo; The second backend might end up with foo = 1 and bar = 1 (if it executes both statements after the first backend), or with foo = 0 and bar = 0 (if it executes both statements before the first backend), or with foo = 1 and bar = 0 (if the first backend executes the first statement, the second backend executes both statements, and then the first backend executes the second statement). Surprisingly, however, the second backend could also end up with foo = 0 and bar = 1. The compiler might swap the order of the two stores performed by the first backend, or the two loads performed by the second backend. Even if it doesn't, on a machine with weak memory ordering (such as PowerPC or Itanium) the CPU might choose to execute either the loads or the stores out of order. This surprising result can lead to bugs. A common pattern where this actually does result in a bug is when adding items onto a queue. The writer does this: q->items[q->num_items] = new_item; ++q->num_items; The reader does this: num_items = q->num_items; for (i = 0; i < num_items; ++i) /* do something with q->items[i] */ This code turns out to be unsafe, because the writer might increment q->num_items before it finishes storing the new item into the appropriate slot. More subtly, the reader might prefetch the contents of the q->items array before reading q->num_items. Thus, there's still a bug here *even if the writer does everything in the order we expect*. We need the writer to update the array before bumping the item counter, and the reader to examine the item counter before examining the array. Note that these types of highly counterintuitive bugs can *only* occur when multiple processes are interacting with the same memory segment. A given process always perceives its *own* writes to memory in program order. Avoiding Memory Ordering Bugs ============================= The simplest (and often best) way to avoid memory ordering bugs is to protect the data structures involved with an lwlock. For more details, see src/backend/storage/lmgr/README. For instance, in the above example, the writer could acquire an lwlock in exclusive mode before appending to the queue, and each reader could acquire the same lock in shared mode before reading it. If the data structure is not heavily trafficked, this solution is generally entirely adequate. However, in some cases, it is desirable to avoid the overhead of acquiring and releasing locks. In this case, memory barriers may be used to ensure that the apparent order of execution is as the programmer desires. In PostgreSQL backend code, the pg_memory_barrier() macro may be used to achieve this result. In the example above, we can prevent the reader from seeing a garbage value by having the writer do this: q->items[q->num_items] = new_item; pg_memory_barrier(); ++q->num_items; And by having the reader do this: num_items = q->num_items; pg_memory_barrier(); for (i = 0; i < num_items; ++i) /* do something with q->items[i] */ The pg_memory_barrier() macro will (1) prevent the compiler from rearranging the code in such a way as to allow the memory accesses to occur out of order and (2) generate any code (often, inline assembly) that is needed to prevent the CPU from executing the memory accesses out of order. Specifically, the barrier prevents loads and stores written after the barrier from being performed before the barrier, and vice-versa. Although this code will work, it is needlessly inefficient. On systems with strong memory ordering (such as x86), the CPU never reorders loads with other loads, nor stores with other stores. It can, however, allow a load to performed before a subsequent store. To avoid emitting unnecessary memory instructions, we provide two additional primitives: pg_read_barrier(), and pg_write_barrier(). When a memory barrier is being used to separate two loads, use pg_read_barrier(); when it is separating two stores, use pg_write_barrier(); when it is a separating a load and a store (in either order), use pg_memory_barrier(). pg_memory_barrier() can always substitute for either a read or a write barrier, but is typically more expensive, and therefore should be used only when needed. With these guidelines in mind, the writer can do this: q->items[q->num_items] = new_item; pg_write_barrier(); ++q->num_items; And the reader can do this: num_items = q->num_items; pg_read_barrier(); for (i = 0; i < num_items; ++i) /* do something with q->items[i] */ On machines with strong memory ordering, these weaker barriers will simply prevent compiler rearrangement, without emitting any actual machine code. On machines with weak memory ordering, they will will prevent compiler reordering and also emit whatever hardware barrier may be required. Even on machines with weak memory ordering, a read or write barrier may be able to use a less expensive instruction than a full barrier. Weaknesses of Memory Barriers ============================= While memory barriers are a powerful tool, and much cheaper than locks, they are also much less capable than locks. Here are some of the problems. 1. Concurrent writers are unsafe. In the above example of a queue, using memory barriers doesn't make it safe for two processes to add items to the same queue at the same time. If more than one process can write to the queue, a spinlock or lwlock must be used to synchronize access. The readers can perhaps proceed without any lock, but the writers may not. Even very simple write operations often require additional synchronization. For example, it's not safe for multiple writers to simultaneously execute this code (supposing x is a pointer into shared memory): x->foo++; Although this may compile down to a single machine-language instruction, the CPU will execute that instruction by reading the current value of foo, adding one to it, and then storing the result back to the original address. If two CPUs try to do this simultaneously, both may do their reads before either one does their writes. Eventually we might be able to use an atomic fetch-and-add instruction for this specific case on architectures that support it, but we can't rely on that being available everywhere, and we currently have no support for it at all. Use a lock. 2. Eight-byte loads and stores aren't necessarily atomic. We assume in various places in the source code that an aligned four-byte load or store is atomic, and that other processes therefore won't see a half-set value. Sadly, the same can't be said for eight-byte value: on some platforms, an aligned eight-byte load or store will generate two four-byte operations. If you need an atomic eight-byte read or write, you must make it atomic with a lock. 3. No ordering guarantees. While memory barriers ensure that any given process performs loads and stores to shared memory in order, they don't guarantee synchronization. In the queue example above, we can use memory barriers to be sure that readers won't see garbage, but there's nothing to say whether a given reader will run before or after a given writer. If this matters in a given situation, some other mechanism must be used instead of or in addition to memory barriers. 4. Barrier proliferation. Many algorithms that at first seem appealing require multiple barriers. If the number of barriers required is more than one or two, you may be better off just using a lock. Keep in mind that, on some platforms, a barrier may be implemented by acquiring and releasing a backend-private spinlock. This may be better than a centralized lock under contention, but it may also be slower in the uncontended case. Further Reading =============== Much of the documentation about memory barriers appears to be quite Linux-specific. The following papers may be helpful: Memory Ordering in Modern Microprocessors, by Paul E. McKenney * http://www.rdrop.com/users/paulmck/scalability/paper/ordering.2007.09.19a.pdf Memory Barriers: a Hardware View for Software Hackers, by Paul E. McKenney * http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2010.06.07c.pdf The Linux kernel also has some useful documentation on this topic. Start with Documentation/memory-barriers.txt