mirror of
https://git.postgresql.org/git/postgresql.git
synced 2024-09-30 07:01:17 +02:00
200 lines
9.1 KiB
Plaintext
200 lines
9.1 KiB
Plaintext
|
Memory Barriers
|
||
|
===============
|
||
|
|
||
|
Modern CPUs make extensive use of pipe-lining and out-of-order execution,
|
||
|
meaning that the CPU is often executing more than one instruction at a
|
||
|
time, and not necessarily in the order that the source code would suggest.
|
||
|
Furthermore, even before the CPU gets a chance to reorder operations, the
|
||
|
compiler may (and often does) reorganize the code for greater efficiency,
|
||
|
particularly at higher optimization levels. Optimizing compilers and
|
||
|
out-of-order execution are both critical for good performance, but they
|
||
|
can lead to surprising results when multiple processes access the same
|
||
|
memory space.
|
||
|
|
||
|
Example
|
||
|
=======
|
||
|
|
||
|
Suppose x is a pointer to a structure stored in shared memory, and that the
|
||
|
entire structure has been initialized to zero bytes. One backend executes
|
||
|
the following code fragment:
|
||
|
|
||
|
x->foo = 1;
|
||
|
x->bar = 1;
|
||
|
|
||
|
Meanwhile, at approximately the same time, another backend executes this
|
||
|
code fragment:
|
||
|
|
||
|
bar = x->bar;
|
||
|
foo = x->foo;
|
||
|
|
||
|
The second backend might end up with foo = 1 and bar = 1 (if it executes
|
||
|
both statements after the first backend), or with foo = 0 and bar = 0 (if
|
||
|
it executes both statements before the first backend), or with foo = 1 and
|
||
|
bar = 0 (if the first backend executes the first statement, the second
|
||
|
backend executes both statements, and then the first backend executes the
|
||
|
second statement).
|
||
|
|
||
|
Surprisingly, however, the second backend could also end up with foo = 0
|
||
|
and bar = 1. The compiler might swap the order of the two stores performed
|
||
|
by the first backend, or the two loads performed by the second backend.
|
||
|
Even if it doesn't, on a machine with weak memory ordering (such as PowerPC
|
||
|
or Itanium) the CPU might choose to execute either the loads or the stores
|
||
|
out of order. This surprising result can lead to bugs.
|
||
|
|
||
|
A common pattern where this actually does result in a bug is when adding items
|
||
|
onto a queue. The writer does this:
|
||
|
|
||
|
q->items[q->num_items] = new_item;
|
||
|
++q->num_items;
|
||
|
|
||
|
The reader does this:
|
||
|
|
||
|
num_items = q->num_items;
|
||
|
for (i = 0; i < num_items; ++i)
|
||
|
/* do something with q->items[i] */
|
||
|
|
||
|
This code turns out to be unsafe, because the writer might increment
|
||
|
q->num_items before it finishes storing the new item into the appropriate slot.
|
||
|
More subtly, the reader might prefetch the contents of the q->items array
|
||
|
before reading q->num_items. Thus, there's still a bug here *even if the
|
||
|
writer does everything in the order we expect*. We need the writer to update
|
||
|
the array before bumping the item counter, and the reader to examine the item
|
||
|
counter before examining the array.
|
||
|
|
||
|
Note that these types of highly counterintuitive bugs can *only* occur when
|
||
|
multiple processes are interacting with the same memory segment. A given
|
||
|
process always perceives its *own* writes to memory in program order.
|
||
|
|
||
|
Avoiding Memory Ordering Bugs
|
||
|
=============================
|
||
|
|
||
|
The simplest (and often best) way to avoid memory ordering bugs is to
|
||
|
protect the data structures involved with an lwlock. For more details, see
|
||
|
src/backend/storage/lmgr/README. For instance, in the above example, the
|
||
|
writer could acquire an lwlock in exclusive mode before appending to the
|
||
|
queue, and each reader could acquire the same lock in shared mode before
|
||
|
reading it. If the data structure is not heavily trafficked, this solution is
|
||
|
generally entirely adequate.
|
||
|
|
||
|
However, in some cases, it is desirable to avoid the overhead of acquiring
|
||
|
and releasing locks. In this case, memory barriers may be used to ensure
|
||
|
that the apparent order of execution is as the programmer desires. In
|
||
|
PostgreSQL backend code, the pg_memory_barrier() macro may be used to achieve
|
||
|
this result. In the example above, we can prevent the reader from seeing a
|
||
|
garbage value by having the writer do this:
|
||
|
|
||
|
q->items[q->num_items] = new_item;
|
||
|
pg_memory_barrier();
|
||
|
++q->num_items;
|
||
|
|
||
|
And by having the reader do this:
|
||
|
|
||
|
num_items = q->num_items;
|
||
|
pg_memory_barrier();
|
||
|
for (i = 0; i < num_items; ++i)
|
||
|
/* do something with q->items[i] */
|
||
|
|
||
|
The pg_memory_barrier() macro will (1) prevent the compiler from rearranging
|
||
|
the code in such a way as to allow the memory accesses to occur out of order
|
||
|
and (2) generate any code (often, inline assembly) that is needed to prevent
|
||
|
the CPU from executing the memory accesses out of order. Specifically, the
|
||
|
barrier prevents loads and stores written after the barrier from being
|
||
|
performed before the barrier, and vice-versa.
|
||
|
|
||
|
Although this code will work, it is needlessly inefficient. On systems with
|
||
|
strong memory ordering (such as x86), the CPU never reorders loads with other
|
||
|
loads, nor stores with other stores. It can, however, allow a load to
|
||
|
performed before a subsequent store. To avoid emitting unnecessary memory
|
||
|
instructions, we provide two additional primitives: pg_read_barrier(), and
|
||
|
pg_write_barrier(). When a memory barrier is being used to separate two
|
||
|
loads, use pg_read_barrier(); when it is separating two stores, use
|
||
|
pg_write_barrier(); when it is a separating a load and a store (in either
|
||
|
order), use pg_memory_barrier(). pg_memory_barrier() can always substitute
|
||
|
for either a read or a write barrier, but is typically more expensive, and
|
||
|
therefore should be used only when needed.
|
||
|
|
||
|
With these guidelines in mind, the writer can do this:
|
||
|
|
||
|
q->items[q->num_items] = new_item;
|
||
|
pg_write_barrier();
|
||
|
++q->num_items;
|
||
|
|
||
|
And the reader can do this:
|
||
|
|
||
|
num_items = q->num_items;
|
||
|
pg_read_barrier();
|
||
|
for (i = 0; i < num_items; ++i)
|
||
|
/* do something with q->items[i] */
|
||
|
|
||
|
On machines with strong memory ordering, these weaker barriers will simply
|
||
|
prevent compiler rearrangement, without emitting any actual machine code.
|
||
|
On machines with weak memory ordering, they will will prevent compiler
|
||
|
reordering and also emit whatever hardware barrier may be required. Even
|
||
|
on machines with weak memory ordering, a read or write barrier may be able
|
||
|
to use a less expensive instruction than a full barrier.
|
||
|
|
||
|
Weaknesses of Memory Barriers
|
||
|
=============================
|
||
|
|
||
|
While memory barriers are a powerful tool, and much cheaper than locks, they
|
||
|
are also much less capable than locks. Here are some of the problems.
|
||
|
|
||
|
1. Concurrent writers are unsafe. In the above example of a queue, using
|
||
|
memory barriers doesn't make it safe for two processes to add items to the
|
||
|
same queue at the same time. If more than one process can write to the queue,
|
||
|
a spinlock or lwlock must be used to synchronize access. The readers can
|
||
|
perhaps proceed without any lock, but the writers may not.
|
||
|
|
||
|
Even very simple write operations often require additional synchronization.
|
||
|
For example, it's not safe for multiple writers to simultaneously execute
|
||
|
this code (supposing x is a pointer into shared memory):
|
||
|
|
||
|
x->foo++;
|
||
|
|
||
|
Although this may compile down to a single machine-language instruction,
|
||
|
the CPU will execute that instruction by reading the current value of foo,
|
||
|
adding one to it, and then storing the result back to the original address.
|
||
|
If two CPUs try to do this simultaneously, both may do their reads before
|
||
|
either one does their writes. Eventually we might be able to use an atomic
|
||
|
fetch-and-add instruction for this specific case on architectures that support
|
||
|
it, but we can't rely on that being available everywhere, and we currently
|
||
|
have no support for it at all. Use a lock.
|
||
|
|
||
|
2. Eight-byte loads and stores aren't necessarily atomic. We assume in
|
||
|
various places in the source code that an aligned four-byte load or store is
|
||
|
atomic, and that other processes therefore won't see a half-set value.
|
||
|
Sadly, the same can't be said for eight-byte value: on some platforms, an
|
||
|
aligned eight-byte load or store will generate two four-byte operations. If
|
||
|
you need an atomic eight-byte read or write, you must make it atomic with a
|
||
|
lock.
|
||
|
|
||
|
3. No ordering guarantees. While memory barriers ensure that any given
|
||
|
process performs loads and stores to shared memory in order, they don't
|
||
|
guarantee synchronization. In the queue example above, we can use memory
|
||
|
barriers to be sure that readers won't see garbage, but there's nothing to
|
||
|
say whether a given reader will run before or after a given writer. If this
|
||
|
matters in a given situation, some other mechanism must be used instead of
|
||
|
or in addition to memory barriers.
|
||
|
|
||
|
4. Barrier proliferation. Many algorithms that at first seem appealing
|
||
|
require multiple barriers. If the number of barriers required is more than
|
||
|
one or two, you may be better off just using a lock. Keep in mind that, on
|
||
|
some platforms, a barrier may be implemented by acquiring and releasing a
|
||
|
backend-private spinlock. This may be better than a centralized lock under
|
||
|
contention, but it may also be slower in the uncontended case.
|
||
|
|
||
|
Further Reading
|
||
|
===============
|
||
|
|
||
|
Much of the documentation about memory barriers appears to be quite
|
||
|
Linux-specific. The following papers may be helpful:
|
||
|
|
||
|
Memory Ordering in Modern Microprocessors, by Paul E. McKenney
|
||
|
* http://www.rdrop.com/users/paulmck/scalability/paper/ordering.2007.09.19a.pdf
|
||
|
|
||
|
Memory Barriers: a Hardware View for Software Hackers, by Paul E. McKenney
|
||
|
* http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2010.06.07c.pdf
|
||
|
|
||
|
The Linux kernel also has some useful documentation on this topic. Start
|
||
|
with Documentation/memory-barriers.txt
|