Improve the documentation about commit_delay.

Clarify the docs explaining what commit_delay does, and add a
recommendation about a useful value for it, namely half of the single-page
fsync time reported by pg_test_fsync.  This is informed by testing of
the new-in-9.3 implementation of commit_delay; in prior versions it
was far harder to arrive at a useful setting.

In passing, do some wordsmithing and markup-fixing in the same general
area.

Also, change pg_test_fsync's default time-per-test from 2 seconds to 5.
The old value was about the minimum at which the results could be taken
seriously at all, and so seems a tad optimistic as a default.

Peter Geoghegan, reviewed by Noah Misch; some additional editing by me
This commit is contained in:
Tom Lane 2013-03-15 17:41:47 -04:00
parent dcafdbcde1
commit 70ec2f8f43
4 changed files with 118 additions and 70 deletions

View File

@ -60,7 +60,7 @@ do { \
static const char *progname;
static int secs_per_test = 2;
static int secs_per_test = 5;
static int needs_unlink = 0;
static char full_buf[XLOG_SEG_SIZE],
*buf,

View File

@ -1603,8 +1603,8 @@ include 'filename'
<title>Write Ahead Log</title>
<para>
See also <xref linkend="wal-configuration"> for details on WAL
and checkpoint tuning.
For additional information on tuning these settings,
see <xref linkend="wal-configuration">.
</para>
<sect2 id="runtime-config-wal-settings">
@ -1957,7 +1957,7 @@ include 'filename'
given interval. However, it also increases latency by up to
<varname>commit_delay</varname> microseconds for each WAL
flush. Because the delay is just wasted if no other transactions
become ready to commit, it is only performed if at least
become ready to commit, a delay is only performed if at least
<varname>commit_siblings</varname> other transactions are active
immediately before a flush would otherwise have been initiated.
In <productname>PostgreSQL</> releases prior to 9.3,
@ -1968,7 +1968,8 @@ include 'filename'
the first process that becomes ready to flush waits for the configured
interval, while subsequent processes wait only until the leader
completes the flush. The default <varname>commit_delay</> is zero
(no delay), and only honored if <varname>fsync</varname> is enabled.
(no delay). No delays are performed unless <varname>fsync</varname>
is enabled.
</para>
</listitem>
</varlistentry>

View File

@ -36,8 +36,8 @@
difference in real database throughput, especially since many database servers
are not speed-limited by their transaction logs.
<application>pg_test_fsync</application> reports average file sync operation
time in microseconds for each wal_sync_method, which can be used to inform
efforts to optimize the value of <varname>commit_delay</varname>.
time in microseconds for each wal_sync_method, which can also be used to
inform efforts to optimize the value of <xref linkend="guc-commit-delay">.
</para>
</refsect1>
@ -72,8 +72,8 @@
<para>
Specifies the number of seconds for each test. The more time
per test, the greater the test's accuracy, but the longer it takes
to run. The default is 2 seconds, which allows the program to
complete in about 30 seconds.
to run. The default is 5 seconds, which allows the program to
complete in under 2 minutes.
</para>
</listitem>
</varlistentry>

View File

@ -133,7 +133,7 @@
(<acronym>BBU</>) disk controllers. In such setups, the synchronize
command forces all data from the controller cache to the disks,
eliminating much of the benefit of the BBU. You can run the
<xref linkend="pgtestfsync"> module to see
<xref linkend="pgtestfsync"> program to see
if you are affected. If you are affected, the performance benefits
of the BBU can be regained by turning off write barriers in
the file system or reconfiguring the disk controller, if that is
@ -372,11 +372,12 @@
asynchronous commit, but it is actually a synchronous commit method
(in fact, <varname>commit_delay</varname> is ignored during an
asynchronous commit). <varname>commit_delay</varname> causes a delay
just before a synchronous commit attempts to flush
<acronym>WAL</acronym> to disk, in the hope that a single flush
executed by one such transaction can also serve other transactions
committing at about the same time. Setting <varname>commit_delay</varname>
can only help when there are many concurrently committing transactions.
just before a transaction flushes <acronym>WAL</acronym> to disk, in
the hope that a single flush executed by one such transaction can also
serve other transactions committing at about the same time. The
setting can be thought of as a way of increasing the time window in
which transactions can join a group about to participate in a single
flush, to amortize the cost of the flush among multiple transactions.
</para>
</sect1>
@ -394,15 +395,16 @@
<para>
<firstterm>Checkpoints</firstterm><indexterm><primary>checkpoint</></>
are points in the sequence of transactions at which it is guaranteed
that the heap and index data files have been updated with all information written before
the checkpoint. At checkpoint time, all dirty data pages are flushed to
disk and a special checkpoint record is written to the log file.
(The changes were previously flushed to the <acronym>WAL</acronym> files.)
that the heap and index data files have been updated with all
information written before that checkpoint. At checkpoint time, all
dirty data pages are flushed to disk and a special checkpoint record is
written to the log file. (The change records were previously flushed
to the <acronym>WAL</acronym> files.)
In the event of a crash, the crash recovery procedure looks at the latest
checkpoint record to determine the point in the log (known as the redo
record) from which it should start the REDO operation. Any changes made to
data files before that point are guaranteed to be already on disk. Hence, after
a checkpoint, log segments preceding the one containing
data files before that point are guaranteed to be already on disk.
Hence, after a checkpoint, log segments preceding the one containing
the redo record are no longer needed and can be recycled or removed. (When
<acronym>WAL</acronym> archiving is being done, the log segments must be
archived before being recycled or removed.)
@ -411,31 +413,32 @@
<para>
The checkpoint requirement of flushing all dirty data pages to disk
can cause a significant I/O load. For this reason, checkpoint
activity is throttled so I/O begins at checkpoint start and completes
before the next checkpoint starts; this minimizes performance
activity is throttled so that I/O begins at checkpoint start and completes
before the next checkpoint is due to start; this minimizes performance
degradation during checkpoints.
</para>
<para>
The server's checkpointer process automatically performs
a checkpoint every so often. A checkpoint is created every <xref
a checkpoint every so often. A checkpoint is begun every <xref
linkend="guc-checkpoint-segments"> log segments, or every <xref
linkend="guc-checkpoint-timeout"> seconds, whichever comes first.
The default settings are 3 segments and 300 seconds (5 minutes), respectively.
In cases where no WAL has been written since the previous checkpoint, new
checkpoints will be skipped even if checkpoint_timeout has passed.
If WAL archiving is being used and you want to put a lower limit on
how often files are archived in order to bound potential data
loss, you should adjust archive_timeout parameter rather than the checkpoint
parameters. It is also possible to force a checkpoint by using the SQL
If no WAL has been written since the previous checkpoint, new checkpoints
will be skipped even if <varname>checkpoint_timeout</> has passed.
(If WAL archiving is being used and you want to put a lower limit on how
often files are archived in order to bound potential data loss, you should
adjust the <xref linkend="guc-archive-timeout"> parameter rather than the
checkpoint parameters.)
It is also possible to force a checkpoint by using the SQL
command <command>CHECKPOINT</command>.
</para>
<para>
Reducing <varname>checkpoint_segments</varname> and/or
<varname>checkpoint_timeout</varname> causes checkpoints to occur
more often. This allows faster after-crash recovery (since less work
will need to be redone). However, one must balance this against the
more often. This allows faster after-crash recovery, since less work
will need to be redone. However, one must balance this against the
increased cost of flushing dirty data pages more often. If
<xref linkend="guc-full-page-writes"> is set (as is the default), there is
another factor to consider. To ensure data page consistency,
@ -450,7 +453,7 @@
Checkpoints are fairly expensive, first because they require writing
out all currently dirty buffers, and second because they result in
extra subsequent WAL traffic as discussed above. It is therefore
wise to set the checkpointing parameters high enough that checkpoints
wise to set the checkpointing parameters high enough so that checkpoints
don't happen too often. As a simple sanity check on your checkpointing
parameters, you can set the <xref linkend="guc-checkpoint-warning">
parameter. If checkpoints happen closer together than
@ -498,7 +501,7 @@
altered when building the server). You can use this to estimate space
requirements for <acronym>WAL</acronym>.
Ordinarily, when old log segment files are no longer needed, they
are recycled (renamed to become the next segments in the numbered
are recycled (that is, renamed to become future segments in the numbered
sequence). If, due to a short-term peak of log output rate, there
are more than 3 * <varname>checkpoint_segments</varname> + 1
segment files, the unneeded segment files will be deleted instead
@ -507,64 +510,108 @@
<para>
In archive recovery or standby mode, the server periodically performs
<firstterm>restartpoints</><indexterm><primary>restartpoint</></>
<firstterm>restartpoints</>,<indexterm><primary>restartpoint</></>
which are similar to checkpoints in normal operation: the server forces
all its state to disk, updates the <filename>pg_control</> file to
indicate that the already-processed WAL data need not be scanned again,
and then recycles any old log segment files in <filename>pg_xlog</>
directory. A restartpoint is triggered if at least one checkpoint record
has been replayed and <varname>checkpoint_timeout</> seconds have passed
since last restartpoint. In standby mode, a restartpoint is also triggered
if <varname>checkpoint_segments</> log segments have been replayed since
last restartpoint and at least one checkpoint record has been replayed.
and then recycles any old log segment files in the <filename>pg_xlog</>
directory.
Restartpoints can't be performed more frequently than checkpoints in the
master because restartpoints can only be performed at checkpoint records.
A restartpoint is triggered when a checkpoint record is reached if at
least <varname>checkpoint_timeout</> seconds have passed since the last
restartpoint. In standby mode, a restartpoint is also triggered if at
least <varname>checkpoint_segments</> log segments have been replayed
since the last restartpoint.
</para>
<para>
There are two commonly used internal <acronym>WAL</acronym> functions:
<function>LogInsert</function> and <function>LogFlush</function>.
<function>LogInsert</function> is used to place a new record into
<function>XLogInsert</function> and <function>XLogFlush</function>.
<function>XLogInsert</function> is used to place a new record into
the <acronym>WAL</acronym> buffers in shared memory. If there is no
space for the new record, <function>LogInsert</function> will have
space for the new record, <function>XLogInsert</function> will have
to write (move to kernel cache) a few filled <acronym>WAL</acronym>
buffers. This is undesirable because <function>LogInsert</function>
buffers. This is undesirable because <function>XLogInsert</function>
is used on every database low level modification (for example, row
insertion) at a time when an exclusive lock is held on affected
data pages, so the operation needs to be as fast as possible. What
is worse, writing <acronym>WAL</acronym> buffers might also force the
creation of a new log segment, which takes even more
time. Normally, <acronym>WAL</acronym> buffers should be written
and flushed by a <function>LogFlush</function> request, which is
and flushed by an <function>XLogFlush</function> request, which is
made, for the most part, at transaction commit time to ensure that
transaction records are flushed to permanent storage. On systems
with high log output, <function>LogFlush</function> requests might
not occur often enough to prevent <function>LogInsert</function>
with high log output, <function>XLogFlush</function> requests might
not occur often enough to prevent <function>XLogInsert</function>
from having to do writes. On such systems
one should increase the number of <acronym>WAL</acronym> buffers by
modifying the configuration parameter <xref
linkend="guc-wal-buffers">. When
modifying the <xref linkend="guc-wal-buffers"> parameter. When
<xref linkend="guc-full-page-writes"> is set and the system is very busy,
setting this value higher will help smooth response times during the
period immediately following each checkpoint.
setting <varname>wal_buffers</> higher will help smooth response times
during the period immediately following each checkpoint.
</para>
<para>
The <xref linkend="guc-commit-delay"> parameter defines for how many
microseconds the server process will sleep after writing a commit
record to the log with <function>LogInsert</function> but before
performing a <function>LogFlush</function>. This delay allows other
server processes to add their commit records to the log so as to have all
of them flushed with a single log sync. No sleep will occur if
<xref linkend="guc-fsync">
is not enabled, or if fewer than <xref linkend="guc-commit-siblings">
other sessions are currently in active transactions; this avoids
sleeping when it's unlikely that any other session will commit soon.
Note that on most platforms, the resolution of a sleep request is
ten milliseconds, so that any nonzero <varname>commit_delay</varname>
setting between 1 and 10000 microseconds would have the same effect.
Good values for these parameters are not yet clear; experimentation
is encouraged.
microseconds a group commit leader process will sleep after acquiring a
lock within <function>XLogFlush</function>, while group commit
followers queue up behind the leader. This delay allows other server
processes to add their commit records to the WAL buffers so that all of
them will be flushed by the leader's eventual sync operation. No sleep
will occur if <xref linkend="guc-fsync"> is not enabled, or if fewer
than <xref linkend="guc-commit-siblings"> other sessions are currently
in active transactions; this avoids sleeping when it's unlikely that
any other session will commit soon. Note that on some platforms, the
resolution of a sleep request is ten milliseconds, so that any nonzero
<varname>commit_delay</varname> setting between 1 and 10000
microseconds would have the same effect. Note also that on some
platforms, sleep operations may take slightly longer than requested by
the parameter.
</para>
<para>
Since the purpose of <varname>commit_delay</varname> is to allow the
cost of each flush operation to be amortized across concurrently
committing transactions (potentially at the expense of transaction
latency), it is necessary to quantify that cost before the setting can
be chosen intelligently. The higher that cost is, the more effective
<varname>commit_delay</varname> is expected to be in increasing
transaction throughput, up to a point. The <xref
linkend="pgtestfsync"> program can be used to measure the average time
in microseconds that a single WAL flush operation takes. A value of
half of the average time the program reports it takes to flush after a
single 8kB write operation is often the most effective setting for
<varname>commit_delay</varname>, so this value is recommended as the
starting point to use when optimizing for a particular workload. While
tuning <varname>commit_delay</varname> is particularly useful when the
WAL log is stored on high-latency rotating disks, benefits can be
significant even on storage media with very fast sync times, such as
solid-state drives or RAID arrays with a battery-backed write cache;
but this should definitely be tested against a representative workload.
Higher values of <varname>commit_siblings</varname> should be used in
such cases, whereas smaller <varname>commit_siblings</varname> values
are often helpful on higher latency media. Note that it is quite
possible that a setting of <varname>commit_delay</varname> that is too
high can increase transaction latency by so much that total transaction
throughput suffers.
</para>
<para>
When <varname>commit_delay</varname> is set to zero (the default), it
is still possible for a form of group commit to occur, but each group
will consist only of sessions that reach the point where they need to
flush their commit records during the window in which the previous
flush operation (if any) is occurring. At higher client counts a
<quote>gangway effect</> tends to occur, so that the effects of group
commit become significant even when <varname>commit_delay</varname> is
zero, and thus explicitly setting <varname>commit_delay</varname> tends
to help less. Setting <varname>commit_delay</varname> can only help
when (1) there are some concurrently committing transactions, and (2)
throughput is limited to some degree by commit rate; but with high
rotational latency this setting can be effective in increasing
transaction throughput with as few as two clients (that is, a single
committing client with one sibling transaction).
</para>
<para>
@ -574,9 +621,9 @@
All the options should be the same in terms of reliability, with
the exception of <literal>fsync_writethrough</>, which can sometimes
force a flush of the disk cache even when other options do not do so.
However, it's quite platform-specific which one will be the fastest;
you can test option speeds using the <xref
linkend="pgtestfsync"> module.
However, it's quite platform-specific which one will be the fastest.
You can test the speeds of different options using the <xref
linkend="pgtestfsync"> program.
Note that this parameter is irrelevant if <varname>fsync</varname>
has been turned off.
</para>
@ -585,7 +632,7 @@
Enabling the <xref linkend="guc-wal-debug"> configuration parameter
(provided that <productname>PostgreSQL</productname> has been
compiled with support for it) will result in each
<function>LogInsert</function> and <function>LogFlush</function>
<function>XLogInsert</function> and <function>XLogFlush</function>
<acronym>WAL</acronym> call being logged to the server log. This
option might be replaced by a more general mechanism in the future.
</para>