mirror of
https://git.postgresql.org/git/postgresql.git
synced 2024-07-15 21:41:09 +02:00
352 lines
16 KiB
Plaintext
352 lines
16 KiB
Plaintext
<!-- $Header: /cvsroot/pgsql/doc/src/sgml/wal.sgml,v 1.23 2003/03/24 14:32:51 petere Exp $ -->
|
|
|
|
<chapter id="wal">
|
|
<title>Write-Ahead Logging (<acronym>WAL</acronym>)</title>
|
|
|
|
<para>
|
|
<firstterm>Write-Ahead Logging</firstterm> (<acronym>WAL</acronym>)
|
|
is a standard approach to transaction logging. Its detailed
|
|
description may be found in most (if not all) books about
|
|
transaction processing. Briefly, <acronym>WAL</acronym>'s central
|
|
concept is that changes to data files (where tables and indexes
|
|
reside) must be written only after those changes have been logged,
|
|
that is, when log records have been flushed to permanent
|
|
storage. If we follow this procedure, we do not need to flush
|
|
data pages to disk on every transaction commit, because we know
|
|
that in the event of a crash we will be able to recover the
|
|
database using the log: any changes that have not been applied to
|
|
the data pages will first be redone from the log records (this is
|
|
roll-forward recovery, also known as REDO) and then changes made by
|
|
uncommitted transactions will be removed from the data pages
|
|
(roll-backward recovery, UNDO).
|
|
</para>
|
|
|
|
<sect1 id="wal-benefits-now">
|
|
<title>Benefits of <acronym>WAL</acronym></title>
|
|
|
|
<para>
|
|
The first obvious benefit of using <acronym>WAL</acronym> is a
|
|
significantly reduced number of disk writes, since only the log
|
|
file needs to be flushed to disk at the time of transaction
|
|
commit; in multiuser environments, commits of many transactions
|
|
may be accomplished with a single <function>fsync()</function> of
|
|
the log file. Furthermore, the log file is written sequentially,
|
|
and so the cost of syncing the log is much less than the cost of
|
|
flushing the data pages.
|
|
</para>
|
|
|
|
<para>
|
|
The next benefit is consistency of the data pages. The truth is
|
|
that, before <acronym>WAL</acronym>,
|
|
<productname>PostgreSQL</productname> was never able to guarantee
|
|
consistency in the case of a crash. Before
|
|
<acronym>WAL</acronym>, any crash during writing could result in:
|
|
|
|
<orderedlist>
|
|
<listitem>
|
|
<simpara>index rows pointing to nonexistent table rows</simpara>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<simpara>index rows lost in split operations</simpara>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<simpara>totally corrupted table or index page content, because
|
|
of partially written data pages</simpara>
|
|
</listitem>
|
|
</orderedlist>
|
|
|
|
Problems with indexes (problems 1 and 2) could possibly have been
|
|
fixed by additional <function>fsync()</function> calls, but it is
|
|
not obvious how to handle the last case without
|
|
<acronym>WAL</acronym>; <acronym>WAL</acronym> saves the entire data
|
|
page content in the log if that is required to ensure page
|
|
consistency for after-crash recovery.
|
|
</para>
|
|
</sect1>
|
|
|
|
<sect1 id="wal-benefits-later">
|
|
<title>Future Benefits</title>
|
|
|
|
<para>
|
|
The UNDO operation is not implemented. This means that changes
|
|
made by aborted transactions will still occupy disk space and that
|
|
a permanent <filename>pg_clog</filename> file to hold
|
|
the status of transactions is still needed, since
|
|
transaction identifiers cannot be reused. Once UNDO is implemented,
|
|
<filename>pg_clog</filename> will no longer be required to be
|
|
permanent; it will be possible to remove
|
|
<filename>pg_clog</filename> at shutdown. (However, the urgency of
|
|
this concern has decreased greatly with the adoption of a segmented
|
|
storage method for <filename>pg_clog</filename>: it is no longer
|
|
necessary to keep old <filename>pg_clog</filename> entries around
|
|
forever.)
|
|
</para>
|
|
|
|
<para>
|
|
With UNDO, it will also be possible to implement
|
|
<firstterm>savepoints</firstterm> to allow partial rollback of
|
|
invalid transaction operations (parser errors caused by mistyping
|
|
commands, insertion of duplicate primary/unique keys and so on)
|
|
with the ability to continue or commit valid operations made by
|
|
the transaction before the error. At present, any error will
|
|
invalidate the whole transaction and require a transaction abort.
|
|
</para>
|
|
|
|
<para>
|
|
<acronym>WAL</acronym> offers the opportunity for a new method for
|
|
database on-line backup and restore (<acronym>BAR</acronym>). To
|
|
use this method, one would have to make periodic saves of data
|
|
files to another disk, a tape or another host and also archive the
|
|
<acronym>WAL</acronym> log files. The database file copy and the
|
|
archived log files could be used to restore just as if one were
|
|
restoring after a crash. Each time a new database file copy was
|
|
made the old log files could be removed. Implementing this
|
|
facility will require the logging of data file and index creation
|
|
and deletion; it will also require development of a method for
|
|
copying the data files (operating system copy commands are not
|
|
suitable).
|
|
</para>
|
|
|
|
<para>
|
|
A difficulty standing in the way of realizing these benefits is that
|
|
they require saving <acronym>WAL</acronym> entries for considerable
|
|
periods of time (e.g., as long as the longest possible transaction if
|
|
transaction UNDO is wanted). The present <acronym>WAL</acronym>
|
|
format is extremely bulky since it includes many disk page
|
|
snapshots. This is not a serious concern at present, since the
|
|
entries only need to be kept for one or two checkpoint intervals;
|
|
but to achieve these future benefits some sort of compressed
|
|
<acronym>WAL</acronym> format will be needed.
|
|
</para>
|
|
</sect1>
|
|
|
|
<sect1 id="wal-configuration">
|
|
<title><acronym>WAL</acronym> Configuration</title>
|
|
|
|
<para>
|
|
There are several <acronym>WAL</acronym>-related configuration parameters that
|
|
affect database performance. This section explains their use.
|
|
Consult <xref linkend="runtime-config"> for details about setting
|
|
configuration parameters.
|
|
</para>
|
|
|
|
<para>
|
|
<firstterm>Checkpoints</firstterm> are points in the sequence of
|
|
transactions at which it is guaranteed that the data files have
|
|
been updated with all information logged before the checkpoint. At
|
|
checkpoint time, all dirty data pages are flushed to disk and a
|
|
special checkpoint record is written to the log file. As result, in
|
|
the event of a crash, the recoverer knows from what record in the
|
|
log (known as the redo record) it should start the REDO operation,
|
|
since any changes made to data files before that record are already
|
|
on disk. After a checkpoint has been made, any log segments written
|
|
before the redo records are no longer needed and can be recycled or
|
|
removed. (When <acronym>WAL</acronym>-based <acronym>BAR</acronym> is
|
|
implemented, the log segments would be archived before being recycled
|
|
or removed.)
|
|
</para>
|
|
|
|
<para>
|
|
The server spawns a special process every so often
|
|
to create the next checkpoint. A checkpoint is created every
|
|
<varname>checkpoint_segments</varname> log segments, or every
|
|
<varname>checkpoint_timeout</varname> seconds, whichever comes first.
|
|
The default settings are 3 segments and 300 seconds respectively.
|
|
It is also possible to force a checkpoint by using the SQL command
|
|
<command>CHECKPOINT</command>.
|
|
</para>
|
|
|
|
<para>
|
|
Reducing <varname>checkpoint_segments</varname> and/or
|
|
<varname>checkpoint_timeout</varname> causes checkpoints to be done
|
|
more often. This allows faster after-crash recovery (since less work
|
|
will need to be redone). However, one must balance this against the
|
|
increased cost of flushing dirty data pages more often. In addition,
|
|
to ensure data page consistency, the first modification of a data
|
|
page after each checkpoint results in logging the entire page
|
|
content. Thus a smaller checkpoint interval increases the volume of
|
|
output to the log, partially negating the goal of using a smaller
|
|
interval, and in any case causing more disk I/O.
|
|
</para>
|
|
|
|
<para>
|
|
There will be at least one 16 MB segment file, and will normally
|
|
not be more than 2 * <varname>checkpoint_segments</varname> + 1
|
|
files. You can use this to estimate space requirements for WAL.
|
|
Ordinarily, when old log segment files are no longer needed, they
|
|
are recycled (renamed to become the next segments in the numbered
|
|
sequence). If, due to a short-term peak of log output rate, there
|
|
are more than 2 * <varname>checkpoint_segments</varname> + 1
|
|
segment files, the unneeded segment files will be deleted instead
|
|
of recycled until the system gets back under this limit.
|
|
</para>
|
|
|
|
<para>
|
|
There are two commonly used <acronym>WAL</acronym> functions:
|
|
<function>LogInsert</function> and <function>LogFlush</function>.
|
|
<function>LogInsert</function> is used to place a new record into
|
|
the <acronym>WAL</acronym> buffers in shared memory. If there is no
|
|
space for the new record, <function>LogInsert</function> will have
|
|
to write (move to kernel cache) a few filled <acronym>WAL</acronym>
|
|
buffers. This is undesirable because <function>LogInsert</function>
|
|
is used on every database low level modification (for example,
|
|
row insertion) at a time when an exclusive lock is held on
|
|
affected data pages, so the operation needs to be as fast as
|
|
possible. What is worse, writing <acronym>WAL</acronym> buffers may
|
|
also force the creation of a new log segment, which takes even more
|
|
time. Normally, <acronym>WAL</acronym> buffers should be written
|
|
and flushed by a <function>LogFlush</function> request, which is
|
|
made, for the most part, at transaction commit time to ensure that
|
|
transaction records are flushed to permanent storage. On systems
|
|
with high log output, <function>LogFlush</function> requests may
|
|
not occur often enough to prevent <acronym>WAL</acronym> buffers
|
|
being written by <function>LogInsert</function>. On such systems
|
|
one should increase the number of <acronym>WAL</acronym> buffers by
|
|
modifying the configuration parameter <varname>wal_buffers</varname>.
|
|
The default number of <acronym>
|
|
WAL</acronym> buffers is 8. Increasing this value will
|
|
correspondingly increase shared memory usage.
|
|
</para>
|
|
|
|
<para>
|
|
Checkpoints are fairly expensive because they force all dirty kernel
|
|
buffers to disk using the operating system <literal>sync()</> call.
|
|
Busy servers may fill checkpoint segment files too quickly,
|
|
causing excessive checkpointing. If such forced checkpoints happen
|
|
more frequently than <varname>checkpoint_warning</varname> seconds,
|
|
a message, will be output to the server logs recommending increasing
|
|
<varname>checkpoint_segments</varname>.
|
|
</para>
|
|
|
|
<para>
|
|
The <varname>commit_delay</varname> parameter defines for how many
|
|
microseconds the server process will sleep after writing a commit
|
|
record to the log with <function>LogInsert</function> but before
|
|
performing a <function>LogFlush</function>. This delay allows other
|
|
server processes to add their commit records to the log so as to have all
|
|
of them flushed with a single log sync. No sleep will occur if <varname>fsync</varname>
|
|
is not enabled or if fewer than <varname>commit_siblings</varname>
|
|
other sessons are currently in active transactions; this avoids
|
|
sleeping when it's unlikely that any other session will commit soon.
|
|
Note that on most platforms, the resolution of a sleep request is
|
|
ten milliseconds, so that any nonzero <varname>commit_delay</varname>
|
|
setting between 1 and 10000 microseconds would have the same effect.
|
|
Good values for these parameters are not yet clear; experimentation
|
|
is encouraged.
|
|
</para>
|
|
|
|
<para>
|
|
The <varname>wal_sync_method</varname> parameter determines how
|
|
<productname>PostgreSQL</productname> will ask the kernel to force
|
|
WAL updates out to disk.
|
|
All the options should be the same as far as reliability goes,
|
|
but it's quite platform-specific which one will be the fastest.
|
|
Note that this parameter is irrelevant if <varname>fsync</varname>
|
|
has been turned off.
|
|
</para>
|
|
|
|
<para>
|
|
Setting the <varname>wal_debug</varname> parameter to any nonzero
|
|
value will result in each <function>LogInsert</function> and
|
|
<function>LogFlush</function> <acronym>WAL</acronym> call being
|
|
logged to the server log. At present, it makes no difference what
|
|
the nonzero value is. This option may be replaced by a more
|
|
general mechanism in the future.
|
|
</para>
|
|
</sect1>
|
|
|
|
<sect1 id="wal-internals">
|
|
<title>Internals</title>
|
|
|
|
<para>
|
|
<acronym>WAL</acronym> is automatically enabled; no action is
|
|
required from the administrator except ensuring that the additional
|
|
disk-space requirements of the <acronym>WAL</acronym> logs are met,
|
|
and that any necessary tuning is done (see <xref
|
|
linkend="wal-configuration">).
|
|
</para>
|
|
|
|
<para>
|
|
<acronym>WAL</acronym> logs are stored in the directory
|
|
<filename>pg_xlog</filename> under the data directory, as a set of
|
|
segment files, each 16 MB in size. Each segment is divided into 8
|
|
kB pages. The log record headers are described in
|
|
<filename>access/xlog.h</filename>; the record content is dependent
|
|
on the type of event that is being logged. Segment files are given
|
|
ever-increasing numbers as names, starting at
|
|
<filename>0000000000000000</filename>. The numbers do not wrap, at
|
|
present, but it should take a very long time to exhaust the
|
|
available stock of numbers.
|
|
</para>
|
|
|
|
<para>
|
|
The <acronym>WAL</acronym> buffers and control structure are in
|
|
shared memory and are handled by the server child processes; they
|
|
are protected by lightweight locks. The demand on shared memory is
|
|
dependent on the number of buffers. The default size of the
|
|
<acronym>WAL</acronym> buffers is 8 buffers of 8 kB each, or 64 kB
|
|
total.
|
|
</para>
|
|
|
|
<para>
|
|
It is of advantage if the log is located on another disk than the
|
|
main database files. This may be achieved by moving the directory
|
|
<filename>pg_xlog</filename> to another location (while the server
|
|
is shut down, of course) and creating a symbolic link from the
|
|
original location in the main data directory to the new location.
|
|
</para>
|
|
|
|
<para>
|
|
The aim of <acronym>WAL</acronym>, to ensure that the log is
|
|
written before database records are altered, may be subverted by
|
|
disk drives that falsely report a successful write to the kernel,
|
|
when, in fact, they have only cached the data and not yet stored it
|
|
on the disk. A power failure in such a situation may still lead to
|
|
irrecoverable data corruption. Administrators should try to ensure
|
|
that disks holding <productname>PostgreSQL</productname>'s
|
|
<acronym>WAL</acronym> log files do not make such false reports.
|
|
</para>
|
|
|
|
<para>
|
|
After a checkpoint has been made and the log flushed, the
|
|
checkpoint's position is saved in the file
|
|
<filename>pg_control</filename>. Therefore, when recovery is to be
|
|
done, the server first reads <filename>pg_control</filename> and
|
|
then the checkpoint record; then it performs the REDO operation by
|
|
scanning forward from the log position indicated in the checkpoint
|
|
record. Because the entire content of data pages is saved in the
|
|
log on the first page modification after a checkpoint, all pages
|
|
changed since the checkpoint will be restored to a consistent
|
|
state.
|
|
</para>
|
|
|
|
<para>
|
|
Using <filename>pg_control</filename> to get the checkpoint
|
|
position speeds up the recovery process, but to handle possible
|
|
corruption of <filename>pg_control</filename>, we should actually
|
|
implement the reading of existing log segments in reverse order --
|
|
newest to oldest -- in order to find the last checkpoint. This has
|
|
not been implemented, yet.
|
|
</para>
|
|
</sect1>
|
|
</chapter>
|
|
|
|
<!-- Keep this comment at the end of the file
|
|
Local variables:
|
|
mode:sgml
|
|
sgml-omittag:nil
|
|
sgml-shorttag:t
|
|
sgml-minimize-attributes:nil
|
|
sgml-always-quote-attributes:t
|
|
sgml-indent-step:1
|
|
sgml-indent-data:t
|
|
sgml-parent-document:nil
|
|
sgml-default-dtd-file:"./reference.ced"
|
|
sgml-exposed-tags:nil
|
|
sgml-local-catalogs:("/usr/lib/sgml/catalog")
|
|
sgml-local-ecat-files:nil
|
|
End:
|
|
-->
|