416 lines
19 KiB
Plaintext
416 lines
19 KiB
Plaintext
<!-- $PostgreSQL: pgsql/doc/src/sgml/wal.sgml,v 1.36 2005/10/13 17:32:42 momjian Exp $ -->
|
|
|
|
<chapter id="reliability">
|
|
<title>Reliability</title>
|
|
|
|
<para>
|
|
Reliability is a major feature of any serious database system, and
|
|
<productname>PostgreSQL</> does everything possible to guarantee
|
|
reliable operation. One aspect of reliable operation is that all data
|
|
recorded by a transaction should be stored in a non-volatile area
|
|
that is safe from power loss, operating system failure, and hardware
|
|
failure (unrelated to the non-volatile area itself). To accomplish
|
|
this, <productname>PostgreSQL</> uses the magnetic platters of modern
|
|
disk drives for permanent storage that is immune to the failures
|
|
listed above. In fact, even if a computer is fatally damaged, if
|
|
the disk drives survive they can be moved to another computer with
|
|
similar hardware and all committed transactions will remain intact.
|
|
</para>
|
|
|
|
<para>
|
|
While forcing data periodically to the disk platters might seem like
|
|
a simple operation, it is not. Because disk drives are dramatically
|
|
slower than main memory and CPUs, several layers of caching exist
|
|
between the computer's main memory and the disk drive platters.
|
|
First, there is the operating system kernel cache, which caches
|
|
frequently requested disk blocks and delays disk writes. Fortunately,
|
|
all operating systems give applications a way to force writes from
|
|
the kernel cache to disk, and <productname>PostgreSQL</> uses those
|
|
features. In fact, the <xref linkend="guc-wal-sync-method"> parameter
|
|
controls how this is done.
|
|
</para>
|
|
<para>
|
|
Secondly, there is an optional disk drive controller cache,
|
|
particularly popular on <acronym>RAID</> controller cards. Some of
|
|
these caches are <literal>write-through</>, meaning writes are passed
|
|
along to the drive as soon as they arrive. Others are
|
|
<literal>write-back</>, meaning data is passed on to the drive at
|
|
some later time. Such caches can be a reliability problem because the
|
|
disk controller card cache is volatile, unlike the disk driver
|
|
platters, unless the disk drive controller has a battery-backed
|
|
cache, meaning the card has a battery that maintains power to the
|
|
cache in case of server power loss. When the disk drives are later
|
|
accessible, the data is written to the drives.
|
|
</para>
|
|
|
|
<para>
|
|
And finally, most disk drives have caches. Some are write-through
|
|
(typically SCSI), and some are write-back(typically IDE), and the
|
|
same concerns about data loss exist for write-back drive caches as
|
|
exist for disk controller caches. To have reliability, all
|
|
storage subsystems must be reliable in their storage characteristics.
|
|
When the operating system sends a write request to the drive platters,
|
|
there is little it can do to make sure the data has arrived at a
|
|
non-volatile store area on the system. Rather, it is the
|
|
administrator's responsibility to be sure that all storage components
|
|
have reliable characteristics.
|
|
</para>
|
|
|
|
<para>
|
|
One other area of potential data loss are the disk platter writes
|
|
themselves. Disk platters are internally made up of 512-byte sectors.
|
|
When a write request arrives at the drive, it might be for 512 bytes,
|
|
1024 bytes, or 8192 bytes, and the process of writing could fail due
|
|
to power loss at any time, meaning some of the 512-byte sectors were
|
|
written, and others were not, or the first half of a 512-byte sector
|
|
has new data, and the remainder has the original data. Obviously, on
|
|
startup, <productname>PostgreSQL</> would not be able to deal with
|
|
these partially written cases. To guard against that,
|
|
<productname>PostgreSQL</> periodically writes full page images to
|
|
permanent storage <emphasis>before</> modifying the actual page on
|
|
disk. By doing this, during crash recovery <productname>PostgreSQL</> can
|
|
restore partially-written pages. If you have a battery-backed disk
|
|
controller or filesystem (e.g. Reiser4) that prevents partial page writes,
|
|
you can turn off this page imaging by using the
|
|
<xref linkend="guc-full-page-writes"> parameter. This parameter has no
|
|
effect on the successful use of Point in Time Recovery (PITR),
|
|
described in <xref linkend="backup-online">.
|
|
</para>
|
|
|
|
<para>
|
|
The following sections explain how the Write-Ahead Log is used to
|
|
obtain efficient, reliable operation.
|
|
</para>
|
|
|
|
<sect1 id="wal">
|
|
<title>Write-Ahead Logging (<acronym>WAL</acronym>)</title>
|
|
|
|
<indexterm zone="wal">
|
|
<primary>WAL</primary>
|
|
</indexterm>
|
|
|
|
<indexterm>
|
|
<primary>transaction log</primary>
|
|
<see>WAL</see>
|
|
</indexterm>
|
|
|
|
<para>
|
|
<firstterm>Write-Ahead Logging</firstterm> (<acronym>WAL</acronym>)
|
|
is a standard approach to transaction logging. Its detailed
|
|
description may be found in most (if not all) books about
|
|
transaction processing. Briefly, <acronym>WAL</acronym>'s central
|
|
concept is that changes to data files (where tables and indexes
|
|
reside) must be written only after those changes have been logged,
|
|
that is, when log records describing the changes have been flushed
|
|
to permanent storage. If we follow this procedure, we do not need
|
|
to flush data pages to disk on every transaction commit, because we
|
|
know that in the event of a crash we will be able to recover the
|
|
database using the log: any changes that have not been applied to
|
|
the data pages can be redone from the log records. (This is
|
|
roll-forward recovery, also known as REDO.)
|
|
</para>
|
|
|
|
<para>
|
|
WAL brings three major benefits:
|
|
</para>
|
|
|
|
<para>
|
|
The first major benefit of using <acronym>WAL</acronym> is a
|
|
significantly reduced number of disk writes, because only the log
|
|
file needs to be flushed to disk at the time of transaction
|
|
commit, rather than every data file changed by the transaction.
|
|
In multiuser environments, commits of many transactions
|
|
may be accomplished with a single <function>fsync</function> of
|
|
the log file. Furthermore, the log file is written sequentially,
|
|
and so the cost of syncing the log is much less than the cost of
|
|
flushing the data pages. This is especially true for servers
|
|
handling many small transactions touching different parts of the data
|
|
store.
|
|
</para>
|
|
|
|
<para>
|
|
The next benefit is crash recovery protection. The truth is
|
|
that, before <acronym>WAL</acronym> was introduced back in release 7.1,
|
|
<productname>PostgreSQL</productname> was never able to guarantee
|
|
consistency in the case of a crash. Now,
|
|
<acronym>WAL</acronym> protects fully against the following problems:
|
|
|
|
<orderedlist>
|
|
<listitem>
|
|
<simpara>index rows pointing to nonexistent table rows</simpara>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<simpara>index rows lost in split operations</simpara>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<simpara>totally corrupted table or index page content, because
|
|
of partially written data pages</simpara>
|
|
</listitem>
|
|
</orderedlist>
|
|
</para>
|
|
|
|
<para>
|
|
Finally, <acronym>WAL</acronym> makes it possible to support on-line
|
|
backup and point-in-time recovery, as described in <xref
|
|
linkend="backup-online">. By archiving the WAL data we can support
|
|
reverting to any time instant covered by the available WAL data:
|
|
we simply install a prior physical backup of the database, and
|
|
replay the WAL log just as far as the desired time. What's more,
|
|
the physical backup doesn't have to be an instantaneous snapshot
|
|
of the database state — if it is made over some period of time,
|
|
then replaying the WAL log for that period will fix any internal
|
|
inconsistencies.
|
|
</para>
|
|
</sect1>
|
|
|
|
<sect1 id="wal-configuration">
|
|
<title><acronym>WAL</acronym> Configuration</title>
|
|
|
|
<para>
|
|
There are several <acronym>WAL</acronym>-related configuration parameters that
|
|
affect database performance. This section explains their use.
|
|
Consult <xref linkend="runtime-config"> for general information about
|
|
setting server configuration parameters.
|
|
</para>
|
|
|
|
<para>
|
|
<firstterm>Checkpoints</firstterm><indexterm><primary>checkpoint</></>
|
|
are points in the sequence of transactions at which it is guaranteed
|
|
that the data files have been updated with all information logged before
|
|
the checkpoint. At checkpoint time, all dirty data pages are flushed to
|
|
disk and a special checkpoint record is written to the log file. As a
|
|
result, in the event of a crash, the crash recovery procedure knows from
|
|
what point in the log (known as the redo record) it should start the
|
|
REDO operation, since any changes made to data files before that point
|
|
are already on disk. After a checkpoint has been made, any log segments
|
|
written before the redo record are no longer needed and can be recycled
|
|
or removed. (When <acronym>WAL</acronym> archiving is being done, the
|
|
log segments must be archived before being recycled or removed.)
|
|
</para>
|
|
|
|
<para>
|
|
The server's background writer process will automatically perform
|
|
a checkpoint every so often. A checkpoint is created every <xref
|
|
linkend="guc-checkpoint-segments"> log segments, or every <xref
|
|
linkend="guc-checkpoint-timeout"> seconds, whichever comes first.
|
|
The default settings are 3 segments and 300 seconds respectively.
|
|
It is also possible to force a checkpoint by using the SQL command
|
|
<command>CHECKPOINT</command>.
|
|
</para>
|
|
|
|
<para>
|
|
Reducing <varname>checkpoint_segments</varname> and/or
|
|
<varname>checkpoint_timeout</varname> causes checkpoints to be done
|
|
more often. This allows faster after-crash recovery (since less work
|
|
will need to be redone). However, one must balance this against the
|
|
increased cost of flushing dirty data pages more often. If
|
|
<xref linkend="guc-full-page-writes"> is set (the default), there is
|
|
another factor to consider. To ensure data page consistency,
|
|
the first modification of a data page after each checkpoint results in
|
|
logging the entire page content. In that case,
|
|
a smaller checkpoint interval increases the volume of output to the WAL log,
|
|
partially negating the goal of using a smaller interval,
|
|
and in any case causing more disk I/O.
|
|
</para>
|
|
|
|
<para>
|
|
Checkpoints are fairly expensive, first because they require writing
|
|
out all currently dirty buffers, and second because they result in
|
|
extra subsequent WAL traffic as discussed above. It is therefore
|
|
wise to set the checkpointing parameters high enough that checkpoints
|
|
don't happen too often. As a simple sanity check on your checkpointing
|
|
parameters, you can set the <xref linkend="guc-checkpoint-warning">
|
|
parameter. If checkpoints happen closer together than
|
|
<varname>checkpoint_warning</> seconds,
|
|
a message will be output to the server log recommending increasing
|
|
<varname>checkpoint_segments</varname>. Occasional appearance of such
|
|
a message is not cause for alarm, but if it appears often then the
|
|
checkpoint control parameters should be increased. Bulk operations such
|
|
as a COPY, INSERT SELECT etc. may cause a number of such warnings if you
|
|
do not set <xref linkend="guc-checkpoint-segments"> high enough.
|
|
</para>
|
|
|
|
<para>
|
|
There will be at least one WAL segment file, and will normally
|
|
not be more than 2 * <varname>checkpoint_segments</varname> + 1
|
|
files. Each segment file is normally 16 MB (though this size can be
|
|
altered when building the server). You can use this to estimate space
|
|
requirements for <acronym>WAL</acronym>.
|
|
Ordinarily, when old log segment files are no longer needed, they
|
|
are recycled (renamed to become the next segments in the numbered
|
|
sequence). If, due to a short-term peak of log output rate, there
|
|
are more than 2 * <varname>checkpoint_segments</varname> + 1
|
|
segment files, the unneeded segment files will be deleted instead
|
|
of recycled until the system gets back under this limit.
|
|
</para>
|
|
|
|
<para>
|
|
There are two commonly used internal <acronym>WAL</acronym> functions:
|
|
<function>LogInsert</function> and <function>LogFlush</function>.
|
|
<function>LogInsert</function> is used to place a new record into
|
|
the <acronym>WAL</acronym> buffers in shared memory. If there is no
|
|
space for the new record, <function>LogInsert</function> will have
|
|
to write (move to kernel cache) a few filled <acronym>WAL</acronym>
|
|
buffers. This is undesirable because <function>LogInsert</function>
|
|
is used on every database low level modification (for example, row
|
|
insertion) at a time when an exclusive lock is held on affected
|
|
data pages, so the operation needs to be as fast as possible. What
|
|
is worse, writing <acronym>WAL</acronym> buffers may also force the
|
|
creation of a new log segment, which takes even more
|
|
time. Normally, <acronym>WAL</acronym> buffers should be written
|
|
and flushed by a <function>LogFlush</function> request, which is
|
|
made, for the most part, at transaction commit time to ensure that
|
|
transaction records are flushed to permanent storage. On systems
|
|
with high log output, <function>LogFlush</function> requests may
|
|
not occur often enough to prevent <function>LogInsert</function>
|
|
from having to do writes. On such systems
|
|
one should increase the number of <acronym>WAL</acronym> buffers by
|
|
modifying the configuration parameter <xref
|
|
linkend="guc-wal-buffers">. The default number of <acronym>WAL</acronym>
|
|
buffers is 8. Increasing this value will
|
|
correspondingly increase shared memory usage. When
|
|
<xref linkend="guc-full-page-writes"> is set and the system is very busy,
|
|
setting this value higher will help smooth response times during the
|
|
period immediately following each checkpoint. As a guide, a setting of 1024
|
|
would be considered to be high.
|
|
</para>
|
|
|
|
<para>
|
|
The <xref linkend="guc-commit-delay"> parameter defines for how many
|
|
microseconds the server process will sleep after writing a commit
|
|
record to the log with <function>LogInsert</function> but before
|
|
performing a <function>LogFlush</function>. This delay allows other
|
|
server processes to add their commit records to the log so as to have all
|
|
of them flushed with a single log sync. No sleep will occur if
|
|
<xref linkend="guc-fsync">
|
|
is not enabled, nor if fewer than <xref linkend="guc-commit-siblings">
|
|
other sessions are currently in active transactions; this avoids
|
|
sleeping when it's unlikely that any other session will commit soon.
|
|
Note that on most platforms, the resolution of a sleep request is
|
|
ten milliseconds, so that any nonzero <varname>commit_delay</varname>
|
|
setting between 1 and 10000 microseconds would have the same effect.
|
|
Good values for these parameters are not yet clear; experimentation
|
|
is encouraged.
|
|
</para>
|
|
|
|
<para>
|
|
The <xref linkend="guc-wal-sync-method"> parameter determines how
|
|
<productname>PostgreSQL</productname> will ask the kernel to force
|
|
<acronym>WAL</acronym> updates out to disk.
|
|
All the options should be the same as far as reliability goes,
|
|
but it's quite platform-specific which one will be the fastest.
|
|
Note that this parameter is irrelevant if <varname>fsync</varname>
|
|
has been turned off.
|
|
</para>
|
|
|
|
<para>
|
|
Enabling the <xref linkend="guc-wal-debug"> configuration parameter
|
|
(provided that <productname>PostgreSQL</productname> has been
|
|
compiled with support for it) will result in each
|
|
<function>LogInsert</function> and <function>LogFlush</function>
|
|
<acronym>WAL</acronym> call being logged to the server log. The output
|
|
is too verbose for use as a guide to performance tuning. This
|
|
option may be replaced by a more general mechanism in the future.
|
|
</para>
|
|
</sect1>
|
|
|
|
<sect1 id="wal-internals">
|
|
<title>WAL Internals</title>
|
|
|
|
<para>
|
|
<acronym>WAL</acronym> is automatically enabled; no action is
|
|
required from the administrator except ensuring that the
|
|
disk-space requirements for the <acronym>WAL</acronym> logs are met,
|
|
and that any necessary tuning is done (see <xref
|
|
linkend="wal-configuration">).
|
|
</para>
|
|
|
|
<para>
|
|
<acronym>WAL</acronym> logs are stored in the directory
|
|
<filename>pg_xlog</filename> under the data directory, as a set of
|
|
segment files, normally each 16 MB in size. Each segment is divided into
|
|
pages, normally 8 KB each. The log record headers are described in
|
|
<filename>access/xlog.h</filename>; the record content is dependent
|
|
on the type of event that is being logged. Segment files are given
|
|
ever-increasing numbers as names, starting at
|
|
<filename>000000010000000000000000</filename>. The numbers do not wrap, at
|
|
present, but it should take a very very long time to exhaust the
|
|
available stock of numbers.
|
|
</para>
|
|
|
|
<para>
|
|
The <acronym>WAL</acronym> buffers and control structure are in
|
|
shared memory and are handled by the server child processes; they
|
|
are protected by lightweight locks. The demand on shared memory is
|
|
dependent on the number of buffers. The default size of the
|
|
<acronym>WAL</acronym> buffers is 8 buffers of 8 kB each, or 64 kB
|
|
total.
|
|
</para>
|
|
|
|
<para>
|
|
It is of advantage if the log is located on another disk than the
|
|
main database files. This may be achieved by moving the directory
|
|
<filename>pg_xlog</filename> to another location (while the server
|
|
is shut down, of course) and creating a symbolic link from the
|
|
original location in the main data directory to the new location.
|
|
</para>
|
|
|
|
<para>
|
|
The aim of <acronym>WAL</acronym>, to ensure that the log is
|
|
written before database records are altered, may be subverted by
|
|
disk drives<indexterm><primary>disk drive</></> that falsely report a
|
|
successful write to the kernel,
|
|
when in fact they have only cached the data and not yet stored it
|
|
on the disk. A power failure in such a situation may still lead to
|
|
irrecoverable data corruption. Administrators should try to ensure
|
|
that disks holding <productname>PostgreSQL</productname>'s
|
|
<acronym>WAL</acronym> log files do not make such false reports.
|
|
</para>
|
|
|
|
<para>
|
|
After a checkpoint has been made and the log flushed, the
|
|
checkpoint's position is saved in the file
|
|
<filename>pg_control</filename>. Therefore, when recovery is to be
|
|
done, the server first reads <filename>pg_control</filename> and
|
|
then the checkpoint record; then it performs the REDO operation by
|
|
scanning forward from the log position indicated in the checkpoint
|
|
record. Because the entire content of data pages is saved in the
|
|
log on the first page modification after a checkpoint, all pages
|
|
changed since the checkpoint will be restored to a consistent
|
|
state.
|
|
</para>
|
|
|
|
<para>
|
|
To deal with the case where <filename>pg_control</filename> is
|
|
corrupted, we should support the possibility of scanning existing log
|
|
segments in reverse order — newest to oldest — in order to find the
|
|
latest checkpoint. This has not been implemented yet.
|
|
<filename>pg_control</filename> is small enough (less than one disk page)
|
|
that it is not subject to partial-write problems, and as of this writing
|
|
there have been no reports of database failures due solely to inability
|
|
to read <filename>pg_control</filename> itself. So while it is
|
|
theoretically a weak spot, <filename>pg_control</filename> does not
|
|
seem to be a problem in practice.
|
|
</para>
|
|
</sect1>
|
|
</chapter>
|
|
|
|
<!-- Keep this comment at the end of the file
|
|
Local variables:
|
|
mode:sgml
|
|
sgml-omittag:nil
|
|
sgml-shorttag:t
|
|
sgml-minimize-attributes:nil
|
|
sgml-always-quote-attributes:t
|
|
sgml-indent-step:1
|
|
sgml-indent-data:t
|
|
sgml-parent-document:nil
|
|
sgml-default-dtd-file:"./reference.ced"
|
|
sgml-exposed-tags:nil
|
|
sgml-local-catalogs:("/usr/lib/sgml/catalog")
|
|
sgml-local-ecat-files:nil
|
|
End:
|
|
-->
|