mirror of
https://git.postgresql.org/git/postgresql.git
synced 2024-09-12 19:07:59 +02:00
667 lines
32 KiB
Plaintext
667 lines
32 KiB
Plaintext
<!-- doc/src/sgml/wal.sgml -->
|
|
|
|
<chapter id="wal">
|
|
<title>Reliability and the Write-Ahead Log</title>
|
|
|
|
<para>
|
|
This chapter explains how the Write-Ahead Log is used to obtain
|
|
efficient, reliable operation.
|
|
</para>
|
|
|
|
<sect1 id="wal-reliability">
|
|
<title>Reliability</title>
|
|
|
|
<para>
|
|
Reliability is an important property of any serious database
|
|
system, and <productname>PostgreSQL</> does everything possible to
|
|
guarantee reliable operation. One aspect of reliable operation is
|
|
that all data recorded by a committed transaction should be stored
|
|
in a nonvolatile area that is safe from power loss, operating
|
|
system failure, and hardware failure (except failure of the
|
|
nonvolatile area itself, of course). Successfully writing the data
|
|
to the computer's permanent storage (disk drive or equivalent)
|
|
ordinarily meets this requirement. In fact, even if a computer is
|
|
fatally damaged, if the disk drives survive they can be moved to
|
|
another computer with similar hardware and all committed
|
|
transactions will remain intact.
|
|
</para>
|
|
|
|
<para>
|
|
While forcing data to the disk platters periodically might seem like
|
|
a simple operation, it is not. Because disk drives are dramatically
|
|
slower than main memory and CPUs, several layers of caching exist
|
|
between the computer's main memory and the disk platters.
|
|
First, there is the operating system's buffer cache, which caches
|
|
frequently requested disk blocks and combines disk writes. Fortunately,
|
|
all operating systems give applications a way to force writes from
|
|
the buffer cache to disk, and <productname>PostgreSQL</> uses those
|
|
features. (See the <xref linkend="guc-wal-sync-method"> parameter
|
|
to adjust how this is done.)
|
|
</para>
|
|
|
|
<para>
|
|
Next, there might be a cache in the disk drive controller; this is
|
|
particularly common on <acronym>RAID</> controller cards. Some of
|
|
these caches are <firstterm>write-through</>, meaning writes are sent
|
|
to the drive as soon as they arrive. Others are
|
|
<firstterm>write-back</>, meaning data is sent to the drive at
|
|
some later time. Such caches can be a reliability hazard because the
|
|
memory in the disk controller cache is volatile, and will lose its
|
|
contents in a power failure. Better controller cards have
|
|
<firstterm>battery-backup units</> (<acronym>BBU</>s), meaning
|
|
the card has a battery that
|
|
maintains power to the cache in case of system power loss. After power
|
|
is restored the data will be written to the disk drives.
|
|
</para>
|
|
|
|
<para>
|
|
And finally, most disk drives have caches. Some are write-through
|
|
while some are write-back, and the same concerns about data loss
|
|
exist for write-back drive caches as for disk controller
|
|
caches. Consumer-grade IDE and SATA drives are particularly likely
|
|
to have write-back caches that will not survive a power failure. Many
|
|
solid-state drives (SSD) also have volatile write-back caches.
|
|
</para>
|
|
|
|
<para>
|
|
These caches can typically be disabled; however, the method for doing
|
|
this varies by operating system and drive type:
|
|
</para>
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
On <productname>Linux</>, IDE drives can be queried using
|
|
<command>hdparm -I</command>; write caching is enabled if there is
|
|
a <literal>*</> next to <literal>Write cache</>. <command>hdparm -W</>
|
|
can be used to turn off write caching. SCSI drives can be queried
|
|
using <ulink url="http://sg.danny.cz/sg/sdparm.html"><application>sdparm</></ulink>.
|
|
Use <command>sdparm --get=WCE</command> to check
|
|
whether the write cache is enabled and <command>sdparm --clear=WCE</>
|
|
to disable it.
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
On <productname>FreeBSD</>, IDE drives can be queried using
|
|
<command>atacontrol</command> and write caching turned off using
|
|
<literal>hw.ata.wc=0</> in <filename>/boot/loader.conf</>;
|
|
SCSI drives use <command>sdparm</command>.
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
On <productname>Solaris</>, the disk write cache is controlled by
|
|
<ulink url="http://www.sun.com/bigadmin/content/submitted/format_utility.jsp"><literal>format -e</></ulink>.
|
|
(The Solaris <acronym>ZFS</> file system is safe with disk write-cache
|
|
enabled because it issues its own disk cache flush commands.)
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
On <productname>Windows</>, if <varname>wal_sync_method</> is
|
|
<literal>open_datasync</> (the default), write caching can be disabled
|
|
by unchecking <literal>My Computer\Open\<replaceable>disk drive</>\Properties\Hardware\Properties\Policies\Enable write caching on the disk</>.
|
|
Alternatively, set <varname>wal_sync_method</varname> to
|
|
<literal>fsync</> or <literal>fsync_writethrough</>, which prevent
|
|
write caching.
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
On <productname>Mac OS X</productname>, write caching can be prevented by
|
|
setting <varname>wal_sync_method</> to <literal>fsync_writethrough</>.
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para>
|
|
Recent SATA drives (those following <acronym>ATAPI-6</> or later)
|
|
offer a drive cache flush command (<command>FLUSH CACHE EXT</>),
|
|
while SCSI drives have long supported a similar command
|
|
<command>SYNCHRONIZE CACHE</>. These commands are not directly
|
|
accessible to <productname>PostgreSQL</>, but some file systems
|
|
(e.g., <acronym>ZFS</>, <acronym>ext4</>) can use them to flush
|
|
data to the platters on write-back-enabled drives. Unfortunately, such
|
|
file systems behave suboptimally when combined with battery-backup unit
|
|
(<acronym>BBU</>) disk controllers. In such setups, the synchronize
|
|
command forces all data from the controller cache to the disks,
|
|
eliminating much of the benefit of the BBU. You can run the
|
|
<xref linkend="pgtestfsync"> module to see
|
|
if you are affected. If you are affected, the performance benefits
|
|
of the BBU can be regained by turning off write barriers in
|
|
the file system or reconfiguring the disk controller, if that is
|
|
an option. If write barriers are turned off, make sure the battery
|
|
remains functional; a faulty battery can potentially lead to data loss.
|
|
Hopefully file system and disk controller designers will eventually
|
|
address this suboptimal behavior.
|
|
</para>
|
|
|
|
<para>
|
|
When the operating system sends a write request to the storage hardware,
|
|
there is little it can do to make sure the data has arrived at a truly
|
|
non-volatile storage area. Rather, it is the
|
|
administrator's responsibility to make certain that all storage components
|
|
ensure data integrity. Avoid disk controllers that have non-battery-backed
|
|
write caches. At the drive level, disable write-back caching if the
|
|
drive cannot guarantee the data will be written before shutdown.
|
|
If you use SSDs, be aware that many of these do not honor cache flush
|
|
commands by default.
|
|
You can test for reliable I/O subsystem behavior using <ulink
|
|
url="http://brad.livejournal.com/2116715.html"><filename>diskchecker.pl</filename></ulink>.
|
|
</para>
|
|
|
|
<para>
|
|
Another risk of data loss is posed by the disk platter write
|
|
operations themselves. Disk platters are divided into sectors,
|
|
commonly 512 bytes each. Every physical read or write operation
|
|
processes a whole sector.
|
|
When a write request arrives at the drive, it might be for some multiple
|
|
of 512 bytes (<productname>PostgreSQL</> typically writes 8192 bytes, or
|
|
16 sectors, at a time), and the process of writing could fail due
|
|
to power loss at any time, meaning some of the 512-byte sectors were
|
|
written while others were not. To guard against such failures,
|
|
<productname>PostgreSQL</> periodically writes full page images to
|
|
permanent WAL storage <emphasis>before</> modifying the actual page on
|
|
disk. By doing this, during crash recovery <productname>PostgreSQL</> can
|
|
restore partially-written pages from WAL. If you have file-system software
|
|
that prevents partial page writes (e.g., ZFS), you can turn off
|
|
this page imaging by turning off the <xref
|
|
linkend="guc-full-page-writes"> parameter. Battery-Backed Unit
|
|
(BBU) disk controllers do not prevent partial page writes unless
|
|
they guarantee that data is written to the BBU as full (8kB) pages.
|
|
</para>
|
|
</sect1>
|
|
|
|
<sect1 id="wal-intro">
|
|
<title>Write-Ahead Logging (<acronym>WAL</acronym>)</title>
|
|
|
|
<indexterm zone="wal">
|
|
<primary>WAL</primary>
|
|
</indexterm>
|
|
|
|
<indexterm>
|
|
<primary>transaction log</primary>
|
|
<see>WAL</see>
|
|
</indexterm>
|
|
|
|
<para>
|
|
<firstterm>Write-Ahead Logging</firstterm> (<acronym>WAL</acronym>)
|
|
is a standard method for ensuring data integrity. A detailed
|
|
description can be found in most (if not all) books about
|
|
transaction processing. Briefly, <acronym>WAL</acronym>'s central
|
|
concept is that changes to data files (where tables and indexes
|
|
reside) must be written only after those changes have been logged,
|
|
that is, after log records describing the changes have been flushed
|
|
to permanent storage. If we follow this procedure, we do not need
|
|
to flush data pages to disk on every transaction commit, because we
|
|
know that in the event of a crash we will be able to recover the
|
|
database using the log: any changes that have not been applied to
|
|
the data pages can be redone from the log records. (This is
|
|
roll-forward recovery, also known as REDO.)
|
|
</para>
|
|
|
|
<tip>
|
|
<para>
|
|
Because <acronym>WAL</acronym> restores database file
|
|
contents after a crash, journaled file systems are not necessary for
|
|
reliable storage of the data files or WAL files. In fact, journaling
|
|
overhead can reduce performance, especially if journaling
|
|
causes file system <emphasis>data</emphasis> to be flushed
|
|
to disk. Fortunately, data flushing during journaling can
|
|
often be disabled with a file system mount option, e.g.
|
|
<literal>data=writeback</> on a Linux ext3 file system.
|
|
Journaled file systems do improve boot speed after a crash.
|
|
</para>
|
|
</tip>
|
|
|
|
|
|
<para>
|
|
Using <acronym>WAL</acronym> results in a
|
|
significantly reduced number of disk writes, because only the log
|
|
file needs to be flushed to disk to guarantee that a transaction is
|
|
committed, rather than every data file changed by the transaction.
|
|
The log file is written sequentially,
|
|
and so the cost of syncing the log is much less than the cost of
|
|
flushing the data pages. This is especially true for servers
|
|
handling many small transactions touching different parts of the data
|
|
store. Furthermore, when the server is processing many small concurrent
|
|
transactions, one <function>fsync</function> of the log file may
|
|
suffice to commit many transactions.
|
|
</para>
|
|
|
|
<para>
|
|
<acronym>WAL</acronym> also makes it possible to support on-line
|
|
backup and point-in-time recovery, as described in <xref
|
|
linkend="continuous-archiving">. By archiving the WAL data we can support
|
|
reverting to any time instant covered by the available WAL data:
|
|
we simply install a prior physical backup of the database, and
|
|
replay the WAL log just as far as the desired time. What's more,
|
|
the physical backup doesn't have to be an instantaneous snapshot
|
|
of the database state — if it is made over some period of time,
|
|
then replaying the WAL log for that period will fix any internal
|
|
inconsistencies.
|
|
</para>
|
|
</sect1>
|
|
|
|
<sect1 id="wal-async-commit">
|
|
<title>Asynchronous Commit</title>
|
|
|
|
<indexterm>
|
|
<primary>synchronous commit</primary>
|
|
</indexterm>
|
|
|
|
<indexterm>
|
|
<primary>asynchronous commit</primary>
|
|
</indexterm>
|
|
|
|
<para>
|
|
<firstterm>Asynchronous commit</> is an option that allows transactions
|
|
to complete more quickly, at the cost that the most recent transactions may
|
|
be lost if the database should crash. In many applications this is an
|
|
acceptable trade-off.
|
|
</para>
|
|
|
|
<para>
|
|
As described in the previous section, transaction commit is normally
|
|
<firstterm>synchronous</>: the server waits for the transaction's
|
|
<acronym>WAL</acronym> records to be flushed to permanent storage
|
|
before returning a success indication to the client. The client is
|
|
therefore guaranteed that a transaction reported to be committed will
|
|
be preserved, even in the event of a server crash immediately after.
|
|
However, for short transactions this delay is a major component of the
|
|
total transaction time. Selecting asynchronous commit mode means that
|
|
the server returns success as soon as the transaction is logically
|
|
completed, before the <acronym>WAL</acronym> records it generated have
|
|
actually made their way to disk. This can provide a significant boost
|
|
in throughput for small transactions.
|
|
</para>
|
|
|
|
<para>
|
|
Asynchronous commit introduces the risk of data loss. There is a short
|
|
time window between the report of transaction completion to the client
|
|
and the time that the transaction is truly committed (that is, it is
|
|
guaranteed not to be lost if the server crashes). Thus asynchronous
|
|
commit should not be used if the client will take external actions
|
|
relying on the assumption that the transaction will be remembered.
|
|
As an example, a bank would certainly not use asynchronous commit for
|
|
a transaction recording an ATM's dispensing of cash. But in many
|
|
scenarios, such as event logging, there is no need for a strong
|
|
guarantee of this kind.
|
|
</para>
|
|
|
|
<para>
|
|
The risk that is taken by using asynchronous commit is of data loss,
|
|
not data corruption. If the database should crash, it will recover
|
|
by replaying <acronym>WAL</acronym> up to the last record that was
|
|
flushed. The database will therefore be restored to a self-consistent
|
|
state, but any transactions that were not yet flushed to disk will
|
|
not be reflected in that state. The net effect is therefore loss of
|
|
the last few transactions. Because the transactions are replayed in
|
|
commit order, no inconsistency can be introduced — for example,
|
|
if transaction B made changes relying on the effects of a previous
|
|
transaction A, it is not possible for A's effects to be lost while B's
|
|
effects are preserved.
|
|
</para>
|
|
|
|
<para>
|
|
The user can select the commit mode of each transaction, so that
|
|
it is possible to have both synchronous and asynchronous commit
|
|
transactions running concurrently. This allows flexible trade-offs
|
|
between performance and certainty of transaction durability.
|
|
The commit mode is controlled by the user-settable parameter
|
|
<xref linkend="guc-synchronous-commit">, which can be changed in any of
|
|
the ways that a configuration parameter can be set. The mode used for
|
|
any one transaction depends on the value of
|
|
<varname>synchronous_commit</varname> when transaction commit begins.
|
|
</para>
|
|
|
|
<para>
|
|
Certain utility commands, for instance <command>DROP TABLE</>, are
|
|
forced to commit synchronously regardless of the setting of
|
|
<varname>synchronous_commit</varname>. This is to ensure consistency
|
|
between the server's file system and the logical state of the database.
|
|
The commands supporting two-phase commit, such as <command>PREPARE
|
|
TRANSACTION</>, are also always synchronous.
|
|
</para>
|
|
|
|
<para>
|
|
If the database crashes during the risk window between an
|
|
asynchronous commit and the writing of the transaction's
|
|
<acronym>WAL</acronym> records,
|
|
then changes made during that transaction <emphasis>will</> be lost.
|
|
The duration of the
|
|
risk window is limited because a background process (the <quote>WAL
|
|
writer</>) flushes unwritten <acronym>WAL</acronym> records to disk
|
|
every <xref linkend="guc-wal-writer-delay"> milliseconds.
|
|
The actual maximum duration of the risk window is three times
|
|
<varname>wal_writer_delay</varname> because the WAL writer is
|
|
designed to favor writing whole pages at a time during busy periods.
|
|
</para>
|
|
|
|
<caution>
|
|
<para>
|
|
An immediate-mode shutdown is equivalent to a server crash, and will
|
|
therefore cause loss of any unflushed asynchronous commits.
|
|
</para>
|
|
</caution>
|
|
|
|
<para>
|
|
Asynchronous commit provides behavior different from setting
|
|
<xref linkend="guc-fsync"> = off.
|
|
<varname>fsync</varname> is a server-wide
|
|
setting that will alter the behavior of all transactions. It disables
|
|
all logic within <productname>PostgreSQL</> that attempts to synchronize
|
|
writes to different portions of the database, and therefore a system
|
|
crash (that is, a hardware or operating system crash, not a failure of
|
|
<productname>PostgreSQL</> itself) could result in arbitrarily bad
|
|
corruption of the database state. In many scenarios, asynchronous
|
|
commit provides most of the performance improvement that could be
|
|
obtained by turning off <varname>fsync</varname>, but without the risk
|
|
of data corruption.
|
|
</para>
|
|
|
|
<para>
|
|
<xref linkend="guc-commit-delay"> also sounds very similar to
|
|
asynchronous commit, but it is actually a synchronous commit method
|
|
(in fact, <varname>commit_delay</varname> is ignored during an
|
|
asynchronous commit). <varname>commit_delay</varname> causes a delay
|
|
just before a synchronous commit attempts to flush
|
|
<acronym>WAL</acronym> to disk, in the hope that a single flush
|
|
executed by one such transaction can also serve other transactions
|
|
committing at about the same time. Setting <varname>commit_delay</varname>
|
|
can only help when there are many concurrently committing transactions,
|
|
and it is difficult to tune it to a value that actually helps rather
|
|
than hurt throughput.
|
|
</para>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="wal-configuration">
|
|
<title><acronym>WAL</acronym> Configuration</title>
|
|
|
|
<para>
|
|
There are several <acronym>WAL</>-related configuration parameters that
|
|
affect database performance. This section explains their use.
|
|
Consult <xref linkend="runtime-config"> for general information about
|
|
setting server configuration parameters.
|
|
</para>
|
|
|
|
<para>
|
|
<firstterm>Checkpoints</firstterm><indexterm><primary>checkpoint</></>
|
|
are points in the sequence of transactions at which it is guaranteed
|
|
that the heap and index data files have been updated with all information written before
|
|
the checkpoint. At checkpoint time, all dirty data pages are flushed to
|
|
disk and a special checkpoint record is written to the log file.
|
|
(The changes were previously flushed to the <acronym>WAL</acronym> files.)
|
|
In the event of a crash, the crash recovery procedure looks at the latest
|
|
checkpoint record to determine the point in the log (known as the redo
|
|
record) from which it should start the REDO operation. Any changes made to
|
|
data files before that point are guaranteed to be already on disk. Hence, after
|
|
a checkpoint, log segments preceding the one containing
|
|
the redo record are no longer needed and can be recycled or removed. (When
|
|
<acronym>WAL</acronym> archiving is being done, the log segments must be
|
|
archived before being recycled or removed.)
|
|
</para>
|
|
|
|
<para>
|
|
The checkpoint requirement of flushing all dirty data pages to disk
|
|
can cause a significant I/O load. For this reason, checkpoint
|
|
activity is throttled so I/O begins at checkpoint start and completes
|
|
before the next checkpoint starts; this minimizes performance
|
|
degradation during checkpoints.
|
|
</para>
|
|
|
|
<para>
|
|
The server's background writer process automatically performs
|
|
a checkpoint every so often. A checkpoint is created every <xref
|
|
linkend="guc-checkpoint-segments"> log segments, or every <xref
|
|
linkend="guc-checkpoint-timeout"> seconds, whichever comes first.
|
|
The default settings are 3 segments and 300 seconds (5 minutes), respectively.
|
|
It is also possible to force a checkpoint by using the SQL command
|
|
<command>CHECKPOINT</command>.
|
|
</para>
|
|
|
|
<para>
|
|
Reducing <varname>checkpoint_segments</varname> and/or
|
|
<varname>checkpoint_timeout</varname> causes checkpoints to occur
|
|
more often. This allows faster after-crash recovery (since less work
|
|
will need to be redone). However, one must balance this against the
|
|
increased cost of flushing dirty data pages more often. If
|
|
<xref linkend="guc-full-page-writes"> is set (as is the default), there is
|
|
another factor to consider. To ensure data page consistency,
|
|
the first modification of a data page after each checkpoint results in
|
|
logging the entire page content. In that case,
|
|
a smaller checkpoint interval increases the volume of output to the WAL log,
|
|
partially negating the goal of using a smaller interval,
|
|
and in any case causing more disk I/O.
|
|
</para>
|
|
|
|
<para>
|
|
Checkpoints are fairly expensive, first because they require writing
|
|
out all currently dirty buffers, and second because they result in
|
|
extra subsequent WAL traffic as discussed above. It is therefore
|
|
wise to set the checkpointing parameters high enough that checkpoints
|
|
don't happen too often. As a simple sanity check on your checkpointing
|
|
parameters, you can set the <xref linkend="guc-checkpoint-warning">
|
|
parameter. If checkpoints happen closer together than
|
|
<varname>checkpoint_warning</> seconds,
|
|
a message will be output to the server log recommending increasing
|
|
<varname>checkpoint_segments</varname>. Occasional appearance of such
|
|
a message is not cause for alarm, but if it appears often then the
|
|
checkpoint control parameters should be increased. Bulk operations such
|
|
as large <command>COPY</> transfers might cause a number of such warnings
|
|
to appear if you have not set <varname>checkpoint_segments</> high
|
|
enough.
|
|
</para>
|
|
|
|
<para>
|
|
To avoid flooding the I/O system with a burst of page writes,
|
|
writing dirty buffers during a checkpoint is spread over a period of time.
|
|
That period is controlled by
|
|
<xref linkend="guc-checkpoint-completion-target">, which is
|
|
given as a fraction of the checkpoint interval.
|
|
The I/O rate is adjusted so that the checkpoint finishes when the
|
|
given fraction of <varname>checkpoint_segments</varname> WAL segments
|
|
have been consumed since checkpoint start, or the given fraction of
|
|
<varname>checkpoint_timeout</varname> seconds have elapsed,
|
|
whichever is sooner. With the default value of 0.5,
|
|
<productname>PostgreSQL</> can be expected to complete each checkpoint
|
|
in about half the time before the next checkpoint starts. On a system
|
|
that's very close to maximum I/O throughput during normal operation,
|
|
you might want to increase <varname>checkpoint_completion_target</varname>
|
|
to reduce the I/O load from checkpoints. The disadvantage of this is that
|
|
prolonging checkpoints affects recovery time, because more WAL segments
|
|
will need to be kept around for possible use in recovery. Although
|
|
<varname>checkpoint_completion_target</varname> can be set as high as 1.0,
|
|
it is best to keep it less than that (perhaps 0.9 at most) since
|
|
checkpoints include some other activities besides writing dirty buffers.
|
|
A setting of 1.0 is quite likely to result in checkpoints not being
|
|
completed on time, which would result in performance loss due to
|
|
unexpected variation in the number of WAL segments needed.
|
|
</para>
|
|
|
|
<para>
|
|
There will always be at least one WAL segment file, and will normally
|
|
not be more than (2 + <varname>checkpoint_completion_target</varname>) * <varname>checkpoint_segments</varname> + 1
|
|
or <varname>checkpoint_segments</> + <xref linkend="guc-wal-keep-segments"> + 1
|
|
files. Each segment file is normally 16 MB (though this size can be
|
|
altered when building the server). You can use this to estimate space
|
|
requirements for <acronym>WAL</acronym>.
|
|
Ordinarily, when old log segment files are no longer needed, they
|
|
are recycled (renamed to become the next segments in the numbered
|
|
sequence). If, due to a short-term peak of log output rate, there
|
|
are more than 3 * <varname>checkpoint_segments</varname> + 1
|
|
segment files, the unneeded segment files will be deleted instead
|
|
of recycled until the system gets back under this limit.
|
|
</para>
|
|
|
|
<para>
|
|
In archive recovery or standby mode, the server periodically performs
|
|
<firstterm>restartpoints</><indexterm><primary>restartpoint</></>
|
|
which are similar to checkpoints in normal operation: the server forces
|
|
all its state to disk, updates the <filename>pg_control</> file to
|
|
indicate that the already-processed WAL data need not be scanned again,
|
|
and then recycles any old log segment files in <filename>pg_xlog</>
|
|
directory. A restartpoint is triggered if at least one checkpoint record
|
|
has been replayed and <varname>checkpoint_timeout</> seconds have passed
|
|
since last restartpoint. In standby mode, a restartpoint is also triggered
|
|
if <varname>checkpoint_segments</> log segments have been replayed since
|
|
last restartpoint and at least one checkpoint record has been replayed.
|
|
Restartpoints can't be performed more frequently than checkpoints in the
|
|
master because restartpoints can only be performed at checkpoint records.
|
|
</para>
|
|
|
|
<para>
|
|
There are two commonly used internal <acronym>WAL</acronym> functions:
|
|
<function>LogInsert</function> and <function>LogFlush</function>.
|
|
<function>LogInsert</function> is used to place a new record into
|
|
the <acronym>WAL</acronym> buffers in shared memory. If there is no
|
|
space for the new record, <function>LogInsert</function> will have
|
|
to write (move to kernel cache) a few filled <acronym>WAL</acronym>
|
|
buffers. This is undesirable because <function>LogInsert</function>
|
|
is used on every database low level modification (for example, row
|
|
insertion) at a time when an exclusive lock is held on affected
|
|
data pages, so the operation needs to be as fast as possible. What
|
|
is worse, writing <acronym>WAL</acronym> buffers might also force the
|
|
creation of a new log segment, which takes even more
|
|
time. Normally, <acronym>WAL</acronym> buffers should be written
|
|
and flushed by a <function>LogFlush</function> request, which is
|
|
made, for the most part, at transaction commit time to ensure that
|
|
transaction records are flushed to permanent storage. On systems
|
|
with high log output, <function>LogFlush</function> requests might
|
|
not occur often enough to prevent <function>LogInsert</function>
|
|
from having to do writes. On such systems
|
|
one should increase the number of <acronym>WAL</acronym> buffers by
|
|
modifying the configuration parameter <xref
|
|
linkend="guc-wal-buffers">. The default number of <acronym>WAL</acronym>
|
|
buffers is 8. Increasing this value will
|
|
correspondingly increase shared memory usage. When
|
|
<xref linkend="guc-full-page-writes"> is set and the system is very busy,
|
|
setting this value higher will help smooth response times during the
|
|
period immediately following each checkpoint.
|
|
</para>
|
|
|
|
<para>
|
|
The <xref linkend="guc-commit-delay"> parameter defines for how many
|
|
microseconds the server process will sleep after writing a commit
|
|
record to the log with <function>LogInsert</function> but before
|
|
performing a <function>LogFlush</function>. This delay allows other
|
|
server processes to add their commit records to the log so as to have all
|
|
of them flushed with a single log sync. No sleep will occur if
|
|
<xref linkend="guc-fsync">
|
|
is not enabled, or if fewer than <xref linkend="guc-commit-siblings">
|
|
other sessions are currently in active transactions; this avoids
|
|
sleeping when it's unlikely that any other session will commit soon.
|
|
Note that on most platforms, the resolution of a sleep request is
|
|
ten milliseconds, so that any nonzero <varname>commit_delay</varname>
|
|
setting between 1 and 10000 microseconds would have the same effect.
|
|
Good values for these parameters are not yet clear; experimentation
|
|
is encouraged.
|
|
</para>
|
|
|
|
<para>
|
|
The <xref linkend="guc-wal-sync-method"> parameter determines how
|
|
<productname>PostgreSQL</productname> will ask the kernel to force
|
|
<acronym>WAL</acronym> updates out to disk.
|
|
All the options should be the same in terms of reliability, with
|
|
the exception of <literal>fsync_writethrough</>, which can sometimes
|
|
force a flush of the disk cache even when other options do not do so.
|
|
However, it's quite platform-specific which one will be the fastest;
|
|
you can test option speeds using the <xref
|
|
linkend="pgtestfsync"> module.
|
|
Note that this parameter is irrelevant if <varname>fsync</varname>
|
|
has been turned off.
|
|
</para>
|
|
|
|
<para>
|
|
Enabling the <xref linkend="guc-wal-debug"> configuration parameter
|
|
(provided that <productname>PostgreSQL</productname> has been
|
|
compiled with support for it) will result in each
|
|
<function>LogInsert</function> and <function>LogFlush</function>
|
|
<acronym>WAL</acronym> call being logged to the server log. This
|
|
option might be replaced by a more general mechanism in the future.
|
|
</para>
|
|
</sect1>
|
|
|
|
<sect1 id="wal-internals">
|
|
<title>WAL Internals</title>
|
|
|
|
<para>
|
|
<acronym>WAL</acronym> is automatically enabled; no action is
|
|
required from the administrator except ensuring that the
|
|
disk-space requirements for the <acronym>WAL</acronym> logs are met,
|
|
and that any necessary tuning is done (see <xref
|
|
linkend="wal-configuration">).
|
|
</para>
|
|
|
|
<para>
|
|
<acronym>WAL</acronym> logs are stored in the directory
|
|
<filename>pg_xlog</filename> under the data directory, as a set of
|
|
segment files, normally each 16 MB in size (but the size can be changed
|
|
by altering the <option>--with-wal-segsize</> configure option when
|
|
building the server). Each segment is divided into pages, normally
|
|
8 kB each (this size can be changed via the <option>--with-wal-blocksize</>
|
|
configure option). The log record headers are described in
|
|
<filename>access/xlog.h</filename>; the record content is dependent
|
|
on the type of event that is being logged. Segment files are given
|
|
ever-increasing numbers as names, starting at
|
|
<filename>000000010000000000000000</filename>. The numbers do not wrap,
|
|
but it will take a very, very long time to exhaust the
|
|
available stock of numbers.
|
|
</para>
|
|
|
|
<para>
|
|
It is advantageous if the log is located on a different disk from the
|
|
main database files. This can be achieved by moving the
|
|
<filename>pg_xlog</filename> directory to another location (while the server
|
|
is shut down, of course) and creating a symbolic link from the
|
|
original location in the main data directory to the new location.
|
|
</para>
|
|
|
|
<para>
|
|
The aim of <acronym>WAL</acronym> is to ensure that the log is
|
|
written before database records are altered, but this can be subverted by
|
|
disk drives<indexterm><primary>disk drive</></> that falsely report a
|
|
successful write to the kernel,
|
|
when in fact they have only cached the data and not yet stored it
|
|
on the disk. A power failure in such a situation might lead to
|
|
irrecoverable data corruption. Administrators should try to ensure
|
|
that disks holding <productname>PostgreSQL</productname>'s
|
|
<acronym>WAL</acronym> log files do not make such false reports.
|
|
(See <xref linkend="wal-reliability">.)
|
|
</para>
|
|
|
|
<para>
|
|
After a checkpoint has been made and the log flushed, the
|
|
checkpoint's position is saved in the file
|
|
<filename>pg_control</filename>. Therefore, at the start of recovery,
|
|
the server first reads <filename>pg_control</filename> and
|
|
then the checkpoint record; then it performs the REDO operation by
|
|
scanning forward from the log position indicated in the checkpoint
|
|
record. Because the entire content of data pages is saved in the
|
|
log on the first page modification after a checkpoint (assuming
|
|
<xref linkend="guc-full-page-writes"> is not disabled), all pages
|
|
changed since the checkpoint will be restored to a consistent
|
|
state.
|
|
</para>
|
|
|
|
<para>
|
|
To deal with the case where <filename>pg_control</filename> is
|
|
corrupt, we should support the possibility of scanning existing log
|
|
segments in reverse order — newest to oldest — in order to find the
|
|
latest checkpoint. This has not been implemented yet.
|
|
<filename>pg_control</filename> is small enough (less than one disk page)
|
|
that it is not subject to partial-write problems, and as of this writing
|
|
there have been no reports of database failures due solely to the inability
|
|
to read <filename>pg_control</filename> itself. So while it is
|
|
theoretically a weak spot, <filename>pg_control</filename> does not
|
|
seem to be a problem in practice.
|
|
</para>
|
|
</sect1>
|
|
</chapter>
|