postgresql/doc/src/sgml/high-availability.sgml

<!-- $PostgreSQL: pgsql/doc/src/sgml/high-availability.sgml,v 1.70 2010/05/29 09:01:10 heikki Exp $ -->

<chapter id="high-availability">
 <title>High Availability, Load Balancing, and Replication</title>

 <indexterm><primary>high availability</></>
 <indexterm><primary>failover</></>
 <indexterm><primary>replication</></>
 <indexterm><primary>load balancing</></>
 <indexterm><primary>clustering</></>
 <indexterm><primary>data partitioning</></>

 <para>
  Database servers can work together to allow a second server to
  take over quickly if the primary server fails (high
  availability), or to allow several computers to serve the same
  data (load balancing).  Ideally, database servers could work
  together seamlessly.  Web servers serving static web pages can
  be combined quite easily by merely load-balancing web requests
  to multiple machines.  In fact, read-only database servers can
  be combined relatively easily too.  Unfortunately, most database
  servers have a read/write mix of requests, and read/write servers
  are much harder to combine.  This is because though read-only
  data needs to be placed on each server only once, a write to any
  server has to be propagated to all servers so that future read
  requests to those servers return consistent results.
 </para>

 <para>
  This synchronization problem is the fundamental difficulty for
  servers working together.  Because there is no single solution
  that eliminates the impact of the sync problem for all use cases,
  there are multiple solutions.  Each solution addresses this
  problem in a different way, and minimizes its impact for a specific
  workload.
 </para>

 <para>
  Some solutions deal with synchronization by allowing only one
  server to modify the data.  Servers that can modify data are
  called read/write, <firstterm>master</> or <firstterm>primary</> servers.
  Servers that track changes in the master are called <firstterm>standby</>
  or <firstterm>slave</> servers. A standby server that cannot be connected
  to until it is promoted to a master server is called a <firstterm>warm
  standby</> server, and one that can accept connections and serves read-only
  queries is called a <firstterm>hot standby</> server.
 </para>

 <para>
  Some solutions are synchronous,
  meaning that a data-modifying transaction is not considered
  committed until all servers have committed the transaction.  This
  guarantees that a failover will not lose any data and that all
  load-balanced servers will return consistent results no matter
  which server is queried. In contrast, asynchronous solutions allow some
  delay between the time of a commit and its propagation to the other servers,
  opening the possibility that some transactions might be lost in
  the switch to a backup server, and that load balanced servers
  might return slightly stale results.  Asynchronous communication
  is used when synchronous would be too slow.
 </para>

 <para>
  Solutions can also be categorized by their granularity.  Some solutions
  can deal only with an entire database server, while others allow control
  at the per-table or per-database level.
 </para>

 <para>
  Performance must be considered in any choice.  There is usually a
  trade-off between functionality and
  performance.  For example, a fully synchronous solution over a slow
  network might cut performance by more than half, while an asynchronous
  one might have a minimal performance impact.
 </para>

 <para>
  The remainder of this section outlines various failover, replication,
  and load balancing solutions.  A <ulink
  url="http://www.postgres-r.org/documentation/terms">glossary</ulink> is
  also available.
 </para>

 <sect1 id="different-replication-solutions">
 <title>Comparison of different solutions</title>

 <variablelist>

  <varlistentry>
   <term>Shared Disk Failover</term>
   <listitem>

    <para>
     Shared disk failover avoids synchronization overhead by having only one
     copy of the database.  It uses a single disk array that is shared by
     multiple servers.  If the main database server fails, the standby server
     is able to mount and start the database as though it were recovering from
     a database crash.  This allows rapid failover with no data loss.
    </para>

    <para>
     Shared hardware functionality is common in network storage devices.
     Using a network file system is also possible, though care must be
     taken that the file system has full <acronym>POSIX</> behavior (see <xref
     linkend="creating-cluster-nfs">).  One significant limitation of this
     method is that if the shared disk array fails or becomes corrupt, the
     primary and standby servers are both nonfunctional.  Another issue is
     that the standby server should never access the shared storage while
     the primary server is running.
    </para>

   </listitem>
  </varlistentry>

  <varlistentry>
   <term>File System (Block-Device) Replication</term>
   <listitem>

    <para>
     A modified version of shared hardware functionality is file system
     replication, where all changes to a file system are mirrored to a file
     system residing on another computer.  The only restriction is that
     the mirroring must be done in a way that ensures the standby server
     has a consistent copy of the file system &mdash; specifically, writes
     to the standby must be done in the same order as those on the master.
     <productname>DRBD</> is a popular file system replication solution
     for Linux.
    </para>

<!--
https://forge.continuent.org/pipermail/sequoia/2006-November/004070.html

Oracle RAC is a shared disk approach and just send cache invalidations
to other nodes but not actual data. As the disk is shared, data is
only committed once to disk and there is a distributed locking
protocol to make nodes agree on a serializable transactional order.
-->

   </listitem>
  </varlistentry>

  <varlistentry>
   <term>Warm and Hot Standby Using Point-In-Time Recovery (<acronym>PITR</>)</term>
   <listitem>

    <para>
     Warm and hot standby servers can be kept current by reading a
     stream of write-ahead log (<acronym>WAL</>)
     records.  If the main server fails, the standby contains
     almost all of the data of the main server, and can be quickly
     made the new master database server.  This is asynchronous and
     can only be done for the entire database server.
    </para>
    <para>
     A PITR standby server can be implemented using file-based log shipping
     (<xref linkend="warm-standby">) or streaming replication (see
     <xref linkend="streaming-replication">), or a combination of both. For
     information on hot standby, see <xref linkend="hot-standby">.
    </para>
   </listitem>
  </varlistentry>

  <varlistentry>
   <term>Trigger-Based Master-Slave Replication</term>
   <listitem>

    <para>
     A master-slave replication setup sends all data modification
     queries to the master server.  The master server asynchronously
     sends data changes to the slave server.  The slave can answer
     read-only queries while the master server is running.  The
     slave server is ideal for data warehouse queries.
    </para>

    <para>
     <productname>Slony-I</> is an example of this type of replication, with per-table
     granularity, and support for multiple slaves.  Because it
     updates the slave server asynchronously (in batches), there is
     possible data loss during fail over.
    </para>
   </listitem>
  </varlistentry>

  <varlistentry>
   <term>Statement-Based Replication Middleware</term>
   <listitem>

    <para>
     With statement-based replication middleware, a program intercepts
     every SQL query and sends it to one or all servers.  Each server
     operates independently.  Read-write queries are sent to all servers,
     while read-only queries can be sent to just one server, allowing
     the read workload to be distributed.
    </para>

    <para>
     If queries are simply broadcast unmodified, functions like
     <function>random()</>, <function>CURRENT_TIMESTAMP</>, and
     sequences can have different values on different servers.
     This is because each server operates independently, and because
     SQL queries are broadcast (and not actual modified rows).  If
     this is unacceptable, either the middleware or the application
     must query such values from a single server and then use those
     values in write queries.  Another option is to use this replication
     option with a traditional master-slave setup, i.e. data modification
     queries are sent only to the master and are propagated to the
     slaves via master-slave replication, not by the replication
     middleware.  Care must also be taken that all
     transactions either commit or abort on all servers, perhaps
     using two-phase commit (<xref linkend="sql-prepare-transaction">
     and <xref linkend="sql-commit-prepared">.
     <productname>Pgpool-II</> and <productname>Sequoia</> are examples of
     this type of replication.
    </para>
   </listitem>
  </varlistentry>

  <varlistentry>
   <term>Asynchronous Multimaster Replication</term>
   <listitem>

    <para>
     For servers that are not regularly connected, like laptops or
     remote servers, keeping data consistent among servers is a
     challenge.  Using asynchronous multimaster replication, each
     server works independently, and periodically communicates with
     the other servers to identify conflicting transactions.  The
     conflicts can be resolved by users or conflict resolution rules.
     Bucardo is an example of this type of replication.
    </para>
   </listitem>
  </varlistentry>

  <varlistentry>
   <term>Synchronous Multimaster Replication</term>
   <listitem>

    <para>
     In synchronous multimaster replication, each server can accept
     write requests, and modified data is transmitted from the
     original server to every other server before each transaction
     commits.  Heavy write activity can cause excessive locking,
     leading to poor performance.  In fact, write performance is
     often worse than that of a single server.  Read requests can
     be sent to any server.  Some implementations use shared disk
     to reduce the communication overhead.  Synchronous multimaster
     replication is best for mostly read workloads, though its big
     advantage is that any server can accept write requests &mdash;
     there is no need to partition workloads between master and
     slave servers, and because the data changes are sent from one
     server to another, there is no problem with non-deterministic
     functions like <function>random()</>.
    </para>

    <para>
     <productname>PostgreSQL</> does not offer this type of replication,
     though <productname>PostgreSQL</> two-phase commit (<xref
     linkend="sql-prepare-transaction"> and <xref
     linkend="sql-commit-prepared">)
     can be used to implement this in application code or middleware.
    </para>
   </listitem>
  </varlistentry>

  <varlistentry>
   <term>Commercial Solutions</term>
   <listitem>

    <para>
     Because <productname>PostgreSQL</> is open source and easily
     extended, a number of companies have taken <productname>PostgreSQL</>
     and created commercial closed-source solutions with unique
     failover, replication, and load balancing capabilities.
    </para>
   </listitem>
  </varlistentry>

 </variablelist>

 <para>
  <xref linkend="high-availability-matrix"> summarizes
  the capabilities of the various solutions listed above.
 </para>

 <table id="high-availability-matrix">
  <title>High Availability, Load Balancing, and Replication Feature Matrix</title>
  <tgroup cols="8">
   <thead>
    <row>
     <entry>Feature</entry>
     <entry>Shared Disk Failover</entry>
     <entry>File System Replication</entry>
     <entry>Hot/Warm Standby Using PITR</entry>
     <entry>Trigger-Based Master-Slave Replication</entry>
     <entry>Statement-Based Replication Middleware</entry>
     <entry>Asynchronous Multimaster Replication</entry>
     <entry>Synchronous Multimaster Replication</entry>
    </row>
   </thead>

   <tbody>

    <row>
     <entry>Most Common Implementation</entry>
     <entry align="center">NAS</entry>
     <entry align="center">DRBD</entry>
     <entry align="center">PITR</entry>
     <entry align="center">Slony</entry>
     <entry align="center">pgpool-II</entry>
     <entry align="center">Bucardo</entry>
     <entry align="center"></entry>
    </row>

    <row>
     <entry>Communication Method</entry>
     <entry align="center">shared disk</entry>
     <entry align="center">disk blocks</entry>
     <entry align="center">WAL</entry>
     <entry align="center">table rows</entry>
     <entry align="center">SQL</entry>
     <entry align="center">table rows</entry>
     <entry align="center">table rows and row locks</entry>
    </row>

    <row>
     <entry>No special hardware required</entry>
     <entry align="center"></entry>
     <entry align="center">&bull;</entry>
     <entry align="center">&bull;</entry>
     <entry align="center">&bull;</entry>
     <entry align="center">&bull;</entry>
     <entry align="center">&bull;</entry>
     <entry align="center">&bull;</entry>
    </row>

    <row>
     <entry>Allows multiple master servers</entry>
     <entry align="center"></entry>
     <entry align="center"></entry>
     <entry align="center"></entry>
     <entry align="center"></entry>
     <entry align="center">&bull;</entry>
     <entry align="center">&bull;</entry>
     <entry align="center">&bull;</entry>
    </row>

    <row>
     <entry>No master server overhead</entry>
     <entry align="center">&bull;</entry>
     <entry align="center"></entry>
     <entry align="center">&bull;</entry>
     <entry align="center"></entry>
     <entry align="center">&bull;</entry>
     <entry align="center"></entry>
     <entry align="center"></entry>
    </row>

    <row>
     <entry>No waiting for multiple servers</entry>
     <entry align="center">&bull;</entry>
     <entry align="center"></entry>
     <entry align="center">&bull;</entry>
     <entry align="center">&bull;</entry>
     <entry align="center"></entry>
     <entry align="center">&bull;</entry>
     <entry align="center"></entry>
    </row>

    <row>
     <entry>Master failure will never lose data</entry>
     <entry align="center">&bull;</entry>
     <entry align="center">&bull;</entry>
     <entry align="center"></entry>
     <entry align="center"></entry>
     <entry align="center">&bull;</entry>
     <entry align="center"></entry>
     <entry align="center">&bull;</entry>
    </row>

    <row>
     <entry>Slaves accept read-only queries</entry>
     <entry align="center"></entry>
     <entry align="center"></entry>
     <entry align="center">Hot only</entry>
     <entry align="center">&bull;</entry>
     <entry align="center">&bull;</entry>
     <entry align="center">&bull;</entry>
     <entry align="center">&bull;</entry>
    </row>

    <row>
     <entry>Per-table granularity</entry>
     <entry align="center"></entry>
     <entry align="center"></entry>
     <entry align="center"></entry>
     <entry align="center">&bull;</entry>
     <entry align="center"></entry>
     <entry align="center">&bull;</entry>
     <entry align="center">&bull;</entry>
    </row>

    <row>
     <entry>No conflict resolution necessary</entry>
     <entry align="center">&bull;</entry>
     <entry align="center">&bull;</entry>
     <entry align="center">&bull;</entry>
     <entry align="center">&bull;</entry>
     <entry align="center"></entry>
     <entry align="center"></entry>
     <entry align="center">&bull;</entry>
    </row>

   </tbody>
  </tgroup>
 </table>

 <para>
  There are a few solutions that do not fit into the above categories:
 </para>

 <variablelist>

  <varlistentry>
   <term>Data Partitioning</term>
   <listitem>

    <para>
     Data partitioning splits tables into data sets.  Each set can
     be modified by only one server.  For example, data can be
     partitioned by offices, e.g., London and Paris, with a server
     in each office.  If queries combining London and Paris data
     are necessary, an application can query both servers, or
     master/slave replication can be used to keep a read-only copy
     of the other office's data on each server.
    </para>
   </listitem>
  </varlistentry>

  <varlistentry>
   <term>Multiple-Server Parallel Query Execution</term>
   <listitem>

    <para>
     Many of the above solutions allow multiple servers to handle multiple
     queries, but none allow a single query to use multiple servers to
     complete faster.  This solution allows multiple servers to work
     concurrently on a single query.  It is usually accomplished by
     splitting the data among servers and having each server execute its
     part of the query and return results to a central server where they
     are combined and returned to the user.  <productname>Pgpool-II</>
     has this capability.  Also, this can be implemented using the
     <productname>PL/Proxy</> toolset.
    </para>

   </listitem>
  </varlistentry>

 </variablelist>

 </sect1>


 <sect1 id="warm-standby">
 <title>Log-Shipping Standby Servers</title>


  <para>
   Continuous archiving can be used to create a <firstterm>high
   availability</> (HA) cluster configuration with one or more
   <firstterm>standby servers</> ready to take over operations if the
   primary server fails. This capability is widely referred to as
   <firstterm>warm standby</> or <firstterm>log shipping</>.
  </para>

  <para>
   The primary and standby server work together to provide this capability,
   though the servers are only loosely coupled. The primary server operates
   in continuous archiving mode, while each standby server operates in
   continuous recovery mode, reading the WAL files from the primary. No
   changes to the database tables are required to enable this capability,
   so it offers low administration overhead compared to some other
   replication solutions. This configuration also has relatively low
   performance impact on the primary server.
  </para>

  <para>
   Directly moving WAL records from one database server to another
   is typically described as log shipping. <productname>PostgreSQL</>
   implements file-based log shipping, which means that WAL records are
   transferred one file (WAL segment) at a time. WAL files (16MB) can be
   shipped easily and cheaply over any distance, whether it be to an
   adjacent system, another system at the same site, or another system on
   the far side of the globe. The bandwidth required for this technique
   varies according to the transaction rate of the primary server.
   Record-based log shipping is also possible with streaming replication
   (see <xref linkend="streaming-replication">).
  </para>

  <para>
   It should be noted that the log shipping is asynchronous, i.e., the WAL
   records are shipped after transaction commit. As a result, there is a
   window for data loss should the primary server suffer a catastrophic
   failure; transactions not yet shipped will be lost.  The size of the
   data loss window in file-based log shipping can be limited by use of the
   <varname>archive_timeout</varname> parameter, which can be set as low
   as a few seconds.  However such a low setting will
   substantially increase the bandwidth required for file shipping.
   If you need a window of less than a minute or so, consider using
   streaming replication (see <xref linkend="streaming-replication">).
  </para>

  <para>
   Recovery performance is sufficiently good that the standby will
   typically be only moments away from full
   availability once it has been activated. As a result, this is called
   a warm standby configuration which offers high
   availability. Restoring a server from an archived base backup and
   rollforward will take considerably longer, so that technique only
   offers a solution for disaster recovery, not high availability.
   A standby server can also be used for read-only queries, in which case
   it is called a Hot Standby server. See <xref linkend="hot-standby"> for
   more information.
  </para>

  <indexterm zone="high-availability">
   <primary>warm standby</primary>
  </indexterm>

  <indexterm zone="high-availability">
   <primary>PITR standby</primary>
  </indexterm>

  <indexterm zone="high-availability">
   <primary>standby server</primary>
  </indexterm>

  <indexterm zone="high-availability">
   <primary>log shipping</primary>
  </indexterm>

  <indexterm zone="high-availability">
   <primary>witness server</primary>
  </indexterm>

  <indexterm zone="high-availability">
   <primary>STONITH</primary>
  </indexterm>

  <sect2 id="standby-planning">
   <title>Planning</title>

   <para>
    It is usually wise to create the primary and standby servers
    so that they are as similar as possible, at least from the
    perspective of the database server.  In particular, the path names
    associated with tablespaces will be passed across unmodified, so both
    primary and standby servers must have the same mount paths for
    tablespaces if that feature is used.  Keep in mind that if
    <xref linkend="sql-createtablespace">
    is executed on the primary, any new mount point needed for it must
    be created on the primary and all standby servers before the command
    is executed. Hardware need not be exactly the same, but experience shows
    that maintaining two identical systems is easier than maintaining two
    dissimilar ones over the lifetime of the application and system.
    In any case the hardware architecture must be the same &mdash; shipping
    from, say, a 32-bit to a 64-bit system will not work.
   </para>

   <para>
    In general, log shipping between servers running different major
    <productname>PostgreSQL</> release
    levels is not possible. It is the policy of the PostgreSQL Global
    Development Group not to make changes to disk formats during minor release
    upgrades, so it is likely that running different minor release levels
    on primary and standby servers will work successfully. However, no
    formal support for that is offered and you are advised to keep primary
    and standby servers at the same release level as much as possible.
    When updating to a new minor release, the safest policy is to update
    the standby servers first &mdash; a new minor release is more likely
    to be able to read WAL files from a previous minor release than vice
    versa.
   </para>

  </sect2>

  <sect2 id="standby-server-operation">
   <title>Standby Server Operation</title>

   <para>
    In standby mode, the server continuously applies WAL received from the
    master server. The standby server can read WAL from a WAL archive
    (see <varname>restore_command</>) or directly from the master
    over a TCP connection (streaming replication). The standby server will
    also attempt to restore any WAL found in the standby cluster's
    <filename>pg_xlog</> directory. That typically happens after a server
    restart, when the standby replays again WAL that was streamed from the
    master before the restart, but you can also manually copy files to
    <filename>pg_xlog</> at any time to have them replayed.
   </para>

   <para>
    At startup, the standby begins by restoring all WAL available in the
    archive location, calling <varname>restore_command</>. Once it
    reaches the end of WAL available there and <varname>restore_command</>
    fails, it tries to restore any WAL available in the pg_xlog directory.
    If that fails, and streaming replication has been configured, the
    standby tries to connect to the primary server and start streaming WAL
    from the last valid record found in archive or pg_xlog. If that fails
    or streaming replication is not configured, or if the connection is
    later disconnected, the standby goes back to step 1 and tries to
    restore the file from the archive again. This loop of retries from the
    archive, pg_xlog, and via streaming replication goes on until the server
    is stopped or failover is triggered by a trigger file.
   </para>

   <para>
    Standby mode is exited and the server switches to normal operation,
    when a trigger file is found (<varname>trigger_file</>). Before failover,
    any WAL immediately available in the archive or in pg_xlog will be
    restored, but no attempt is made to connect to the master.
   </para>
  </sect2>

  <sect2 id="preparing-master-for-standby">
   <title>Preparing the Master for Standby Servers</title>

   <para>
    Set up continuous archiving on the primary to an archive directory
    accessible from the standby, as described
    in <xref linkend="continuous-archiving">. The archive location should be
    accessible from the standby even when the master is down, i.e. it should
    reside on the standby server itself or another trusted server, not on
    the master server.
   </para>

   <para>
    If you want to use streaming replication, set up authentication on the
    primary server to allow replication connections from the standby
    server(s); that is, provide a suitable entry or entries in
    <filename>pg_hba.conf</> with the database field set to
    <literal>replication</>.  Also ensure <varname>max_wal_senders</> is set
    to a sufficiently large value in the configuration file of the primary
    server.
   </para>

   <para>
    Take a base backup as described in <xref linkend="backup-base-backup">
    to bootstrap the standby server.
   </para>
  </sect2>

  <sect2 id="standby-server-setup">
   <title>Setting Up a Standby Server</title>

   <para>
    To set up the standby server, restore the base backup taken from primary
    server (see <xref linkend="backup-pitr-recovery">). Create a recovery
    command file <filename>recovery.conf</> in the standby's cluster data
    directory, and turn on <varname>standby_mode</>. Set
    <varname>restore_command</> to a simple command to copy files from
    the WAL archive.
   </para>

   <note>
     <para>
     Do not use pg_standby or similar tools with the built-in standby mode
     described here. <varname>restore_command</> should return immediately
     if the file does not exist; the server will retry the command again if
     necessary. See <xref linkend="log-shipping-alternative">
     for using tools like pg_standby.
    </para>
   </note>

   <para>
     If you want to use streaming replication, fill in
     <varname>primary_conninfo</> with a libpq connection string, including
     the host name (or IP address) and any additional details needed to
     connect to the primary server. If the primary needs a password for
     authentication, the password needs to be specified in
     <varname>primary_conninfo</> as well.
   </para>

   <para>
    You can use <varname>restartpoint_command</> to prune the archive of
    files no longer needed by the standby.
   </para>

   <para>
    If you're setting up the standby server for high availability purposes,
    set up WAL archiving, connections and authentication like the primary
    server, because the standby server will work as a primary server after
    failover. You will also need to set <varname>trigger_file</> to make
    it possible to fail over.
    If you're setting up the standby server for reporting
    purposes, with no plans to fail over to it, <varname>trigger_file</>
    is not required.
   </para>

   <para>
    A simple example of a <filename>recovery.conf</> is:
<programlisting>
standby_mode = 'on'
primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
restore_command = 'cp /path/to/archive/%f %p'
trigger_file = '/path/to/trigger_file'
</programlisting>
   </para>

   <para>
    You can have any number of standby servers, but if you use streaming
    replication, make sure you set <varname>max_wal_senders</> high enough in
    the primary to allow them to be connected simultaneously.
   </para>

   <para>
    If you're using a WAL archive, its size can be minimized using
    the <varname>restartpoint_command</> option to remove files that are
    no longer required by the standby server. Note however, that if you're
    using the archive for backup purposes, you need to retain files needed
    to recover from at least the latest base backup, even if they're no
    longer needed by the standby.
   </para>
  </sect2>

  <sect2 id="streaming-replication">
   <title>Streaming Replication</title>

   <indexterm zone="high-availability">
    <primary>Streaming Replication</primary>
   </indexterm>

   <para>
    Streaming replication allows a standby server to stay more up-to-date
    than is possible with file-based log shipping. The standby connects
    to the primary, which streams WAL records to the standby as they're
    generated, without waiting for the WAL file to be filled.
   </para>

   <para>
    Streaming replication is asynchronous, so there is still a small delay
    between committing a transaction in the primary and for the changes to
    become visible in the standby. The delay is however much smaller than with
    file-based log shipping, typically under one second assuming the standby
    is powerful enough to keep up with the load. With streaming replication,
    <varname>archive_timeout</> is not required to reduce the data loss
    window.
   </para>

   <para>
    If you use streaming replication without file-based continuous
    archiving, you have to set <varname>wal_keep_segments</> in the master
    to a value high enough to ensure that old WAL segments are not recycled
    too early, while the standby might still need them to catch up. If the
    standby falls behind too much, it needs to be reinitialized from a new
    base backup. If you set up a WAL archive that's accessible from the
    standby, wal_keep_segments is not required as the standby can always
    use the archive to catch up.
   </para>

   <para>
    To use streaming replication, set up a file-based log-shipping standby
    server as described in <xref linkend="warm-standby">. The step that
    turns a file-based log-shipping standby into streaming replication
    standby is setting <varname>primary_conninfo</> setting in the
    <filename>recovery.conf</> file to point to the primary server. Set
    <xref linkend="guc-listen-addresses"> and authentication options
    (see <filename>pg_hba.conf</>) on the primary so that the standby server
    can connect to the <literal>replication</> pseudo-database on the primary
    server (see <xref linkend="streaming-replication-authentication">).
   </para>

   <para>
    On systems that support the keepalive socket option, setting
    <xref linkend="guc-tcp-keepalives-idle">,
    <xref linkend="guc-tcp-keepalives-interval"> and
    <xref linkend="guc-tcp-keepalives-count"> helps the primary promptly
    notice a broken connection.
   </para>

   <para>
    Set the maximum number of concurrent connections from the standby servers
    (see <xref linkend="guc-max-wal-senders"> for details).
   </para>

   <para>
    When the standby is started and <varname>primary_conninfo</> is set
    correctly, the standby will connect to the primary after replaying all
    WAL files available in the archive. If the connection is established
    successfully, you will see a walreceiver process in the standby, and
    a corresponding walsender process in the primary.
   </para>

   <sect3 id="streaming-replication-authentication">
    <title>Authentication</title>
    <para>
     It is very important that the access privileges for replication be set up
     so that only trusted users can read the WAL stream, because it is
     easy to extract privileged information from it.  Standby servers must
     authenticate to the primary as a superuser account.
     So a role with the <literal>SUPERUSER</> and <literal>LOGIN</>
     privileges needs to be created on the primary.
    </para>
    <para>
     Client authentication for replication is controlled by a
     <filename>pg_hba.conf</> record specifying <literal>replication</> in the
     <replaceable>database</> field. For example, if the standby is running on
     host IP <literal>192.168.1.100</> and the superuser's name for replication
     is <literal>foo</>, the administrator can add the following line to the
     <filename>pg_hba.conf</> file on the primary:

<programlisting>
# Allow the user "foo" from host 192.168.1.100 to connect to the primary
# as a replication standby if the user's password is correctly supplied.
#
# TYPE  DATABASE        USER            CIDR-ADDRESS            METHOD
host    replication     foo             192.168.1.100/32        md5
</programlisting>
    </para>
    <para>
     The host name and port number of the primary, connection user name,
     and password are specified in the <filename>recovery.conf</> file or
     the corresponding environment variable on the standby.
     For example, if the primary is running on host IP <literal>192.168.1.50</>,
     port <literal>5432</literal>, the superuser's name for replication is
     <literal>foo</>, and the password is <literal>foopass</>, the administrator
     can add the following line to the <filename>recovery.conf</> file on the
     standby:

<programlisting>
# The standby connects to the primary that is running on host 192.168.1.50
# and port 5432 as the user "foo" whose password is "foopass".
primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
</programlisting>
    </para>
   </sect3>

   <sect3 id="streaming-replication-monitoring">
    <title>Monitoring</title>
    <para>
     An important health indicator of streaming replication is the amount
     of WAL records generated in the primary, but not yet applied in the
     standby. You can calculate this lag by comparing the current WAL write
     location on the primary with the last WAL location received by the
     standby. They can be retrieved using
     <function>pg_current_xlog_location</> on the primary and the
     <function>pg_last_xlog_receive_location</> on the standby,
     respectively (see <xref linkend="functions-admin-backup-table"> and
     <xref linkend="functions-recovery-info-table"> for details).
     The last WAL receive location in the standby is also displayed in the
     process status of the WAL receiver process, displayed using the
     <command>ps</> command (see <xref linkend="monitoring-ps"> for details).
    </para>
   </sect3>

  </sect2>
  </sect1>

  <sect1 id="warm-standby-failover">
   <title>Failover</title>

   <para>
    If the primary server fails then the standby server should begin
    failover procedures.
   </para>

   <para>
    If the standby server fails then no failover need take place. If the
    standby server can be restarted, even some time later, then the recovery
    process can also be restarted immediately, taking advantage of
    restartable recovery. If the standby server cannot be restarted, then a
    full new standby server instance should be created.
   </para>

   <para>
    If the primary server fails and the standby server becomes the
    new primary, and then the old primary restarts, you must have
    a mechanism for informing the old primary that it is no longer the primary. This is
    sometimes known as <acronym>STONITH</> (Shoot The Other Node In The Head), which is
    necessary to avoid situations where both systems think they are the
    primary, which will lead to confusion and ultimately data loss.
   </para>

   <para>
    Many failover systems use just two systems, the primary and the standby,
    connected by some kind of heartbeat mechanism to continually verify the
    connectivity between the two and the viability of the primary. It is
    also possible to use a third system (called a witness server) to prevent
    some cases of inappropriate failover, but the additional complexity
    might not be worthwhile unless it is set up with sufficient care and
    rigorous testing.
   </para>

   <para>
    <productname>PostgreSQL</productname> does not provide the system
    software required to identify a failure on the primary and notify
    the standby database server.  Many such tools exist and are well
    integrated with the operating system facilities required for
    successful failover, such as IP address migration.
   </para>

   <para>
    Once failover to the standby occurs, there is only a
    single server in operation. This is known as a degenerate state.
    The former standby is now the primary, but the former primary is down
    and might stay down.  To return to normal operation, a standby server
    must be recreated,
    either on the former primary system when it comes up, or on a third,
    possibly new, system. Once complete the primary and standby can be
    considered to have switched roles. Some people choose to use a third
    server to provide backup for the new primary until the new standby
    server is recreated,
    though clearly this complicates the system configuration and
    operational processes.
   </para>

   <para>
    So, switching from primary to standby server can be fast but requires
    some time to re-prepare the failover cluster. Regular switching from
    primary to standby is useful, since it allows regular downtime on
    each system for maintenance. This also serves as a test of the
    failover mechanism to ensure that it will really work when you need it.
    Written administration procedures are advised.
   </para>

   <para>
    To trigger failover of a log-shipping standby server, create a trigger
    file with the filename and path specified by the <varname>trigger_file</>
    setting in <filename>recovery.conf</>. If <varname>trigger_file</> is
    not given, there is no way to exit recovery in the standby and promote
    it to a master. That can be useful for e.g reporting servers that are
    only used to offload read-only queries from the primary, not for high
    availability purposes.
   </para>
  </sect1>

  <sect1 id="log-shipping-alternative">
   <title>Alternative method for log shipping</title>

   <para>
    An alternative to the built-in standby mode described in the previous
    sections is to use a <varname>restore_command</> that polls the archive location.
    This was the only option available in versions 8.4 and below. In this
    setup, set <varname>standby_mode</> off, because you are implementing
    the polling required for standby operation yourself. See
    contrib/pg_standby (<xref linkend="pgstandby">) for a reference
    implementation of this.
   </para>

   <para>
    Note that in this mode, the server will apply WAL one file at a
    time, so if you use the standby server for queries (see Hot Standby),
    there is a delay between an action in the master and when the
    action becomes visible in the standby, corresponding the time it takes
    to fill up the WAL file. <varname>archive_timeout</> can be used to make that delay
    shorter. Also note that you can't combine streaming replication with
    this method.
   </para>

   <para>
    The operations that occur on both primary and standby servers are
    normal continuous archiving and recovery tasks. The only point of
    contact between the two database servers is the archive of WAL files
    that both share: primary writing to the archive, standby reading from
    the archive. Care must be taken to ensure that WAL archives from separate
    primary servers do not become mixed together or confused. The archive
    need not be large if it is only required for standby operation.
   </para>

   <para>
    The magic that makes the two loosely coupled servers work together is
    simply a <varname>restore_command</> used on the standby that,
    when asked for the next WAL file, waits for it to become available from
    the primary. The <varname>restore_command</> is specified in the
    <filename>recovery.conf</> file on the standby server. Normal recovery
    processing would request a file from the WAL archive, reporting failure
    if the file was unavailable.  For standby processing it is normal for
    the next WAL file to be unavailable, so the standby must wait for
    it to appear. For files ending in <literal>.backup</> or
    <literal>.history</> there is no need to wait, and a non-zero return
    code must be returned. A waiting <varname>restore_command</> can be
    written as a custom script that loops after polling for the existence of
    the next WAL file. There must also be some way to trigger failover, which
    should interrupt the <varname>restore_command</>, break the loop and
    return a file-not-found error to the standby server. This ends recovery
    and the standby will then come up as a normal server.
   </para>

   <para>
    Pseudocode for a suitable <varname>restore_command</> is:
<programlisting>
triggered = false;
while (!NextWALFileReady() &amp;&amp; !triggered)
{
    sleep(100000L);         /* wait for ~0.1 sec */
    if (CheckForExternalTrigger())
        triggered = true;
}
if (!triggered)
        CopyWALFileForRecovery();
</programlisting>
   </para>

   <para>
    A working example of a waiting <varname>restore_command</> is provided
    as a <filename>contrib</> module named <application>pg_standby</>. It
    should be used as a reference on how to correctly implement the logic
    described above. It can also be extended as needed to support specific
    configurations and environments.
   </para>

   <para>
    The method for triggering failover is an important part of planning
    and design. One potential option is the <varname>restore_command</>
    command.  It is executed once for each WAL file, but the process
    running the <varname>restore_command</> is created and dies for
    each file, so there is no daemon or server process, and
    signals or a signal handler cannot be used. Therefore, the
    <varname>restore_command</> is not suitable to trigger failover.
    It is possible to use a simple timeout facility, especially if
    used in conjunction with a known <varname>archive_timeout</>
    setting on the primary. However, this is somewhat error prone
    since a network problem or busy primary server might be sufficient
    to initiate failover. A notification mechanism such as the explicit
    creation of a trigger file is ideal, if this can be arranged.
   </para>

  <sect2 id="warm-standby-config">
   <title>Implementation</title>

   <para>
    The short procedure for configuring a standby server using this alternative
    method is as follows. For
    full details of each step, refer to previous sections as noted.
    <orderedlist>
     <listitem>
      <para>
       Set up primary and standby systems as nearly identical as
       possible, including two identical copies of
       <productname>PostgreSQL</> at the same release level.
      </para>
     </listitem>
     <listitem>
      <para>
       Set up continuous archiving from the primary to a WAL archive
       directory on the standby server. Ensure that
       <xref linkend="guc-archive-mode">,
       <xref linkend="guc-archive-command"> and
       <xref linkend="guc-archive-timeout">
       are set appropriately on the primary
       (see <xref linkend="backup-archiving-wal">).
      </para>
     </listitem>
     <listitem>
      <para>
       Make a base backup of the primary server (see <xref
       linkend="backup-base-backup">), and load this data onto the standby.
      </para>
     </listitem>
     <listitem>
      <para>
       Begin recovery on the standby server from the local WAL
       archive, using a <filename>recovery.conf</> that specifies a
       <varname>restore_command</> that waits as described
       previously (see <xref linkend="backup-pitr-recovery">).
      </para>
     </listitem>
    </orderedlist>
   </para>

   <para>
    Recovery treats the WAL archive as read-only, so once a WAL file has
    been copied to the standby system it can be copied to tape at the same
    time as it is being read by the standby database server.
    Thus, running a standby server for high availability can be performed at
    the same time as files are stored for longer term disaster recovery
    purposes.
   </para>

   <para>
    For testing purposes, it is possible to run both primary and standby
    servers on the same system. This does not provide any worthwhile
    improvement in server robustness, nor would it be described as HA.
   </para>
  </sect2>

  <sect2 id="warm-standby-record">
   <title>Record-based Log Shipping</title>

   <para>
    It is also possible to implement record-based log shipping using this
    alternative method, though this requires custom development, and changes
    will still only become visible to hot standby queries after a full WAL
    file has been shipped.
   </para>

   <para>
    An external program can call the <function>pg_xlogfile_name_offset()</>
    function (see <xref linkend="functions-admin">)
    to find out the file name and the exact byte offset within it of
    the current end of WAL.  It can then access the WAL file directly
    and copy the data from the last known end of WAL through the current end
    over to the standby servers.  With this approach, the window for data
    loss is the polling cycle time of the copying program, which can be very
    small, and there is no wasted bandwidth from forcing partially-used
    segment files to be archived.  Note that the standby servers'
    <varname>restore_command</> scripts can only deal with whole WAL files,
    so the incrementally copied data is not ordinarily made available to
    the standby servers.  It is of use only when the primary dies &mdash;
    then the last partial WAL file is fed to the standby before allowing
    it to come up.  The correct implementation of this process requires
    cooperation of the <varname>restore_command</> script with the data
    copying program.
   </para>

   <para>
    Starting with <productname>PostgreSQL</> version 9.0, you can use
    streaming replication (see <xref linkend="streaming-replication">) to
    achieve the same benefits with less effort.
   </para>
  </sect2>
 </sect1>

 <sect1 id="hot-standby">
  <title>Hot Standby</title>

  <indexterm zone="high-availability">
   <primary>Hot Standby</primary>
  </indexterm>

   <para>
    Hot Standby is the term used to describe the ability to connect to
    the server and run read-only queries while the server is in archive
    recovery. This
    is useful for both log shipping replication and for restoring a backup
    to an exact state with great precision.
    The term Hot Standby also refers to the ability of the server to move
    from recovery through to normal operation while users continue running
    queries and/or keep their connections open.
   </para>

   <para>
    Running queries in recovery mode is similar to normal query operation,
    though there are several usage and administrative differences
    noted below.
   </para>

  <sect2 id="hot-standby-users">
   <title>User's Overview</title>

   <para>
    Users can connect to the database server while it is in recovery
    mode and perform read-only queries. Read-only access to system
    catalogs and views will also occur as normal.
   </para>

   <para>
    The data on the standby takes some time to arrive from the primary server
    so there will be a measurable delay between primary and standby. Running the
    same query nearly simultaneously on both primary and standby might therefore
    return differing results. We say that data on the standby is
    <firstterm>eventually consistent</firstterm> with the primary.
    Queries executed on the standby will be correct with regard to the transactions
    that had been recovered at the start of the query, or start of first statement
    in the case of serializable transactions. In comparison with the primary,
    the standby returns query results that could have been obtained on the primary
    at some moment in the past.
   </para>

   <para>
    When a transaction is started in recovery, the parameter
    <varname>transaction_read_only</> will be forced to be true, regardless of the
    <varname>default_transaction_read_only</> setting in <filename>postgresql.conf</>.
    It can't be manually set to false either. As a result, all transactions
    started during recovery will be limited to read-only actions. In all
    other ways, connected sessions will appear identical to sessions
    initiated during normal processing mode. There are no special commands
    required to initiate a connection so all interfaces
    work unchanged. After recovery finishes, the session
    will allow normal read-write transactions at the start of the next
    transaction, if these are requested.
   </para>

   <para>
    "Read-only" above means no writes to the permanent or temporary database
    tables.  There are no problems with queries that use transient sort and
    work files.
   </para>

   <para>
    The following actions are allowed:

    <itemizedlist>
     <listitem>
      <para>
       Query access - <command>SELECT</>, <command>COPY TO</> including views and
       <command>SELECT</> rules
      </para>
     </listitem>
     <listitem>
      <para>
       Cursor commands - <command>DECLARE</>, <command>FETCH</>, <command>CLOSE</>
      </para>
     </listitem>
     <listitem>
      <para>
       Parameters - <command>SHOW</>, <command>SET</>, <command>RESET</>
      </para>
     </listitem>
     <listitem>
      <para>
       Transaction management commands
        <itemizedlist>
         <listitem>
          <para>
           <command>BEGIN</>, <command>END</>, <command>ABORT</>, <command>START TRANSACTION</>
          </para>
         </listitem>
         <listitem>
          <para>
           <command>SAVEPOINT</>, <command>RELEASE</>, <command>ROLLBACK TO SAVEPOINT</>
          </para>
         </listitem>
         <listitem>
          <para>
           <command>EXCEPTION</> blocks and other internal subtransactions
          </para>
         </listitem>
        </itemizedlist>
      </para>
     </listitem>
     <listitem>
      <para>
       <command>LOCK TABLE</>, though only when explicitly in one of these modes:
       <literal>ACCESS SHARE</>, <literal>ROW SHARE</> or <literal>ROW EXCLUSIVE</>.
      </para>
     </listitem>
     <listitem>
      <para>
       Plans and resources - <command>PREPARE</>, <command>EXECUTE</>,
       <command>DEALLOCATE</>, <command>DISCARD</>
      </para>
     </listitem>
     <listitem>
      <para>
       Plugins and extensions - <command>LOAD</>
      </para>
     </listitem>
    </itemizedlist>
   </para>

   <para>
    These actions produce error messages:

    <itemizedlist>
     <listitem>
      <para>
       Data Manipulation Language (DML) - <command>INSERT</>,
       <command>UPDATE</>, <command>DELETE</>, <command>COPY FROM</>,
       <command>TRUNCATE</>.
       Note that there are no allowed actions that result in a trigger
       being executed during recovery.
      </para>
     </listitem>
     <listitem>
      <para>
       Data Definition Language (DDL) - <command>CREATE</>,
       <command>DROP</>, <command>ALTER</>, <command>COMMENT</>.
       This also applies to temporary tables also because currently their
       definition causes writes to catalog tables.
      </para>
     </listitem>
     <listitem>
      <para>
       <command>SELECT ... FOR SHARE | UPDATE</> which cause row locks to be written
      </para>
     </listitem>
     <listitem>
      <para>
       Rules on <command>SELECT</> statements that generate DML commands.
      </para>
     </listitem>
     <listitem>
      <para>
       <command>LOCK</> that explicitly requests a mode higher than <literal>ROW EXCLUSIVE MODE</>.
      </para>
     </listitem>
     <listitem>
      <para>
       <command>LOCK</> in short default form, since it requests <literal>ACCESS EXCLUSIVE MODE</>.
      </para>
     </listitem>
     <listitem>
      <para>
       Transaction management commands that explicitly set non-read-only state:
        <itemizedlist>
         <listitem>
          <para>
            <command>BEGIN READ WRITE</>,
            <command>START TRANSACTION READ WRITE</>
          </para>
         </listitem>
         <listitem>
          <para>
            <command>SET TRANSACTION READ WRITE</>,
            <command>SET SESSION CHARACTERISTICS AS TRANSACTION READ WRITE</>
          </para>
         </listitem>
         <listitem>
          <para>
           <command>SET transaction_read_only = off</>
          </para>
         </listitem>
        </itemizedlist>
      </para>
     </listitem>
     <listitem>
      <para>
       Two-phase commit commands - <command>PREPARE TRANSACTION</>,
       <command>COMMIT PREPARED</>, <command>ROLLBACK PREPARED</>
       because even read-only transactions need to write WAL in the
       prepare phase (the first phase of two phase commit).
      </para>
     </listitem>
     <listitem>
      <para>
       Sequence updates - <function>nextval()</>, <function>setval()</>
      </para>
     </listitem>
     <listitem>
      <para>
       LISTEN, UNLISTEN, NOTIFY
      </para>
     </listitem>
    </itemizedlist>
   </para>

   <para>
    Note that the current behavior of read only transactions when not in
    recovery is to allow the last two actions, so there are small and
    subtle differences in behavior between read-only transactions
    run on a standby and run during normal operation.
    It is possible that <command>LISTEN</>, <command>UNLISTEN</>,
    and temporary tables might be allowed in a future release.
   </para>

   <para>
    If failover or switchover occurs the database will switch to normal
    processing mode. Sessions will remain connected while the server
    changes mode. Current transactions will continue, though will remain
    read-only. After recovery is complete, it will be possible to initiate
    read-write transactions.
   </para>

   <para>
    Users will be able to tell whether their session is read-only by
    issuing <command>SHOW transaction_read_only</>.  In addition, a set of
    functions (<xref linkend="functions-recovery-info-table">) allow users to
    access information about the standby server. These allow you to write
    programs that are aware of the current state of the database. These
    can be used to monitor the progress of recovery, or to allow you to
    write complex programs that restore the database to particular states.
   </para>

   <para>
    In recovery, transactions will not be permitted to take any table lock
    higher than <literal>RowExclusiveLock</>. In addition, transactions may never assign
    a TransactionId and may never write WAL.
    Any <command>LOCK TABLE</> command that runs on the standby and requests
    a specific lock mode higher than <literal>ROW EXCLUSIVE MODE</> will be rejected.
   </para>

   <para>
    In general queries will not experience lock conflicts from the database
    changes made by recovery. This is because recovery follows normal
    concurrency control mechanisms, known as <acronym>MVCC</>. There are
    some types of change that will cause conflicts, covered in the following
    section.
   </para>
  </sect2>

  <sect2 id="hot-standby-conflict">
   <title>Handling query conflicts</title>

   <para>
    The primary and standby nodes are in many ways loosely connected. Actions
    on the primary will have an effect on the standby. As a result, there is
    potential for negative interactions or conflicts between them. The easiest
    conflict to understand is performance: if a huge data load is taking place
    on the primary then this will generate a similar stream of WAL records on the
    standby, so standby queries may contend for system resources, such as I/O.
   </para>

   <para>
    There are also additional types of conflict that can occur with Hot Standby.
    These conflicts are <emphasis>hard conflicts</> in the sense that queries
    might need to be cancelled and, in some cases, sessions disconnected to resolve them.
    The user is provided with several ways to handle these
    conflicts, though it is important to first understand the possible causes
    of conflicts:

      <itemizedlist>
       <listitem>
        <para>
         Access Exclusive Locks from primary node, including both explicit
         <command>LOCK</> commands and various <acronym>DDL</> actions
        </para>
       </listitem>
       <listitem>
        <para>
         Dropping tablespaces on the primary while standby queries are using
         those tablespaces for temporary work files (<varname>work_mem</> overflow)
        </para>
       </listitem>
       <listitem>
        <para>
         Dropping databases on the primary while users are connected to that
         database on the standby.
        </para>
       </listitem>
       <listitem>
        <para>
         The standby waiting longer than <varname>max_standby_delay</>
         to acquire a buffer cleanup lock.
        </para>
       </listitem>
       <listitem>
        <para>
         Early cleanup of data still visible to the current query's snapshot
        </para>
       </listitem>
      </itemizedlist>
   </para>

   <para>
    Some WAL redo actions will be for <acronym>DDL</> execution. These DDL
    actions are replaying changes that have already committed on the primary
    node, so they must not fail on the standby node. These DDL locks take
    priority and will automatically *cancel* any read-only transactions that
    get in their way, after a grace period. This is similar to the possibility
    of being canceled by the deadlock detector.  But in this case, the standby
    recovery process always wins, since the replayed actions must not fail.
    This also ensures that replication does not fall behind while waiting for a
    query to complete. This prioritization presumes that the standby exists
    primarily for high availability, and that adjusting the grace period
    will allow a sufficient guard against unexpected cancellation.
   </para>

   <para>
    An example of the above would be an administrator on the primary server
    running <command>DROP TABLE</> on a table that is currently being queried
    on the standby server.
    Clearly the query cannot continue if <command>DROP TABLE</>
    proceeds. If this situation occurred on the primary, the <command>DROP TABLE</>
    would wait until the query had finished. When <command>DROP TABLE</> is
    run on the primary, the primary doesn't have
    information about which queries are running on the standby, so it
    cannot wait for any of the standby queries. The WAL change records come through to the
    standby while the standby query is still running, causing a conflict.
   </para>

   <para>
    The most common reason for conflict between standby queries and WAL redo is
    "early cleanup". Normally, <productname>PostgreSQL</> allows cleanup of old
    row versions when there are no users who need to see them to ensure correct
    visibility of data (the heart of MVCC). If there is a standby query that has
    been running for longer than any query on the primary then it is possible
    for old row versions to be removed by either a vacuum or HOT. This will
    then generate WAL records that, if applied, would remove data on the
    standby that might <emphasis>potentially</> be required by the standby query.
    In more technical language, the primary's xmin horizon is later than
    the standby's xmin horizon, allowing dead rows to be removed.
   </para>

   <para>
    Experienced users should note that both row version cleanup and row version
    freezing will potentially conflict with recovery queries. Running a
    manual <command>VACUUM FREEZE</> is likely to cause conflicts even on tables
    with no updated or deleted rows.
   </para>

   <para>
    There are a number of choices for resolving query conflicts.  The default
    is to wait and hope the query finishes. The server will wait
    automatically until the lag between primary and standby is at most
    <xref linkend="guc-max-standby-delay"> (30 seconds by default).
    Once that grace period expires,
    one of the following actions is taken:

      <itemizedlist>
       <listitem>
        <para>
         If the conflict is caused by a lock, the conflicting standby
         transaction is cancelled immediately. If the transaction is
         idle-in-transaction, then the session is aborted instead.
         This behavior might change in the future.
        </para>
       </listitem>

       <listitem>
        <para>
         If the conflict is caused by cleanup records, the standby query is informed
         a conflict has occurred and that it must cancel itself to avoid the
         risk that it silently fails to read relevant data because
         that data has been removed. (This is regrettably similar to the
         much feared and iconic error message "snapshot too old"). Some cleanup
         records only conflict with older queries, while others
         can affect all queries.
        </para>

        <para>
         If cancellation does occur, the query and/or transaction can always
         be re-executed. The error is dynamic and will not necessarily reoccur
         if the query is executed again.
        </para>
       </listitem>
      </itemizedlist>
   </para>

   <para>
    Keep in mind that <varname>max_standby_delay</> is compared to the
    difference between the standby server's clock and the transaction
    commit timestamps read from the WAL log.  Thus, the grace period
    allowed to any one query on the standby is never more than
    <varname>max_standby_delay</>, and could be considerably less if the
    standby has already fallen behind as a result of waiting for previous
    queries to complete, or as a result of being unable to keep up with a
    heavy update load.
   </para>

   <caution>
    <para>
     Be sure that the primary and standby servers' clocks are kept in sync;
     otherwise the values compared to <varname>max_standby_delay</> will be
     erroneous, possibly leading to undesirable query cancellations.
     If the clocks are intentionally not in sync, or if there is a large
     propagation delay from primary to standby, it is advisable to set
     <varname>max_standby_delay</> to -1.  In any case the value should be
     larger than the largest expected clock skew between primary and standby.
    </para>
   </caution>

   <para>
    Users should be clear that tables that are regularly and heavily updated on the
    primary server will quickly cause cancellation of longer running queries on
    the standby. In those cases <varname>max_standby_delay</> can be
    considered similar to setting
    <varname>statement_timeout</>.
    </para>

   <para>
    Other remedial actions exist if the number of cancellations is unacceptable.
    The first option is to connect to the primary server and keep a query active
    for as long as needed to run queries on the standby. This guarantees that
    a WAL cleanup record is never generated and query conflicts do not occur,
    as described above. This could be done using <filename>contrib/dblink</>
    and <function>pg_sleep()</>, or via other mechanisms. If you do this, you
    should note that this will delay cleanup of dead rows on the primary by
    vacuum or HOT, and people might find this undesirable. However, remember
    that the primary and standby nodes are linked via the WAL, so the cleanup
    situation is no different from the case where the query ran on the primary
    node itself.  And you are still getting the benefit of off-loading the
    execution onto the standby. <varname>max_standby_delay</> should
    not be used in this case because delayed WAL files might already
    contain entries that invalidate the current snapshot.
   </para>

   <para>
    It is also possible to set <varname>vacuum_defer_cleanup_age</> on the primary
    to defer the cleanup of records by autovacuum, <command>VACUUM</>
    and HOT. This might allow
    more time for queries to execute before they are cancelled on the standby,
    without the need for setting a high <varname>max_standby_delay</>.
   </para>

   <para>
    Three-way deadlocks are possible between <literal>AccessExclusiveLocks</> arriving from
    the primary, cleanup WAL records that require buffer cleanup locks, and
    user requests that are waiting behind replayed <literal>AccessExclusiveLocks</>.
    Deadlocks are resolved automatically after <varname>deadlock_timeout</>
    seconds, though they are thought to be rare in practice.
   </para>

   <para>
    Dropping tablespaces or databases is discussed in the administrator's
    section since they are not typical user situations.
   </para>
  </sect2>

  <sect2 id="hot-standby-admin">
   <title>Administrator's Overview</title>

   <para>
    If <varname>hot_standby</> is turned <literal>on</> in
    <filename>postgresql.conf</> and there is a <filename>recovery.conf</>
    file present, the server will run in Hot Standby mode.
    However, it may take some time for Hot Standby connections to be allowed,
    because the server will not accept connections until it has completed
    sufficient recovery to provide a consistent state against which queries
    can run.  During this period,
    clients that attempt to connect will be refused with an error message.
    To confirm the server has come up, either loop trying to connect from
    the application, or look for these messages in the server logs:

<programlisting>
LOG:  entering standby mode

... then some time later ...

LOG:  consistent recovery state reached
LOG:  database system is ready to accept read only connections
</programlisting>

    Consistency information is recorded once per checkpoint on the primary.
    It is not possible to enable hot standby when reading WAL
    written during a period when <varname>wal_level</> was not set to
    <literal>hot_standby</> on the primary.  Reaching a consistent state can
    also be delayed in the presence of both of these conditions:

      <itemizedlist>
       <listitem>
        <para>
         A write transaction has more than 64 subtransactions
        </para>
       </listitem>
       <listitem>
        <para>
         Very long-lived write transactions
        </para>
       </listitem>
      </itemizedlist>

    If you are running file-based log shipping ("warm standby"), you might need
    to wait until the next WAL file arrives, which could be as long as the
    <varname>archive_timeout</> setting on the primary.
   </para>

   <para>
    The setting of some parameters on the standby will need reconfiguration
    if they have been changed on the primary. For these parameters,
    the value on the standby must
    be equal to or greater than the value on the primary. If these parameters
    are not set high enough then the standby will refuse to start.
    Higher values can then be supplied and the server
    restarted to begin recovery again.  These parameters are:

      <itemizedlist>
       <listitem>
        <para>
         <varname>max_connections</>
        </para>
       </listitem>
       <listitem>
        <para>
         <varname>max_prepared_transactions</>
        </para>
       </listitem>
       <listitem>
        <para>
         <varname>max_locks_per_transaction</>
        </para>
       </listitem>
      </itemizedlist>
   </para>

   <para>
    It is important that the administrator consider the appropriate setting
    of <varname>max_standby_delay</>,
    which can be set in <filename>postgresql.conf</>.
    There is no optimal setting, so it should be set according to business
    priorities. For example if the server is primarily tasked as a High
    Availability server, then you may wish to lower
    <varname>max_standby_delay</> or even set it to zero, though that is a
    very aggressive setting. If the standby server is tasked as an additional
    server for decision support queries then it might be acceptable to set this
    to a value of many hours.  It is also possible to set
    <varname>max_standby_delay</> to -1 which means wait forever for queries
    to complete; this will be useful when performing
    an archive recovery from a backup.
   </para>

   <para>
    Transaction status "hint bits" written on the primary are not WAL-logged,
    so data on the standby will likely re-write the hints again on the standby.
    Thus, the standby server will still perform disk writes even though
    all users are read-only; no changes occur to the data values
    themselves.  Users will still write large sort temporary files and
    re-generate relcache info files, so no part of the database
    is truly read-only during hot standby mode.
    Note also that writes to remote databases using
    <application>dblink</application> module, and other operations outside the
    database using PL functions will still be possible, even though the
    transaction is read-only locally.
   </para>

   <para>
    The following types of administration commands are not accepted
    during recovery mode:

      <itemizedlist>
       <listitem>
        <para>
         Data Definition Language (DDL) - e.g. <command>CREATE INDEX</>
        </para>
       </listitem>
       <listitem>
        <para>
         Privilege and Ownership - <command>GRANT</>, <command>REVOKE</>,
         <command>REASSIGN</>
        </para>
       </listitem>
       <listitem>
        <para>
         Maintenance commands - <command>ANALYZE</>, <command>VACUUM</>,
         <command>CLUSTER</>, <command>REINDEX</>
        </para>
       </listitem>
      </itemizedlist>
   </para>

   <para>
    Again, note that some of these commands are actually allowed during
    "read only" mode transactions on the primary.
   </para>

   <para>
    As a result, you cannot create additional indexes that exist solely
    on the standby, nor statistics that exist solely on the standby.
    If these administration commands are needed, they should be executed
    on the primary, and eventually those changes will propagate to the
    standby.
   </para>

   <para>
    <function>pg_cancel_backend()</> will work on user backends, but not the
    Startup process, which performs recovery. <structname>pg_stat_activity</structname> does not
    show an entry for the Startup process, nor do recovering transactions
    show as active. As a result, <structname>pg_prepared_xacts</structname> is always empty during
    recovery. If you wish to resolve in-doubt prepared transactions,
    view <literal>pg_prepared_xacts</> on the primary and issue commands to
    resolve transactions there.
   </para>

   <para>
    <structname>pg_locks</structname> will show locks held by backends,
    as normal. <structname>pg_locks</structname> also shows
    a virtual transaction managed by the Startup process that owns all
    <literal>AccessExclusiveLocks</> held by transactions being replayed by recovery.
    Note that the Startup process does not acquire locks to
    make database changes, and thus locks other than <literal>AccessExclusiveLocks</>
    do not show in <structname>pg_locks</structname> for the Startup
    process; they are just presumed to exist.
   </para>

   <para>
    The <productname>Nagios</> plugin <productname>check_pgsql</> will
    work, because the simple information it checks for exists.
    The <productname>check_postgres</> monitoring script will also work,
    though some reported values could give different or confusing results.
    For example, last vacuum time will not be maintained, since no
    vacuum occurs on the standby.  Vacuums running on the primary
    do still send their changes to the standby.
   </para>

   <para>
    WAL file control commands will not work during recovery,
    e.g. <function>pg_start_backup</>, <function>pg_switch_xlog</> etc.
   </para>

   <para>
    Dynamically loadable modules work, including <structname>pg_stat_statements</>.
   </para>

   <para>
    Advisory locks work normally in recovery, including deadlock detection.
    Note that advisory locks are never WAL logged, so it is impossible for
    an advisory lock on either the primary or the standby to conflict with WAL
    replay. Nor is it possible to acquire an advisory lock on the primary
    and have it initiate a similar advisory lock on the standby. Advisory
    locks relate only to the server on which they are acquired.
   </para>

   <para>
    Trigger-based replication systems such as <productname>Slony</>,
    <productname>Londiste</> and <productname>Bucardo</> won't run on the
    standby at all, though they will run happily on the primary server as
    long as the changes are not sent to standby servers to be applied.
    WAL replay is not trigger-based so you cannot relay from the
    standby to any system that requires additional database writes or
    relies on the use of triggers.
   </para>

   <para>
    New oids cannot be assigned, though some <acronym>UUID</> generators may still
    work as long as they do not rely on writing new status to the database.
   </para>

   <para>
    Currently, temporary table creation is not allowed during read only
    transactions, so in some cases existing scripts will not run correctly.
    This restriction might be relaxed in a later release. This is
    both a SQL Standard compliance issue and a technical issue.
   </para>

   <para>
    <command>DROP TABLESPACE</> can only succeed if the tablespace is empty.
    Some standby users may be actively using the tablespace via their
    <varname>temp_tablespaces</> parameter. If there are temporary files in the
    tablespace, all active queries are cancelled to ensure that temporary
    files are removed, so the tablespace can be removed and WAL replay
    can continue.
   </para>

   <para>
    Running <command>DROP DATABASE</>, <command>ALTER DATABASE ... SET TABLESPACE</>,
    or <command>ALTER DATABASE ... RENAME</> on primary will generate a log message
    that will cause all users connected to that database on the standby to be
    forcibly disconnected. This action occurs immediately, whatever the setting of
    <varname>max_standby_delay</>.
   </para>

   <para>
    In normal (non-recovery) mode, if you issue <command>DROP USER</> or <command>DROP ROLE</>
    for a role with login capability while that user is still connected then
    nothing happens to the connected user - they remain connected. The user cannot
    reconnect however. This behavior applies in recovery also, so a
    <command>DROP USER</> on the primary does not disconnect that user on the standby.
   </para>

   <para>
    The statistics collector is active during recovery. All scans, reads, blocks,
    index usage, etc., will be recorded normally on the standby. Replayed
    actions will not duplicate their effects on primary, so replaying an
    insert will not increment the Inserts column of pg_stat_user_tables.
    The stats file is deleted at the start of recovery, so stats from primary
    and standby will differ; this is considered a feature, not a bug.
   </para>

   <para>
    Autovacuum is not active during recovery, it will start normally at the
    end of recovery.
   </para>

   <para>
    The background writer is active during recovery and will perform
    restartpoints (similar to checkpoints on the primary) and normal block
    cleaning activities. This can include updates of the hint bit
    information stored on the standby server.
    The <command>CHECKPOINT</> command is accepted during recovery,
    though it performs a restartpoint rather than a new checkpoint.
   </para>
  </sect2>

  <sect2 id="hot-standby-parameters">
   <title>Hot Standby Parameter Reference</title>

   <para>
    Various parameters have been mentioned above in
    <xref linkend="hot-standby-admin">
    and <xref linkend="hot-standby-conflict">.
   </para>

   <para>
    On the primary, parameters <xref linkend="guc-wal-level"> and
    <xref linkend="guc-vacuum-defer-cleanup-age"> can be used.
    <xref linkend="guc-max-standby-delay"> has no effect if set on the primary.
   </para>

   <para>
    On the standby, parameters <xref linkend="guc-hot-standby"> and
    <xref linkend="guc-max-standby-delay"> can be used.
    <xref linkend="guc-vacuum-defer-cleanup-age"> has no effect during
    recovery.
   </para>
  </sect2>

  <sect2 id="hot-standby-caveats">
   <title>Caveats</title>

   <para>
    There are several limitations of Hot Standby.
    These can and probably will be fixed in future releases:

  <itemizedlist>
   <listitem>
    <para>
     Operations on hash indexes are not presently WAL-logged, so
     replay will not update these indexes.  Hash indexes will not be
     used for query plans during recovery.
    </para>
   </listitem>
   <listitem>
    <para>
     Full knowledge of running transactions is required before snapshots
     can be taken. Transactions that use large numbers of subtransactions
     (currently greater than 64) will delay the start of read only
     connections until the completion of the longest running write transaction.
     If this situation occurs, explanatory messages will be sent to the server log.
    </para>
   </listitem>
   <listitem>
    <para>
     Valid starting points for standby queries are generated at each
     checkpoint on the master. If the standby is shut down while the master
     is in a shutdown state, it might not be possible to re-enter Hot Standby
     until the primary is started up, so that it generates further starting
     points in the WAL logs.  This situation isn't a problem in the most
     common situations where it might happen. Generally, if the primary is
     shut down and not available anymore, that's likely due to a serious
     failure that requires the standby being converted to operate as
     the new primary anyway.  And in situations where the primary is
     being intentionally taken down, coordinating to make sure the standby
     becomes the new primary smoothly is also standard procedure.
    </para>
   </listitem>
   <listitem>
    <para>
     At the end of recovery, <literal>AccessExclusiveLocks</> held by prepared transactions
     will require twice the normal number of lock table entries. If you plan
     on running either a large number of concurrent prepared transactions
     that normally take <literal>AccessExclusiveLocks</>, or you plan on having one
     large transaction that takes many <literal>AccessExclusiveLocks</>, you are
     advised to select a larger value of <varname>max_locks_per_transaction</>,
     perhaps as much as twice the value of the parameter on
     the primary server. You need not consider this at all if
     your setting of <varname>max_prepared_transactions</> is <literal>0</>.
    </para>
   </listitem>
  </itemizedlist>

   </para>
  </sect2>

 </sect1>

  <sect1 id="backup-incremental-updated">
   <title>Incrementally Updated Backups</title>

  <indexterm zone="high-availability">
   <primary>incrementally updated backups</primary>
  </indexterm>

  <indexterm zone="high-availability">
   <primary>change accumulation</primary>
  </indexterm>

   <para>
    In a standby configuration, it is possible to offload the expense of
    taking periodic base backups from the primary server; instead base backups
    can be made by backing
    up a standby server's files.  This concept is generally known as
    incrementally updated backups, log change accumulation, or more simply,
    change accumulation.
   </para>

   <para>
    If we take a file system backup of the standby server's data
    directory while it is processing
    logs shipped from the primary, we will be able to reload that backup and
    restart the standby's recovery process from the last restart point.
    We no longer need to keep WAL files from before the standby's restart point.
    If recovery is needed, it will be faster to recover from the incrementally
    updated backup than from the original base backup.
   </para>

   <para>
    The procedure for taking a file system backup of the standby server's
    data directory while it's processing logs shipped from the primary is:
   <orderedlist>
    <listitem>
     <para>
      Perform the backup, without using <function>pg_start_backup</> and
      <function>pg_stop_backup</>. Note that the <filename>pg_control</>
      file must be backed up <emphasis>first</>, as in:
<programlisting>
cp /var/lib/pgsql/data/global/pg_control /tmp
cp -r /var/lib/pgsql/data /path/to/backup
mv /tmp/pg_control /path/to/backup/data/global
</programlisting>
      <filename>pg_control</> contains the location where WAL replay will
      begin after restoring from the backup; backing it up first ensures
      that it points to the last restartpoint when the backup started, not
      some later restartpoint that happened while files were copied to the
      backup.
     </para>
    </listitem>
    <listitem>
     <para>
      Make note of the backup ending WAL location by calling the <function>
      pg_last_xlog_replay_location</> function at the end of the backup,
      and keep it with the backup.
<programlisting>
psql -c "select pg_last_xlog_replay_location();" > /path/to/backup/end_location
</programlisting>
      When recovering from the incrementally updated backup, the server
      can begin accepting connections and complete the recovery successfully
      before the database has become consistent. To avoid that, you must
      ensure the database is consistent before users try to connect to the
      server and when the recovery ends. You can do that by comparing the
      progress of the recovery with the stored backup ending WAL location:
      the server is not consistent until recovery has reached the backup end
      location. The progress of the recovery can also be observed with the
      <function>pg_last_xlog_replay_location</> function, but that required
      connecting to the server while it might not be consistent yet, so
      care should be taken with that method.
     </para>
     <para>
     </para>
    </listitem>
   </orderedlist>
   </para>

   <para>
    Since the standby server is not <quote>live</>, it is not possible to
    use <function>pg_start_backup()</> and <function>pg_stop_backup()</>
    to manage the backup process; it will be up to you to determine how
    far back you need to keep WAL segment files to have a recoverable
    backup. That is determined by the last restartpoint when the backup
    was taken, any WAL older than that can be deleted from the archive
    once the backup is complete. You can determine the last restartpoint
    by running <application>pg_controldata</> on the standby server before
    taking the backup, or by using the <varname>log_checkpoints</> option
    to print values to the standby's server log.
   </para>
  </sect1>

</chapter>