From 5e550acbc4d125a9bc484bfeb204c8db4ff29c59 Mon Sep 17 00:00:00 2001
From: Bruce Momjian <bruce@momjian.us>
Date: Fri, 15 Sep 2006 22:02:21 +0000
Subject: [PATCH] Document Warm Standby for High Availability

Includes sample standby script.

Simon Riggs
---
 doc/src/sgml/backup.sgml | 308 ++++++++++++++++++++++++++++++++++++++-
 1 file changed, 307 insertions(+), 1 deletion(-)
diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 01c8ed7d6e..e5d7597dc3 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1,4 +1,4 @@
-<!-- $PostgreSQL: pgsql/doc/src/sgml/backup.sgml,v 2.84 2006/09/15 21:55:07 momjian Exp $ -->
+<!-- $PostgreSQL: pgsql/doc/src/sgml/backup.sgml,v 2.85 2006/09/15 22:02:21 momjian Exp $ -->
 
 <chapter id="backup">
  <title>Backup and Restore</title>
@@ -1203,6 +1203,312 @@ restore_command = 'copy /mnt/server/archivedir/%f "%p"'  # Windows
   </sect2>
  </sect1>
 
+ <sect1 id="warm-standby">
+  <title>Warm Standby Servers for High Availability</title>
+
+  <indexterm zone="backup">
+   <primary>Warm Standby</primary>
+  </indexterm>
+
+  <indexterm zone="backup">
+   <primary>PITR Standby</primary>
+  </indexterm>
+
+  <indexterm zone="backup">
+   <primary>Standby Server</primary>
+  </indexterm>
+
+  <indexterm zone="backup">
+   <primary>Log Shipping</primary>
+  </indexterm>
+
+  <indexterm zone="backup">
+   <primary>Witness Server</primary>
+  </indexterm>
+
+  <indexterm zone="backup">
+   <primary>STONITH</primary>
+  </indexterm>
+
+  <indexterm zone="backup">
+   <primary>High Availability</primary>
+  </indexterm>
+
+  <para>
+   Continuous Archiving can be used to create a High Availability (HA)
+   cluster configuration with one or more Standby Servers ready to take
+   over operations in the case that the Primary Server fails. This
+   capability is more widely known as Warm Standby Log Shipping.
+  </para>
+
+  <para>
+   The Primary and Standby Server work together to provide this capability,
+   though the servers are only loosely coupled. The Primary Server operates
+   in Continuous Archiving mode, while the Standby Server operates in a
+   continuous Recovery mode, reading the WAL files from the Primary. No
+   changes to the database tables are required to enable this capability,
+   so it offers a low administration overhead in comparison with other
+   replication approaches. This configuration also has a very low
+   performance impact on the Primary server.
+  </para>
+
+  <para>
+   Directly moving WAL or "log" records from one database server to another
+   is typically described as Log Shipping. PostgreSQL implements file-based
+   Log Shipping, meaning WAL records are batched one file at a time. WAL
+   files can be shipped easily and cheaply over any distance, whether it be
+   to an adjacent system, another system on the same site or another system
+   on the far side of the globe. The bandwidth required for this technique
+   varies according to the transaction rate of the Primary Server.
+   Record-based Log Shipping is also possible with custom-developed
+   procedures, discussed in a later section. Future developments are likely
+   to include options for synchronous and/or integrated record-based log
+   shipping.
+  </para>
+
+  <para>
+   It should be noted that the log shipping is asynchronous, i.e. the WAL
+   records are shipped after transaction commit. As a result there can be a
+   small window of data loss, should the Primary Server suffer a
+   catastrophic failure. The window of data loss is minimised by the use of
+   the archive_timeout parameter, which can be set as low as a few seconds
+   if required. A very low setting can increase the bandwidth requirements
+   for file shipping.
+  </para>
+
+  <para>
+   The Standby server is not available for access, since it is continually
+   performing recovery processing. Recovery performance is sufficiently
+   good that the Standby will typically be only minutes away from full
+   availability once it has been activated. As a result, we refer to this
+   capability as a Warm Standby configuration that offers High
+   Availability. Restoring a server from an archived base backup and
+   rollforward can take considerably longer and so that technique only
+   really offers a solution for Disaster Recovery, not HA.
+  </para>
+
+  <para>
+   Other mechanisms for High Availability replication are available, both
+   commercially and as open-source software.  
+  </para>
+
+  <para>
+   In general, log shipping between servers running different release
+   levels will not be possible. It is the policy of the PostgreSQL Worldwide
+   Development Group not to make changes to disk formats during minor release
+   upgrades, so it is likely that running different minor release levels 
+   on Primary and Standby servers will work successfully. However, no
+   formal support for that is offered and you are advised not to allow this
+   to occur over long periods.
+  </para>
+
+  <sect2 id="warm-standby-planning">
+   <title>Planning</title>
+
+   <para>
+    On the Standby server all tablespaces and paths will refer to similarly
+    named mount points, so it is important to create the Primary and Standby
+    servers so that they are as similar as possible, at least from the
+    perspective of the database server. Furthermore, any CREATE TABLESPACE
+    commands will be passed across as-is, so any new mount points must be
+    created on both servers before they are used on the Primary. Hardware
+    need not be the same, but experience shows that maintaining two
+    identical systems is easier than maintaining two dissimilar ones over
+    the whole lifetime of the application and system.
+   </para>
+
+   <para>
+    There is no special mode required to enable a Standby server. The
+    operations that occur on both Primary and Standby servers are entirely
+    normal continuous archiving and recovery tasks. The primary point of
+    contact between the two database servers is the archive of WAL files
+    that both share: Primary writing to the archive, Standby reading from
+    the archive. Care must be taken to ensure that WAL archives for separate
+    servers do not become mixed together or confused.
+   </para>
+
+   <para>
+    The magic that makes the two loosely coupled servers work together is
+    simply a restore_command that waits for the next WAL file to be archived
+    from the Primary. The restore_command is specified in the recovery.conf
+    file on the Standby Server. Normal recovery processing would request a
+    file from the WAL archive, causing an error if the file was unavailable.
+    For Standby processing it is normal for the next file to be unavailable,
+    so we must be patient and wait for it to appear. A waiting
+    restore_command can be written as a custom script that loops after
+    polling for the existence of the next WAL file. There must also be some
+    way to trigger failover, which should interrupt the restore_command,
+    break the loop and return a file not found error to the Standby Server.
+    This then ends recovery and the Standby will then come up as a normal
+    server.
+   </para>
+
+   <para>
+    Sample code for the C version of the restore_command would be be:
+<programlisting>
+triggered = false;
+while (!NextWALFileReady() && !triggered)
+{
+    sleep(100000L);         // wait for ~0.1 sec
+    if (CheckForExternalTrigger())
+        triggered = true;
+}
+if (!triggered)
+        CopyWALFileForRecovery();
+</programlisting>
+   </para>
+
+   <para>
+    PostgreSQL does not provide the system software required to identify a
+    failure on the Primary and notify the Standby system and then the
+    Standby database server. Many such tools exist and are well integrated
+    with other aspects of a system failover, such as ip address migration.
+   </para>
+
+   <para>
+    Triggering failover is an important part of planning and design. The
+    restore_command is executed in full once for each WAL file. The process
+    running the restore_command is therefore created and dies for each file,
+    so there is no daemon or server process and so we cannot use signals and
+    a signal handler. A more permanent notification is required to trigger
+    the failover. It is possible to use a simple timeout facility,
+    especially if used in conjunction with a known archive_timeout setting
+    on the Primary. This is somewhat error prone since a network or busy
+    Primary server might be sufficient to initiate failover. A notification
+    mechanism such as the explicit creation of a trigger file is less error
+    prone, if this can be arranged.
+   </para>
+  </sect2>
+
+  <sect2 id="warm-standby-config">
+   <title>Implementation</title>
+
+   <para>
+    The short procedure for configuring a Standby Server is as follows. For
+    full details of each step, refer to previous sections as noted.
+    <orderedlist>
+     <listitem>
+      <para>
+       Set up Primary and Standby systems as near identically as possible,
+       including two identical copies of PostgreSQL at same release level.
+      </para>
+     </listitem>
+     <listitem>
+      <para>
+       Set up Continuous Archiving from the Primary to a WAL archive located
+       in a directory on the Standby Server. Ensure that both <xref
+       linkend="guc-archive-command"> and <xref linkend="guc-archive-timeout">
+       are set. (See <xref linkend="backup-archiving-wal">)
+      </para>
+     </listitem>
+     <listitem>
+      <para>
+       Make a Base Backup of the Primary Server. (See <xref
+       linkend="backup-base-backup">)
+      </para>
+     </listitem>
+     <listitem>
+      <para>
+       Begin recovery on the Standby Server from the local WAL archive,
+       using a recovery.conf that specifies a restore_command that waits as
+       described previously. (See <xref linkend="backup-pitr-recovery">)
+      </para>
+     </listitem>
+    </orderedlist>
+   </para>
+
+   <para>
+    Recovery treats the WAL Archive as read-only, so once a WAL file has
+    been copied to the Standby system it can be copied to tape at the same
+    time as it is being used by the Standby database server to recover.
+    Thus, running a Standby Server for High Availability can be performed at
+    the same time as files are stored for longer term Disaster Recovery
+    purposes. 
+   </para>
+
+   <para>
+    For testing purposes, it is possible to run both Primary and Standby
+    servers on the same system. This does not provide any worthwhile
+    improvement on server robustness, nor would it be described as HA.
+   </para>
+  </sect2>
+
+  <sect2 id="warm-standby-failover">
+   <title>Failover</title>
+
+   <para>
+    If the Primary Server fails then the Standby Server should take begin
+    failover procedures.
+   </para>
+
+   <para>
+    If the Standby Server fails then no failover need take place. If the
+    Standby Server can be restarted, then the recovery process can also be
+    immediately restarted, taking advantage of Restartable Recovery.
+   </para>
+
+   <para>
+    If the Primary Server fails and then immediately restarts, you must have
+    a mechanism for informing it that it is no longer the Primary. This is
+    sometimes known as STONITH (Should the Other Node In The Head), which is
+    necessary to avoid situations where both systems think they are the
+    Primary, which can lead to confusion and ultimately data loss.
+   </para>
+
+   <para>
+    Many failover systems use just two systems, the Primary and the Standby,
+    connected by some kind of heartbeat mechanism to continually verify the
+    connectivity between the two and the viability of the Primary. It is
+    also possible to use a third system, known as a Witness Server to avoid
+    some problems of inappropriate failover, but the additional complexity
+    may not be worthwhile unless it is set-up with sufficient care and
+    rigorous testing.
+   </para>
+
+   <para>
+    At the instant that failover takes place to the Standby, we have only a
+    single server in operation. This is known as a degenerate state.
+    The former Standby is now the Primary, but the former Primary is down 
+    and may stay down. We must now fully re-create a Standby server, 
+    either on the former Primary system when it comes up, or on a third, 
+    possibly new, system. Once complete the Primary and Standby can be 
+    considered to have switched roles. Some people choose to use a third 
+    server to provide additional protection across the failover interval, 
+    though clearly this complicates the system configuration and 
+    operational processes (and this can also act as a Witness Server).
+   </para>
+
+   <para>
+    So, switching from Primary to Standby Server can be fast, but requires
+    some time to re-prepare the failover cluster. Regular switching from
+    Primary to Standby is encouraged, since it allows the regular downtime
+    one each system required to maintain HA. This also acts as a test of the
+    failover so that it definitely works when you really need it. Written
+    administration procedures are advised.
+   </para>
+  </sect2>
+
+  <sect2 id="warm-standby-record">
+   <title>Implementing Record-based Log Shipping</title>
+
+   <para>
+    The main features for Log Shipping in this release are based around the
+    file-based Log Shipping described above. It is also possible to
+    implement record-based Log Shipping using the pg_xlogfile_name_offset()
+    function, though this requires custom development.
+   </para>
+
+   <para>
+    An external program can call pg_xlogfile_name_offset() to find out the
+    filename and the exact byte offset within it of the latest WAL pointer.
+    If the external program regularly polls the server it can find out how
+    far forward the pointer has moved. It can then access the WAL file
+    directly and copy those bytes across to a less up-to-date copy on a
+    Standby Server.
+   </para>
+  </sect2>
+ </sect1>
+
  <sect1 id="migration">
   <title>Migration Between Releases</title>