High Availability, Load Balancing, and Replication high availability failover replication load balancing clustering data partitioning Database servers can work together to allow a second server to take over quickly if the primary server fails (high availability), or to allow several computers to serve the same data (load balancing). Ideally, database servers could work together seamlessly. Web servers serving static web pages can be combined quite easily by merely load-balancing web requests to multiple machines. In fact, read-only database servers can be combined relatively easily too. Unfortunately, most database servers have a read/write mix of requests, and read/write servers are much harder to combine. This is because though read-only data needs to be placed on each server only once, a write to any server has to be propagated to all servers so that future read requests to those servers return consistent results. This synchronization problem is the fundamental difficulty for servers working together. Because there is no single solution that eliminates the impact of the sync problem for all use cases, there are multiple solutions. Each solution addresses this problem in a different way, and minimizes its impact for a specific workload. Some solutions deal with synchronization by allowing only one server to modify the data. Servers that can modify data are called read/write or "master" servers. Servers that can reply to read-only queries are called "slave" servers. Servers that cannot be accessed until they are changed to master servers are called "standby" servers. Some solutions are synchronous, meaning that a data-modifying transaction is not considered committed until all servers have committed the transaction. This guarantees that a failover will not lose any data and that all load-balanced servers will return consistent results no matter which server is queried. In contrast, asynchronous solutions allow some delay between the time of a commit and its propagation to the other servers, opening the possibility that some transactions might be lost in the switch to a backup server, and that load balanced servers might return slightly stale results. Asynchronous communication is used when synchronous would be too slow. Solutions can also be categorized by their granularity. Some solutions can deal only with an entire database server, while others allow control at the per-table or per-database level. Performance must be considered in any choice. There is usually a trade-off between functionality and performance. For example, a fully synchronous solution over a slow network might cut performance by more than half, while an asynchronous one might have a minimal performance impact. The remainder of this section outlines various failover, replication, and load balancing solutions. A glossary is also available. Comparison of different solutions Shared Disk Failover Shared disk failover avoids synchronization overhead by having only one copy of the database. It uses a single disk array that is shared by multiple servers. If the main database server fails, the standby server is able to mount and start the database as though it were recovering from a database crash. This allows rapid failover with no data loss. Shared hardware functionality is common in network storage devices. Using a network file system is also possible, though care must be taken that the file system has full POSIX behavior (see ). One significant limitation of this method is that if the shared disk array fails or becomes corrupt, the primary and standby servers are both nonfunctional. Another issue is that the standby server should never access the shared storage while the primary server is running. File System (Block-Device) Replication A modified version of shared hardware functionality is file system replication, where all changes to a file system are mirrored to a file system residing on another computer. The only restriction is that the mirroring must be done in a way that ensures the standby server has a consistent copy of the file system — specifically, writes to the standby must be done in the same order as those on the master. DRBD is a popular file system replication solution for Linux. Warm and Hot Standby Using Point-In-Time Recovery (PITR) Warm and hot standby servers can be kept current by reading a stream of write-ahead log (WAL) records. If the main server fails, the warm standby contains almost all of the data of the main server, and can be quickly made the new master database server. This is asynchronous and can only be done for the entire database server. A PITR standby server can be kept more up-to-date using streaming replication.; see . For warm standby information, see , and for hot standby, see . Trigger-Based Master-Slave Replication A master-slave replication setup sends all data modification queries to the master server. The master server asynchronously sends data changes to the slave server. The slave can answer read-only queries while the master server is running. The slave server is ideal for data warehouse queries. Slony-I is an example of this type of replication, with per-table granularity, and support for multiple slaves. Because it updates the slave server asynchronously (in batches), there is possible data loss during fail over. Statement-Based Replication Middleware With statement-based replication middleware, a program intercepts every SQL query and sends it to one or all servers. Each server operates independently. Read-write queries are sent to all servers, while read-only queries can be sent to just one server, allowing the read workload to be distributed. If queries are simply broadcast unmodified, functions like random(), CURRENT_TIMESTAMP, and sequences can have different values on different servers. This is because each server operates independently, and because SQL queries are broadcast (and not actual modified rows). If this is unacceptable, either the middleware or the application must query such values from a single server and then use those values in write queries. Another option is to use this replication option with a traditional master-slave setup, i.e. data modification queries are sent only to the master and are propogated to the slaves via master-slave replication, not by the replication middleware. Care must also be taken that all transactions either commit or abort on all servers, perhaps using two-phase commit ( and . Pgpool-II and Sequoia are examples of this type of replication. Asynchronous Multimaster Replication For servers that are not regularly connected, like laptops or remote servers, keeping data consistent among servers is a challenge. Using asynchronous multimaster replication, each server works independently, and periodically communicates with the other servers to identify conflicting transactions. The conflicts can be resolved by users or conflict resolution rules. Bucardo is an example of this type of replication. Synchronous Multimaster Replication In synchronous multimaster replication, each server can accept write requests, and modified data is transmitted from the original server to every other server before each transaction commits. Heavy write activity can cause excessive locking, leading to poor performance. In fact, write performance is often worse than that of a single server. Read requests can be sent to any server. Some implementations use shared disk to reduce the communication overhead. Synchronous multimaster replication is best for mostly read workloads, though its big advantage is that any server can accept write requests — there is no need to partition workloads between master and slave servers, and because the data changes are sent from one server to another, there is no problem with non-deterministic functions like random(). PostgreSQL does not offer this type of replication, though PostgreSQL two-phase commit ( and ) can be used to implement this in application code or middleware. Commercial Solutions Because PostgreSQL is open source and easily extended, a number of companies have taken PostgreSQL and created commercial closed-source solutions with unique failover, replication, and load balancing capabilities. summarizes the capabilities of the various solutions listed above. High Availability, Load Balancing, and Replication Feature Matrix Feature Shared Disk Failover File System Replication Hot/Warm Standby Using PITR Trigger-Based Master-Slave Replication Statement-Based Replication Middleware Asynchronous Multimaster Replication Synchronous Multimaster Replication Most Common Implementation NAS DRBD PITR Slony pgpool-II Bucardo Communication Method shared disk disk blocks WAL table rows SQL table rows table rows and row locks No special hardware required Allows multiple master servers No master server overhead No waiting for multiple servers Master failure will never lose data Slaves accept read-only queries Hot only Per-table granularity No conflict resolution necessary
There are a few solutions that do not fit into the above categories: Data Partitioning Data partitioning splits tables into data sets. Each set can be modified by only one server. For example, data can be partitioned by offices, e.g., London and Paris, with a server in each office. If queries combining London and Paris data are necessary, an application can query both servers, or master/slave replication can be used to keep a read-only copy of the other office's data on each server. Multiple-Server Parallel Query Execution Many of the above solutions allow multiple servers to handle multiple queries, but none allow a single query to use multiple servers to complete faster. This solution allows multiple servers to work concurrently on a single query. It is usually accomplished by splitting the data among servers and having each server execute its part of the query and return results to a central server where they are combined and returned to the user. Pgpool-II has this capability. Also, this can be implemented using the PL/Proxy toolset.
Log-Shipping Standby Servers Continuous archiving can be used to create a high availability (HA) cluster configuration with one or more standby servers ready to take over operations if the primary server fails. This capability is widely referred to as warm standby or log shipping. The primary and standby server work together to provide this capability, though the servers are only loosely coupled. The primary server operates in continuous archiving mode, while each standby server operates in continuous recovery mode, reading the WAL files from the primary. No changes to the database tables are required to enable this capability, so it offers low administration overhead compared to some other replication solutions. This configuration also has relatively low performance impact on the primary server. Directly moving WAL records from one database server to another is typically described as log shipping. PostgreSQL implements file-based log shipping, which means that WAL records are transferred one file (WAL segment) at a time. WAL files (16MB) can be shipped easily and cheaply over any distance, whether it be to an adjacent system, another system at the same site, or another system on the far side of the globe. The bandwidth required for this technique varies according to the transaction rate of the primary server. Record-based log shipping is also possible with streaming replication (see ). It should be noted that the log shipping is asynchronous, i.e., the WAL records are shipped after transaction commit. As a result, there is a window for data loss should the primary server suffer a catastrophic failure; transactions not yet shipped will be lost. The size of the data loss window in file-based log shipping can be limited by use of the archive_timeout parameter, which can be set as low as a few seconds. However such a low setting will substantially increase the bandwidth required for file shipping. If you need a window of less than a minute or so, consider using streaming replication (see ). Recovery performance is sufficiently good that the standby will typically be only moments away from full availability once it has been activated. As a result, this is called a warm standby configuration which offers high availability. Restoring a server from an archived base backup and rollforward will take considerably longer, so that technique only offers a solution for disaster recovery, not high availability. A standby server can also be used for read-only queries, in which case it is called a Hot Standby server. See for more information. warm standby PITR standby standby server log shipping witness server STONITH Planning It is usually wise to create the primary and standby servers so that they are as similar as possible, at least from the perspective of the database server. In particular, the path names associated with tablespaces will be passed across unmodified, so both primary and standby servers must have the same mount paths for tablespaces if that feature is used. Keep in mind that if is executed on the primary, any new mount point needed for it must be created on the primary and all standby servers before the command is executed. Hardware need not be exactly the same, but experience shows that maintaining two identical systems is easier than maintaining two dissimilar ones over the lifetime of the application and system. In any case the hardware architecture must be the same — shipping from, say, a 32-bit to a 64-bit system will not work. In general, log shipping between servers running different major PostgreSQL release levels is not possible. It is the policy of the PostgreSQL Global Development Group not to make changes to disk formats during minor release upgrades, so it is likely that running different minor release levels on primary and standby servers will work successfully. However, no formal support for that is offered and you are advised to keep primary and standby servers at the same release level as much as possible. When updating to a new minor release, the safest policy is to update the standby servers first — a new minor release is more likely to be able to read WAL files from a previous minor release than vice versa. Standby Server Operation In standby mode, the server continuously applies WAL received from the master server. The standby server can read WAL from a WAL archive (see restore_command) or directly from the master over a TCP connection (streaming replication). The standby server will also attempt to restore any WAL found in the standby cluster's pg_xlog directory. That typically happens after a server restart, when the standby replays again WAL that was streamed from the master before the restart, but you can also manually copy files to pg_xlog at any time to have them replayed. At startup, the standby begins by restoring all WAL available in the archive location, calling restore_command. Once it reaches the end of WAL available there and restore_command fails, it tries to restore any WAL available in the pg_xlog directory. If that fails, and streaming replication has been configured, the standby tries to connect to the primary server and start streaming WAL from the last valid record found in archive or pg_xlog. If that fails or streaming replication is not configured, or if the connection is later disconnected, the standby goes back to step 1 and tries to restore the file from the archive again. This loop of retries from the archive, pg_xlog, and via streaming replication goes on until the server is stopped or failover is triggered by a trigger file. Standby mode is exited and the server switches to normal operation, when a trigger file is found (trigger_file). Before failover, it will restore any WAL available in the archive or in pg_xlog, but won't try to connect to the master or wait for files to become available in the archive. Preparing the Master for Standby Servers Set up continuous archiving on the primary to an archive directory accessible from the standby, as described in . The archive location should be accessible from the standby even when the master is down, i.e. it should reside on the standby server itself or another trusted server, not on the master server. If you want to use streaming replication, set up authentication on the primary server to allow replication connections from the standby server(s); that is, provide a suitable entry or entries in pg_hba.conf with the database field set to replication. Also ensure max_wal_senders is set to a sufficiently large value in the configuration file of the primary server. Take a base backup as described in to bootstrap the standby server. Setting Up a Standby Server To set up the standby server, restore the base backup taken from primary server (see ). Create a recovery command file recovery.conf in the standby's cluster data directory, and turn on standby_mode. Set restore_command to a simple command to copy files from the WAL archive. Do not use pg_standby or similar tools with the built-in standby mode described here. restore_command should return immediately if the file does not exist; the server will retry the command again if necessary. See for using tools like pg_standby. If you want to use streaming replication, fill in primary_conninfo with a libpq connection string, including the host name (or IP address) and any additional details needed to connect to the primary server. If the primary needs a password for authentication, the password needs to be specified in primary_conninfo as well. You can use restartpoint_command to prune the archive of files no longer needed by the standby. If you're setting up the standby server for high availability purposes, set up WAL archiving, connections and authentication like the primary server, because the standby server will work as a primary server after failover. You will also need to set trigger_file to make it possible to fail over. If you're setting up the standby server for reporting purposes, with no plans to fail over to it, trigger_file is not required. A simple example of a recovery.conf is: standby_mode = 'on' primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass' restore_command = 'cp /path/to/archive/%f %p' trigger_file = '/path/to/trigger_file' You can have any number of standby servers, but if you use streaming replication, make sure you set max_wal_senders high enough in the primary to allow them to be connected simultaneously. Streaming Replication Streaming Replication Streaming replication allows a standby server to stay more up-to-date than is possible with file-based log shipping. The standby connects to the primary, which streams WAL records to the standby as they're generated, without waiting for the WAL file to be filled. Streaming replication is asynchronous, so there is still a small delay between committing a transaction in the primary and for the changes to become visible in the standby. The delay is however much smaller than with file-based log shipping, typically under one second assuming the standby is powerful enough to keep up with the load. With streaming replication, archive_timeout is not required to reduce the data loss window. Streaming replication relies on file-based continuous archiving for making the base backup and for allowing the standby to catch up if it is disconnected from the primary for long enough for the primary to delete old WAL files still required by the standby. It is possible to use streaming replication without WAL archiving, but if a standby falls behind too much, the primary will delete old WAL files still needed by the standby, and the standby will have to be manually restored from a base backup. You can control how long the primary retains old WAL segments using the wal_keep_segments setting. To use streaming replication, set up a file-based log-shipping standby server as described in . The step that turns a file-based log-shipping standby into streaming replication standby is setting primary_conninfo setting in the recovery.conf file to point to the primary server. Set and authentication options (see pg_hba.conf) on the primary so that the standby server can connect to the replication pseudo-database on the primary server (see ). On systems that support the keepalive socket option, setting , and helps the primary promptly notice a broken connection. Set the maximum number of concurrent connections from the standby servers (see for details). When the standby is started and primary_conninfo is set correctly, the standby will connect to the primary after replaying all WAL files available in the archive. If the connection is established successfully, you will see a walreceiver process in the standby, and a corresponding walsender process in the primary. Authentication It is very important that the access privileges for replication be set up so that only trusted users can read the WAL stream, because it is easy to extract privileged information from it. Standby servers must authenticate to the primary as a superuser account. So a role with the SUPERUSER and LOGIN privileges needs to be created on the primary. Client authentication for replication is controlled by a pg_hba.conf record specifying replication in the database field. For example, if the standby is running on host IP 192.168.1.100 and the superuser's name for replication is foo, the administrator can add the following line to the pg_hba.conf file on the primary: # Allow the user "foo" from host 192.168.1.100 to connect to the primary # as a replication standby if the user's password is correctly supplied. # # TYPE DATABASE USER CIDR-ADDRESS METHOD host replication foo 192.168.1.100/32 md5 The host name and port number of the primary, connection user name, and password are specified in the recovery.conf file or the corresponding environment variable on the standby. For example, if the primary is running on host IP 192.168.1.50, port 5432, the superuser's name for replication is foo, and the password is foopass, the administrator can add the following line to the recovery.conf file on the standby: # The standby connects to the primary that is running on host 192.168.1.50 # and port 5432 as the user "foo" whose password is "foopass". primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass' Monitoring An important health indicator of streaming replication is the amount of WAL records generated in the primary, but not yet applied in the standby. You can calculate this lag by comparing the current WAL write location on the primary with the last WAL location received by the standby. They can be retrieved using pg_current_xlog_location on the primary and the pg_last_xlog_receive_location on the standby, respectively (see and for details). The last WAL receive location in the standby is also displayed in the process status of the WAL receiver process, displayed using the ps command (see for details). Failover If the primary server fails then the standby server should begin failover procedures. If the standby server fails then no failover need take place. If the standby server can be restarted, even some time later, then the recovery process can also be restarted immediately, taking advantage of restartable recovery. If the standby server cannot be restarted, then a full new standby server instance should be created. If the primary server fails and the standby server becomes the new primary, and then the old primary restarts, you must have a mechanism for informing the old primary that it is no longer the primary. This is sometimes known as STONITH (Shoot The Other Node In The Head), which is necessary to avoid situations where both systems think they are the primary, which will lead to confusion and ultimately data loss. Many failover systems use just two systems, the primary and the standby, connected by some kind of heartbeat mechanism to continually verify the connectivity between the two and the viability of the primary. It is also possible to use a third system (called a witness server) to prevent some cases of inappropriate failover, but the additional complexity might not be worthwhile unless it is set up with sufficient care and rigorous testing. Once failover to the standby occurs, there is only a single server in operation. This is known as a degenerate state. The former standby is now the primary, but the former primary is down and might stay down. To return to normal operation, a standby server must be recreated, either on the former primary system when it comes up, or on a third, possibly new, system. Once complete the primary and standby can be considered to have switched roles. Some people choose to use a third server to provide backup for the new primary until the new standby server is recreated, though clearly this complicates the system configuration and operational processes. So, switching from primary to standby server can be fast but requires some time to re-prepare the failover cluster. Regular switching from primary to standby is useful, since it allows regular downtime on each system for maintenance. This also serves as a test of the failover mechanism to ensure that it will really work when you need it. Written administration procedures are advised. To trigger failover of a log-shipping standby server, create a trigger file with the filename and path specified by the trigger_file setting in recovery.conf. If trigger_file is not given, there is no way to exit recovery in the standby and promote it to a master. That can be useful for e.g reporting servers that are only used to offload read-only queries from the primary, not for high availability purposes. Alternative method for log shipping An alternative to the built-in standby mode described in the previous sections is to use a restore_command that polls the archive location. This was the only option available in versions 8.4 and below. In this setup, set standby_mode off, because you are implementing the polling required for standby operation yourself. See contrib/pg_standby () for a reference implementation of this. Note that in this mode, the server will apply WAL one file at a time, so if you use the standby server for queries (see Hot Standby), there is a bigger delay between an action in the master and when the action becomes visible in the standby, corresponding the time it takes to fill up the WAL file. archive_timeout can be used to make that delay shorter. Also note that you can't combine streaming replication with this method. The operations that occur on both primary and standby servers are normal continuous archiving and recovery tasks. The only point of contact between the two database servers is the archive of WAL files that both share: primary writing to the archive, standby reading from the archive. Care must be taken to ensure that WAL archives from separate primary servers do not become mixed together or confused. The archive need not be large if it is only required for standby operation. The magic that makes the two loosely coupled servers work together is simply a restore_command used on the standby that, when asked for the next WAL file, waits for it to become available from the primary. The restore_command is specified in the recovery.conf file on the standby server. Normal recovery processing would request a file from the WAL archive, reporting failure if the file was unavailable. For standby processing it is normal for the next WAL file to be unavailable, so the standby must wait for it to appear. For files ending in .backup or .history there is no need to wait, and a non-zero return code must be returned. A waiting restore_command can be written as a custom script that loops after polling for the existence of the next WAL file. There must also be some way to trigger failover, which should interrupt the restore_command, break the loop and return a file-not-found error to the standby server. This ends recovery and the standby will then come up as a normal server. Pseudocode for a suitable restore_command is: triggered = false; while (!NextWALFileReady() && !triggered) { sleep(100000L); /* wait for ~0.1 sec */ if (CheckForExternalTrigger()) triggered = true; } if (!triggered) CopyWALFileForRecovery(); A working example of a waiting restore_command is provided as a contrib module named pg_standby. It should be used as a reference on how to correctly implement the logic described above. It can also be extended as needed to support specific configurations and environments. PostgreSQL does not provide the system software required to identify a failure on the primary and notify the standby database server. Many such tools exist and are well integrated with the operating system facilities required for successful failover, such as IP address migration. The method for triggering failover is an important part of planning and design. One potential option is the restore_command command. It is executed once for each WAL file, but the process running the restore_command is created and dies for each file, so there is no daemon or server process, and signals or a signal handler cannot be used. Therefore, the restore_command is not suitable to trigger failover. It is possible to use a simple timeout facility, especially if used in conjunction with a known archive_timeout setting on the primary. However, this is somewhat error prone since a network problem or busy primary server might be sufficient to initiate failover. A notification mechanism such as the explicit creation of a trigger file is ideal, if this can be arranged. The size of the WAL archive can be minimized by using the %r option of the restore_command. This option specifies the last archive file name that needs to be kept to allow the recovery to restart correctly. This can be used to truncate the archive once files are no longer required, assuming the archive is writable from the standby server. Implementation The short procedure for configuring a standby server is as follows. For full details of each step, refer to previous sections as noted. Set up primary and standby systems as nearly identical as possible, including two identical copies of PostgreSQL at the same release level. Set up continuous archiving from the primary to a WAL archive directory on the standby server. Ensure that , and are set appropriately on the primary (see ). Make a base backup of the primary server (see ), and load this data onto the standby. Begin recovery on the standby server from the local WAL archive, using a recovery.conf that specifies a restore_command that waits as described previously (see ). Recovery treats the WAL archive as read-only, so once a WAL file has been copied to the standby system it can be copied to tape at the same time as it is being read by the standby database server. Thus, running a standby server for high availability can be performed at the same time as files are stored for longer term disaster recovery purposes. For testing purposes, it is possible to run both primary and standby servers on the same system. This does not provide any worthwhile improvement in server robustness, nor would it be described as HA. Record-based Log Shipping PostgreSQL directly supports file-based log shipping as described above. It is also possible to implement record-based log shipping, though this requires custom development. An external program can call the pg_xlogfile_name_offset() function (see ) to find out the file name and the exact byte offset within it of the current end of WAL. It can then access the WAL file directly and copy the data from the last known end of WAL through the current end over to the standby servers. With this approach, the window for data loss is the polling cycle time of the copying program, which can be very small, and there is no wasted bandwidth from forcing partially-used segment files to be archived. Note that the standby servers' restore_command scripts can only deal with whole WAL files, so the incrementally copied data is not ordinarily made available to the standby servers. It is of use only when the primary dies — then the last partial WAL file is fed to the standby before allowing it to come up. The correct implementation of this process requires cooperation of the restore_command script with the data copying program. Starting with PostgreSQL version 9.0, you can use streaming replication (see ) to achieve the same benefits with less effort. Hot Standby Hot Standby Hot Standby is the term used to describe the ability to connect to the server and run queries while the server is in archive recovery. This is useful for both log shipping replication and for restoring a backup to an exact state with great precision. The term Hot Standby also refers to the ability of the server to move from recovery through to normal operation while users continue running queries and/or keep their connections open. Running queries in recovery mode is similar to normal query operation, though there are a several usage and administrative differences noted below. User's Overview Users can connect to the database server while it is in recovery mode and perform read-only queries. Read-only access to system catalogs and views will also occur as normal. The data on the standby takes some time to arrive from the primary server so there will be a measurable delay between primary and standby. Running the same query nearly simultaneously on both primary and standby might therefore return differing results. We say that data on the standby is eventually consistent with the primary. Queries executed on the standby will be correct with regard to the transactions that had been recovered at the start of the query, or start of first statement in the case of serializable transactions. In comparison with the primary, the standby returns query results that could have been obtained on the primary at some moment in the past. When a transaction is started in recovery, the parameter transaction_read_only will be forced to be true, regardless of the default_transaction_read_only setting in postgresql.conf. It can't be manually set to false either. As a result, all transactions started during recovery will be limited to read-only actions. In all other ways, connected sessions will appear identical to sessions initiated during normal processing mode. There are no special commands required to initiate a connection so all interfaces work unchanged. After recovery finishes, the session will allow normal read-write transactions at the start of the next transaction, if these are requested. "Read-only" above means no writes to the permanent or temporary database tables. There are no problems with queries that use transient sort and work files. The following actions are allowed: Query access - SELECT, COPY TO including views and SELECT rules Cursor commands - DECLARE, FETCH, CLOSE Parameters - SHOW, SET, RESET Transaction management commands BEGIN, END, ABORT, START TRANSACTION SAVEPOINT, RELEASE, ROLLBACK TO SAVEPOINT EXCEPTION blocks and other internal subtransactions LOCK TABLE, though only when explicitly in one of these modes: ACCESS SHARE, ROW SHARE or ROW EXCLUSIVE. Plans and resources - PREPARE, EXECUTE, DEALLOCATE, DISCARD Plugins and extensions - LOAD These actions produce error messages: Data Manipulation Language (DML) - INSERT, UPDATE, DELETE, COPY FROM, TRUNCATE. Note that there are no allowed actions that result in a trigger being executed during recovery. Data Definition Language (DDL) - CREATE, DROP, ALTER, COMMENT. This also applies to temporary tables also because currently their definition causes writes to catalog tables. SELECT ... FOR SHARE | UPDATE which cause row locks to be written Rules on SELECT statements that generate DML commands. LOCK that explicitly requests a mode higher than ROW EXCLUSIVE MODE. LOCK in short default form, since it requests ACCESS EXCLUSIVE MODE. Transaction management commands that explicitly set non-read-only state: BEGIN READ WRITE, START TRANSACTION READ WRITE SET TRANSACTION READ WRITE, SET SESSION CHARACTERISTICS AS TRANSACTION READ WRITE SET transaction_read_only = off Two-phase commit commands - PREPARE TRANSACTION, COMMIT PREPARED, ROLLBACK PREPARED because even read-only transactions need to write WAL in the prepare phase (the first phase of two phase commit). Sequence updates - nextval(), setval() LISTEN, UNLISTEN, NOTIFY Note that the current behavior of read only transactions when not in recovery is to allow the last two actions, so there are small and subtle differences in behavior between read-only transactions run on a standby and run during normal operation. It is possible that LISTEN, UNLISTEN, and temporary tables might be allowed in a future release. If failover or switchover occurs the database will switch to normal processing mode. Sessions will remain connected while the server changes mode. Current transactions will continue, though will remain read-only. After recovery is complete, it will be possible to initiate read-write transactions. Users will be able to tell whether their session is read-only by issuing SHOW transaction_read_only. In addition, a set of functions () allow users to access information about the standby server. These allow you to write programs that are aware of the current state of the database. These can be used to monitor the progress of recovery, or to allow you to write complex programs that restore the database to particular states. In recovery, transactions will not be permitted to take any table lock higher than RowExclusiveLock. In addition, transactions may never assign a TransactionId and may never write WAL. Any LOCK TABLE command that runs on the standby and requests a specific lock mode higher than ROW EXCLUSIVE MODE will be rejected. In general queries will not experience lock conflicts from the database changes made by recovery. This is because recovery follows normal concurrency control mechanisms, known as MVCC. There are some types of change that will cause conflicts, covered in the following section. Handling query conflicts The primary and standby nodes are in many ways loosely connected. Actions on the primary will have an effect on the standby. As a result, there is potential for negative interactions or conflicts between them. The easiest conflict to understand is performance: if a huge data load is taking place on the primary then this will generate a similar stream of WAL records on the standby, so standby queries may contend for system resources, such as I/O. There are also additional types of conflict that can occur with Hot Standby. These conflicts are hard conflicts in the sense that queries might need to be cancelled and, in some cases, sessions disconnected to resolve them. The user is provided with several ways to handle these conflicts, though it is important to first understand the possible causes of conflicts: Access Exclusive Locks from primary node, including both explicit LOCK commands and various DDL actions Dropping tablespaces on the primary while standby queries are using those tablespaces for temporary work files (work_mem overflow) Dropping databases on the primary while users are connected to that database on the standby. The standby waiting longer than max_standby_delay to acquire a buffer cleanup lock. Early cleanup of data still visible to the current query's snapshot Some WAL redo actions will be for DDL execution. These DDL actions are replaying changes that have already committed on the primary node, so they must not fail on the standby node. These DDL locks take priority and will automatically *cancel* any read-only transactions that get in their way, after a grace period. This is similar to the possibility of being canceled by the deadlock detector. But in this case, the standby recovery process always wins, since the replayed actions must not fail. This also ensures that replication does not fall behind while waiting for a query to complete. This prioritization presumes that the standby exists primarily for high availability, and that adjusting the grace period will allow a sufficient guard against unexpected cancellation. An example of the above would be an administrator on the primary server running DROP TABLE on a table that is currently being queried on the standby server. Clearly the query cannot continue if DROP TABLE proceeds. If this situation occurred on the primary, the DROP TABLE would wait until the query had finished. When DROP TABLE is run on the primary, the primary doesn't have information about which queries are running on the standby, so it cannot wait for any of the standby queries. The WAL change records come through to the standby while the standby query is still running, causing a conflict. The most common reason for conflict between standby queries and WAL redo is "early cleanup". Normally, PostgreSQL allows cleanup of old row versions when there are no users who need to see them to ensure correct visibility of data (the heart of MVCC). If there is a standby query that has been running for longer than any query on the primary then it is possible for old row versions to be removed by either a vacuum or HOT. This will then generate WAL records that, if applied, would remove data on the standby that might potentially be required by the standby query. In more technical language, the primary's xmin horizon is later than the standby's xmin horizon, allowing dead rows to be removed. Experienced users should note that both row version cleanup and row version freezing will potentially conflict with recovery queries. Running a manual VACUUM FREEZE is likely to cause conflicts even on tables with no updated or deleted rows. There are a number of choices for resolving query conflicts. The default is to wait and hope the query finishes. The server will wait automatically until the lag between primary and standby is at most max_standby_delay seconds. Once that grace period expires, one of the following actions is taken: If the conflict is caused by a lock, the conflicting standby transaction is cancelled immediately. If the transaction is idle-in-transaction, then the session is aborted instead. This behavior might change in the future. If the conflict is caused by cleanup records, the standby query is informed a conflict has occurred and that it must cancel itself to avoid the risk that it silently fails to read relevant data because that data has been removed. (This is regrettably similar to the much feared and iconic error message "snapshot too old"). Some cleanup records only conflict with older queries, while others can affect all queries. If cancellation does occur, the query and/or transaction can always be re-executed. The error is dynamic and will not necessarily reoccur if the query is executed again. max_standby_delay is set in postgresql.conf. The parameter applies to the server as a whole, so if the delay is consumed by a single query then there may be little or no waiting for queries that follow, though they will have benefited equally from the initial waiting period. The server may take time to catch up again before the grace period is available again, though if there is a heavy and constant stream of conflicts it may seldom catch up fully. Users should be clear that tables that are regularly and heavily updated on the primary server will quickly cause cancellation of longer running queries on the standby. In those cases max_standby_delay can be considered similar to setting statement_timeout. Other remedial actions exist if the number of cancellations is unacceptable. The first option is to connect to the primary server and keep a query active for as long as needed to run queries on the standby. This guarantees that a WAL cleanup record is never generated and query conflicts do not occur, as described above. This could be done using contrib/dblink and pg_sleep(), or via other mechanisms. If you do this, you should note that this will delay cleanup of dead rows on the primary by vacuum or HOT, and people might find this undesirable. However, remember that the primary and standby nodes are linked via the WAL, so the cleanup situation is no different from the case where the query ran on the primary node itself. And you are still getting the benefit of off-loading the execution onto the standby. max_standby_delay should not be used in this case because delayed WAL files might already contain entries that invalidate the current shapshot. It is also possible to set vacuum_defer_cleanup_age on the primary to defer the cleanup of records by autovacuum, VACUUM and HOT. This might allow more time for queries to execute before they are cancelled on the standby, without the need for setting a high max_standby_delay. Three-way deadlocks are possible between AccessExclusiveLocks arriving from the primary, cleanup WAL records that require buffer cleanup locks, and user requests that are waiting behind replayed AccessExclusiveLocks. Deadlocks are resolved immediately, should they occur, though they are thought to be rare in practice. Dropping tablespaces or databases is discussed in the administrator's section since they are not typical user situations. Administrator's Overview If there is a recovery.conf file present, the server will start in Hot Standby mode by default, though recovery_connections can be disabled via postgresql.conf. The server might take some time to enable recovery connections since the server must first complete sufficient recovery to provide a consistent state against which queries can run before enabling read only connections. During this period, clients that attempt to connect will be refused with an error message. To confirm the server has come up, either loop retrying to connect from the application, or look for these messages in the server logs: LOG: entering standby mode ... then some time later ... LOG: consistent recovery state reached LOG: database system is ready to accept read only connections Consistency information is recorded once per checkpoint on the primary, as long as wal_level is set to hot_standby on the primary. It is not possible to enable recovery connections on the standby when reading WAL written during the period that wal_level was not set to hot_standby on the primary. Reaching a consistent state can also be delayed in the presence of both of these conditions: A write transaction has more than 64 subtransactions Very long-lived write transactions If you are running file-based log shipping ("warm standby"), you might need to wait until the next WAL file arrives, which could be as long as the archive_timeout setting on the primary. The setting of some parameters on the standby will need reconfiguration if they have been changed on the primary. For these parameters, the value on the standby must be equal to or greater than the value on the primary. If these parameters are not set high enough then the standby will not be able to process recovering transactions properly. If these values are set too low the server will halt. Higher values can then be supplied and the server restarted to begin recovery again. The parameters are: max_connections max_prepared_transactions max_locks_per_transaction It is important that the administrator consider the appropriate setting of max_standby_delay, set in postgresql.conf. There is no optimal setting, so it should be set according to business priorities. For example if the server is primarily tasked as a High Availability server, then you may wish to lower max_standby_delay or even set it to zero, though that is a very aggressive setting. If the standby server is tasked as an additional server for decision support queries then it might be acceptable to set this to a value of many hours (in seconds). It is also possible to set max_standby_delay to -1 which means wait forever for queries to complete, if there are conflicts; this will be useful when performing an archive recovery from a backup. Transaction status "hint bits" written on the primary are not WAL-logged, so data on the standby will likely re-write the hints again on the standby. Thus, the standby server will still perform disk writes even though all users are read-only; no changes occur to the data values themselves. Users will still write large sort temporary files and re-generate relcache info files, so no part of the database is truly read-only during hot standby mode. There is no restriction on the use of set returning functions, or other users of tuplestore/tuplesort code. Note also that writes to remote databases will still be possible, even though the transaction is read-only locally. The following types of administration commands are not accepted during recovery mode: Data Definition Language (DDL) - e.g. CREATE INDEX Privilege and Ownership - GRANT, REVOKE, REASSIGN Maintenance commands - ANALYZE, VACUUM, CLUSTER, REINDEX Again, note that some of these commands are actually allowed during "read only" mode transactions on the primary. As a result, you cannot create additional indexes that exist solely on the standby, nor statistics that exist solely on the standby. If these administration commands are needed, they should be executed on the primary, and eventually those changes will propagate to the standby. pg_cancel_backend() will work on user backends, but not the Startup process, which performs recovery. pg_stat_activity does not show an entry for the Startup process, nor do recovering transactions show as active. As a result, pg_prepared_xacts is always empty during recovery. If you wish to resolve in-doubt prepared transactions, view pg_prepared_xacts on the primary and issue commands to resolve transactions there. pg_locks will show locks held by backends, as normal. pg_locks also shows a virtual transaction managed by the Startup process that owns all AccessExclusiveLocks held by transactions being replayed by recovery. Note that the Startup process does not acquire locks to make database changes, and thus locks other than AccessExclusiveLocks do not show in pg_locks for the Startup process; they are just presumed to exist. The Nagios plugin check_pgsql will work, because the simple information it checks for exists. The check_postgres monitoring script will also work, though some reported values could give different or confusing results. For example, last vacuum time will not be maintained, since no vacuum occurs on the standby. Vacuums running on the primary do still send their changes to the standby. WAL file control commands will not work during recovery, e.g. pg_start_backup, pg_switch_xlog etc. Dynamically loadable modules work, including pg_stat_statements. Advisory locks work normally in recovery, including deadlock detection. Note that advisory locks are never WAL logged, so it is impossible for an advisory lock on either the primary or the standby to conflict with WAL replay. Nor is it possible to acquire an advisory lock on the primary and have it initiate a similar advisory lock on the standby. Advisory locks relate only to the server on which they are acquired. Trigger-based replication systems such as Slony, Londiste and Bucardo won't run on the standby at all, though they will run happily on the primary server as long as the changes are not sent to standby servers to be applied. WAL replay is not trigger-based so you cannot relay from the standby to any system that requires additional database writes or relies on the use of triggers. New oids cannot be assigned, though some UUID generators may still work as long as they do not rely on writing new status to the database. Currently, temporary table creation is not allowed during read only transactions, so in some cases existing scripts will not run correctly. This restriction might be relaxed in a later release. This is both a SQL Standard compliance issue and a technical issue. DROP TABLESPACE can only succeed if the tablespace is empty. Some standby users may be actively using the tablespace via their temp_tablespaces parameter. If there are temporary files in the tablespace, all active queries are cancelled to ensure that temporary files are removed, so the tablespace can be removed and WAL replay can continue. Running DROP DATABASE, ALTER DATABASE ... SET TABLESPACE, or ALTER DATABASE ... RENAME on primary will generate a log message that will cause all users connected to that database on the standby to be forcibly disconnected. This action occurs immediately, whatever the setting of max_standby_delay. In normal (non-recovery) mode, if you issue DROP USER or DROP ROLE for a role with login capability while that user is still connected then nothing happens to the connected user - they remain connected. The user cannot reconnect however. This behavior applies in recovery also, so a DROP USER on the primary does not disconnect that user on the standby. The statististics collector is active during recovery. All scans, reads, blocks, index usage, etc., will be recorded normally on the standby. Replayed actions will not duplicate their effects on primary, so replaying an insert will not increment the Inserts column of pg_stat_user_tables. The stats file is deleted at the start of recovery, so stats from primary and standby will differ; this is considered a feature, not a bug. Autovacuum is not active during recovery, it will start normally at the end of recovery. The background writer is active during recovery and will perform restartpoints (similar to checkpoints on the primary) and normal block cleaning activities. This can include updates of the hint bit information stored on the standby server. The CHECKPOINT command is accepted during recovery, though it performs a restartpoint rather than a new checkpoint. Hot Standby Parameter Reference Various parameters have been mentioned above in and . On the primary, parameters wal_level and vacuum_defer_cleanup_age can be used. max_standby_delay has no effect if set on the primary. On the standby, parameters recovery_connections and max_standby_delay can be used. vacuum_defer_cleanup_age has no effect during recovery. Caveats There are several limitations of Hot Standby. These can and probably will be fixed in future releases: Operations on hash indexes are not presently WAL-logged, so replay will not update these indexes. Hash indexes will not be used for query plans during recovery. Full knowledge of running transactions is required before snapshots can be taken. Transactions that use large numbers of subtransactions (currently greater than 64) will delay the start of read only connections until the completion of the longest running write transaction. If this situation occurs, explanatory messages will be sent to the server log. Valid starting points for recovery connections are generated at each checkpoint on the master. If the standby is shut down while the master is in a shutdown state, it might not be possible to re-enter Hot Standby until the primary is started up, so that it generates further starting points in the WAL logs. This situation isn't a problem in the most common situations where it might happen. Generally, if the primary is shut down and not available anymore, that's likely due to a serious failure that requires the standby being converted to operate as the new primary anyway. And in situations where the primary is being intentionally taken down, coordinating to make sure the standby becomes the new primary smoothly is also standard procedure. At the end of recovery, AccessExclusiveLocks held by prepared transactions will require twice the normal number of lock table entries. If you plan on running either a large number of concurrent prepared transactions that normally take AccessExclusiveLocks, or you plan on having one large transaction that takes many AccessExclusiveLocks, you are advised to select a larger value of max_locks_per_transaction, up to, but never more than twice the value of the parameter setting on the primary server. You need not consider this at all if your setting of max_prepared_transactions is 0. Incrementally Updated Backups incrementally updated backups change accumulation In a warm standby configuration, it is possible to offload the expense of taking periodic base backups from the primary server; instead base backups can be made by backing up a standby server's files. This concept is generally known as incrementally updated backups, log change accumulation, or more simply, change accumulation. If we take a file system backup of the standby server's data directory while it is processing logs shipped from the primary, we will be able to reload that backup and restart the standby's recovery process from the last restart point. We no longer need to keep WAL files from before the standby's restart point. If recovery is needed, it will be faster to recover from the incrementally updated backup than from the original base backup. The procedure for taking a file system backup of the standby server's data directory while it's processing logs shipped from the primary is: Perform the backup, without using pg_start_backup and pg_stop_backup. Note that the pg_control file must be backed up first, as in: cp /var/lib/pgsql/data/global/pg_control /tmp cp -r /var/lib/pgsql/data /path/to/backup mv /tmp/pg_control /path/to/backup/data/global pg_control contains the location where WAL replay will begin after restoring from the backup; backing it up first ensures that it points to the last restartpoint when the backup started, not some later restartpoint that happened while files were copied to the backup. Make note of the backup ending WAL location by calling the pg_last_xlog_replay_location function at the end of the backup, and keep it with the backup. psql -c "select pg_last_xlog_replay_location();" > /path/to/backup/end_location When recovering from the incrementally updated backup, the server can begin accepting connections and complete the recovery successfully before the database has become consistent. To avoid that, you must ensure the database is consistent before users try to connect to the server and when the recovery ends. You can do that by comparing the progress of the recovery with the stored backup ending WAL location: the server is not consistent until recovery has reached the backup end location. The progress of the recovery can also be observed with the pg_last_xlog_replay_location function, but that required connecting to the server while it might not be consistent yet, so care should be taken with that method. Since the standby server is not live, it is not possible to use pg_start_backup() and pg_stop_backup() to manage the backup process; it will be up to you to determine how far back you need to keep WAL segment files to have a recoverable backup. That is determined by the last restartpoint when the backup was taken, any WAL older than that can be deleted from the archive once the backup is complete. You can determine the last restartpoint by running pg_controldata on the standby server before taking the backup, or by using the log_checkpoints option to print values to the standby's server log.