High Availability, Load Balancing, and Replication

High Availability, Load Balancing, and Replication high availability failover replication load balancing clustering data partitioning Database servers can work together to allow a second server to take over quickly if the primary server fails (high availability), or to allow several computers to serve the same data (load balancing). Ideally, database servers could work together seamlessly. Web servers serving static web pages can be combined quite easily by merely load-balancing web requests to multiple machines. In fact, read-only database servers can be combined relatively easily too. Unfortunately, most database servers have a read/write mix of requests, and read/write servers are much harder to combine. This is because though read-only data needs to be placed on each server only once, a write to any server has to be propagated to all servers so that future read requests to those servers return consistent results. This synchronization problem is the fundamental difficulty for servers working together. Because there is no single solution that eliminates the impact of the sync problem for all use cases, there are multiple solutions. Each solution addresses this problem in a different way, and minimizes its impact for a specific workload. Some solutions deal with synchronization by allowing only one server to modify the data. Servers that can modify data are called read/write, master or primary servers. Servers that track changes in the master are called standby or slave servers. A standby server that cannot be connected to until it is promoted to a master server is called a warm standby server, and one that can accept connections and serves read-only queries is called a hot standby server. Some solutions are synchronous, meaning that a data-modifying transaction is not considered committed until all servers have committed the transaction. This guarantees that a failover will not lose any data and that all load-balanced servers will return consistent results no matter which server is queried. In contrast, asynchronous solutions allow some delay between the time of a commit and its propagation to the other servers, opening the possibility that some transactions might be lost in the switch to a backup server, and that load balanced servers might return slightly stale results. Asynchronous communication is used when synchronous would be too slow. Solutions can also be categorized by their granularity. Some solutions can deal only with an entire database server, while others allow control at the per-table or per-database level. Performance must be considered in any choice. There is usually a trade-off between functionality and performance. For example, a fully synchronous solution over a slow network might cut performance by more than half, while an asynchronous one might have a minimal performance impact. The remainder of this section outlines various failover, replication, and load balancing solutions. A glossary is also available. Comparison of different solutions Shared Disk Failover Shared disk failover avoids synchronization overhead by having only one copy of the database. It uses a single disk array that is shared by multiple servers. If the main database server fails, the standby server is able to mount and start the database as though it were recovering from a database crash. This allows rapid failover with no data loss. Shared hardware functionality is common in network storage devices. Using a network file system is also possible, though care must be taken that the file system has full POSIX behavior (see ). One significant limitation of this method is that if the shared disk array fails or becomes corrupt, the primary and standby servers are both nonfunctional. Another issue is that the standby server should never access the shared storage while the primary server is running. File System (Block-Device) Replication A modified version of shared hardware functionality is file system replication, where all changes to a file system are mirrored to a file system residing on another computer. The only restriction is that the mirroring must be done in a way that ensures the standby server has a consistent copy of the file system — specifically, writes to the standby must be done in the same order as those on the master. DRBD is a popular file system replication solution for Linux. Warm and Hot Standby Using Point-In-Time Recovery (PITR) Warm and hot standby servers can be kept current by reading a stream of write-ahead log (WAL) records. If the main server fails, the standby contains almost all of the data of the main server, and can be quickly made the new master database server. This is asynchronous and can only be done for the entire database server. A PITR standby server can be implemented using file-based log shipping () or streaming replication (see ), or a combination of both. For information on hot standby, see . Trigger-Based Master-Standby Replication A master-standby replication setup sends all data modification queries to the master server. The master server asynchronously sends data changes to the standby server. The standby can answer read-only queries while the master server is running. The standby server is ideal for data warehouse queries. Slony-I is an example of this type of replication, with per-table granularity, and support for multiple standby servers. Because it updates the standby server asynchronously (in batches), there is possible data loss during fail over. Statement-Based Replication Middleware With statement-based replication middleware, a program intercepts every SQL query and sends it to one or all servers. Each server operates independently. Read-write queries are sent to all servers, while read-only queries can be sent to just one server, allowing the read workload to be distributed. If queries are simply broadcast unmodified, functions like random(), CURRENT_TIMESTAMP, and sequences can have different values on different servers. This is because each server operates independently, and because SQL queries are broadcast (and not actual modified rows). If this is unacceptable, either the middleware or the application must query such values from a single server and then use those values in write queries. Another option is to use this replication option with a traditional master-standby setup, i.e. data modification queries are sent only to the master and are propagated to the standby servers via master-standby replication, not by the replication middleware. Care must also be taken that all transactions either commit or abort on all servers, perhaps using two-phase commit ( and . Pgpool-II and Sequoia are examples of this type of replication. Asynchronous Multimaster Replication For servers that are not regularly connected, like laptops or remote servers, keeping data consistent among servers is a challenge. Using asynchronous multimaster replication, each server works independently, and periodically communicates with the other servers to identify conflicting transactions. The conflicts can be resolved by users or conflict resolution rules. Bucardo is an example of this type of replication. Synchronous Multimaster Replication In synchronous multimaster replication, each server can accept write requests, and modified data is transmitted from the original server to every other server before each transaction commits. Heavy write activity can cause excessive locking, leading to poor performance. In fact, write performance is often worse than that of a single server. Read requests can be sent to any server. Some implementations use shared disk to reduce the communication overhead. Synchronous multimaster replication is best for mostly read workloads, though its big advantage is that any server can accept write requests — there is no need to partition workloads between master and standby servers, and because the data changes are sent from one server to another, there is no problem with non-deterministic functions like random(). PostgreSQL does not offer this type of replication, though PostgreSQL two-phase commit ( and ) can be used to implement this in application code or middleware. Commercial Solutions Because PostgreSQL is open source and easily extended, a number of companies have taken PostgreSQL and created commercial closed-source solutions with unique failover, replication, and load balancing capabilities. summarizes the capabilities of the various solutions listed above. High Availability, Load Balancing, and Replication Feature Matrix Feature Shared Disk Failover File System Replication Hot/Warm Standby Using PITR Trigger-Based Master-Standby Replication Statement-Based Replication Middleware Asynchronous Multimaster Replication Synchronous Multimaster Replication Most Common Implementation NAS DRBD PITR Slony pgpool-II Bucardo Communication Method shared disk disk blocks WAL table rows SQL table rows table rows and row locks No special hardware required • • • • • • Allows multiple master servers • • • No master server overhead • • • No waiting for multiple servers • • • • Master failure will never lose data • • • • Standby accept read-only queries Hot only • • • • Per-table granularity • • • No conflict resolution necessary • • • • •

There are a few solutions that do not fit into the above categories: Data Partitioning Data partitioning splits tables into data sets. Each set can be modified by only one server. For example, data can be partitioned by offices, e.g., London and Paris, with a server in each office. If queries combining London and Paris data are necessary, an application can query both servers, or master/standby replication can be used to keep a read-only copy of the other office's data on each server. Multiple-Server Parallel Query Execution Many of the above solutions allow multiple servers to handle multiple queries, but none allow a single query to use multiple servers to complete faster. This solution allows multiple servers to work concurrently on a single query. It is usually accomplished by splitting the data among servers and having each server execute its part of the query and return results to a central server where they are combined and returned to the user. Pgpool-II has this capability. Also, this can be implemented using the PL/Proxy tool set. Log-Shipping Standby Servers Continuous archiving can be used to create a high availability (HA) cluster configuration with one or more standby servers ready to take over operations if the primary server fails. This capability is widely referred to as warm standby or log shipping. The primary and standby server work together to provide this capability, though the servers are only loosely coupled. The primary server operates in continuous archiving mode, while each standby server operates in continuous recovery mode, reading the WAL files from the primary. No changes to the database tables are required to enable this capability, so it offers low administration overhead compared to some other replication solutions. This configuration also has relatively low performance impact on the primary server. Directly moving WAL records from one database server to another is typically described as log shipping. PostgreSQL implements file-based log shipping, which means that WAL records are transferred one file (WAL segment) at a time. WAL files (16MB) can be shipped easily and cheaply over any distance, whether it be to an adjacent system, another system at the same site, or another system on the far side of the globe. The bandwidth required for this technique varies according to the transaction rate of the primary server. Record-based log shipping is also possible with streaming replication (see ). It should be noted that the log shipping is asynchronous, i.e., the WAL records are shipped after transaction commit. As a result, there is a window for data loss should the primary server suffer a catastrophic failure; transactions not yet shipped will be lost. The size of the data loss window in file-based log shipping can be limited by use of the archive_timeout parameter, which can be set as low as a few seconds. However such a low setting will substantially increase the bandwidth required for file shipping. If you need a window of less than a minute or so, consider using streaming replication (see ). Recovery performance is sufficiently good that the standby will typically be only moments away from full availability once it has been activated. As a result, this is called a warm standby configuration which offers high availability. Restoring a server from an archived base backup and rollforward will take considerably longer, so that technique only offers a solution for disaster recovery, not high availability. A standby server can also be used for read-only queries, in which case it is called a Hot Standby server. See for more information. warm standby PITR standby standby server log shipping witness server STONITH Planning It is usually wise to create the primary and standby servers so that they are as similar as possible, at least from the perspective of the database server. In particular, the path names associated with tablespaces will be passed across unmodified, so both primary and standby servers must have the same mount paths for tablespaces if that feature is used. Keep in mind that if is executed on the primary, any new mount point needed for it must be created on the primary and all standby servers before the command is executed. Hardware need not be exactly the same, but experience shows that maintaining two identical systems is easier than maintaining two dissimilar ones over the lifetime of the application and system. In any case the hardware architecture must be the same — shipping from, say, a 32-bit to a 64-bit system will not work. In general, log shipping between servers running different major PostgreSQL release levels is not possible. It is the policy of the PostgreSQL Global Development Group not to make changes to disk formats during minor release upgrades, so it is likely that running different minor release levels on primary and standby servers will work successfully. However, no formal support for that is offered and you are advised to keep primary and standby servers at the same release level as much as possible. When updating to a new minor release, the safest policy is to update the standby servers first — a new minor release is more likely to be able to read WAL files from a previous minor release than vice versa. Standby Server Operation In standby mode, the server continuously applies WAL received from the master server. The standby server can read WAL from a WAL archive (see ) or directly from the master over a TCP connection (streaming replication). The standby server will also attempt to restore any WAL found in the standby cluster's pg_xlog directory. That typically happens after a server restart, when the standby replays again WAL that was streamed from the master before the restart, but you can also manually copy files to pg_xlog at any time to have them replayed. At startup, the standby begins by restoring all WAL available in the archive location, calling restore_command. Once it reaches the end of WAL available there and restore_command fails, it tries to restore any WAL available in the pg_xlog directory. If that fails, and streaming replication has been configured, the standby tries to connect to the primary server and start streaming WAL from the last valid record found in archive or pg_xlog. If that fails or streaming replication is not configured, or if the connection is later disconnected, the standby goes back to step 1 and tries to restore the file from the archive again. This loop of retries from the archive, pg_xlog, and via streaming replication goes on until the server is stopped or failover is triggered by a trigger file. Standby mode is exited and the server switches to normal operation, when a trigger file is found (trigger_file). Before failover, any WAL immediately available in the archive or in pg_xlog will be restored, but no attempt is made to connect to the master. Preparing the Master for Standby Servers Set up continuous archiving on the primary to an archive directory accessible from the standby, as described in . The archive location should be accessible from the standby even when the master is down, i.e. it should reside on the standby server itself or another trusted server, not on the master server. If you want to use streaming replication, set up authentication on the primary server to allow replication connections from the standby server(s); that is, provide a suitable entry or entries in pg_hba.conf with the database field set to replication. Also ensure max_wal_senders is set to a sufficiently large value in the configuration file of the primary server. Take a base backup as described in to bootstrap the standby server. Setting Up a Standby Server To set up the standby server, restore the base backup taken from primary server (see ). Create a recovery command file recovery.conf in the standby's cluster data directory, and turn on standby_mode. Set restore_command to a simple command to copy files from the WAL archive. Do not use pg_standby or similar tools with the built-in standby mode described here. restore_command should return immediately if the file does not exist; the server will retry the command again if necessary. See for using tools like pg_standby. If you want to use streaming replication, fill in primary_conninfo with a libpq connection string, including the host name (or IP address) and any additional details needed to connect to the primary server. If the primary needs a password for authentication, the password needs to be specified in primary_conninfo as well. You can use archive_cleanup_command to prune the archive of files no longer needed by the standby. If you're setting up the standby server for high availability purposes, set up WAL archiving, connections and authentication like the primary server, because the standby server will work as a primary server after failover. You will also need to set trigger_file to make it possible to fail over. If you're setting up the standby server for reporting purposes, with no plans to fail over to it, trigger_file is not required. A simple example of a recovery.conf is: standby_mode = 'on' primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass' restore_command = 'cp /path/to/archive/%f %p' trigger_file = '/path/to/trigger_file' You can have any number of standby servers, but if you use streaming replication, make sure you set max_wal_senders high enough in the primary to allow them to be connected simultaneously. If you're using a WAL archive, its size can be minimized using the archive_cleanup_command option to remove files that are no longer required by the standby server. Note however, that if you're using the archive for backup purposes, you need to retain files needed to recover from at least the latest base backup, even if they're no longer needed by the standby. Streaming Replication Streaming Replication Streaming replication allows a standby server to stay more up-to-date than is possible with file-based log shipping. The standby connects to the primary, which streams WAL records to the standby as they're generated, without waiting for the WAL file to be filled. Streaming replication is asynchronous, so there is still a small delay between committing a transaction in the primary and for the changes to become visible in the standby. The delay is however much smaller than with file-based log shipping, typically under one second assuming the standby is powerful enough to keep up with the load. With streaming replication, archive_timeout is not required to reduce the data loss window. If you use streaming replication without file-based continuous archiving, you have to set wal_keep_segments in the master to a value high enough to ensure that old WAL segments are not recycled too early, while the standby might still need them to catch up. If the standby falls behind too much, it needs to be reinitialized from a new base backup. If you set up a WAL archive that's accessible from the standby, wal_keep_segments is not required as the standby can always use the archive to catch up. To use streaming replication, set up a file-based log-shipping standby server as described in . The step that turns a file-based log-shipping standby into streaming replication standby is setting primary_conninfo setting in the recovery.conf file to point to the primary server. Set and authentication options (see pg_hba.conf) on the primary so that the standby server can connect to the replication pseudo-database on the primary server (see ). On systems that support the keepalive socket option, setting , and helps the primary promptly notice a broken connection. Set the maximum number of concurrent connections from the standby servers (see for details). When the standby is started and primary_conninfo is set correctly, the standby will connect to the primary after replaying all WAL files available in the archive. If the connection is established successfully, you will see a walreceiver process in the standby, and a corresponding walsender process in the primary. Authentication It is very important that the access privileges for replication be set up so that only trusted users can read the WAL stream, because it is easy to extract privileged information from it. Standby servers must authenticate to the primary as a superuser account. So a role with the SUPERUSER and LOGIN privileges needs to be created on the primary. Client authentication for replication is controlled by a pg_hba.conf record specifying replication in the database field. For example, if the standby is running on host IP 192.168.1.100 and the superuser's name for replication is foo, the administrator can add the following line to the pg_hba.conf file on the primary: # Allow the user "foo" from host 192.168.1.100 to connect to the primary # as a replication standby if the user's password is correctly supplied. # # TYPE DATABASE USER CIDR-ADDRESS METHOD host replication foo 192.168.1.100/32 md5 The host name and port number of the primary, connection user name, and password are specified in the recovery.conf file. The password can also be set in the ~/.pgpass file on the standby (specify replication in the database field). For example, if the primary is running on host IP 192.168.1.50, port 5432, the superuser's name for replication is foo, and the password is foopass, the administrator can add the following line to the recovery.conf file on the standby: # The standby connects to the primary that is running on host 192.168.1.50 # and port 5432 as the user "foo" whose password is "foopass". primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass' Monitoring An important health indicator of streaming replication is the amount of WAL records generated in the primary, but not yet applied in the standby. You can calculate this lag by comparing the current WAL write location on the primary with the last WAL location received by the standby. They can be retrieved using pg_current_xlog_location on the primary and the pg_last_xlog_receive_location on the standby, respectively (see and for details). The last WAL receive location in the standby is also displayed in the process status of the WAL receiver process, displayed using the ps command (see for details). Failover If the primary server fails then the standby server should begin failover procedures. If the standby server fails then no failover need take place. If the standby server can be restarted, even some time later, then the recovery process can also be restarted immediately, taking advantage of restartable recovery. If the standby server cannot be restarted, then a full new standby server instance should be created. If the primary server fails and the standby server becomes the new primary, and then the old primary restarts, you must have a mechanism for informing the old primary that it is no longer the primary. This is sometimes known as STONITH (Shoot The Other Node In The Head), which is necessary to avoid situations where both systems think they are the primary, which will lead to confusion and ultimately data loss. Many failover systems use just two systems, the primary and the standby, connected by some kind of heartbeat mechanism to continually verify the connectivity between the two and the viability of the primary. It is also possible to use a third system (called a witness server) to prevent some cases of inappropriate failover, but the additional complexity might not be worthwhile unless it is set up with sufficient care and rigorous testing. PostgreSQL does not provide the system software required to identify a failure on the primary and notify the standby database server. Many such tools exist and are well integrated with the operating system facilities required for successful failover, such as IP address migration. Once failover to the standby occurs, there is only a single server in operation. This is known as a degenerate state. The former standby is now the primary, but the former primary is down and might stay down. To return to normal operation, a standby server must be recreated, either on the former primary system when it comes up, or on a third, possibly new, system. Once complete the primary and standby can be considered to have switched roles. Some people choose to use a third server to provide backup for the new primary until the new standby server is recreated, though clearly this complicates the system configuration and operational processes. So, switching from primary to standby server can be fast but requires some time to re-prepare the failover cluster. Regular switching from primary to standby is useful, since it allows regular downtime on each system for maintenance. This also serves as a test of the failover mechanism to ensure that it will really work when you need it. Written administration procedures are advised. To trigger failover of a log-shipping standby server, create a trigger file with the filename and path specified by the trigger_file setting in recovery.conf. If trigger_file is not given, there is no way to exit recovery in the standby and promote it to a master. That can be useful for e.g reporting servers that are only used to offload read-only queries from the primary, not for high availability purposes. Alternative method for log shipping An alternative to the built-in standby mode described in the previous sections is to use a restore_command that polls the archive location. This was the only option available in versions 8.4 and below. In this setup, set standby_mode off, because you are implementing the polling required for standby operation yourself. See contrib/pg_standby () for a reference implementation of this. Note that in this mode, the server will apply WAL one file at a time, so if you use the standby server for queries (see Hot Standby), there is a delay between an action in the master and when the action becomes visible in the standby, corresponding the time it takes to fill up the WAL file. archive_timeout can be used to make that delay shorter. Also note that you can't combine streaming replication with this method. The operations that occur on both primary and standby servers are normal continuous archiving and recovery tasks. The only point of contact between the two database servers is the archive of WAL files that both share: primary writing to the archive, standby reading from the archive. Care must be taken to ensure that WAL archives from separate primary servers do not become mixed together or confused. The archive need not be large if it is only required for standby operation. The magic that makes the two loosely coupled servers work together is simply a restore_command used on the standby that, when asked for the next WAL file, waits for it to become available from the primary. The restore_command is specified in the recovery.conf file on the standby server. Normal recovery processing would request a file from the WAL archive, reporting failure if the file was unavailable. For standby processing it is normal for the next WAL file to be unavailable, so the standby must wait for it to appear. For files ending in .backup or .history there is no need to wait, and a non-zero return code must be returned. A waiting restore_command can be written as a custom script that loops after polling for the existence of the next WAL file. There must also be some way to trigger failover, which should interrupt the restore_command, break the loop and return a file-not-found error to the standby server. This ends recovery and the standby will then come up as a normal server. Pseudocode for a suitable restore_command is: triggered = false; while (!NextWALFileReady() && !triggered) { sleep(100000L); /* wait for ~0.1 sec */ if (CheckForExternalTrigger()) triggered = true; } if (!triggered) CopyWALFileForRecovery(); A working example of a waiting restore_command is provided as a contrib module named pg_standby. It should be used as a reference on how to correctly implement the logic described above. It can also be extended as needed to support specific configurations and environments. The method for triggering failover is an important part of planning and design. One potential option is the restore_command command. It is executed once for each WAL file, but the process running the restore_command is created and dies for each file, so there is no daemon or server process, and signals or a signal handler cannot be used. Therefore, the restore_command is not suitable to trigger failover. It is possible to use a simple timeout facility, especially if used in conjunction with a known archive_timeout setting on the primary. However, this is somewhat error prone since a network problem or busy primary server might be sufficient to initiate failover. A notification mechanism such as the explicit creation of a trigger file is ideal, if this can be arranged. Implementation The short procedure for configuring a standby server using this alternative method is as follows. For full details of each step, refer to previous sections as noted. Set up primary and standby systems as nearly identical as possible, including two identical copies of PostgreSQL at the same release level. Set up continuous archiving from the primary to a WAL archive directory on the standby server. Ensure that , and are set appropriately on the primary (see ). Make a base backup of the primary server (see ), and load this data onto the standby. Begin recovery on the standby server from the local WAL archive, using a recovery.conf that specifies a restore_command that waits as described previously (see ). Recovery treats the WAL archive as read-only, so once a WAL file has been copied to the standby system it can be copied to tape at the same time as it is being read by the standby database server. Thus, running a standby server for high availability can be performed at the same time as files are stored for longer term disaster recovery purposes. For testing purposes, it is possible to run both primary and standby servers on the same system. This does not provide any worthwhile improvement in server robustness, nor would it be described as HA. Record-based Log Shipping It is also possible to implement record-based log shipping using this alternative method, though this requires custom development, and changes will still only become visible to hot standby queries after a full WAL file has been shipped. An external program can call the pg_xlogfile_name_offset() function (see ) to find out the file name and the exact byte offset within it of the current end of WAL. It can then access the WAL file directly and copy the data from the last known end of WAL through the current end over to the standby servers. With this approach, the window for data loss is the polling cycle time of the copying program, which can be very small, and there is no wasted bandwidth from forcing partially-used segment files to be archived. Note that the standby servers' restore_command scripts can only deal with whole WAL files, so the incrementally copied data is not ordinarily made available to the standby servers. It is of use only when the primary dies — then the last partial WAL file is fed to the standby before allowing it to come up. The correct implementation of this process requires cooperation of the restore_command script with the data copying program. Starting with PostgreSQL version 9.0, you can use streaming replication (see ) to achieve the same benefits with less effort. Hot Standby Hot Standby Hot Standby is the term used to describe the ability to connect to the server and run read-only queries while the server is in archive recovery or standby mode. This is useful both for replication purposes and for restoring a backup to a desired state with great precision. The term Hot Standby also refers to the ability of the server to move from recovery through to normal operation while users continue running queries and/or keep their connections open. Running queries in hot standby mode is similar to normal query operation, though there are several usage and administrative differences explained below. User's Overview When the parameter is set to true on a standby server, it will begin accepting connections once the recovery has brought the system to a consistent state. All such connections are strictly read-only; not even temporary tables may be written. The data on the standby takes some time to arrive from the primary server so there will be a measurable delay between primary and standby. Running the same query nearly simultaneously on both primary and standby might therefore return differing results. We say that data on the standby is eventually consistent with the primary. Once the commit record for a transaction is replayed on the standby, the changes made by that transaction will be visible to any new snapshots taken on the standby. Snapshots may be taken at the start of each query or at the start of each transaction, depending on the current transaction isolation level. For more details, see . Transactions started during hot standby may issue the following commands: Query access - SELECT, COPY TO Cursor commands - DECLARE, FETCH, CLOSE Parameters - SHOW, SET, RESET Transaction management commands BEGIN, END, ABORT, START TRANSACTION SAVEPOINT, RELEASE, ROLLBACK TO SAVEPOINT EXCEPTION blocks and other internal subtransactions LOCK TABLE, though only when explicitly in one of these modes: ACCESS SHARE, ROW SHARE or ROW EXCLUSIVE. Plans and resources - PREPARE, EXECUTE, DEALLOCATE, DISCARD Plugins and extensions - LOAD Transactions started during hot standby will never be assigned a transaction ID and cannot write to the system write-ahead log. Therefore, the following actions will produce error messages: Data Manipulation Language (DML) - INSERT, UPDATE, DELETE, COPY FROM, TRUNCATE. Note that there are no allowed actions that result in a trigger being executed during recovery. This restriction applies even to temporary tables, because table rows cannot be read or written without assigning a transaction ID, which is currently not possible in a Hot Standby environment. Data Definition Language (DDL) - CREATE, DROP, ALTER, COMMENT. This restriction applies even to temporary tables, because carrying out these operations would require updating the system catalog tables. SELECT ... FOR SHARE | UPDATE, because row locks cannot be taken without updating the underlying data files. Rules on SELECT statements that generate DML commands. LOCK that explicitly requests a mode higher than ROW EXCLUSIVE MODE. LOCK in short default form, since it requests ACCESS EXCLUSIVE MODE. Transaction management commands that explicitly set non-read-only state: BEGIN READ WRITE, START TRANSACTION READ WRITE SET TRANSACTION READ WRITE, SET SESSION CHARACTERISTICS AS TRANSACTION READ WRITE SET transaction_read_only = off Two-phase commit commands - PREPARE TRANSACTION, COMMIT PREPARED, ROLLBACK PREPARED because even read-only transactions need to write WAL in the prepare phase (the first phase of two phase commit). Sequence updates - nextval(), setval() LISTEN, UNLISTEN, NOTIFY In normal operation, read-only transactions are allowed to update sequences and to use LISTEN, UNLISTEN, and NOTIFY, so Hot Standby sessions operate under slightly tighter restrictions than ordinary read-only sessions. It is possible that some of these restrictions might be loosened in a future release. During hot standby, the parameter transaction_read_only is always true and may not be changed. But as long as no attempt is made to modify the database, connections during hot standby will act much like any other database connection. If failover or switchover occurs, the database will switch to normal processing mode. Sessions will remain connected while the server changes mode. Once hot standby finishes, it will be possible to initiate read-write transactions (even from a session begun during hot standby). Users will be able to tell whether their session is read-only by issuing SHOW transaction_read_only. In addition, a set of functions () allow users to access information about the standby server. These allow you to write programs that are aware of the current state of the database. These can be used to monitor the progress of recovery, or to allow you to write complex programs that restore the database to particular states. Handling query conflicts The primary and standby servers are in many ways loosely connected. Actions on the primary will have an effect on the standby. As a result, there is potential for negative interactions or conflicts between them. The easiest conflict to understand is performance: if a huge data load is taking place on the primary then this will generate a similar stream of WAL records on the standby, so standby queries may contend for system resources, such as I/O. There are also additional types of conflict that can occur with Hot Standby. These conflicts are hard conflicts in the sense that queries might need to be cancelled and, in some cases, sessions disconnected to resolve them. The user is provided with several ways to handle these conflicts. Conflict cases include: Access Exclusive locks taken on the primary server, including both explicit LOCK commands and various DDL actions, conflict with table accesses in standby queries. Dropping a tablespace on the primary conflicts with standby queries using that tablespace for temporary work files. Dropping a database on the primary conflicts with sessions connected to that database on the standby. Application of a vacuum cleanup record from WAL conflicts with standby transactions whose snapshots can still see any of the rows to be removed. Application of a vacuum cleanup record from WAL conflicts with queries accessing the target page on the standby, whether or not the data to be removed is visible. On the primary server, these cases simply result in waiting; and the user might choose to cancel either of the conflicting actions. However, on the standby there is no choice: the WAL-logged action already occurred on the primary so the standby must not fail to apply it. Furthermore, allowing WAL application to wait indefinitely may be very undesirable, because the standby's state will become increasingly far behind the primary's. Therefore, a mechanism is provided to forcibly cancel standby queries that conflict with to-be-applied WAL records. An example of the problem situation is an administrator on the primary server running DROP TABLE on a table that is currently being queried on the standby server. Clearly the standby query cannot continue if the DROP TABLE is applied on the standby. If this situation occurred on the primary, the DROP TABLE would wait until the other query had finished. But when DROP TABLE is run on the primary, the primary doesn't have information about what queries are running on the standby, so it will not wait for any such standby queries. The WAL change records come through to the standby while the standby query is still running, causing a conflict. The standby server must either delay application of the WAL records (and everything after them, too) or else cancel the conflicting query so that the DROP TABLE can be applied. When a conflicting query is short, it's typically desirable to allow it to complete by delaying WAL application for a little bit; but a long delay in WAL application is usually not desirable. So the cancel mechanism has parameters, and , that define the maximum allowed delay in WAL application. Conflicting queries will be canceled once it has taken longer than the relevant delay setting to apply any newly-received WAL data. There are two parameters so that different delay values can be specified for the case of reading WAL data from an archive (i.e., initial recovery from a base backup or catching up a standby server that has fallen far behind) versus reading WAL data via streaming replication. In a standby server that exists primarily for high availability, it's best to set the delay parameters relatively short, so that the server cannot fall far behind the primary due to delays caused by standby queries. However, if the standby server is meant for executing long-running queries, then a high or even infinite delay value may be preferable. Keep in mind however that a long-running query could cause other sessions on the standby server to not see recent changes on the primary, if it delays application of WAL records. The most common reason for conflict between standby queries and WAL replay is early cleanup. Normally, PostgreSQL allows cleanup of old row versions when there are no transactions that need to see them to ensure correct visibility of data according to MVCC rules. However, this rule can only be applied for transactions executing on the master. So it is possible that cleanup on the master will remove row versions that are still visible to a transaction on the standby. Experienced users should note that both row version cleanup and row version freezing will potentially conflict with standby queries. Running a manual VACUUM FREEZE is likely to cause conflicts even on tables with no updated or deleted rows. Once the delay specified by max_standby_archive_delay or max_standby_streaming_delay has been exceeded, conflicting queries will be cancelled. This usually results just in a cancellation error, although in the case of replaying a DROP DATABASE the entire conflicting session will be terminated. Also, if the conflict is over a lock held by an idle transaction, the conflicting session is terminated (this behavior might change in the future). Cancelled queries may be retried immediately (after beginning a new transaction, of course). Since query cancellation depends on the nature of the WAL records being replayed, a query that was cancelled may well succeed if it is executed again. Keep in mind that the delay parameters are compared to the elapsed time since the WAL data was received by the standby server. Thus, the grace period allowed to any one query on the standby is never more than the delay parameter, and could be considerably less if the standby has already fallen behind as a result of waiting for previous queries to complete, or as a result of being unable to keep up with a heavy update load. Users should be clear that tables that are regularly and heavily updated on the primary server will quickly cause cancellation of longer running queries on the standby. In such cases the setting of a finite value for max_standby_archive_delay or max_standby_streaming_delay can be considered similar to setting statement_timeout. Remedial possibilities exist if the number of standby-query cancellations is found to be unacceptable. The first option is to connect to the primary server and keep a query active for as long as needed to run queries on the standby. This prevents VACUUM from removing recently-dead rows and so cleanup conflicts do not occur. This could be done using contrib/dblink and pg_sleep(), or via other mechanisms. If you do this, you should note that this will delay cleanup of dead rows on the primary, which may result in undesirable table bloat. However, the cleanup situation will be no worse than if the standby queries were running directly on the primary server, and you are still getting the benefit of off-loading execution onto the standby. max_standby_archive_delay must be kept large in this case, because delayed WAL files might already contain entries that conflict with the desired standby queries. Another option is to increase on the primary server, so that dead rows will not be cleaned up as quickly as they normally would be. This will allow more time for queries to execute before they are cancelled on the standby, without having to set a high max_standby_streaming_delay. However it is difficult to guarantee any specific execution-time window with this approach, since vacuum_defer_cleanup_age is measured in transactions executed on the primary server. Administrator's Overview If hot_standby is turned on in postgresql.conf and there is a recovery.conf file present, the server will run in Hot Standby mode. However, it may take some time for Hot Standby connections to be allowed, because the server will not accept connections until it has completed sufficient recovery to provide a consistent state against which queries can run. During this period, clients that attempt to connect will be refused with an error message. To confirm the server has come up, either loop trying to connect from the application, or look for these messages in the server logs: LOG: entering standby mode ... then some time later ... LOG: consistent recovery state reached LOG: database system is ready to accept read only connections Consistency information is recorded once per checkpoint on the primary. It is not possible to enable hot standby when reading WAL written during a period when wal_level was not set to hot_standby on the primary. Reaching a consistent state can also be delayed in the presence of both of these conditions: A write transaction has more than 64 subtransactions Very long-lived write transactions If you are running file-based log shipping ("warm standby"), you might need to wait until the next WAL file arrives, which could be as long as the archive_timeout setting on the primary. The setting of some parameters on the standby will need reconfiguration if they have been changed on the primary. For these parameters, the value on the standby must be equal to or greater than the value on the primary. If these parameters are not set high enough then the standby will refuse to start. Higher values can then be supplied and the server restarted to begin recovery again. These parameters are: max_connections max_prepared_transactions max_locks_per_transaction It is important that the administrator select appropriate settings for and . The best choices vary depending on business priorities. For example if the server is primarily tasked as a High Availability server, then you will want low delay settings, perhaps even zero, though that is a very aggressive setting. If the standby server is tasked as an additional server for decision support queries then it might be acceptable to set the maximum delay values to many hours, or even -1 which means wait forever for queries to complete. Transaction status "hint bits" written on the primary are not WAL-logged, so data on the standby will likely re-write the hints again on the standby. Thus, the standby server will still perform disk writes even though all users are read-only; no changes occur to the data values themselves. Users will still write large sort temporary files and re-generate relcache info files, so no part of the database is truly read-only during hot standby mode. Note also that writes to remote databases using dblink module, and other operations outside the database using PL functions will still be possible, even though the transaction is read-only locally. The following types of administration commands are not accepted during recovery mode: Data Definition Language (DDL) - e.g. CREATE INDEX Privilege and Ownership - GRANT, REVOKE, REASSIGN Maintenance commands - ANALYZE, VACUUM, CLUSTER, REINDEX Again, note that some of these commands are actually allowed during "read only" mode transactions on the primary. As a result, you cannot create additional indexes that exist solely on the standby, nor statistics that exist solely on the standby. If these administration commands are needed, they should be executed on the primary, and eventually those changes will propagate to the standby. pg_cancel_backend() will work on user backends, but not the Startup process, which performs recovery. pg_stat_activity does not show an entry for the Startup process, nor do recovering transactions show as active. As a result, pg_prepared_xacts is always empty during recovery. If you wish to resolve in-doubt prepared transactions, view pg_prepared_xacts on the primary and issue commands to resolve transactions there. pg_locks will show locks held by backends, as normal. pg_locks also shows a virtual transaction managed by the Startup process that owns all AccessExclusiveLocks held by transactions being replayed by recovery. Note that the Startup process does not acquire locks to make database changes, and thus locks other than AccessExclusiveLocks do not show in pg_locks for the Startup process; they are just presumed to exist. The Nagios plugin check_pgsql will work, because the simple information it checks for exists. The check_postgres monitoring script will also work, though some reported values could give different or confusing results. For example, last vacuum time will not be maintained, since no vacuum occurs on the standby. Vacuums running on the primary do still send their changes to the standby. WAL file control commands will not work during recovery, e.g. pg_start_backup, pg_switch_xlog etc. Dynamically loadable modules work, including pg_stat_statements. Advisory locks work normally in recovery, including deadlock detection. Note that advisory locks are never WAL logged, so it is impossible for an advisory lock on either the primary or the standby to conflict with WAL replay. Nor is it possible to acquire an advisory lock on the primary and have it initiate a similar advisory lock on the standby. Advisory locks relate only to the server on which they are acquired. Trigger-based replication systems such as Slony, Londiste and Bucardo won't run on the standby at all, though they will run happily on the primary server as long as the changes are not sent to standby servers to be applied. WAL replay is not trigger-based so you cannot relay from the standby to any system that requires additional database writes or relies on the use of triggers. New OIDs cannot be assigned, though some UUID generators may still work as long as they do not rely on writing new status to the database. Currently, temporary table creation is not allowed during read only transactions, so in some cases existing scripts will not run correctly. This restriction might be relaxed in a later release. This is both a SQL Standard compliance issue and a technical issue. DROP TABLESPACE can only succeed if the tablespace is empty. Some standby users may be actively using the tablespace via their temp_tablespaces parameter. If there are temporary files in the tablespace, all active queries are cancelled to ensure that temporary files are removed, so the tablespace can be removed and WAL replay can continue. Running DROP DATABASE, ALTER DATABASE ... SET TABLESPACE, or ALTER DATABASE ... RENAME on the primary will generate a WAL entry that will cause all users connected to that database on the standby to be forcibly disconnected. This action occurs immediately, whatever the setting of max_standby_streaming_delay. In normal (non-recovery) mode, if you issue DROP USER or DROP ROLE for a role with login capability while that user is still connected then nothing happens to the connected user - they remain connected. The user cannot reconnect however. This behavior applies in recovery also, so a DROP USER on the primary does not disconnect that user on the standby. The statistics collector is active during recovery. All scans, reads, blocks, index usage, etc., will be recorded normally on the standby. Replayed actions will not duplicate their effects on primary, so replaying an insert will not increment the Inserts column of pg_stat_user_tables. The stats file is deleted at the start of recovery, so stats from primary and standby will differ; this is considered a feature, not a bug. Autovacuum is not active during recovery. It will start normally at the end of recovery. The background writer is active during recovery and will perform restartpoints (similar to checkpoints on the primary) and normal block cleaning activities. This can include updates of the hint bit information stored on the standby server. The CHECKPOINT command is accepted during recovery, though it performs a restartpoint rather than a new checkpoint. Hot Standby Parameter Reference Various parameters have been mentioned above in and . On the primary, parameters and can be used. and have no effect if set on the primary. On the standby, parameters , and can be used. has no effect as long as the server remains in standby mode, though it will become relevant if the standby becomes primary. Caveats There are several limitations of Hot Standby. These can and probably will be fixed in future releases: Operations on hash indexes are not presently WAL-logged, so replay will not update these indexes. Hash indexes will not be used for query plans during recovery. Full knowledge of running transactions is required before snapshots can be taken. Transactions that use large numbers of subtransactions (currently greater than 64) will delay the start of read only connections until the completion of the longest running write transaction. If this situation occurs, explanatory messages will be sent to the server log. Valid starting points for standby queries are generated at each checkpoint on the master. If the standby is shut down while the master is in a shutdown state, it might not be possible to re-enter Hot Standby until the primary is started up, so that it generates further starting points in the WAL logs. This situation isn't a problem in the most common situations where it might happen. Generally, if the primary is shut down and not available anymore, that's likely due to a serious failure that requires the standby being converted to operate as the new primary anyway. And in situations where the primary is being intentionally taken down, coordinating to make sure the standby becomes the new primary smoothly is also standard procedure. At the end of recovery, AccessExclusiveLocks held by prepared transactions will require twice the normal number of lock table entries. If you plan on running either a large number of concurrent prepared transactions that normally take AccessExclusiveLocks, or you plan on having one large transaction that takes many AccessExclusiveLocks, you are advised to select a larger value of max_locks_per_transaction, perhaps as much as twice the value of the parameter on the primary server. You need not consider this at all if your setting of max_prepared_transactions is 0.