Failover, Replication, Load Balancing, and Clustering Options failover replication load balancing clustering Database servers can work together to allow a backup server to quickly take over if the primary server fails (failover), or to allow several computers to serve the same data (load balancing). Ideally, database servers could work together seamlessly. Web servers serving static web pages can be combined quite easily by merely load-balancing web requests to multiple machines. In fact, read-only database servers can be combined relatively easily too. Unfortunately, most database servers have a read/write mix of requests, and read/write servers are much harder to combine. This is because though read-only data needs to be placed on each server only once, a write to any server has to be propagated to all servers so that future read requests to those servers return consistent results. This synchronization problem is the fundamental difficulty for servers working together. Because there is no single solution that eliminates the impact of the sync problem for all use cases, there are multiple solutions. Each solution addresses this problem in a different way, and minimizes its impact for a specific workload. Some solutions deal with synchronization by allowing only one server to modify the data. Servers that can modify data are called read/write or "master" server. Servers with read-only data are called backup or "slave" servers. As you will see below, these terms cover a variety of implementations. Some servers are masters of some data sets, and slave of others. Some slaves cannot be accessed until they are changed to master servers, while other slaves can reply to read-only queries while they are slaves. Some failover and load balancing solutions are synchronous, meaning that a data-modifying transaction is not considered committed until all servers have committed the transaction. This guarantees that a failover will not lose any data and that all load-balanced servers will return consistent results with no propagation delay. Asynchronous updating has a small delay between the time of commit and its propagation to the other servers, opening the possibility that some transactions might be lost in the switch to a backup server, and that load balanced servers might return slightly stale results. Asynchronous communication is used when synchronous would be too slow. Solutions can also be categorized by their granularity. Some solutions can deal only with an entire database server, while others allow control at the per-table or per-database level. Performance must be considered in any failover or load balancing choice. There is usually a tradeoff between functionality and performance. For example, a full synchronous solution over a slow network might cut performance by more than half, while an asynchronous one might have a minimal performance impact. The remainder of this section outlines various failover, replication, and load balancing solutions. Shared Disk Failover Shared disk failover avoids synchronization overhead by having only one copy of the database. It uses a single disk array that is shared by multiple servers. If the main database server fails, the backup server is able to mount and start the database as though it was recovering from a database crash. This allows rapid failover with no data loss. Shared hardware functionality is common in network storage devices. One significant limitation of this method is that if the shared disk array fails or becomes corrupt, the primary and backup servers are both nonfunctional. Warm Standby Using Point-In-Time Recovery A warm standby server (see ) can be kept current by reading a stream of write-ahead log (WAL) records. If the main server fails, the warm standby contains almost all of the data of the main server, and can be quickly made the new master database server. This is asynchronous and can only be done for the entire database server. Continuously Running Replication Server A continuously running replication server allows the backup server to answer read-only queries while the master server is running. It receives a continuous stream of write activity from the master server. Because the backup server can be used for read-only database requests, it is ideal for data warehouse queries. Slony-I is an example of this type of replication, with per-table granularity. It updates the backup server in batches, so the replication is asynchronous and might lose data during a fail over. Data Partitioning Data partitioning splits tables into data sets. Each set can be modified by only one server. For example, data can be partitioned by offices, e.g. London and Paris. While London and Paris servers have all data records, only London can modify London records, and Paris can only modify Paris records. This is similar to the "Continuously Running Replication Server" item above, except that instead of having a read/write server and a read-only server, each server has a read/write data set and a read-only data set. Such partitioning provides both failover and load balancing. Failover is achieved because the data resides on both servers, and this is an ideal way to enable failover if the servers share a slow communication channel. Load balancing is possible because read requests can go to any of the servers, and write requests are split among the servers. Of course, the communication to keep all the servers up-to-date adds overhead, so ideally the write load should be low, or localized as in the London/Paris example above. Data partitioning is usually handled by application code, though rules and triggers can be used to keep the read-only data sets current. Slony-I can also be used in such a setup. While Slony-I replicates only entire tables, London and Paris can be placed in separate tables, and inheritance can be used to access both tables using a single table name. Query Broadcast Load Balancing Query broadcast load balancing is accomplished by having a program intercept every SQL query and send it to all servers. This is unique because most replication solutions have the write server propagate its changes to the other servers. With query broadcasting, each server operates independently. Read-only queries can be sent to a single server because there is no need for all servers to process it. One limitation of this solution is that functions like random(), CURRENT_TIMESTAMP, and sequences can have different values on different servers. This is because each server operates independently, and because SQL queries are broadcast (and not actual modified rows). If this is unacceptable, applications must query such values from a single server and then use those values in write queries. Also, care must be taken that all transactions either commit or abort on all servers, perhaps using two-phase commit ( and . Pgpool is an example of this type of replication. Clustering For Load Balancing In clustering, each server can accept write requests, and modified data is transmitted from the original server to every other server before each transaction commits. Heavy write activity can cause excessive locking, leading to poor performance. In fact, write performance is often worse than that of a single server. Read requests can be sent to any server. Clustering is best for mostly read workloads, though its big advantage is that any server can accept write requests — there is no need to partition workloads between read/write and read-only servers. Clustering is implemented by Oracle in their RAC product. PostgreSQL does not offer this type of load balancing, though PostgreSQL two-phase commit ( and ) can be used to implement this in application code or middleware. Clustering For Parallel Query Execution This allows multiple servers to work concurrently on a single query. One possible way this could work is for the data to be split among servers and for each server to execute its part of the query and results sent to a central server to be combined and returned to the user. There currently is no PostgreSQL open source solution for this. Commercial Solutions Because PostgreSQL is open source and easily extended, a number of companies have taken PostgreSQL and created commercial closed-source solutions with unique failover, replication, and load balancing capabilities.