diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml index f9e45ec3d8..77ecc4f04b 100644 --- a/doc/src/sgml/backup.sgml +++ b/doc/src/sgml/backup.sgml @@ -1,4 +1,4 @@ - + Backup and Restore @@ -1429,8 +1429,12 @@ archive_command = 'local_backup_script.sh' Operations on hash indexes are not presently WAL-logged, so - replay will not update these indexes. The recommended workaround - is to manually + replay will not update these indexes. This will mean that any new inserts + will be ignored by the index, updated rows will apparently disappear and + deleted rows will still retain pointers. In other words, if you modify a + table with a hash index on it then you will get incorrect query results + on a standby server. When recovery completes it is recommended that you + manually each such index after completing a recovery operation. @@ -1883,6 +1887,772 @@ if (!triggered) + + Hot Standby + + + Hot Standby + + + + Hot Standby is the term used to describe the ability to connect to + the server and run queries while the server is in archive recovery. This + is useful for both log shipping replication and for restoring a backup + to an exact state with great precision. + The term Hot Standby also refers to the ability of the server to move + from recovery through to normal running while users continue running + queries and/or continue their connections. + + + + Running queries in recovery is in many ways the same as normal running + though there are a large number of usage and administrative points + to note. + + + + User's Overview + + + Users can connect to the database while the server is in recovery + and perform read-only queries. Read-only access to catalogs and views + will also occur as normal. + + + + The data on the standby takes some time to arrive from the primary server + so there will be a measurable delay between primary and standby. Running the + same query nearly simultaneously on both primary and standby might therefore + return differing results. We say that data on the standby is eventually + consistent with the primary. + Queries executed on the standby will be correct with regard to the transactions + that had been recovered at the start of the query, or start of first statement, + in the case of serializable transactions. In comparison with the primary, + the standby returns query results that could have been obtained on the primary + at some exact moment in the past. + + + + When a transaction is started in recovery, the parameter + transaction_read_only will be forced to be true, regardless of the + default_transaction_read_only setting in postgresql.conf. + It can't be manually set to false either. As a result, all transactions + started during recovery will be limited to read-only actions only. In all + other ways, connected sessions will appear identical to sessions + initiated during normal processing mode. There are no special commands + required to initiate a connection at this time, so all interfaces + work normally without change. After recovery finishes, the session + will allow normal read-write transactions at the start of the next + transaction, if these are requested. + + + + Read-only here means "no writes to the permanent database tables". + There are no problems with queries that make use of transient sort and + work files. + + + + The following actions are allowed + + + + + Query access - SELECT, COPY TO including views and SELECT RULEs + + + + + Cursor commands - DECLARE, FETCH, CLOSE, + + + + + Parameters - SHOW, SET, RESET + + + + + Transaction management commands + + + + BEGIN, END, ABORT, START TRANSACTION + + + + + SAVEPOINT, RELEASE, ROLLBACK TO SAVEPOINT + + + + + EXCEPTION blocks and other internal subtransactions + + + + + + + + LOCK TABLE, though only when explicitly in one of these modes: + ACCESS SHARE, ROW SHARE or ROW EXCLUSIVE. + + + + + Plans and resources - PREPARE, EXECUTE, DEALLOCATE, DISCARD + + + + + Plugins and extensions - LOAD + + + + + + + These actions produce error messages + + + + + Data Definition Language (DML) - INSERT, UPDATE, DELETE, COPY FROM, TRUNCATE. + Note that there are no allowed actions that result in a trigger + being executed during recovery. + + + + + Data Definition Language (DDL) - CREATE, DROP, ALTER, COMMENT. + This also applies to temporary tables currently because currently their + definition causes writes to catalog tables. + + + + + SELECT ... FOR SHARE | UPDATE which cause row locks to be written + + + + + RULEs on SELECT statements that generate DML commands. + + + + + LOCK TABLE, in short default form, since it requests ACCESS EXCLUSIVE MODE. + LOCK TABLE that explicitly requests a mode higher than ROW EXCLUSIVE MODE. + + + + + Transaction management commands that explicitly set non-read only state + + + + BEGIN READ WRITE, + START TRANSACTION READ WRITE + + + + + SET TRANSACTION READ WRITE, + SET SESSION CHARACTERISTICS AS TRANSACTION READ WRITE + + + + + SET transaction_read_only = off + + + + + + + + Two-phase commit commands - PREPARE TRANSACTION, COMMIT PREPARED, + ROLLBACK PREPARED because even read-only transactions need to write + WAL in the prepare phase (the first phase of two phase commit). + + + + + sequence update - nextval() + + + + + LISTEN, UNLISTEN, NOTIFY since they currently write to system tables + + + + + + + Note that current behaviour of read only transactions when not in + recovery is to allow the last two actions, so there are small and + subtle differences in behaviour between read-only transactions + run on standby and during normal running. + It is possible that the restrictions on LISTEN, UNLISTEN, NOTIFY and + temporary tables may be lifted in a future release, if their internal + implementation is altered to make this possible. + + + + If failover or switchover occurs the database will switch to normal + processing mode. Sessions will remain connected while the server + changes mode. Current transactions will continue, though will remain + read-only. After recovery is complete, it will be possible to initiate + read-write transactions. + + + + Users will be able to tell whether their session is read-only by + issuing SHOW transaction_read_only. In addition a set of + functions allow users to + access information about Hot Standby. These allow you to write + functions that are aware of the current state of the database. These + can be used to monitor the progress of recovery, or to allow you to + write complex programs that restore the database to particular states. + + + + In recovery, transactions will not be permitted to take any table lock + higher than RowExclusiveLock. In addition, transactions may never assign + a TransactionId and may never write WAL. + Any LOCK TABLE command that runs on the standby and requests + a specific lock mode higher than ROW EXCLUSIVE MODE will be rejected. + + + + In general queries will not experience lock conflicts with the database + changes made by recovery. This is becase recovery follows normal + concurrency control mechanisms, known as MVCC. There are + some types of change that will cause conflicts, covered in the following + section. + + + + + Handling query conflicts + + + The primary and standby nodes are in many ways loosely connected. Actions + on the primary will have an effect on the standby. As a result, there is + potential for negative interactions or conflicts between them. The easiest + conflict to understand is performance: if a huge data load is taking place + on the primary then this will generate a similar stream of WAL records on the + standby, so standby queries may contend for system resources, such as I/O. + + + + There are also additional types of conflict that can occur with Hot Standby. + These conflicts are hard conflicts in the sense that we may + need to cancel queries and in some cases disconnect sessions to resolve them. + The user is provided with a number of optional ways to handle these + conflicts, though we must first understand the possible reasons behind a conflict. + + + + + Access Exclusive Locks from primary node, including both explicit + LOCK commands and various kinds of DDL action + + + + + Dropping tablespaces on the primary while standby queries are using + those tablespace for temporary work files (work_mem overflow) + + + + + Dropping databases on the primary while that role is connected on standby. + + + + + Waiting to acquire buffer cleanup locks (for which there is no time out) + + + + + Early cleanup of data still visible to the current query's snapshot + + + + + + + Some WAL redo actions will be for DDL actions. These DDL actions are + repeating actions that have already committed on the primary node, so + they must not fail on the standby node. These DDL locks take priority + and will automatically *cancel* any read-only transactions that get in + their way, after a grace period. This is similar to the possibility of + being canceled by the deadlock detector, but in this case the standby + process always wins, since the replayed actions must not fail. This + also ensures that replication doesn't fall behind while we wait for a + query to complete. Again, we assume that the standby is there for high + availability purposes primarily. + + + + An example of the above would be an Administrator on Primary server + runs a DROP TABLE on a table that's currently being queried + in the standby server. + Clearly the query cannot continue if we let the DROP TABLE + proceed. If this situation occurred on the primary, the DROP TABLE + would wait until the query has finished. When the query is on the standby + and the DROP TABLE is on the primary, the primary doesn't have + information about which queries are running on the standby and so the query + does not wait on the primary. The WAL change records come through to the + standby while the standby query is still running, causing a conflict. + + + + The most common reason for conflict between standby queries and WAL redo is + "early cleanup". Normally, PostgreSQL allows cleanup of old + row versions when there are no users who may need to see them to ensure correct + visibility of data (the heart of MVCC). If there is a standby query that has + been running for longer than any query on the primary then it is possible + for old row versions to be removed by either a vacuum or HOT. This will + then generate WAL records that, if applied, would remove data on the + standby that might *potentially* be required by the standby query. + In more technical language, the primary's xmin horizon is later than + the standby's xmin horizon, allowing dead rows to be removed. + + + + Experienced users should note that both row version cleanup and row version + freezing will potentially conflict with recovery queries. Running a + manual VACUUM FREEZE is likely to cause conflicts even on tables + with no updated or deleted rows. + + + + We have a number of choices for resolving query conflicts. The default + is that we wait and hope the query completes. The server will wait + automatically until the lag between primary and standby is at most + max_standby_delay seconds. Once that grace period expires, + we take one of the following actions: + + + + + If the conflict is caused by a lock, we cancel the conflicting standby + transaction immediately. If the transaction is idle-in-transaction + then currently we abort the session instead, though this may change + in the future. + + + + + + If the conflict is caused by cleanup records we tell the standby query + that a conflict has occurred and that it must cancel itself to avoid the + risk that it silently fails to read relevant data because + that data has been removed. (This is regrettably very similar to the + much feared and iconic error message "snapshot too old"). Some cleanup + records only cause conflict with older queries, though some types of + cleanup record affect all queries. + + + + If cancellation does occur, the query and/or transaction can always + be re-executed. The error is dynamic and will not necessarily occur + the same way if the query is executed again. + + + + + + + max_standby_delay is set in postgresql.conf. + The parameter applies to the server as a whole so if the delay is all used + up by a single query then there may be little or no waiting for queries that + follow immediately, though they will have benefited equally from the initial + waiting period. The server may take time to catch up again before the grace + period is available again, though if there is a heavy and constant stream + of conflicts it may seldom catch up fully. + + + + Users should be clear that tables that are regularly and heavily updated on + primary server will quickly cause cancellation of longer running queries on + the standby. In those cases max_standby_delay can be + considered somewhat but not exactly the same as setting + statement_timeout. + + + + Other remedial actions exist if the number of cancellations is unacceptable. + The first option is to connect to primary server and keep a query active + for as long as we need to run queries on the standby. This guarantees that + a WAL cleanup record is never generated and we don't ever get query + conflicts as described above. This could be done using contrib/dblink + and pg_sleep(), or via other mechanisms. If you do this, you should note + that this will delay cleanup of dead rows by vacuum or HOT and many + people may find this undesirable. However, we should remember that + primary and standby nodes are linked via the WAL, so this situation is no + different to the case where we ran the query on the primary node itself + except we have the benefit of off-loading the execution onto the standby. + + + + It is also possible to set vacuum_defer_cleanup_age on the primary + to defer the cleanup of records by autovacuum, vacuum and HOT. This may allow + more time for queries to execute before they are cancelled on the standby, + without the need for setting a high max_standby_delay. + + + + Three-way deadlocks are possible between AccessExclusiveLocks arriving from + the primary, cleanup WAL records that require buffer cleanup locks and + user requests that are waiting behind replayed AccessExclusiveLocks. Deadlocks + are currently resolved by the cancellation of user processes that would + need to wait on a lock. This is heavy-handed and generates more query + cancellations than we need to, though does remove the possibility of deadlock. + This behaviour is expected to improve substantially for the main release + version of 8.5. + + + + Dropping tablespaces or databases is discussed in the administrator's + section since they are not typical user situations. + + + + + Administrator's Overview + + + If there is a recovery.conf file present the server will start + in Hot Standby mode by default, though recovery_connections can + be disabled via postgresql.conf, if required. The server may take + some time to enable recovery connections since the server must first complete + sufficient recovery to provide a consistent state against which queries + can run before enabling read only connections. Look for these messages + in the server logs + + +LOG: initializing recovery connections + +... then some time later ... + +LOG: consistent recovery state reached +LOG: database system is ready to accept read only connections + + + Consistency information is recorded once per checkpoint on the primary, as long + as recovery_connections is enabled (on the primary). If this parameter + is disabled, it will not be possible to enable recovery connections on the standby. + The consistent state can also be delayed in the presence of both of these conditions + + + + + a write transaction has more than 64 subtransactions + + + + + very long-lived write transactions + + + + + If you are running file-based log shipping ("warm standby"), you may need + to wait until the next WAL file arrives, which could be as long as the + archive_timeout setting on the primary. + + + + The setting of some parameters on the standby will need reconfiguration + if they have been changed on the primary. The value on the standby must + be equal to or greater than the value on the primary. If these parameters + are not set high enough then the standby will not be able to track work + correctly from recovering transactions. If these values are set too low the + the server will halt. Higher values can then be supplied and the server + restarted to begin recovery again. + + + + + max_connections + + + + + max_prepared_transactions + + + + + max_locks_per_transaction + + + + + + + It is important that the administrator consider the appropriate setting + of max_standby_delay, set in postgresql.conf. + There is no optimal setting and should be set according to business + priorities. For example if the server is primarily tasked as a High + Availability server, then you may wish to lower + max_standby_delay or even set it to zero, though that is a + very aggressive setting. If the standby server is tasked as an additional + server for decision support queries then it may be acceptable to set this + to a value of many hours (in seconds). It is also possible to set + max_standby_delay to -1 which means wait forever for queries + to complete, if there are conflicts; this will be useful when performing + an archive recovery from a backup. + + + + Transaction status "hint bits" written on primary are not WAL-logged, + so data on standby will likely re-write the hints again on the standby. + Thus the main database blocks will produce write I/Os even though + all users are read-only; no changes have occurred to the data values + themselves. Users will be able to write large sort temp files and + re-generate relcache info files, so there is no part of the database + that is truly read-only during hot standby mode. There is no restriction + on the use of set returning functions, or other users of tuplestore/tuplesort + code. Note also that writes to remote databases will still be possible, + even though the transaction is read-only locally. + + + + The following types of administrator command are not accepted + during recovery mode + + + + + Data Definition Language (DDL) - e.g. CREATE INDEX + + + + + Privilege and Ownership - GRANT, REVOKE, REASSIGN + + + + + Maintenance commands - ANALYZE, VACUUM, CLUSTER, REINDEX + + + + + + + Note again that some of these commands are actually allowed during + "read only" mode transactions on the primary. + + + + As a result, you cannot create additional indexes that exist solely + on the standby, nor can statistics that exist solely on the standby. + If these administrator commands are needed they should be executed + on the primary so that the changes will propagate through to the + standby. + + + + pg_cancel_backend() will work on user backends, but not the + Startup process, which performs recovery. pg_stat_activity does not + show an entry for the Startup process, nor do recovering transactions + show as active. As a result, pg_prepared_xacts is always empty during + recovery. If you wish to resolve in-doubt prepared transactions + then look at pg_prepared_xacts on the primary and issue commands to + resolve those transactions there. + + + + pg_locks will show locks held by backends as normal. pg_locks also shows + a virtual transaction managed by the Startup process that owns all + AccessExclusiveLocks held by transactions being replayed by recovery. + Note that Startup process does not acquire locks to + make database changes and thus locks other than AccessExclusiveLocks + do not show in pg_locks for the Startup process, they are just presumed + to exist. + + + + check_pgsql will work, but it is very simple. + check_postgres will also work, though many some actions + could give different or confusing results. + e.g. last vacuum time will not be maintained for example, since no + vacuum occurs on the standby (though vacuums running on the primary do + send their changes to the standby). + + + + WAL file control commands will not work during recovery + e.g. pg_start_backup, pg_switch_xlog etc.. + + + + Dynamically loadable modules work, including pg_stat_statements. + + + + Advisory locks work normally in recovery, including deadlock detection. + Note that advisory locks are never WAL logged, so it is not possible for + an advisory lock on either the primary or the standby to conflict with WAL + replay. Nor is it possible to acquire an advisory lock on the primary + and have it initiate a similar advisory lock on the standby. Advisory + locks relate only to a single server on which they are acquired. + + + + Trigger-based replication systems such as Slony, + Londiste and Bucardo won't run on the + standby at all, though they will run happily on the primary server as + long as the changes are not sent to standby servers to be applied. + WAL replay is not trigger-based so you cannot relay from the + standby to any system that requires additional database writes or + relies on the use of triggers. + + + + New oids cannot be assigned, though some UUID generators may still + work as long as they do not rely on writing new status to the database. + + + + Currently, temp table creation is not allowed during read only + transactions, so in some cases existing scripts will not run correctly. + It is possible we may relax that restriction in a later release. This is + both a SQL Standard compliance issue and a technical issue. + + + + DROP TABLESPACE can only succeed if the tablespace is empty. + Some standby users may be actively using the tablespace via their + temp_tablespaces parameter. If there are temp files in the + tablespace we currently cancel all active queries to ensure that temp + files are removed, so that we can remove the tablespace and continue with + WAL replay. + + + + Running DROP DATABASE, ALTER DATABASE ... SET TABLESPACE, + or ALTER DATABASE ... RENAME on primary will generate a log message + that will cause all users connected to that database on the standby to be + forcibly disconnected, once max_standby_delay has been reached. + + + + In normal running, if you issue DROP USER or DROP ROLE + for a role with login capability while that user is still connected then + nothing happens to the connected user - they remain connected. The user cannot + reconnect however. This behaviour applies in recovery also, so a + DROP USER on the primary does not disconnect that user on the standby. + + + + Stats collector is active during recovery. All scans, reads, blocks, + index usage etc will all be recorded normally on the standby. Replayed + actions will not duplicate their effects on primary, so replaying an + insert will not increment the Inserts column of pg_stat_user_tables. + The stats file is deleted at start of recovery, so stats from primary + and standby will differ; this is considered a feature not a bug. + + + + Autovacuum is not active during recovery, though will start normally + at the end of recovery. + + + + Background writer is active during recovery and will perform + restartpoints (similar to checkpoints on primary) and normal block + cleaning activities. The CHECKPOINT command is accepted during recovery, + though performs a restartpoint rather than a new checkpoint. + + + + + Hot Standby Parameter Reference + + + Various parameters have been mentioned above in the + and sections. + + + + On the primary, parameters recovery_connections and + vacuum_defer_cleanup_age can be used to enable and control the + primary server to assist the successful configuration of Hot Standby servers. + max_standby_delay has no effect if set on the primary. + + + + On the standby, parameters recovery_connections and + max_standby_delay can be used to enable and control Hot Standby. + standby server to assist the successful configuration of Hot Standby servers. + vacuum_defer_cleanup_age has no effect during recovery. + + + + + Caveats + + + At this writing, there are several limitations of Hot Standby. + These can and probably will be fixed in future releases: + + + + + Operations on hash indexes are not presently WAL-logged, so + replay will not update these indexes. Hash indexes will not be + used for query plans during recovery. + + + + + Full knowledge of running transactions is required before snapshots + may be taken. Transactions that take use large numbers of subtransactions + (currently greater than 64) will delay the start of read only + connections until the completion of the longest running write transaction. + If this situation occurs explanatory messages will be sent to server log. + + + + + Valid starting points for recovery connections are generated at each + checkpoint on the master. If the standby is shutdown while the master + is in a shutdown state it may not be possible to re-enter Hot Standby + until the primary is started up so that it generates further starting + points in the WAL logs. This is not considered a serious issue + because the standby is usually switched into the primary role while + the first node is taken down. + + + + + At the end of recovery, AccessExclusiveLocks held by prepared transactions + will require twice the normal number of lock table entries. If you plan + on running either a large number of concurrent prepared transactions + that normally take AccessExclusiveLocks, or you plan on having one + large transaction that takes many AccessExclusiveLocks then you are + advised to select a larger value of max_locks_per_transaction, + up to, but never more than twice the value of the parameter setting on + the primary server in rare extremes. You need not consider this at all if + your setting of max_prepared_transactions is 0. + + + + + + + + + Migration Between Releases diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index d13e6d151f..4554cb614a 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -1,4 +1,4 @@ - + Server Configuration @@ -376,6 +376,12 @@ SET ENABLE_SEQSCAN TO OFF; allows. See for information on how to adjust those parameters, if necessary. + + + When running a standby server, you must set this parameter to the + same or higher value than on the master server. Otherwise, queries + will not be allowed in the standby server. + @@ -826,6 +832,12 @@ SET ENABLE_SEQSCAN TO OFF; allows. See for information on how to adjust those parameters, if necessary. + + + When running a standby server, you must set this parameter to the + same or higher value than on the master server. Otherwise, queries + will not be allowed in the standby server. + @@ -1733,6 +1745,51 @@ archive_command = 'copy "%p" "C:\\server\\archivedir\\%f"' # Windows + + + Standby Servers + + + + + recovery_connections (boolean) + + + Parameter has two roles. During recovery, specifies whether or not + you can connect and run queries to enable . + During normal running, specifies whether additional information is written + to WAL to allow recovery connections on a standby server that reads + WAL data generated by this server. The default value is + on. It is thought that there is little + measurable difference in performance from using this feature, so + feedback is welcome if any production impacts are noticeable. + It is likely that this parameter will be removed in later releases. + This parameter can only be set at server start. + + + + + + max_standby_delay (string) + + + When server acts as a standby, this parameter specifies a wait policy + for queries that conflict with incoming data changes. Valid settings + are -1, meaning wait forever, or a wait time of 0 or more seconds. + If a conflict should occur the server will delay up to this + amount before it begins trying to resolve things less amicably, as + described in . Typically, + this parameter makes sense only during replication, so when + performing an archive recovery to recover from data loss a + parameter setting of 0 is recommended. The default is 30 seconds. + This parameter can only be set in the postgresql.conf + file or on the server command line. + + + + + + @@ -4161,6 +4218,29 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv; + + vacuum_defer_cleanup_age (integer) + + vacuum_defer_cleanup_age configuration parameter + + + + Specifies the number of transactions by which VACUUM and + HOT updates will defer cleanup of dead row versions. The + default is 0 transactions, meaning that dead row versions will be + removed as soon as possible. You may wish to set this to a non-zero + value when planning or maintaining a + configuration. The recommended value is 0 unless you have + clear reason to increase it. The purpose of the parameter is to + allow the user to specify an approximate time delay before cleanup + occurs. However, it should be noted that there is no direct link with + any specific time delay and so the results will be application and + installation specific, as well as variable over time, depending upon + the transaction rate (of writes only). + + + + bytea_output (enum) @@ -4689,6 +4769,12 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir' allows. See for information on how to adjust those parameters, if necessary. + + + When running a standby server, you must set this parameter to the + same or higher value than on the master server. Otherwise, queries + will not be allowed in the standby server. + @@ -5546,6 +5632,32 @@ plruby.use_strict = true # generates error: unknown class name + + trace_recovery_messages (string) + + trace_recovery_messages configuration parameter + + + + Controls which message levels are written to the server log + for system modules needed for recovery processing. This allows + the user to override the normal setting of log_min_messages, + but only for specific messages. This is intended for use in + debugging Hot Standby. + Valid values are DEBUG5, DEBUG4, + DEBUG3, DEBUG2, DEBUG1, + INFO, NOTICE, WARNING, + ERROR, LOG, FATAL, and + PANIC. Each level includes all the levels that + follow it. The later the level, the fewer messages are sent + to the log. The default is WARNING. Note that + LOG has a different rank here than in + client_min_messages. + Parameter should be set in the postgresql.conf only. + + + + zero_damaged_pages (boolean) diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 7d6125c97e..5094727403 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -1,4 +1,4 @@ - + Functions and Operators @@ -13132,6 +13132,38 @@ postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup()); . + + pg_is_in_recovery + + + + The functions shown in provide information + about the current status of Hot Standby. + These functions may be executed during both recovery and in normal running. + + + + Recovery Information Functions + + + Name Return Type Description + + + + + + + pg_is_in_recovery() + + bool + True if recovery is still in progress. + + + + +
+ The functions shown in calculate the disk space usage of database objects. diff --git a/doc/src/sgml/ref/checkpoint.sgml b/doc/src/sgml/ref/checkpoint.sgml index 76eb273dea..31f1b0fe19 100644 --- a/doc/src/sgml/ref/checkpoint.sgml +++ b/doc/src/sgml/ref/checkpoint.sgml @@ -1,4 +1,4 @@ - + @@ -42,6 +42,11 @@ CHECKPOINT for more information about the WAL system. + + If executed during recovery, the CHECKPOINT command + will force a restartpoint rather than writing a new checkpoint. + + Only superusers can call CHECKPOINT. The command is not intended for use during normal operation. diff --git a/src/backend/access/gin/ginxlog.c b/src/backend/access/gin/ginxlog.c index 1f008b727f..186805b124 100644 --- a/src/backend/access/gin/ginxlog.c +++ b/src/backend/access/gin/ginxlog.c @@ -8,7 +8,7 @@ * Portions Copyright (c) 1994, Regents of the University of California * * IDENTIFICATION - * $PostgreSQL: pgsql/src/backend/access/gin/ginxlog.c,v 1.19 2009/06/11 14:48:53 momjian Exp $ + * $PostgreSQL: pgsql/src/backend/access/gin/ginxlog.c,v 1.20 2009/12/19 01:32:31 sriggs Exp $ *------------------------------------------------------------------------- */ #include "postgres.h" @@ -621,6 +621,10 @@ gin_redo(XLogRecPtr lsn, XLogRecord *record) { uint8 info = record->xl_info & ~XLR_INFO_MASK; + /* + * GIN indexes do not require any conflict processing. + */ + RestoreBkpBlocks(lsn, record, false); topCtx = MemoryContextSwitchTo(opCtx); diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c index 672d714e01..7a9f8934cf 100644 --- a/src/backend/access/gist/gistxlog.c +++ b/src/backend/access/gist/gistxlog.c @@ -8,7 +8,7 @@ * Portions Copyright (c) 1994, Regents of the University of California * * IDENTIFICATION - * $PostgreSQL: pgsql/src/backend/access/gist/gistxlog.c,v 1.32 2009/01/20 18:59:36 heikki Exp $ + * $PostgreSQL: pgsql/src/backend/access/gist/gistxlog.c,v 1.33 2009/12/19 01:32:32 sriggs Exp $ *------------------------------------------------------------------------- */ #include "postgres.h" @@ -396,6 +396,12 @@ gist_redo(XLogRecPtr lsn, XLogRecord *record) uint8 info = record->xl_info & ~XLR_INFO_MASK; MemoryContext oldCxt; + /* + * GIST indexes do not require any conflict processing. NB: If we ever + * implement a similar optimization we have in b-tree, and remove killed + * tuples outside VACUUM, we'll need to handle that here. + */ + RestoreBkpBlocks(lsn, record, false); oldCxt = MemoryContextSwitchTo(opCtx); diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 148d88ba27..4b85b127a7 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -8,7 +8,7 @@ * * * IDENTIFICATION - * $PostgreSQL: pgsql/src/backend/access/heap/heapam.c,v 1.278 2009/08/24 02:18:31 tgl Exp $ + * $PostgreSQL: pgsql/src/backend/access/heap/heapam.c,v 1.279 2009/12/19 01:32:32 sriggs Exp $ * * * INTERFACE ROUTINES @@ -59,6 +59,7 @@ #include "storage/lmgr.h" #include "storage/procarray.h" #include "storage/smgr.h" +#include "storage/standby.h" #include "utils/datum.h" #include "utils/inval.h" #include "utils/lsyscache.h" @@ -248,8 +249,11 @@ heapgetpage(HeapScanDesc scan, BlockNumber page) /* * If the all-visible flag indicates that all tuples on the page are * visible to everyone, we can skip the per-tuple visibility tests. + * But not in hot standby mode. A tuple that's already visible to all + * transactions in the master might still be invisible to a read-only + * transaction in the standby. */ - all_visible = PageIsAllVisible(dp); + all_visible = PageIsAllVisible(dp) && !snapshot->takenDuringRecovery; for (lineoff = FirstOffsetNumber, lpp = PageGetItemId(dp, lineoff); lineoff <= lines; @@ -3769,6 +3773,60 @@ heap_restrpos(HeapScanDesc scan) } } +/* + * If 'tuple' contains any XID greater than latestRemovedXid, update + * latestRemovedXid to the greatest one found. + */ +void +HeapTupleHeaderAdvanceLatestRemovedXid(HeapTupleHeader tuple, + TransactionId *latestRemovedXid) +{ + TransactionId xmin = HeapTupleHeaderGetXmin(tuple); + TransactionId xmax = HeapTupleHeaderGetXmax(tuple); + TransactionId xvac = HeapTupleHeaderGetXvac(tuple); + + if (tuple->t_infomask & HEAP_MOVED_OFF || + tuple->t_infomask & HEAP_MOVED_IN) + { + if (TransactionIdPrecedes(*latestRemovedXid, xvac)) + *latestRemovedXid = xvac; + } + + if (TransactionIdPrecedes(*latestRemovedXid, xmax)) + *latestRemovedXid = xmax; + + if (TransactionIdPrecedes(*latestRemovedXid, xmin)) + *latestRemovedXid = xmin; + + Assert(TransactionIdIsValid(*latestRemovedXid)); +} + +/* + * Perform XLogInsert to register a heap cleanup info message. These + * messages are sent once per VACUUM and are required because + * of the phasing of removal operations during a lazy VACUUM. + * see comments for vacuum_log_cleanup_info(). + */ +XLogRecPtr +log_heap_cleanup_info(RelFileNode rnode, TransactionId latestRemovedXid) +{ + xl_heap_cleanup_info xlrec; + XLogRecPtr recptr; + XLogRecData rdata; + + xlrec.node = rnode; + xlrec.latestRemovedXid = latestRemovedXid; + + rdata.data = (char *) &xlrec; + rdata.len = SizeOfHeapCleanupInfo; + rdata.buffer = InvalidBuffer; + rdata.next = NULL; + + recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_CLEANUP_INFO, &rdata); + + return recptr; +} + /* * Perform XLogInsert for a heap-clean operation. Caller must already * have modified the buffer and marked it dirty. @@ -3776,13 +3834,17 @@ heap_restrpos(HeapScanDesc scan) * Note: prior to Postgres 8.3, the entries in the nowunused[] array were * zero-based tuple indexes. Now they are one-based like other uses * of OffsetNumber. + * + * We also include latestRemovedXid, which is the greatest XID present in + * the removed tuples. That allows recovery processing to cancel or wait + * for long standby queries that can still see these tuples. */ XLogRecPtr log_heap_clean(Relation reln, Buffer buffer, OffsetNumber *redirected, int nredirected, OffsetNumber *nowdead, int ndead, OffsetNumber *nowunused, int nunused, - bool redirect_move) + TransactionId latestRemovedXid, bool redirect_move) { xl_heap_clean xlrec; uint8 info; @@ -3794,6 +3856,7 @@ log_heap_clean(Relation reln, Buffer buffer, xlrec.node = reln->rd_node; xlrec.block = BufferGetBlockNumber(buffer); + xlrec.latestRemovedXid = latestRemovedXid; xlrec.nredirected = nredirected; xlrec.ndead = ndead; @@ -4067,6 +4130,33 @@ log_newpage(RelFileNode *rnode, ForkNumber forkNum, BlockNumber blkno, return recptr; } +/* + * Handles CLEANUP_INFO + */ +static void +heap_xlog_cleanup_info(XLogRecPtr lsn, XLogRecord *record) +{ + xl_heap_cleanup_info *xlrec = (xl_heap_cleanup_info *) XLogRecGetData(record); + + if (InHotStandby) + { + VirtualTransactionId *backends; + + backends = GetConflictingVirtualXIDs(xlrec->latestRemovedXid, + InvalidOid, + true); + ResolveRecoveryConflictWithVirtualXIDs(backends, + "VACUUM index cleanup", + CONFLICT_MODE_ERROR); + } + + /* + * Actual operation is a no-op. Record type exists to provide a means + * for conflict processing to occur before we begin index vacuum actions. + * see vacuumlazy.c and also comments in btvacuumpage() + */ +} + /* * Handles CLEAN and CLEAN_MOVE record types */ @@ -4085,12 +4175,31 @@ heap_xlog_clean(XLogRecPtr lsn, XLogRecord *record, bool clean_move) int nunused; Size freespace; + /* + * We're about to remove tuples. In Hot Standby mode, ensure that there's + * no queries running for which the removed tuples are still visible. + */ + if (InHotStandby) + { + VirtualTransactionId *backends; + + backends = GetConflictingVirtualXIDs(xlrec->latestRemovedXid, + InvalidOid, + true); + ResolveRecoveryConflictWithVirtualXIDs(backends, + "VACUUM heap cleanup", + CONFLICT_MODE_ERROR); + } + + RestoreBkpBlocks(lsn, record, true); + if (record->xl_info & XLR_BKP_BLOCK_1) return; - buffer = XLogReadBuffer(xlrec->node, xlrec->block, false); + buffer = XLogReadBufferExtended(xlrec->node, MAIN_FORKNUM, xlrec->block, RBM_NORMAL); if (!BufferIsValid(buffer)) return; + LockBufferForCleanup(buffer); page = (Page) BufferGetPage(buffer); if (XLByteLE(lsn, PageGetLSN(page))) @@ -4145,12 +4254,40 @@ heap_xlog_freeze(XLogRecPtr lsn, XLogRecord *record) Buffer buffer; Page page; + /* + * In Hot Standby mode, ensure that there's no queries running which still + * consider the frozen xids as running. + */ + if (InHotStandby) + { + VirtualTransactionId *backends; + + /* + * XXX: Using cutoff_xid is overly conservative. Even if cutoff_xid + * is recent enough to conflict with a backend, the actual values + * being frozen might not be. With a typical vacuum_freeze_min_age + * setting in the ballpark of millions of transactions, it won't make + * a difference, but it might if you run a manual VACUUM FREEZE. + * Typically the cutoff is much earlier than any recently deceased + * tuple versions removed by this vacuum, so don't worry too much. + */ + backends = GetConflictingVirtualXIDs(cutoff_xid, + InvalidOid, + true); + ResolveRecoveryConflictWithVirtualXIDs(backends, + "VACUUM heap freeze", + CONFLICT_MODE_ERROR); + } + + RestoreBkpBlocks(lsn, record, false); + if (record->xl_info & XLR_BKP_BLOCK_1) return; - buffer = XLogReadBuffer(xlrec->node, xlrec->block, false); + buffer = XLogReadBufferExtended(xlrec->node, MAIN_FORKNUM, xlrec->block, RBM_NORMAL); if (!BufferIsValid(buffer)) return; + LockBufferForCleanup(buffer); page = (Page) BufferGetPage(buffer); if (XLByteLE(lsn, PageGetLSN(page))) @@ -4740,6 +4877,11 @@ heap_redo(XLogRecPtr lsn, XLogRecord *record) { uint8 info = record->xl_info & ~XLR_INFO_MASK; + /* + * These operations don't overwrite MVCC data so no conflict + * processing is required. The ones in heap2 rmgr do. + */ + RestoreBkpBlocks(lsn, record, false); switch (info & XLOG_HEAP_OPMASK) @@ -4778,20 +4920,25 @@ heap2_redo(XLogRecPtr lsn, XLogRecord *record) { uint8 info = record->xl_info & ~XLR_INFO_MASK; + /* + * Note that RestoreBkpBlocks() is called after conflict processing + * within each record type handling function. + */ + switch (info & XLOG_HEAP_OPMASK) { case XLOG_HEAP2_FREEZE: - RestoreBkpBlocks(lsn, record, false); heap_xlog_freeze(lsn, record); break; case XLOG_HEAP2_CLEAN: - RestoreBkpBlocks(lsn, record, true); heap_xlog_clean(lsn, record, false); break; case XLOG_HEAP2_CLEAN_MOVE: - RestoreBkpBlocks(lsn, record, true); heap_xlog_clean(lsn, record, true); break; + case XLOG_HEAP2_CLEANUP_INFO: + heap_xlog_cleanup_info(lsn, record); + break; default: elog(PANIC, "heap2_redo: unknown op code %u", info); } @@ -4921,17 +5068,26 @@ heap2_desc(StringInfo buf, uint8 xl_info, char *rec) { xl_heap_clean *xlrec = (xl_heap_clean *) rec; - appendStringInfo(buf, "clean: rel %u/%u/%u; blk %u", + appendStringInfo(buf, "clean: rel %u/%u/%u; blk %u remxid %u", xlrec->node.spcNode, xlrec->node.dbNode, - xlrec->node.relNode, xlrec->block); + xlrec->node.relNode, xlrec->block, + xlrec->latestRemovedXid); } else if (info == XLOG_HEAP2_CLEAN_MOVE) { xl_heap_clean *xlrec = (xl_heap_clean *) rec; - appendStringInfo(buf, "clean_move: rel %u/%u/%u; blk %u", + appendStringInfo(buf, "clean_move: rel %u/%u/%u; blk %u remxid %u", xlrec->node.spcNode, xlrec->node.dbNode, - xlrec->node.relNode, xlrec->block); + xlrec->node.relNode, xlrec->block, + xlrec->latestRemovedXid); + } + else if (info == XLOG_HEAP2_CLEANUP_INFO) + { + xl_heap_cleanup_info *xlrec = (xl_heap_cleanup_info *) rec; + + appendStringInfo(buf, "cleanup info: remxid %u", + xlrec->latestRemovedXid); } else appendStringInfo(buf, "UNKNOWN"); diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c index 71ea689d0e..1ea0899acc 100644 --- a/src/backend/access/heap/pruneheap.c +++ b/src/backend/access/heap/pruneheap.c @@ -8,7 +8,7 @@ * * * IDENTIFICATION - * $PostgreSQL: pgsql/src/backend/access/heap/pruneheap.c,v 1.18 2009/06/11 14:48:53 momjian Exp $ + * $PostgreSQL: pgsql/src/backend/access/heap/pruneheap.c,v 1.19 2009/12/19 01:32:32 sriggs Exp $ * *------------------------------------------------------------------------- */ @@ -30,7 +30,8 @@ typedef struct { TransactionId new_prune_xid; /* new prune hint value for page */ - int nredirected; /* numbers of entries in arrays below */ + TransactionId latestRemovedXid; /* latest xid to be removed by this prune */ + int nredirected; /* numbers of entries in arrays below */ int ndead; int nunused; /* arrays that accumulate indexes of items to be changed */ @@ -84,6 +85,14 @@ heap_page_prune_opt(Relation relation, Buffer buffer, TransactionId OldestXmin) if (!PageIsPrunable(page, OldestXmin)) return; + /* + * We can't write WAL in recovery mode, so there's no point trying to + * clean the page. The master will likely issue a cleaning WAL record + * soon anyway, so this is no particular loss. + */ + if (RecoveryInProgress()) + return; + /* * We prune when a previous UPDATE failed to find enough space on the page * for a new tuple version, or when free space falls below the relation's @@ -176,6 +185,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin, * of our working state. */ prstate.new_prune_xid = InvalidTransactionId; + prstate.latestRemovedXid = InvalidTransactionId; prstate.nredirected = prstate.ndead = prstate.nunused = 0; memset(prstate.marked, 0, sizeof(prstate.marked)); @@ -257,7 +267,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin, prstate.redirected, prstate.nredirected, prstate.nowdead, prstate.ndead, prstate.nowunused, prstate.nunused, - redirect_move); + prstate.latestRemovedXid, redirect_move); PageSetLSN(BufferGetPage(buffer), recptr); PageSetTLI(BufferGetPage(buffer), ThisTimeLineID); @@ -395,6 +405,8 @@ heap_prune_chain(Relation relation, Buffer buffer, OffsetNumber rootoffnum, == HEAPTUPLE_DEAD && !HeapTupleHeaderIsHotUpdated(htup)) { heap_prune_record_unused(prstate, rootoffnum); + HeapTupleHeaderAdvanceLatestRemovedXid(htup, + &prstate->latestRemovedXid); ndeleted++; } @@ -520,7 +532,11 @@ heap_prune_chain(Relation relation, Buffer buffer, OffsetNumber rootoffnum, * find another DEAD tuple is a fairly unusual corner case.) */ if (tupdead) + { latestdead = offnum; + HeapTupleHeaderAdvanceLatestRemovedXid(htup, + &prstate->latestRemovedXid); + } else if (!recent_dead) break; diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c index f07996a3d4..3bbbf3b06d 100644 --- a/src/backend/access/index/genam.c +++ b/src/backend/access/index/genam.c @@ -8,7 +8,7 @@ * * * IDENTIFICATION - * $PostgreSQL: pgsql/src/backend/access/index/genam.c,v 1.77 2009/12/07 05:22:21 tgl Exp $ + * $PostgreSQL: pgsql/src/backend/access/index/genam.c,v 1.78 2009/12/19 01:32:32 sriggs Exp $ * * NOTES * many of the old access method routines have been turned into @@ -91,8 +91,19 @@ RelationGetIndexScan(Relation indexRelation, else scan->keyData = NULL; + /* + * During recovery we ignore killed tuples and don't bother to kill them + * either. We do this because the xmin on the primary node could easily + * be later than the xmin on the standby node, so that what the primary + * thinks is killed is supposed to be visible on standby. So for correct + * MVCC for queries during recovery we must ignore these hints and check + * all tuples. Do *not* set ignore_killed_tuples to true when running + * in a transaction that was started during recovery. + * xactStartedInRecovery should not be altered by index AMs. + */ scan->kill_prior_tuple = false; - scan->ignore_killed_tuples = true; /* default setting */ + scan->xactStartedInRecovery = TransactionStartedDuringRecovery(); + scan->ignore_killed_tuples = !scan->xactStartedInRecovery; scan->opaque = NULL; diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c index f4ffeccd32..d71b26a554 100644 --- a/src/backend/access/index/indexam.c +++ b/src/backend/access/index/indexam.c @@ -8,7 +8,7 @@ * * * IDENTIFICATION - * $PostgreSQL: pgsql/src/backend/access/index/indexam.c,v 1.115 2009/07/29 20:56:18 tgl Exp $ + * $PostgreSQL: pgsql/src/backend/access/index/indexam.c,v 1.116 2009/12/19 01:32:32 sriggs Exp $ * * INTERFACE ROUTINES * index_open - open an index relation by relation OID @@ -455,9 +455,12 @@ index_getnext(IndexScanDesc scan, ScanDirection direction) /* * If we scanned a whole HOT chain and found only dead tuples, - * tell index AM to kill its entry for that TID. + * tell index AM to kill its entry for that TID. We do not do + * this when in recovery because it may violate MVCC to do so. + * see comments in RelationGetIndexScan(). */ - scan->kill_prior_tuple = scan->xs_hot_dead; + if (!scan->xactStartedInRecovery) + scan->kill_prior_tuple = scan->xs_hot_dead; /* * The AM's gettuple proc finds the next index entry matching the diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README index 9fe84e320e..e53315a83f 100644 --- a/src/backend/access/nbtree/README +++ b/src/backend/access/nbtree/README @@ -1,4 +1,4 @@ -$PostgreSQL: pgsql/src/backend/access/nbtree/README,v 1.20 2008/03/21 13:23:27 momjian Exp $ +$PostgreSQL: pgsql/src/backend/access/nbtree/README,v 1.21 2009/12/19 01:32:32 sriggs Exp $ Btree Indexing ============== @@ -401,6 +401,33 @@ of the WAL entry.) If the parent page becomes half-dead but is not immediately deleted due to a subsequent crash, there is no loss of consistency, and the empty page will be picked up by the next VACUUM. +Scans during Recovery +--------------------- + +The btree index type can be safely used during recovery. During recovery +we have at most one writer and potentially many readers. In that +situation the locking requirements can be relaxed and we do not need +double locking during block splits. Each WAL record makes changes to a +single level of the btree using the correct locking sequence and so +is safe for concurrent readers. Some readers may observe a block split +in progress as they descend the tree, but they will simply move right +onto the correct page. + +During recovery all index scans start with ignore_killed_tuples = false +and we never set kill_prior_tuple. We do this because the oldest xmin +on the standby server can be older than the oldest xmin on the master +server, which means tuples can be marked as killed even when they are +still visible on the standby. We don't WAL log tuple killed bits, but +they can still appear in the standby because of full page writes. So +we must always ignore them in standby, and that means it's not worth +setting them either. + +Note that we talk about scans that are started during recovery. We go to +a little trouble to allow a scan to start during recovery and end during +normal running after recovery has completed. This is a key capability +because it allows running applications to continue while the standby +changes state into a normally running server. + Other Things That Are Handy to Know ----------------------------------- diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c index a1dadfb692..3263d5846a 100644 --- a/src/backend/access/nbtree/nbtinsert.c +++ b/src/backend/access/nbtree/nbtinsert.c @@ -8,7 +8,7 @@ * * * IDENTIFICATION - * $PostgreSQL: pgsql/src/backend/access/nbtree/nbtinsert.c,v 1.174 2009/10/02 21:14:04 tgl Exp $ + * $PostgreSQL: pgsql/src/backend/access/nbtree/nbtinsert.c,v 1.175 2009/12/19 01:32:32 sriggs Exp $ * *------------------------------------------------------------------------- */ @@ -2025,7 +2025,7 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer) } if (ndeletable > 0) - _bt_delitems(rel, buffer, deletable, ndeletable); + _bt_delitems(rel, buffer, deletable, ndeletable, false, 0); /* * Note: if we didn't find any LP_DEAD items, then the page's diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c index 0dd4fdae79..85f352d343 100644 --- a/src/backend/access/nbtree/nbtpage.c +++ b/src/backend/access/nbtree/nbtpage.c @@ -9,7 +9,7 @@ * * * IDENTIFICATION - * $PostgreSQL: pgsql/src/backend/access/nbtree/nbtpage.c,v 1.113 2009/05/05 19:02:22 tgl Exp $ + * $PostgreSQL: pgsql/src/backend/access/nbtree/nbtpage.c,v 1.114 2009/12/19 01:32:33 sriggs Exp $ * * NOTES * Postgres btree pages look like ordinary relation pages. The opaque @@ -653,19 +653,33 @@ _bt_page_recyclable(Page page) * * This routine assumes that the caller has pinned and locked the buffer. * Also, the given itemnos *must* appear in increasing order in the array. + * + * We record VACUUMs and b-tree deletes differently in WAL. InHotStandby + * we need to be able to pin all of the blocks in the btree in physical + * order when replaying the effects of a VACUUM, just as we do for the + * original VACUUM itself. lastBlockVacuumed allows us to tell whether an + * intermediate range of blocks has had no changes at all by VACUUM, + * and so must be scanned anyway during replay. We always write a WAL record + * for the last block in the index, whether or not it contained any items + * to be removed. This allows us to scan right up to end of index to + * ensure correct locking. */ void _bt_delitems(Relation rel, Buffer buf, - OffsetNumber *itemnos, int nitems) + OffsetNumber *itemnos, int nitems, bool isVacuum, + BlockNumber lastBlockVacuumed) { Page page = BufferGetPage(buf); BTPageOpaque opaque; + Assert(isVacuum || lastBlockVacuumed == 0); + /* No ereport(ERROR) until changes are logged */ START_CRIT_SECTION(); /* Fix the page */ - PageIndexMultiDelete(page, itemnos, nitems); + if (nitems > 0) + PageIndexMultiDelete(page, itemnos, nitems); /* * We can clear the vacuum cycle ID since this page has certainly been @@ -688,15 +702,36 @@ _bt_delitems(Relation rel, Buffer buf, /* XLOG stuff */ if (!rel->rd_istemp) { - xl_btree_delete xlrec; XLogRecPtr recptr; XLogRecData rdata[2]; - xlrec.node = rel->rd_node; - xlrec.block = BufferGetBlockNumber(buf); + if (isVacuum) + { + xl_btree_vacuum xlrec_vacuum; + xlrec_vacuum.node = rel->rd_node; + xlrec_vacuum.block = BufferGetBlockNumber(buf); + + xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed; + rdata[0].data = (char *) &xlrec_vacuum; + rdata[0].len = SizeOfBtreeVacuum; + } + else + { + xl_btree_delete xlrec_delete; + xlrec_delete.node = rel->rd_node; + xlrec_delete.block = BufferGetBlockNumber(buf); + + /* + * XXX: We would like to set an accurate latestRemovedXid, but + * there is no easy way of obtaining a useful value. So we punt + * and store InvalidTransactionId, which forces the standby to + * wait for/cancel all currently running transactions. + */ + xlrec_delete.latestRemovedXid = InvalidTransactionId; + rdata[0].data = (char *) &xlrec_delete; + rdata[0].len = SizeOfBtreeDelete; + } - rdata[0].data = (char *) &xlrec; - rdata[0].len = SizeOfBtreeDelete; rdata[0].buffer = InvalidBuffer; rdata[0].next = &(rdata[1]); @@ -719,7 +754,10 @@ _bt_delitems(Relation rel, Buffer buf, rdata[1].buffer_std = true; rdata[1].next = NULL; - recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DELETE, rdata); + if (isVacuum) + recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM, rdata); + else + recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DELETE, rdata); PageSetLSN(page, recptr); PageSetTLI(page, ThisTimeLineID); diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c index 87a8a225db..d166a811b8 100644 --- a/src/backend/access/nbtree/nbtree.c +++ b/src/backend/access/nbtree/nbtree.c @@ -12,7 +12,7 @@ * Portions Copyright (c) 1994, Regents of the University of California * * IDENTIFICATION - * $PostgreSQL: pgsql/src/backend/access/nbtree/nbtree.c,v 1.172 2009/07/29 20:56:18 tgl Exp $ + * $PostgreSQL: pgsql/src/backend/access/nbtree/nbtree.c,v 1.173 2009/12/19 01:32:33 sriggs Exp $ * *------------------------------------------------------------------------- */ @@ -57,7 +57,8 @@ typedef struct IndexBulkDeleteCallback callback; void *callback_state; BTCycleId cycleid; - BlockNumber lastUsedPage; + BlockNumber lastBlockVacuumed; /* last blkno reached by Vacuum scan */ + BlockNumber lastUsedPage; /* blkno of last non-recyclable page */ BlockNumber totFreePages; /* true total # of free pages */ MemoryContext pagedelcontext; } BTVacState; @@ -629,6 +630,7 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats, vstate.callback = callback; vstate.callback_state = callback_state; vstate.cycleid = cycleid; + vstate.lastBlockVacuumed = BTREE_METAPAGE; /* Initialise at first block */ vstate.lastUsedPage = BTREE_METAPAGE; vstate.totFreePages = 0; @@ -705,6 +707,32 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats, num_pages = new_pages; } + /* + * InHotStandby we need to scan right up to the end of the index for + * correct locking, so we may need to write a WAL record for the final + * block in the index if it was not vacuumed. It's possible that VACUUMing + * has actually removed zeroed pages at the end of the index so we need to + * take care to issue the record for last actual block and not for the + * last block that was scanned. Ignore empty indexes. + */ + if (XLogStandbyInfoActive() && + num_pages > 1 && vstate.lastBlockVacuumed < (num_pages - 1)) + { + Buffer buf; + + /* + * We can't use _bt_getbuf() here because it always applies + * _bt_checkpage(), which will barf on an all-zero page. We want to + * recycle all-zero pages, not fail. Also, we want to use a nondefault + * buffer access strategy. + */ + buf = ReadBufferExtended(rel, MAIN_FORKNUM, num_pages - 1, RBM_NORMAL, + info->strategy); + LockBufferForCleanup(buf); + _bt_delitems(rel, buf, NULL, 0, true, vstate.lastBlockVacuumed); + _bt_relbuf(rel, buf); + } + MemoryContextDelete(vstate.pagedelcontext); /* update statistics */ @@ -847,6 +875,26 @@ restart: itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum)); htup = &(itup->t_tid); + + /* + * During Hot Standby we currently assume that XLOG_BTREE_VACUUM + * records do not produce conflicts. That is only true as long + * as the callback function depends only upon whether the index + * tuple refers to heap tuples removed in the initial heap scan. + * When vacuum starts it derives a value of OldestXmin. Backends + * taking later snapshots could have a RecentGlobalXmin with a + * later xid than the vacuum's OldestXmin, so it is possible that + * row versions deleted after OldestXmin could be marked as killed + * by other backends. The callback function *could* look at the + * index tuple state in isolation and decide to delete the index + * tuple, though currently it does not. If it ever did, we would + * need to reconsider whether XLOG_BTREE_VACUUM records should + * cause conflicts. If they did cause conflicts they would be + * fairly harsh conflicts, since we haven't yet worked out a way + * to pass a useful value for latestRemovedXid on the + * XLOG_BTREE_VACUUM records. This applies to *any* type of index + * that marks index tuples as killed. + */ if (callback(htup, callback_state)) deletable[ndeletable++] = offnum; } @@ -858,7 +906,19 @@ restart: */ if (ndeletable > 0) { - _bt_delitems(rel, buf, deletable, ndeletable); + BlockNumber lastBlockVacuumed = BufferGetBlockNumber(buf); + + _bt_delitems(rel, buf, deletable, ndeletable, true, vstate->lastBlockVacuumed); + + /* + * Keep track of the block number of the lastBlockVacuumed, so + * we can scan those blocks as well during WAL replay. This then + * provides concurrency protection and allows btrees to be used + * while in recovery. + */ + if (lastBlockVacuumed > vstate->lastBlockVacuumed) + vstate->lastBlockVacuumed = lastBlockVacuumed; + stats->tuples_removed += ndeletable; /* must recompute maxoff */ maxoff = PageGetMaxOffsetNumber(page); diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c index d132d6bdee..418eec162d 100644 --- a/src/backend/access/nbtree/nbtxlog.c +++ b/src/backend/access/nbtree/nbtxlog.c @@ -8,7 +8,7 @@ * Portions Copyright (c) 1994, Regents of the University of California * * IDENTIFICATION - * $PostgreSQL: pgsql/src/backend/access/nbtree/nbtxlog.c,v 1.55 2009/06/11 14:48:54 momjian Exp $ + * $PostgreSQL: pgsql/src/backend/access/nbtree/nbtxlog.c,v 1.56 2009/12/19 01:32:33 sriggs Exp $ * *------------------------------------------------------------------------- */ @@ -16,7 +16,11 @@ #include "access/nbtree.h" #include "access/transam.h" +#include "access/xact.h" #include "storage/bufmgr.h" +#include "storage/procarray.h" +#include "storage/standby.h" +#include "miscadmin.h" /* * We must keep track of expected insertions due to page splits, and apply @@ -458,6 +462,97 @@ btree_xlog_split(bool onleft, bool isroot, xlrec->leftsib, xlrec->rightsib, isroot); } +static void +btree_xlog_vacuum(XLogRecPtr lsn, XLogRecord *record) +{ + xl_btree_vacuum *xlrec; + Buffer buffer; + Page page; + BTPageOpaque opaque; + + xlrec = (xl_btree_vacuum *) XLogRecGetData(record); + + /* + * If queries might be active then we need to ensure every block is unpinned + * between the lastBlockVacuumed and the current block, if there are any. + * This ensures that every block in the index is touched during VACUUM as + * required to ensure scans work correctly. + */ + if (standbyState == STANDBY_SNAPSHOT_READY && + (xlrec->lastBlockVacuumed + 1) != xlrec->block) + { + BlockNumber blkno = xlrec->lastBlockVacuumed + 1; + + for (; blkno < xlrec->block; blkno++) + { + /* + * XXX we don't actually need to read the block, we + * just need to confirm it is unpinned. If we had a special call + * into the buffer manager we could optimise this so that + * if the block is not in shared_buffers we confirm it as unpinned. + * + * Another simple optimization would be to check if there's any + * backends running; if not, we could just skip this. + */ + buffer = XLogReadBufferExtended(xlrec->node, MAIN_FORKNUM, blkno, RBM_NORMAL); + if (BufferIsValid(buffer)) + { + LockBufferForCleanup(buffer); + UnlockReleaseBuffer(buffer); + } + } + } + + /* + * If the block was restored from a full page image, nothing more to do. + * The RestoreBkpBlocks() call already pinned and took cleanup lock on + * it. XXX: Perhaps we should call RestoreBkpBlocks() *after* the loop + * above, to make the disk access more sequential. + */ + if (record->xl_info & XLR_BKP_BLOCK_1) + return; + + /* + * Like in btvacuumpage(), we need to take a cleanup lock on every leaf + * page. See nbtree/README for details. + */ + buffer = XLogReadBufferExtended(xlrec->node, MAIN_FORKNUM, xlrec->block, RBM_NORMAL); + if (!BufferIsValid(buffer)) + return; + LockBufferForCleanup(buffer); + page = (Page) BufferGetPage(buffer); + + if (XLByteLE(lsn, PageGetLSN(page))) + { + UnlockReleaseBuffer(buffer); + return; + } + + if (record->xl_len > SizeOfBtreeVacuum) + { + OffsetNumber *unused; + OffsetNumber *unend; + + unused = (OffsetNumber *) ((char *) xlrec + SizeOfBtreeVacuum); + unend = (OffsetNumber *) ((char *) xlrec + record->xl_len); + + if ((unend - unused) > 0) + PageIndexMultiDelete(page, unused, unend - unused); + } + + /* + * Mark the page as not containing any LP_DEAD items --- see comments in + * _bt_delitems(). + */ + opaque = (BTPageOpaque) PageGetSpecialPointer(page); + opaque->btpo_flags &= ~BTP_HAS_GARBAGE; + + PageSetLSN(page, lsn); + PageSetTLI(page, ThisTimeLineID); + MarkBufferDirty(buffer); + UnlockReleaseBuffer(buffer); +} + static void btree_xlog_delete(XLogRecPtr lsn, XLogRecord *record) { @@ -470,6 +565,11 @@ btree_xlog_delete(XLogRecPtr lsn, XLogRecord *record) return; xlrec = (xl_btree_delete *) XLogRecGetData(record); + + /* + * We don't need to take a cleanup lock to apply these changes. + * See nbtree/README for details. + */ buffer = XLogReadBuffer(xlrec->node, xlrec->block, false); if (!BufferIsValid(buffer)) return; @@ -714,7 +814,43 @@ btree_redo(XLogRecPtr lsn, XLogRecord *record) { uint8 info = record->xl_info & ~XLR_INFO_MASK; - RestoreBkpBlocks(lsn, record, false); + /* + * Btree delete records can conflict with standby queries. You might + * think that vacuum records would conflict as well, but we've handled + * that already. XLOG_HEAP2_CLEANUP_INFO records provide the highest xid + * cleaned by the vacuum of the heap and so we can resolve any conflicts + * just once when that arrives. After that any we know that no conflicts + * exist from individual btree vacuum records on that index. + */ + if (InHotStandby) + { + if (info == XLOG_BTREE_DELETE) + { + xl_btree_delete *xlrec = (xl_btree_delete *) XLogRecGetData(record); + VirtualTransactionId *backends; + + /* + * XXX Currently we put everybody on death row, because + * currently _bt_delitems() supplies InvalidTransactionId. + * This can be fairly painful, so providing a better value + * here is worth some thought and possibly some effort to + * improve. + */ + backends = GetConflictingVirtualXIDs(xlrec->latestRemovedXid, + InvalidOid, + true); + + ResolveRecoveryConflictWithVirtualXIDs(backends, + "b-tree delete", + CONFLICT_MODE_ERROR); + } + } + + /* + * Vacuum needs to pin and take cleanup lock on every leaf page, + * a regular exclusive lock is enough for all other purposes. + */ + RestoreBkpBlocks(lsn, record, (info == XLOG_BTREE_VACUUM)); switch (info) { @@ -739,6 +875,9 @@ btree_redo(XLogRecPtr lsn, XLogRecord *record) case XLOG_BTREE_SPLIT_R_ROOT: btree_xlog_split(false, true, lsn, record); break; + case XLOG_BTREE_VACUUM: + btree_xlog_vacuum(lsn, record); + break; case XLOG_BTREE_DELETE: btree_xlog_delete(lsn, record); break; @@ -843,13 +982,24 @@ btree_desc(StringInfo buf, uint8 xl_info, char *rec) xlrec->level, xlrec->firstright); break; } + case XLOG_BTREE_VACUUM: + { + xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec; + + appendStringInfo(buf, "vacuum: rel %u/%u/%u; blk %u, lastBlockVacuumed %u", + xlrec->node.spcNode, xlrec->node.dbNode, + xlrec->node.relNode, xlrec->block, + xlrec->lastBlockVacuumed); + break; + } case XLOG_BTREE_DELETE: { xl_btree_delete *xlrec = (xl_btree_delete *) rec; - appendStringInfo(buf, "delete: rel %u/%u/%u; blk %u", + appendStringInfo(buf, "delete: rel %u/%u/%u; blk %u, latestRemovedXid %u", xlrec->node.spcNode, xlrec->node.dbNode, - xlrec->node.relNode, xlrec->block); + xlrec->node.relNode, xlrec->block, + xlrec->latestRemovedXid); break; } case XLOG_BTREE_DELETE_PAGE: diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README index 2edac9d088..05c41d487c 100644 --- a/src/backend/access/transam/README +++ b/src/backend/access/transam/README @@ -1,4 +1,4 @@ -$PostgreSQL: pgsql/src/backend/access/transam/README,v 1.12 2008/10/20 19:18:18 alvherre Exp $ +$PostgreSQL: pgsql/src/backend/access/transam/README,v 1.13 2009/12/19 01:32:33 sriggs Exp $ The Transaction System ====================== @@ -649,3 +649,34 @@ fsync it down to disk without any sort of interlock, as soon as it finishes the bulk update. However, all these paths are designed to write data that no other transaction can see until after T1 commits. The situation is thus not different from ordinary WAL-logged updates. + +Transaction Emulation during Recovery +------------------------------------- + +During Recovery we replay transaction changes in the order they occurred. +As part of this replay we emulate some transactional behaviour, so that +read only backends can take MVCC snapshots. We do this by maintaining a +list of XIDs belonging to transactions that are being replayed, so that +each transaction that has recorded WAL records for database writes exist +in the array until it commits. Further details are given in comments in +procarray.c. + +Many actions write no WAL records at all, for example read only transactions. +These have no effect on MVCC in recovery and we can pretend they never +occurred at all. Subtransaction commit does not write a WAL record either +and has very little effect, since lock waiters need to wait for the +parent transaction to complete. + +Not all transactional behaviour is emulated, for example we do not insert +a transaction entry into the lock table, nor do we maintain the transaction +stack in memory. Clog entries are made normally. Multitrans is not maintained +because its purpose is to record tuple level locks that an application has +requested to prevent write locks. Since write locks cannot be obtained at all, +there is never any conflict and so there is no reason to update multitrans. +Subtrans is maintained during recovery but the details of the transaction +tree are ignored and all subtransactions reference the top-level TransactionId +directly. Since commit is atomic this provides correct lock wait behaviour +yet simplifies emulation of subtransactions considerably. + +Further details on locking mechanics in recovery are given in comments +with the Lock rmgr code. diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c index 8544725abb..d94c09424a 100644 --- a/src/backend/access/transam/clog.c +++ b/src/backend/access/transam/clog.c @@ -26,7 +26,7 @@ * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group * Portions Copyright (c) 1994, Regents of the University of California * - * $PostgreSQL: pgsql/src/backend/access/transam/clog.c,v 1.53 2009/06/11 14:48:54 momjian Exp $ + * $PostgreSQL: pgsql/src/backend/access/transam/clog.c,v 1.54 2009/12/19 01:32:33 sriggs Exp $ * *------------------------------------------------------------------------- */ @@ -574,7 +574,7 @@ ExtendCLOG(TransactionId newestXact) LWLockAcquire(CLogControlLock, LW_EXCLUSIVE); /* Zero the page and make an XLOG entry about it */ - ZeroCLOGPage(pageno, true); + ZeroCLOGPage(pageno, !InRecovery); LWLockRelease(CLogControlLock); } diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c index 46eca9c983..b272e9886b 100644 --- a/src/backend/access/transam/multixact.c +++ b/src/backend/access/transam/multixact.c @@ -42,7 +42,7 @@ * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group * Portions Copyright (c) 1994, Regents of the University of California * - * $PostgreSQL: pgsql/src/backend/access/transam/multixact.c,v 1.32 2009/11/23 09:58:36 heikki Exp $ + * $PostgreSQL: pgsql/src/backend/access/transam/multixact.c,v 1.33 2009/12/19 01:32:33 sriggs Exp $ * *------------------------------------------------------------------------- */ @@ -59,6 +59,7 @@ #include "storage/backendid.h" #include "storage/lmgr.h" #include "storage/procarray.h" +#include "utils/builtins.h" #include "utils/memutils.h" @@ -220,7 +221,6 @@ static MultiXactId GetNewMultiXactId(int nxids, MultiXactOffset *offset); static MultiXactId mXactCacheGetBySet(int nxids, TransactionId *xids); static int mXactCacheGetById(MultiXactId multi, TransactionId **xids); static void mXactCachePut(MultiXactId multi, int nxids, TransactionId *xids); -static int xidComparator(const void *arg1, const void *arg2); #ifdef MULTIXACT_DEBUG static char *mxid_to_string(MultiXactId multi, int nxids, TransactionId *xids); @@ -1221,27 +1221,6 @@ mXactCachePut(MultiXactId multi, int nxids, TransactionId *xids) MXactCache = entry; } -/* - * xidComparator - * qsort comparison function for XIDs - * - * We don't need to use wraparound comparison for XIDs, and indeed must - * not do so since that does not respect the triangle inequality! Any - * old sort order will do. - */ -static int -xidComparator(const void *arg1, const void *arg2) -{ - TransactionId xid1 = *(const TransactionId *) arg1; - TransactionId xid2 = *(const TransactionId *) arg2; - - if (xid1 > xid2) - return 1; - if (xid1 < xid2) - return -1; - return 0; -} - #ifdef MULTIXACT_DEBUG static char * mxid_to_string(MultiXactId multi, int nxids, TransactionId *xids) @@ -2051,11 +2030,18 @@ multixact_redo(XLogRecPtr lsn, XLogRecord *record) if (TransactionIdPrecedes(max_xid, xids[i])) max_xid = xids[i]; } + + /* We don't expect anyone else to modify nextXid, hence startup process + * doesn't need to hold a lock while checking this. We still acquire + * the lock to modify it, though. + */ if (TransactionIdFollowsOrEquals(max_xid, ShmemVariableCache->nextXid)) { + LWLockAcquire(XidGenLock, LW_EXCLUSIVE); ShmemVariableCache->nextXid = max_xid; TransactionIdAdvance(ShmemVariableCache->nextXid); + LWLockRelease(XidGenLock); } } else diff --git a/src/backend/access/transam/recovery.conf.sample b/src/backend/access/transam/recovery.conf.sample index 1ef80ac60f..cdbb49295f 100644 --- a/src/backend/access/transam/recovery.conf.sample +++ b/src/backend/access/transam/recovery.conf.sample @@ -79,3 +79,10 @@ # # #--------------------------------------------------------------------------- +# HOT STANDBY PARAMETERS +#--------------------------------------------------------------------------- +# +# If you want to enable read-only connections during recovery, enable +# recovery_connections in postgresql.conf +# +#--------------------------------------------------------------------------- diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c index 44c3cd7769..7e1e0f60fc 100644 --- a/src/backend/access/transam/rmgr.c +++ b/src/backend/access/transam/rmgr.c @@ -3,7 +3,7 @@ * * Resource managers definition * - * $PostgreSQL: pgsql/src/backend/access/transam/rmgr.c,v 1.27 2008/11/19 10:34:50 heikki Exp $ + * $PostgreSQL: pgsql/src/backend/access/transam/rmgr.c,v 1.28 2009/12/19 01:32:33 sriggs Exp $ */ #include "postgres.h" @@ -21,6 +21,7 @@ #include "commands/sequence.h" #include "commands/tablespace.h" #include "storage/freespace.h" +#include "storage/standby.h" const RmgrData RmgrTable[RM_MAX_ID + 1] = { @@ -32,7 +33,7 @@ const RmgrData RmgrTable[RM_MAX_ID + 1] = { {"Tablespace", tblspc_redo, tblspc_desc, NULL, NULL, NULL}, {"MultiXact", multixact_redo, multixact_desc, NULL, NULL, NULL}, {"Reserved 7", NULL, NULL, NULL, NULL, NULL}, - {"Reserved 8", NULL, NULL, NULL, NULL, NULL}, + {"Standby", standby_redo, standby_desc, NULL, NULL, NULL}, {"Heap2", heap2_redo, heap2_desc, NULL, NULL, NULL}, {"Heap", heap_redo, heap_desc, NULL, NULL, NULL}, {"Btree", btree_redo, btree_desc, btree_xlog_startup, btree_xlog_cleanup, btree_safe_restartpoint}, diff --git a/src/backend/access/transam/subtrans.c b/src/backend/access/transam/subtrans.c index 9c74e995db..2b9db48f3b 100644 --- a/src/backend/access/transam/subtrans.c +++ b/src/backend/access/transam/subtrans.c @@ -22,7 +22,7 @@ * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group * Portions Copyright (c) 1994, Regents of the University of California * - * $PostgreSQL: pgsql/src/backend/access/transam/subtrans.c,v 1.24 2009/01/01 17:23:36 momjian Exp $ + * $PostgreSQL: pgsql/src/backend/access/transam/subtrans.c,v 1.25 2009/12/19 01:32:33 sriggs Exp $ * *------------------------------------------------------------------------- */ @@ -68,15 +68,19 @@ static bool SubTransPagePrecedes(int page1, int page2); /* * Record the parent of a subtransaction in the subtrans log. + * + * In some cases we may need to overwrite an existing value. */ void -SubTransSetParent(TransactionId xid, TransactionId parent) +SubTransSetParent(TransactionId xid, TransactionId parent, bool overwriteOK) { int pageno = TransactionIdToPage(xid); int entryno = TransactionIdToEntry(xid); int slotno; TransactionId *ptr; + Assert(TransactionIdIsValid(parent)); + LWLockAcquire(SubtransControlLock, LW_EXCLUSIVE); slotno = SimpleLruReadPage(SubTransCtl, pageno, true, xid); @@ -84,7 +88,8 @@ SubTransSetParent(TransactionId xid, TransactionId parent) ptr += entryno; /* Current state should be 0 */ - Assert(*ptr == InvalidTransactionId); + Assert(*ptr == InvalidTransactionId || + (*ptr == parent && overwriteOK)); *ptr = parent; diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c index db5795324b..4c3a1b901c 100644 --- a/src/backend/access/transam/twophase.c +++ b/src/backend/access/transam/twophase.c @@ -7,7 +7,7 @@ * Portions Copyright (c) 1994, Regents of the University of California * * IDENTIFICATION - * $PostgreSQL: pgsql/src/backend/access/transam/twophase.c,v 1.56 2009/11/23 09:58:36 heikki Exp $ + * $PostgreSQL: pgsql/src/backend/access/transam/twophase.c,v 1.57 2009/12/19 01:32:33 sriggs Exp $ * * NOTES * Each global transaction is associated with a global transaction @@ -57,6 +57,7 @@ #include "pgstat.h" #include "storage/fd.h" #include "storage/procarray.h" +#include "storage/sinvaladt.h" #include "storage/smgr.h" #include "utils/builtins.h" #include "utils/memutils.h" @@ -144,7 +145,10 @@ static void RecordTransactionCommitPrepared(TransactionId xid, int nchildren, TransactionId *children, int nrels, - RelFileNode *rels); + RelFileNode *rels, + int ninvalmsgs, + SharedInvalidationMessage *invalmsgs, + bool initfileinval); static void RecordTransactionAbortPrepared(TransactionId xid, int nchildren, TransactionId *children, @@ -736,10 +740,11 @@ TwoPhaseGetDummyProc(TransactionId xid) * 2. TransactionId[] (subtransactions) * 3. RelFileNode[] (files to be deleted at commit) * 4. RelFileNode[] (files to be deleted at abort) - * 5. TwoPhaseRecordOnDisk - * 6. ... - * 7. TwoPhaseRecordOnDisk (end sentinel, rmid == TWOPHASE_RM_END_ID) - * 8. CRC32 + * 5. SharedInvalidationMessage[] (inval messages to be sent at commit) + * 6. TwoPhaseRecordOnDisk + * 7. ... + * 8. TwoPhaseRecordOnDisk (end sentinel, rmid == TWOPHASE_RM_END_ID) + * 9. CRC32 * * Each segment except the final CRC32 is MAXALIGN'd. */ @@ -760,6 +765,8 @@ typedef struct TwoPhaseFileHeader int32 nsubxacts; /* number of following subxact XIDs */ int32 ncommitrels; /* number of delete-on-commit rels */ int32 nabortrels; /* number of delete-on-abort rels */ + int32 ninvalmsgs; /* number of cache invalidation messages */ + bool initfileinval; /* does relcache init file need invalidation? */ char gid[GIDSIZE]; /* GID for transaction */ } TwoPhaseFileHeader; @@ -835,6 +842,7 @@ StartPrepare(GlobalTransaction gxact) TransactionId *children; RelFileNode *commitrels; RelFileNode *abortrels; + SharedInvalidationMessage *invalmsgs; /* Initialize linked list */ records.head = palloc0(sizeof(XLogRecData)); @@ -859,11 +867,16 @@ StartPrepare(GlobalTransaction gxact) hdr.nsubxacts = xactGetCommittedChildren(&children); hdr.ncommitrels = smgrGetPendingDeletes(true, &commitrels, NULL); hdr.nabortrels = smgrGetPendingDeletes(false, &abortrels, NULL); + hdr.ninvalmsgs = xactGetCommittedInvalidationMessages(&invalmsgs, + &hdr.initfileinval); StrNCpy(hdr.gid, gxact->gid, GIDSIZE); save_state_data(&hdr, sizeof(TwoPhaseFileHeader)); - /* Add the additional info about subxacts and deletable files */ + /* + * Add the additional info about subxacts, deletable files and + * cache invalidation messages. + */ if (hdr.nsubxacts > 0) { save_state_data(children, hdr.nsubxacts * sizeof(TransactionId)); @@ -880,6 +893,12 @@ StartPrepare(GlobalTransaction gxact) save_state_data(abortrels, hdr.nabortrels * sizeof(RelFileNode)); pfree(abortrels); } + if (hdr.ninvalmsgs > 0) + { + save_state_data(invalmsgs, + hdr.ninvalmsgs * sizeof(SharedInvalidationMessage)); + pfree(invalmsgs); + } } /* @@ -1071,7 +1090,7 @@ RegisterTwoPhaseRecord(TwoPhaseRmgrId rmid, uint16 info, * contents of the file. Otherwise return NULL. */ static char * -ReadTwoPhaseFile(TransactionId xid) +ReadTwoPhaseFile(TransactionId xid, bool give_warnings) { char path[MAXPGPATH]; char *buf; @@ -1087,10 +1106,11 @@ ReadTwoPhaseFile(TransactionId xid) fd = BasicOpenFile(path, O_RDONLY | PG_BINARY, 0); if (fd < 0) { - ereport(WARNING, - (errcode_for_file_access(), - errmsg("could not open two-phase state file \"%s\": %m", - path))); + if (give_warnings) + ereport(WARNING, + (errcode_for_file_access(), + errmsg("could not open two-phase state file \"%s\": %m", + path))); return NULL; } @@ -1103,10 +1123,11 @@ ReadTwoPhaseFile(TransactionId xid) if (fstat(fd, &stat)) { close(fd); - ereport(WARNING, - (errcode_for_file_access(), - errmsg("could not stat two-phase state file \"%s\": %m", - path))); + if (give_warnings) + ereport(WARNING, + (errcode_for_file_access(), + errmsg("could not stat two-phase state file \"%s\": %m", + path))); return NULL; } @@ -1134,10 +1155,11 @@ ReadTwoPhaseFile(TransactionId xid) if (read(fd, buf, stat.st_size) != stat.st_size) { close(fd); - ereport(WARNING, - (errcode_for_file_access(), - errmsg("could not read two-phase state file \"%s\": %m", - path))); + if (give_warnings) + ereport(WARNING, + (errcode_for_file_access(), + errmsg("could not read two-phase state file \"%s\": %m", + path))); pfree(buf); return NULL; } @@ -1166,6 +1188,30 @@ ReadTwoPhaseFile(TransactionId xid) return buf; } +/* + * Confirms an xid is prepared, during recovery + */ +bool +StandbyTransactionIdIsPrepared(TransactionId xid) +{ + char *buf; + TwoPhaseFileHeader *hdr; + bool result; + + Assert(TransactionIdIsValid(xid)); + + /* Read and validate file */ + buf = ReadTwoPhaseFile(xid, false); + if (buf == NULL) + return false; + + /* Check header also */ + hdr = (TwoPhaseFileHeader *) buf; + result = TransactionIdEquals(hdr->xid, xid); + pfree(buf); + + return result; +} /* * FinishPreparedTransaction: execute COMMIT PREPARED or ROLLBACK PREPARED @@ -1184,6 +1230,7 @@ FinishPreparedTransaction(const char *gid, bool isCommit) RelFileNode *abortrels; RelFileNode *delrels; int ndelrels; + SharedInvalidationMessage *invalmsgs; int i; /* @@ -1196,7 +1243,7 @@ FinishPreparedTransaction(const char *gid, bool isCommit) /* * Read and validate the state file */ - buf = ReadTwoPhaseFile(xid); + buf = ReadTwoPhaseFile(xid, true); if (buf == NULL) ereport(ERROR, (errcode(ERRCODE_DATA_CORRUPTED), @@ -1215,6 +1262,8 @@ FinishPreparedTransaction(const char *gid, bool isCommit) bufptr += MAXALIGN(hdr->ncommitrels * sizeof(RelFileNode)); abortrels = (RelFileNode *) bufptr; bufptr += MAXALIGN(hdr->nabortrels * sizeof(RelFileNode)); + invalmsgs = (SharedInvalidationMessage *) bufptr; + bufptr += MAXALIGN(hdr->ninvalmsgs * sizeof(SharedInvalidationMessage)); /* compute latestXid among all children */ latestXid = TransactionIdLatest(xid, hdr->nsubxacts, children); @@ -1230,7 +1279,9 @@ FinishPreparedTransaction(const char *gid, bool isCommit) if (isCommit) RecordTransactionCommitPrepared(xid, hdr->nsubxacts, children, - hdr->ncommitrels, commitrels); + hdr->ncommitrels, commitrels, + hdr->ninvalmsgs, invalmsgs, + hdr->initfileinval); else RecordTransactionAbortPrepared(xid, hdr->nsubxacts, children, @@ -1277,6 +1328,18 @@ FinishPreparedTransaction(const char *gid, bool isCommit) smgrclose(srel); } + /* + * Handle cache invalidation messages. + * + * Relcache init file invalidation requires processing both + * before and after we send the SI messages. See AtEOXact_Inval() + */ + if (hdr->initfileinval) + RelationCacheInitFileInvalidate(true); + SendSharedInvalidMessages(invalmsgs, hdr->ninvalmsgs); + if (hdr->initfileinval) + RelationCacheInitFileInvalidate(false); + /* And now do the callbacks */ if (isCommit) ProcessRecords(bufptr, xid, twophase_postcommit_callbacks); @@ -1528,14 +1591,21 @@ CheckPointTwoPhase(XLogRecPtr redo_horizon) * Our other responsibility is to determine and return the oldest valid XID * among the prepared xacts (if none, return ShmemVariableCache->nextXid). * This is needed to synchronize pg_subtrans startup properly. + * + * If xids_p and nxids_p are not NULL, pointer to a palloc'd array of all + * top-level xids is stored in *xids_p. The number of entries in the array + * is returned in *nxids_p. */ TransactionId -PrescanPreparedTransactions(void) +PrescanPreparedTransactions(TransactionId **xids_p, int *nxids_p) { TransactionId origNextXid = ShmemVariableCache->nextXid; TransactionId result = origNextXid; DIR *cldir; struct dirent *clde; + TransactionId *xids = NULL; + int nxids = 0; + int allocsize = 0; cldir = AllocateDir(TWOPHASE_DIR); while ((clde = ReadDir(cldir, TWOPHASE_DIR)) != NULL) @@ -1567,7 +1637,7 @@ PrescanPreparedTransactions(void) */ /* Read and validate file */ - buf = ReadTwoPhaseFile(xid); + buf = ReadTwoPhaseFile(xid, true); if (buf == NULL) { ereport(WARNING, @@ -1615,11 +1685,36 @@ PrescanPreparedTransactions(void) } } + + if (xids_p) + { + if (nxids == allocsize) + { + if (nxids == 0) + { + allocsize = 10; + xids = palloc(allocsize * sizeof(TransactionId)); + } + else + { + allocsize = allocsize * 2; + xids = repalloc(xids, allocsize * sizeof(TransactionId)); + } + } + xids[nxids++] = xid; + } + pfree(buf); } } FreeDir(cldir); + if (xids_p) + { + *xids_p = xids; + *nxids_p = nxids; + } + return result; } @@ -1636,6 +1731,7 @@ RecoverPreparedTransactions(void) char dir[MAXPGPATH]; DIR *cldir; struct dirent *clde; + bool overwriteOK = false; snprintf(dir, MAXPGPATH, "%s", TWOPHASE_DIR); @@ -1666,7 +1762,7 @@ RecoverPreparedTransactions(void) } /* Read and validate file */ - buf = ReadTwoPhaseFile(xid); + buf = ReadTwoPhaseFile(xid, true); if (buf == NULL) { ereport(WARNING, @@ -1687,6 +1783,15 @@ RecoverPreparedTransactions(void) bufptr += MAXALIGN(hdr->nsubxacts * sizeof(TransactionId)); bufptr += MAXALIGN(hdr->ncommitrels * sizeof(RelFileNode)); bufptr += MAXALIGN(hdr->nabortrels * sizeof(RelFileNode)); + bufptr += MAXALIGN(hdr->ninvalmsgs * sizeof(SharedInvalidationMessage)); + + /* + * It's possible that SubTransSetParent has been set before, if the + * prepared transaction generated xid assignment records. Test + * here must match one used in AssignTransactionId(). + */ + if (InHotStandby && hdr->nsubxacts >= PGPROC_MAX_CACHED_SUBXIDS) + overwriteOK = true; /* * Reconstruct subtrans state for the transaction --- needed @@ -1696,7 +1801,7 @@ RecoverPreparedTransactions(void) * hierarchy, but there's no need to restore that exactly. */ for (i = 0; i < hdr->nsubxacts; i++) - SubTransSetParent(subxids[i], xid); + SubTransSetParent(subxids[i], xid, overwriteOK); /* * Recreate its GXACT and dummy PGPROC @@ -1719,6 +1824,14 @@ RecoverPreparedTransactions(void) */ ProcessRecords(bufptr, xid, twophase_recover_callbacks); + /* + * Release locks held by the standby process after we process each + * prepared transaction. As a result, we don't need too many + * additional locks at any one time. + */ + if (InHotStandby) + StandbyReleaseLockTree(xid, hdr->nsubxacts, subxids); + pfree(buf); } } @@ -1739,9 +1852,12 @@ RecordTransactionCommitPrepared(TransactionId xid, int nchildren, TransactionId *children, int nrels, - RelFileNode *rels) + RelFileNode *rels, + int ninvalmsgs, + SharedInvalidationMessage *invalmsgs, + bool initfileinval) { - XLogRecData rdata[3]; + XLogRecData rdata[4]; int lastrdata = 0; xl_xact_commit_prepared xlrec; XLogRecPtr recptr; @@ -1754,8 +1870,12 @@ RecordTransactionCommitPrepared(TransactionId xid, /* Emit the XLOG commit record */ xlrec.xid = xid; xlrec.crec.xact_time = GetCurrentTimestamp(); + xlrec.crec.xinfo = initfileinval ? XACT_COMPLETION_UPDATE_RELCACHE_FILE : 0; + xlrec.crec.nmsgs = 0; xlrec.crec.nrels = nrels; xlrec.crec.nsubxacts = nchildren; + xlrec.crec.nmsgs = ninvalmsgs; + rdata[0].data = (char *) (&xlrec); rdata[0].len = MinSizeOfXactCommitPrepared; rdata[0].buffer = InvalidBuffer; @@ -1777,6 +1897,15 @@ RecordTransactionCommitPrepared(TransactionId xid, rdata[2].buffer = InvalidBuffer; lastrdata = 2; } + /* dump cache invalidation messages */ + if (ninvalmsgs > 0) + { + rdata[lastrdata].next = &(rdata[3]); + rdata[3].data = (char *) invalmsgs; + rdata[3].len = ninvalmsgs * sizeof(SharedInvalidationMessage); + rdata[3].buffer = InvalidBuffer; + lastrdata = 3; + } rdata[lastrdata].next = NULL; recptr = XLogInsert(RM_XACT_ID, XLOG_XACT_COMMIT_PREPARED, rdata); diff --git a/src/backend/access/transam/twophase_rmgr.c b/src/backend/access/transam/twophase_rmgr.c index d1f7ac7aba..1bd83e043b 100644 --- a/src/backend/access/transam/twophase_rmgr.c +++ b/src/backend/access/transam/twophase_rmgr.c @@ -8,7 +8,7 @@ * * * IDENTIFICATION - * $PostgreSQL: pgsql/src/backend/access/transam/twophase_rmgr.c,v 1.10 2009/11/23 09:58:36 heikki Exp $ + * $PostgreSQL: pgsql/src/backend/access/transam/twophase_rmgr.c,v 1.11 2009/12/19 01:32:33 sriggs Exp $ * *------------------------------------------------------------------------- */ @@ -19,14 +19,12 @@ #include "commands/async.h" #include "pgstat.h" #include "storage/lock.h" -#include "utils/inval.h" const TwoPhaseCallback twophase_recover_callbacks[TWOPHASE_RM_MAX_ID + 1] = { NULL, /* END ID */ lock_twophase_recover, /* Lock */ - NULL, /* Inval */ NULL, /* notify/listen */ NULL, /* pgstat */ multixact_twophase_recover /* MultiXact */ @@ -36,7 +34,6 @@ const TwoPhaseCallback twophase_postcommit_callbacks[TWOPHASE_RM_MAX_ID + 1] = { NULL, /* END ID */ lock_twophase_postcommit, /* Lock */ - inval_twophase_postcommit, /* Inval */ notify_twophase_postcommit, /* notify/listen */ pgstat_twophase_postcommit, /* pgstat */ multixact_twophase_postcommit /* MultiXact */ @@ -46,8 +43,16 @@ const TwoPhaseCallback twophase_postabort_callbacks[TWOPHASE_RM_MAX_ID + 1] = { NULL, /* END ID */ lock_twophase_postabort, /* Lock */ - NULL, /* Inval */ NULL, /* notify/listen */ pgstat_twophase_postabort, /* pgstat */ multixact_twophase_postabort /* MultiXact */ }; + +const TwoPhaseCallback twophase_standby_recover_callbacks[TWOPHASE_RM_MAX_ID + 1] = +{ + NULL, /* END ID */ + lock_twophase_standby_recover, /* Lock */ + NULL, /* notify/listen */ + NULL, /* pgstat */ + NULL /* MultiXact */ +}; diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c index f9a71760d3..a165692277 100644 --- a/src/backend/access/transam/xact.c +++ b/src/backend/access/transam/xact.c @@ -10,7 +10,7 @@ * * * IDENTIFICATION - * $PostgreSQL: pgsql/src/backend/access/transam/xact.c,v 1.277 2009/12/09 21:57:50 tgl Exp $ + * $PostgreSQL: pgsql/src/backend/access/transam/xact.c,v 1.278 2009/12/19 01:32:33 sriggs Exp $ * *------------------------------------------------------------------------- */ @@ -42,6 +42,7 @@ #include "storage/procarray.h" #include "storage/sinvaladt.h" #include "storage/smgr.h" +#include "storage/standby.h" #include "utils/combocid.h" #include "utils/guc.h" #include "utils/inval.h" @@ -139,6 +140,7 @@ typedef struct TransactionStateData Oid prevUser; /* previous CurrentUserId setting */ int prevSecContext; /* previous SecurityRestrictionContext */ bool prevXactReadOnly; /* entry-time xact r/o state */ + bool startedInRecovery; /* did we start in recovery? */ struct TransactionStateData *parent; /* back link to parent */ } TransactionStateData; @@ -167,9 +169,17 @@ static TransactionStateData TopTransactionStateData = { InvalidOid, /* previous CurrentUserId setting */ 0, /* previous SecurityRestrictionContext */ false, /* entry-time xact r/o state */ + false, /* startedInRecovery */ NULL /* link to parent state block */ }; +/* + * unreportedXids holds XIDs of all subtransactions that have not yet been + * reported in a XLOG_XACT_ASSIGNMENT record. + */ +static int nUnreportedXids; +static TransactionId unreportedXids[PGPROC_MAX_CACHED_SUBXIDS]; + static TransactionState CurrentTransactionState = &TopTransactionStateData; /* @@ -392,6 +402,9 @@ AssignTransactionId(TransactionState s) bool isSubXact = (s->parent != NULL); ResourceOwner currentOwner; + if (RecoveryInProgress()) + elog(ERROR, "cannot assign TransactionIds during recovery"); + /* Assert that caller didn't screw up */ Assert(!TransactionIdIsValid(s->transactionId)); Assert(s->state == TRANS_INPROGRESS); @@ -414,7 +427,7 @@ AssignTransactionId(TransactionState s) s->transactionId = GetNewTransactionId(isSubXact); if (isSubXact) - SubTransSetParent(s->transactionId, s->parent->transactionId); + SubTransSetParent(s->transactionId, s->parent->transactionId, false); /* * Acquire lock on the transaction XID. (We assume this cannot block.) We @@ -435,8 +448,57 @@ AssignTransactionId(TransactionState s) } PG_END_TRY(); CurrentResourceOwner = currentOwner; -} + /* + * Every PGPROC_MAX_CACHED_SUBXIDS assigned transaction ids within each + * top-level transaction we issue a WAL record for the assignment. We + * include the top-level xid and all the subxids that have not yet been + * reported using XLOG_XACT_ASSIGNMENT records. + * + * This is required to limit the amount of shared memory required in a + * hot standby server to keep track of in-progress XIDs. See notes for + * RecordKnownAssignedTransactionIds(). + * + * We don't keep track of the immediate parent of each subxid, + * only the top-level transaction that each subxact belongs to. This + * is correct in recovery only because aborted subtransactions are + * separately WAL logged. + */ + if (isSubXact && XLogStandbyInfoActive()) + { + unreportedXids[nUnreportedXids] = s->transactionId; + nUnreportedXids++; + + /* ensure this test matches similar one in RecoverPreparedTransactions() */ + if (nUnreportedXids >= PGPROC_MAX_CACHED_SUBXIDS) + { + XLogRecData rdata[2]; + xl_xact_assignment xlrec; + + /* + * xtop is always set by now because we recurse up transaction + * stack to the highest unassigned xid and then come back down + */ + xlrec.xtop = GetTopTransactionId(); + Assert(TransactionIdIsValid(xlrec.xtop)); + xlrec.nsubxacts = nUnreportedXids; + + rdata[0].data = (char *) &xlrec; + rdata[0].len = MinSizeOfXactAssignment; + rdata[0].buffer = InvalidBuffer; + rdata[0].next = &rdata[1]; + + rdata[1].data = (char *) unreportedXids; + rdata[1].len = PGPROC_MAX_CACHED_SUBXIDS * sizeof(TransactionId); + rdata[1].buffer = InvalidBuffer; + rdata[1].next = NULL; + + (void) XLogInsert(RM_XACT_ID, XLOG_XACT_ASSIGNMENT, rdata); + + nUnreportedXids = 0; + } + } +} /* * GetCurrentSubTransactionId @@ -596,6 +658,18 @@ TransactionIdIsCurrentTransactionId(TransactionId xid) return false; } +/* + * TransactionStartedDuringRecovery + * + * Returns true if the current transaction started while recovery was still + * in progress. Recovery might have ended since so RecoveryInProgress() might + * return false already. + */ +bool +TransactionStartedDuringRecovery(void) +{ + return CurrentTransactionState->startedInRecovery; +} /* * CommandCounterIncrement @@ -811,7 +885,7 @@ AtSubStart_ResourceOwner(void) * This is exported only to support an ugly hack in VACUUM FULL. */ TransactionId -RecordTransactionCommit(void) +RecordTransactionCommit(bool isVacuumFull) { TransactionId xid = GetTopTransactionIdIfAny(); bool markXidCommitted = TransactionIdIsValid(xid); @@ -821,11 +895,15 @@ RecordTransactionCommit(void) bool haveNonTemp; int nchildren; TransactionId *children; + int nmsgs; + SharedInvalidationMessage *invalMessages = NULL; + bool RelcacheInitFileInval; /* Get data needed for commit record */ nrels = smgrGetPendingDeletes(true, &rels, &haveNonTemp); nchildren = xactGetCommittedChildren(&children); - + nmsgs = xactGetCommittedInvalidationMessages(&invalMessages, + &RelcacheInitFileInval); /* * If we haven't been assigned an XID yet, we neither can, nor do we want * to write a COMMIT record. @@ -859,13 +937,24 @@ RecordTransactionCommit(void) /* * Begin commit critical section and insert the commit XLOG record. */ - XLogRecData rdata[3]; + XLogRecData rdata[4]; int lastrdata = 0; xl_xact_commit xlrec; /* Tell bufmgr and smgr to prepare for commit */ BufmgrCommit(); + /* + * Set flags required for recovery processing of commits. + */ + xlrec.xinfo = 0; + if (RelcacheInitFileInval) + xlrec.xinfo |= XACT_COMPLETION_UPDATE_RELCACHE_FILE; + if (isVacuumFull) + xlrec.xinfo |= XACT_COMPLETION_VACUUM_FULL; + if (forceSyncCommit) + xlrec.xinfo |= XACT_COMPLETION_FORCE_SYNC_COMMIT; + /* * Mark ourselves as within our "commit critical section". This * forces any concurrent checkpoint to wait until we've updated @@ -890,6 +979,7 @@ RecordTransactionCommit(void) xlrec.xact_time = xactStopTimestamp; xlrec.nrels = nrels; xlrec.nsubxacts = nchildren; + xlrec.nmsgs = nmsgs; rdata[0].data = (char *) (&xlrec); rdata[0].len = MinSizeOfXactCommit; rdata[0].buffer = InvalidBuffer; @@ -911,6 +1001,15 @@ RecordTransactionCommit(void) rdata[2].buffer = InvalidBuffer; lastrdata = 2; } + /* dump shared cache invalidation messages */ + if (nmsgs > 0) + { + rdata[lastrdata].next = &(rdata[3]); + rdata[3].data = (char *) invalMessages; + rdata[3].len = nmsgs * sizeof(SharedInvalidationMessage); + rdata[3].buffer = InvalidBuffer; + lastrdata = 3; + } rdata[lastrdata].next = NULL; (void) XLogInsert(RM_XACT_ID, XLOG_XACT_COMMIT, rdata); @@ -1352,6 +1451,13 @@ AtSubAbort_childXids(void) s->childXids = NULL; s->nChildXids = 0; s->maxChildXids = 0; + + /* + * We could prune the unreportedXids array here. But we don't bother. + * That would potentially reduce number of XLOG_XACT_ASSIGNMENT records + * but it would likely introduce more CPU time into the more common + * paths, so we choose not to do that. + */ } /* ---------------------------------------------------------------- @@ -1461,9 +1567,23 @@ StartTransaction(void) /* * Make sure we've reset xact state variables + * + * If recovery is still in progress, mark this transaction as read-only. + * We have lower level defences in XLogInsert and elsewhere to stop us + * from modifying data during recovery, but this gives the normal + * indication to the user that the transaction is read-only. */ + if (RecoveryInProgress()) + { + s->startedInRecovery = true; + XactReadOnly = true; + } + else + { + s->startedInRecovery = false; + XactReadOnly = DefaultXactReadOnly; + } XactIsoLevel = DefaultXactIsoLevel; - XactReadOnly = DefaultXactReadOnly; forceSyncCommit = false; MyXactAccessedTempRel = false; @@ -1475,6 +1595,11 @@ StartTransaction(void) currentCommandId = FirstCommandId; currentCommandIdUsed = false; + /* + * initialize reported xid accounting + */ + nUnreportedXids = 0; + /* * must initialize resource-management stuff first */ @@ -1619,7 +1744,7 @@ CommitTransaction(void) /* * Here is where we really truly commit. */ - latestXid = RecordTransactionCommit(); + latestXid = RecordTransactionCommit(false); TRACE_POSTGRESQL_TRANSACTION_COMMIT(MyProc->lxid); @@ -1853,7 +1978,6 @@ PrepareTransaction(void) StartPrepare(gxact); AtPrepare_Notify(); - AtPrepare_Inval(); AtPrepare_Locks(); AtPrepare_PgStat(); AtPrepare_MultiXact(); @@ -4199,29 +4323,108 @@ xactGetCommittedChildren(TransactionId **ptr) * XLOG support routines */ +/* + * Before 8.5 this was a fairly short function, but now it performs many + * actions for which the order of execution is critical. + */ static void -xact_redo_commit(xl_xact_commit *xlrec, TransactionId xid) +xact_redo_commit(xl_xact_commit *xlrec, TransactionId xid, XLogRecPtr lsn) { TransactionId *sub_xids; + SharedInvalidationMessage *inval_msgs; TransactionId max_xid; int i; - /* Mark the transaction committed in pg_clog */ + /* subxid array follows relfilenodes */ sub_xids = (TransactionId *) &(xlrec->xnodes[xlrec->nrels]); - TransactionIdCommitTree(xid, xlrec->nsubxacts, sub_xids); + /* invalidation messages array follows subxids */ + inval_msgs = (SharedInvalidationMessage *) &(sub_xids[xlrec->nsubxacts]); - /* Make sure nextXid is beyond any XID mentioned in the record */ - max_xid = xid; - for (i = 0; i < xlrec->nsubxacts; i++) - { - if (TransactionIdPrecedes(max_xid, sub_xids[i])) - max_xid = sub_xids[i]; - } + max_xid = TransactionIdLatest(xid, xlrec->nsubxacts, sub_xids); + + /* + * Make sure nextXid is beyond any XID mentioned in the record. + * + * We don't expect anyone else to modify nextXid, hence we + * don't need to hold a lock while checking this. We still acquire + * the lock to modify it, though. + */ if (TransactionIdFollowsOrEquals(max_xid, ShmemVariableCache->nextXid)) { + LWLockAcquire(XidGenLock, LW_EXCLUSIVE); ShmemVariableCache->nextXid = max_xid; TransactionIdAdvance(ShmemVariableCache->nextXid); + LWLockRelease(XidGenLock); + } + + if (!InHotStandby || XactCompletionVacuumFull(xlrec)) + { + /* + * Mark the transaction committed in pg_clog. + * + * If InHotStandby and this is the first commit of a VACUUM FULL INPLACE + * we perform only the actual commit to clog. Strangely, there are two + * commits that share the same xid for every VFI, so we need to skip + * some steps for the first commit. It's OK to repeat the clog update + * when we see the second commit on a VFI. + */ + TransactionIdCommitTree(xid, xlrec->nsubxacts, sub_xids); + } + else + { + /* + * If a transaction completion record arrives that has as-yet unobserved + * subtransactions then this will not have been fully handled by the call + * to RecordKnownAssignedTransactionIds() in the main recovery loop in + * xlog.c. So we need to do bookkeeping again to cover that case. This is + * confusing and it is easy to think this call is irrelevant, which has + * happened three times in development already. Leave it in. + */ + RecordKnownAssignedTransactionIds(max_xid); + + /* + * Mark the transaction committed in pg_clog. We use async commit + * protocol during recovery to provide information on database + * consistency for when users try to set hint bits. It is important + * that we do not set hint bits until the minRecoveryPoint is past + * this commit record. This ensures that if we crash we don't see + * hint bits set on changes made by transactions that haven't yet + * recovered. It's unlikely but it's good to be safe. + */ + TransactionIdAsyncCommitTree(xid, xlrec->nsubxacts, sub_xids, lsn); + + /* + * We must mark clog before we update the ProcArray. + */ + ExpireTreeKnownAssignedTransactionIds(xid, xlrec->nsubxacts, sub_xids); + + /* + * Send any cache invalidations attached to the commit. We must + * maintain the same order of invalidation then release locks + * as occurs in . + */ + if (xlrec->nmsgs > 0) + { + /* + * Relcache init file invalidation requires processing both + * before and after we send the SI messages. See AtEOXact_Inval() + */ + if (XactCompletionRelcacheInitFileInval(xlrec)) + RelationCacheInitFileInvalidate(true); + + SendSharedInvalidMessages(inval_msgs, xlrec->nmsgs); + + if (XactCompletionRelcacheInitFileInval(xlrec)) + RelationCacheInitFileInvalidate(false); + } + + /* + * Release locks, if any. We do this for both two phase and normal + * one phase transactions. In effect we are ignoring the prepare + * phase and just going straight to lock release. + */ + StandbyReleaseLockTree(xid, xlrec->nsubxacts, sub_xids); } /* Make sure files supposed to be dropped are dropped */ @@ -4240,8 +4443,31 @@ xact_redo_commit(xl_xact_commit *xlrec, TransactionId xid) } smgrclose(srel); } + + /* + * We issue an XLogFlush() for the same reason we emit ForceSyncCommit() in + * normal operation. For example, in DROP DATABASE, we delete all the files + * belonging to the database, and then commit the transaction. If we crash + * after all the files have been deleted but before the commit, you have an + * entry in pg_database without any files. To minimize the window for that, + * we use ForceSyncCommit() to rush the commit record to disk as quick as + * possible. We have the same window during recovery, and forcing an + * XLogFlush() (which updates minRecoveryPoint during recovery) helps + * to reduce that problem window, for any user that requested ForceSyncCommit(). + */ + if (XactCompletionForceSyncCommit(xlrec)) + XLogFlush(lsn); } +/* + * Be careful with the order of execution, as with xact_redo_commit(). + * The two functions are similar but differ in key places. + * + * Note also that an abort can be for a subtransaction and its children, + * not just for a top level abort. That means we have to consider + * topxid != xid, whereas in commit we would find topxid == xid always + * because subtransaction commit is never WAL logged. + */ static void xact_redo_abort(xl_xact_abort *xlrec, TransactionId xid) { @@ -4249,22 +4475,55 @@ xact_redo_abort(xl_xact_abort *xlrec, TransactionId xid) TransactionId max_xid; int i; - /* Mark the transaction aborted in pg_clog */ sub_xids = (TransactionId *) &(xlrec->xnodes[xlrec->nrels]); - TransactionIdAbortTree(xid, xlrec->nsubxacts, sub_xids); + max_xid = TransactionIdLatest(xid, xlrec->nsubxacts, sub_xids); /* Make sure nextXid is beyond any XID mentioned in the record */ - max_xid = xid; - for (i = 0; i < xlrec->nsubxacts; i++) - { - if (TransactionIdPrecedes(max_xid, sub_xids[i])) - max_xid = sub_xids[i]; - } + /* We don't expect anyone else to modify nextXid, hence we + * don't need to hold a lock while checking this. We still acquire + * the lock to modify it, though. + */ if (TransactionIdFollowsOrEquals(max_xid, ShmemVariableCache->nextXid)) { + LWLockAcquire(XidGenLock, LW_EXCLUSIVE); ShmemVariableCache->nextXid = max_xid; TransactionIdAdvance(ShmemVariableCache->nextXid); + LWLockRelease(XidGenLock); + } + + if (InHotStandby) + { + /* + * If a transaction completion record arrives that has as-yet unobserved + * subtransactions then this will not have been fully handled by the call + * to RecordKnownAssignedTransactionIds() in the main recovery loop in + * xlog.c. So we need to do bookkeeping again to cover that case. This is + * confusing and it is easy to think this call is irrelevant, which has + * happened three times in development already. Leave it in. + */ + RecordKnownAssignedTransactionIds(max_xid); + } + + /* Mark the transaction aborted in pg_clog, no need for async stuff */ + TransactionIdAbortTree(xid, xlrec->nsubxacts, sub_xids); + + if (InHotStandby) + { + /* + * We must mark clog before we update the ProcArray. + */ + ExpireTreeKnownAssignedTransactionIds(xid, xlrec->nsubxacts, sub_xids); + + /* + * There are no flat files that need updating, nor invalidation + * messages to send or undo. + */ + + /* + * Release locks, if any. There are no invalidations to send. + */ + StandbyReleaseLockTree(xid, xlrec->nsubxacts, sub_xids); } /* Make sure files supposed to be dropped are dropped */ @@ -4297,7 +4556,7 @@ xact_redo(XLogRecPtr lsn, XLogRecord *record) { xl_xact_commit *xlrec = (xl_xact_commit *) XLogRecGetData(record); - xact_redo_commit(xlrec, record->xl_xid); + xact_redo_commit(xlrec, record->xl_xid, lsn); } else if (info == XLOG_XACT_ABORT) { @@ -4315,7 +4574,7 @@ xact_redo(XLogRecPtr lsn, XLogRecord *record) { xl_xact_commit_prepared *xlrec = (xl_xact_commit_prepared *) XLogRecGetData(record); - xact_redo_commit(&xlrec->crec, xlrec->xid); + xact_redo_commit(&xlrec->crec, xlrec->xid, lsn); RemoveTwoPhaseFile(xlrec->xid, false); } else if (info == XLOG_XACT_ABORT_PREPARED) @@ -4325,6 +4584,14 @@ xact_redo(XLogRecPtr lsn, XLogRecord *record) xact_redo_abort(&xlrec->arec, xlrec->xid); RemoveTwoPhaseFile(xlrec->xid, false); } + else if (info == XLOG_XACT_ASSIGNMENT) + { + xl_xact_assignment *xlrec = (xl_xact_assignment *) XLogRecGetData(record); + + if (InHotStandby) + ProcArrayApplyXidAssignment(xlrec->xtop, + xlrec->nsubxacts, xlrec->xsub); + } else elog(PANIC, "xact_redo: unknown op code %u", info); } @@ -4333,6 +4600,14 @@ static void xact_desc_commit(StringInfo buf, xl_xact_commit *xlrec) { int i; + TransactionId *xacts; + SharedInvalidationMessage *msgs; + + xacts = (TransactionId *) &xlrec->xnodes[xlrec->nrels]; + msgs = (SharedInvalidationMessage *) &xacts[xlrec->nsubxacts]; + + if (XactCompletionRelcacheInitFileInval(xlrec)) + appendStringInfo(buf, "; relcache init file inval"); appendStringInfoString(buf, timestamptz_to_str(xlrec->xact_time)); if (xlrec->nrels > 0) @@ -4348,13 +4623,25 @@ xact_desc_commit(StringInfo buf, xl_xact_commit *xlrec) } if (xlrec->nsubxacts > 0) { - TransactionId *xacts = (TransactionId *) - &xlrec->xnodes[xlrec->nrels]; - appendStringInfo(buf, "; subxacts:"); for (i = 0; i < xlrec->nsubxacts; i++) appendStringInfo(buf, " %u", xacts[i]); } + if (xlrec->nmsgs > 0) + { + appendStringInfo(buf, "; inval msgs:"); + for (i = 0; i < xlrec->nmsgs; i++) + { + SharedInvalidationMessage *msg = &msgs[i]; + + if (msg->id >= 0) + appendStringInfo(buf, "catcache id%d ", msg->id); + else if (msg->id == SHAREDINVALRELCACHE_ID) + appendStringInfo(buf, "relcache "); + else if (msg->id == SHAREDINVALSMGR_ID) + appendStringInfo(buf, "smgr "); + } + } } static void @@ -4385,6 +4672,17 @@ xact_desc_abort(StringInfo buf, xl_xact_abort *xlrec) } } +static void +xact_desc_assignment(StringInfo buf, xl_xact_assignment *xlrec) +{ + int i; + + appendStringInfo(buf, "subxacts:"); + + for (i = 0; i < xlrec->nsubxacts; i++) + appendStringInfo(buf, " %u", xlrec->xsub[i]); +} + void xact_desc(StringInfo buf, uint8 xl_info, char *rec) { @@ -4412,16 +4710,28 @@ xact_desc(StringInfo buf, uint8 xl_info, char *rec) { xl_xact_commit_prepared *xlrec = (xl_xact_commit_prepared *) rec; - appendStringInfo(buf, "commit %u: ", xlrec->xid); + appendStringInfo(buf, "commit prepared %u: ", xlrec->xid); xact_desc_commit(buf, &xlrec->crec); } else if (info == XLOG_XACT_ABORT_PREPARED) { xl_xact_abort_prepared *xlrec = (xl_xact_abort_prepared *) rec; - appendStringInfo(buf, "abort %u: ", xlrec->xid); + appendStringInfo(buf, "abort prepared %u: ", xlrec->xid); xact_desc_abort(buf, &xlrec->arec); } + else if (info == XLOG_XACT_ASSIGNMENT) + { + xl_xact_assignment *xlrec = (xl_xact_assignment *) rec; + + /* + * Note that we ignore the WAL record's xid, since we're more + * interested in the top-level xid that issued the record + * and which xids are being reported here. + */ + appendStringInfo(buf, "xid assignment xtop %u: ", xlrec->xtop); + xact_desc_assignment(buf, xlrec); + } else appendStringInfo(buf, "UNKNOWN"); } diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 331809a3b9..b861a76ee4 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -7,7 +7,7 @@ * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group * Portions Copyright (c) 1994, Regents of the University of California * - * $PostgreSQL: pgsql/src/backend/access/transam/xlog.c,v 1.353 2009/09/13 18:32:07 heikki Exp $ + * $PostgreSQL: pgsql/src/backend/access/transam/xlog.c,v 1.354 2009/12/19 01:32:33 sriggs Exp $ * *------------------------------------------------------------------------- */ @@ -67,6 +67,8 @@ int XLOGbuffers = 8; int XLogArchiveTimeout = 0; bool XLogArchiveMode = false; char *XLogArchiveCommand = NULL; +bool XLogRequestRecoveryConnections = true; +int MaxStandbyDelay = 30; bool fullPageWrites = true; bool log_checkpoints = false; int sync_method = DEFAULT_SYNC_METHOD; @@ -129,10 +131,16 @@ TimeLineID ThisTimeLineID = 0; * recovery mode". It should be examined primarily by functions that need * to act differently when called from a WAL redo function (e.g., to skip WAL * logging). To check whether the system is in recovery regardless of which - * process you're running in, use RecoveryInProgress(). + * process you're running in, use RecoveryInProgress() but only after shared + * memory startup and lock initialization. */ bool InRecovery = false; +/* Are we in Hot Standby mode? Only valid in startup process, see xlog.h */ +HotStandbyState standbyState = STANDBY_DISABLED; + +static XLogRecPtr LastRec; + /* * Local copy of SharedRecoveryInProgress variable. True actually means "not * known, need to check the shared state". @@ -359,6 +367,8 @@ typedef struct XLogCtlData /* end+1 of the last record replayed (or being replayed) */ XLogRecPtr replayEndRecPtr; + /* timestamp of last record replayed (or being replayed) */ + TimestampTz recoveryLastXTime; slock_t info_lck; /* locks shared variables shown above */ } XLogCtlData; @@ -463,6 +473,7 @@ static void readRecoveryCommandFile(void); static void exitArchiveRecovery(TimeLineID endTLI, uint32 endLogId, uint32 endLogSeg); static bool recoveryStopsHere(XLogRecord *record, bool *includeThis); +static void CheckRequiredParameterValues(CheckPoint checkPoint); static void LocalSetXLogInsertAllowed(void); static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags); @@ -2103,9 +2114,40 @@ XLogAsyncCommitFlush(void) bool XLogNeedsFlush(XLogRecPtr record) { - /* XLOG doesn't need flushing during recovery */ + /* + * During recovery, we don't flush WAL but update minRecoveryPoint + * instead. So "needs flush" is taken to mean whether minRecoveryPoint + * would need to be updated. + */ if (RecoveryInProgress()) - return false; + { + /* Quick exit if already known updated */ + if (XLByteLE(record, minRecoveryPoint) || !updateMinRecoveryPoint) + return false; + + /* + * Update local copy of minRecoveryPoint. But if the lock is busy, + * just return a conservative guess. + */ + if (!LWLockConditionalAcquire(ControlFileLock, LW_SHARED)) + return true; + minRecoveryPoint = ControlFile->minRecoveryPoint; + LWLockRelease(ControlFileLock); + + /* + * An invalid minRecoveryPoint means that we need to recover all the WAL, + * i.e., we're doing crash recovery. We never modify the control file's + * value in that case, so we can short-circuit future checks here too. + */ + if (minRecoveryPoint.xlogid == 0 && minRecoveryPoint.xrecoff == 0) + updateMinRecoveryPoint = false; + + /* check again */ + if (XLByteLE(record, minRecoveryPoint) || !updateMinRecoveryPoint) + return false; + else + return true; + } /* Quick exit if already known flushed */ if (XLByteLE(record, LogwrtResult.Flush)) @@ -3259,10 +3301,11 @@ CleanupBackupHistory(void) * ignoring them as already applied, but that's not a huge drawback. * * If 'cleanup' is true, a cleanup lock is used when restoring blocks. - * Otherwise, a normal exclusive lock is used. At the moment, that's just - * pro forma, because there can't be any regular backends in the system - * during recovery. The 'cleanup' argument applies to all backup blocks - * in the WAL record, that suffices for now. + * Otherwise, a normal exclusive lock is used. During crash recovery, that's + * just pro forma because there can't be any regular backends in the system, + * but in hot standby mode the distinction is important. The 'cleanup' + * argument applies to all backup blocks in the WAL record, that suffices for + * now. */ void RestoreBkpBlocks(XLogRecPtr lsn, XLogRecord *record, bool cleanup) @@ -4679,6 +4722,7 @@ BootStrapXLOG(void) checkPoint.oldestXid = FirstNormalTransactionId; checkPoint.oldestXidDB = TemplateDbOid; checkPoint.time = (pg_time_t) time(NULL); + checkPoint.oldestActiveXid = InvalidTransactionId; ShmemVariableCache->nextXid = checkPoint.nextXid; ShmemVariableCache->nextOid = checkPoint.nextOid; @@ -5117,22 +5161,43 @@ recoveryStopsHere(XLogRecord *record, bool *includeThis) TimestampTz recordXtime; /* We only consider stopping at COMMIT or ABORT records */ - if (record->xl_rmid != RM_XACT_ID) - return false; - record_info = record->xl_info & ~XLR_INFO_MASK; - if (record_info == XLOG_XACT_COMMIT) + if (record->xl_rmid == RM_XACT_ID) { - xl_xact_commit *recordXactCommitData; + record_info = record->xl_info & ~XLR_INFO_MASK; + if (record_info == XLOG_XACT_COMMIT) + { + xl_xact_commit *recordXactCommitData; - recordXactCommitData = (xl_xact_commit *) XLogRecGetData(record); - recordXtime = recordXactCommitData->xact_time; + recordXactCommitData = (xl_xact_commit *) XLogRecGetData(record); + recordXtime = recordXactCommitData->xact_time; + } + else if (record_info == XLOG_XACT_ABORT) + { + xl_xact_abort *recordXactAbortData; + + recordXactAbortData = (xl_xact_abort *) XLogRecGetData(record); + recordXtime = recordXactAbortData->xact_time; + } + else + return false; } - else if (record_info == XLOG_XACT_ABORT) + else if (record->xl_rmid == RM_XLOG_ID) { - xl_xact_abort *recordXactAbortData; + record_info = record->xl_info & ~XLR_INFO_MASK; + if (record_info == XLOG_CHECKPOINT_SHUTDOWN || + record_info == XLOG_CHECKPOINT_ONLINE) + { + CheckPoint checkPoint; - recordXactAbortData = (xl_xact_abort *) XLogRecGetData(record); - recordXtime = recordXactAbortData->xact_time; + memcpy(&checkPoint, XLogRecGetData(record), sizeof(CheckPoint)); + recoveryLastXTime = checkPoint.time; + } + + /* + * We don't want to stop recovery on a checkpoint record, but we do + * want to update recoveryLastXTime. So return is unconditional. + */ + return false; } else return false; @@ -5216,6 +5281,67 @@ recoveryStopsHere(XLogRecord *record, bool *includeThis) return stopsHere; } +/* + * Returns bool with current recovery mode, a global state. + */ +Datum +pg_is_in_recovery(PG_FUNCTION_ARGS) +{ + PG_RETURN_BOOL(RecoveryInProgress()); +} + +/* + * Returns timestamp of last recovered commit/abort record. + */ +TimestampTz +GetLatestXLogTime(void) +{ + /* use volatile pointer to prevent code rearrangement */ + volatile XLogCtlData *xlogctl = XLogCtl; + + SpinLockAcquire(&xlogctl->info_lck); + recoveryLastXTime = xlogctl->recoveryLastXTime; + SpinLockRelease(&xlogctl->info_lck); + + return recoveryLastXTime; +} + +/* + * Note that text field supplied is a parameter name and does not require translation + */ +#define RecoveryRequiresIntParameter(param_name, currValue, checkpointValue) \ +{ \ + if (currValue < checkpointValue) \ + ereport(ERROR, \ + (errmsg("recovery connections cannot continue because " \ + "%s = %u is a lower setting than on WAL source server (value was %u)", \ + param_name, \ + currValue, \ + checkpointValue))); \ +} + +/* + * Check to see if required parameters are set high enough on this server + * for various aspects of recovery operation. + */ +static void +CheckRequiredParameterValues(CheckPoint checkPoint) +{ + /* We ignore autovacuum_max_workers when we make this test. */ + RecoveryRequiresIntParameter("max_connections", + MaxConnections, checkPoint.MaxConnections); + + RecoveryRequiresIntParameter("max_prepared_xacts", + max_prepared_xacts, checkPoint.max_prepared_xacts); + RecoveryRequiresIntParameter("max_locks_per_xact", + max_locks_per_xact, checkPoint.max_locks_per_xact); + + if (!checkPoint.XLogStandbyInfoMode) + ereport(ERROR, + (errmsg("recovery connections cannot start because the recovery_connections " + "parameter is disabled on the WAL source server"))); +} + /* * This must be called ONCE during postmaster or standalone-backend startup */ @@ -5228,7 +5354,6 @@ StartupXLOG(void) bool reachedStopPoint = false; bool haveBackupLabel = false; XLogRecPtr RecPtr, - LastRec, checkPointLoc, backupStopLoc, EndOfLog; @@ -5238,6 +5363,7 @@ StartupXLOG(void) uint32 freespace; TransactionId oldestActiveXID; bool bgwriterLaunched = false; + bool backendsAllowed = false; /* * Read control file and check XLOG status looks valid. @@ -5506,6 +5632,38 @@ StartupXLOG(void) BACKUP_LABEL_FILE, BACKUP_LABEL_OLD))); } + /* + * Initialize recovery connections, if enabled. We won't let backends + * in yet, not until we've reached the min recovery point specified + * in control file and we've established a recovery snapshot from a + * running-xacts WAL record. + */ + if (InArchiveRecovery && XLogRequestRecoveryConnections) + { + TransactionId *xids; + int nxids; + + CheckRequiredParameterValues(checkPoint); + + ereport(LOG, + (errmsg("initializing recovery connections"))); + + InitRecoveryTransactionEnvironment(); + + if (wasShutdown) + oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids); + else + oldestActiveXID = checkPoint.oldestActiveXid; + Assert(TransactionIdIsValid(oldestActiveXID)); + + /* Startup commit log and related stuff */ + StartupCLOG(); + StartupSUBTRANS(oldestActiveXID); + StartupMultiXact(); + + ProcArrayInitRecoveryInfo(oldestActiveXID); + } + /* Initialize resource managers */ for (rmid = 0; rmid <= RM_MAX_ID; rmid++) { @@ -5580,7 +5738,9 @@ StartupXLOG(void) do { #ifdef WAL_DEBUG - if (XLOG_DEBUG) + if (XLOG_DEBUG || + (rmid == RM_XACT_ID && trace_recovery_messages <= DEBUG2) || + (rmid != RM_XACT_ID && trace_recovery_messages <= DEBUG3)) { StringInfoData buf; @@ -5608,27 +5768,29 @@ StartupXLOG(void) } /* - * Check if we were requested to exit without finishing - * recovery. - */ - if (shutdown_requested) - proc_exit(1); - - /* - * Have we passed our safe starting point? If so, we can tell - * postmaster that the database is consistent now. + * Have we passed our safe starting point? */ if (!reachedMinRecoveryPoint && - XLByteLT(minRecoveryPoint, EndRecPtr)) + XLByteLE(minRecoveryPoint, EndRecPtr)) { reachedMinRecoveryPoint = true; - if (InArchiveRecovery) - { - ereport(LOG, - (errmsg("consistent recovery state reached"))); - if (IsUnderPostmaster) - SendPostmasterSignal(PMSIGNAL_RECOVERY_CONSISTENT); - } + ereport(LOG, + (errmsg("consistent recovery state reached at %X/%X", + EndRecPtr.xlogid, EndRecPtr.xrecoff))); + } + + /* + * Have we got a valid starting snapshot that will allow + * queries to be run? If so, we can tell postmaster that + * the database is consistent now, enabling connections. + */ + if (standbyState == STANDBY_SNAPSHOT_READY && + !backendsAllowed && + reachedMinRecoveryPoint && + IsUnderPostmaster) + { + backendsAllowed = true; + SendPostmasterSignal(PMSIGNAL_RECOVERY_CONSISTENT); } /* @@ -5662,8 +5824,13 @@ StartupXLOG(void) */ SpinLockAcquire(&xlogctl->info_lck); xlogctl->replayEndRecPtr = EndRecPtr; + xlogctl->recoveryLastXTime = recoveryLastXTime; SpinLockRelease(&xlogctl->info_lck); + /* In Hot Standby mode, keep track of XIDs we've seen */ + if (InHotStandby && TransactionIdIsValid(record->xl_xid)) + RecordKnownAssignedTransactionIds(record->xl_xid); + RmgrTable[record->xl_rmid].rm_redo(EndRecPtr, record); /* Pop the error context stack */ @@ -5810,7 +5977,7 @@ StartupXLOG(void) } /* Pre-scan prepared transactions to find out the range of XIDs present */ - oldestActiveXID = PrescanPreparedTransactions(); + oldestActiveXID = PrescanPreparedTransactions(NULL, NULL); if (InRecovery) { @@ -5891,14 +6058,27 @@ StartupXLOG(void) ShmemVariableCache->latestCompletedXid = ShmemVariableCache->nextXid; TransactionIdRetreat(ShmemVariableCache->latestCompletedXid); - /* Start up the commit log and related stuff, too */ - StartupCLOG(); - StartupSUBTRANS(oldestActiveXID); - StartupMultiXact(); + /* + * Start up the commit log and related stuff, too. In hot standby mode + * we did this already before WAL replay. + */ + if (standbyState == STANDBY_DISABLED) + { + StartupCLOG(); + StartupSUBTRANS(oldestActiveXID); + StartupMultiXact(); + } /* Reload shared-memory state for prepared transactions */ RecoverPreparedTransactions(); + /* + * Shutdown the recovery environment. This must occur after + * RecoverPreparedTransactions(), see notes for lock_twophase_recover() + */ + if (standbyState != STANDBY_DISABLED) + ShutdownRecoveryTransactionEnvironment(); + /* Shut down readFile facility, free space */ if (readFile >= 0) { @@ -5964,8 +6144,9 @@ RecoveryInProgress(void) /* * Initialize TimeLineID and RedoRecPtr when we discover that recovery - * is finished. (If you change this, see also - * LocalSetXLogInsertAllowed.) + * is finished. InitPostgres() relies upon this behaviour to ensure + * that InitXLOGAccess() is called at backend startup. (If you change + * this, see also LocalSetXLogInsertAllowed.) */ if (!LocalRecoveryInProgress) InitXLOGAccess(); @@ -6151,7 +6332,7 @@ InitXLOGAccess(void) { /* ThisTimeLineID doesn't change so we need no lock to copy it */ ThisTimeLineID = XLogCtl->ThisTimeLineID; - Assert(ThisTimeLineID != 0); + Assert(ThisTimeLineID != 0 || IsBootstrapProcessingMode()); /* Use GetRedoRecPtr to copy the RedoRecPtr safely */ (void) GetRedoRecPtr(); @@ -6449,6 +6630,12 @@ CreateCheckPoint(int flags) MemSet(&checkPoint, 0, sizeof(checkPoint)); checkPoint.time = (pg_time_t) time(NULL); + /* Set important parameter values for use when replaying WAL */ + checkPoint.MaxConnections = MaxConnections; + checkPoint.max_prepared_xacts = max_prepared_xacts; + checkPoint.max_locks_per_xact = max_locks_per_xact; + checkPoint.XLogStandbyInfoMode = XLogStandbyInfoActive(); + /* * We must hold WALInsertLock while examining insert state to determine * the checkpoint REDO pointer. @@ -6624,6 +6811,21 @@ CreateCheckPoint(int flags) CheckPointGuts(checkPoint.redo, flags); + /* + * Take a snapshot of running transactions and write this to WAL. + * This allows us to reconstruct the state of running transactions + * during archive recovery, if required. Skip, if this info disabled. + * + * If we are shutting down, or Startup process is completing crash + * recovery we don't need to write running xact data. + * + * Update checkPoint.nextXid since we have a later value + */ + if (!shutdown && XLogStandbyInfoActive()) + LogStandbySnapshot(&checkPoint.oldestActiveXid, &checkPoint.nextXid); + else + checkPoint.oldestActiveXid = InvalidTransactionId; + START_CRIT_SECTION(); /* @@ -6791,7 +6993,7 @@ RecoveryRestartPoint(const CheckPoint *checkPoint) if (RmgrTable[rmid].rm_safe_restartpoint != NULL) if (!(RmgrTable[rmid].rm_safe_restartpoint())) { - elog(DEBUG2, "RM %d not safe to record restart point at %X/%X", + elog(trace_recovery(DEBUG2), "RM %d not safe to record restart point at %X/%X", rmid, checkPoint->redo.xlogid, checkPoint->redo.xrecoff); @@ -6923,14 +7125,9 @@ CreateRestartPoint(int flags) LogCheckpointEnd(true); ereport((log_checkpoints ? LOG : DEBUG2), - (errmsg("recovery restart point at %X/%X", - lastCheckPoint.redo.xlogid, lastCheckPoint.redo.xrecoff))); - - /* XXX this is currently BROKEN because we are in the wrong process */ - if (recoveryLastXTime) - ereport((log_checkpoints ? LOG : DEBUG2), - (errmsg("last completed transaction was at log time %s", - timestamptz_to_str(recoveryLastXTime)))); + (errmsg("recovery restart point at %X/%X with latest known log time %s", + lastCheckPoint.redo.xlogid, lastCheckPoint.redo.xrecoff, + timestamptz_to_str(GetLatestXLogTime())))); LWLockRelease(CheckpointLock); return true; @@ -7036,6 +7233,19 @@ xlog_redo(XLogRecPtr lsn, XLogRecord *record) ShmemVariableCache->oldestXid = checkPoint.oldestXid; ShmemVariableCache->oldestXidDB = checkPoint.oldestXidDB; + /* Check to see if any changes to max_connections give problems */ + if (standbyState != STANDBY_DISABLED) + CheckRequiredParameterValues(checkPoint); + + if (standbyState >= STANDBY_INITIALIZED) + { + /* + * Remove stale transactions, if any. + */ + ExpireOldKnownAssignedTransactionIds(checkPoint.nextXid); + StandbyReleaseOldLocks(checkPoint.nextXid); + } + /* ControlFile->checkPointCopy always tracks the latest ckpt XID */ ControlFile->checkPointCopy.nextXidEpoch = checkPoint.nextXidEpoch; ControlFile->checkPointCopy.nextXid = checkPoint.nextXid; @@ -7114,7 +7324,7 @@ xlog_desc(StringInfo buf, uint8 xl_info, char *rec) appendStringInfo(buf, "checkpoint: redo %X/%X; " "tli %u; xid %u/%u; oid %u; multi %u; offset %u; " - "oldest xid %u in DB %u; %s", + "oldest xid %u in DB %u; oldest running xid %u; %s", checkpoint->redo.xlogid, checkpoint->redo.xrecoff, checkpoint->ThisTimeLineID, checkpoint->nextXidEpoch, checkpoint->nextXid, @@ -7123,6 +7333,7 @@ xlog_desc(StringInfo buf, uint8 xl_info, char *rec) checkpoint->nextMultiOffset, checkpoint->oldestXid, checkpoint->oldestXidDB, + checkpoint->oldestActiveXid, (info == XLOG_CHECKPOINT_SHUTDOWN) ? "shutdown" : "online"); } else if (info == XLOG_NOOP) @@ -7155,6 +7366,9 @@ xlog_outrec(StringInfo buf, XLogRecord *record) record->xl_prev.xlogid, record->xl_prev.xrecoff, record->xl_xid); + appendStringInfo(buf, "; len %u", + record->xl_len); + for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++) { if (record->xl_info & XLR_SET_BKP_BLOCK(i)) @@ -7311,6 +7525,12 @@ pg_start_backup(PG_FUNCTION_ARGS) (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), errmsg("must be superuser to run a backup"))); + if (RecoveryInProgress()) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("recovery is in progress"), + errhint("WAL control functions cannot be executed during recovery."))); + if (!XLogArchivingActive()) ereport(ERROR, (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), @@ -7498,6 +7718,12 @@ pg_stop_backup(PG_FUNCTION_ARGS) (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), (errmsg("must be superuser to run a backup")))); + if (RecoveryInProgress()) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("recovery is in progress"), + errhint("WAL control functions cannot be executed during recovery."))); + if (!XLogArchivingActive()) ereport(ERROR, (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), @@ -7659,6 +7885,12 @@ pg_switch_xlog(PG_FUNCTION_ARGS) (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), (errmsg("must be superuser to switch transaction log files")))); + if (RecoveryInProgress()) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("recovery is in progress"), + errhint("WAL control functions cannot be executed during recovery."))); + switchpoint = RequestXLogSwitch(); /* @@ -7681,6 +7913,12 @@ pg_current_xlog_location(PG_FUNCTION_ARGS) { char location[MAXFNAMELEN]; + if (RecoveryInProgress()) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("recovery is in progress"), + errhint("WAL control functions cannot be executed during recovery."))); + /* Make sure we have an up-to-date local LogwrtResult */ { /* use volatile pointer to prevent code rearrangement */ @@ -7708,6 +7946,12 @@ pg_current_xlog_insert_location(PG_FUNCTION_ARGS) XLogRecPtr current_recptr; char location[MAXFNAMELEN]; + if (RecoveryInProgress()) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("recovery is in progress"), + errhint("WAL control functions cannot be executed during recovery."))); + /* * Get the current end-of-WAL position ... shared lock is sufficient */ diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c index 25dc2f5817..452f59d21c 100644 --- a/src/backend/commands/dbcommands.c +++ b/src/backend/commands/dbcommands.c @@ -13,7 +13,7 @@ * * * IDENTIFICATION - * $PostgreSQL: pgsql/src/backend/commands/dbcommands.c,v 1.228 2009/11/12 02:46:16 tgl Exp $ + * $PostgreSQL: pgsql/src/backend/commands/dbcommands.c,v 1.229 2009/12/19 01:32:34 sriggs Exp $ * *------------------------------------------------------------------------- */ @@ -26,6 +26,7 @@ #include "access/genam.h" #include "access/heapam.h" +#include "access/transam.h" #include "access/xact.h" #include "access/xlogutils.h" #include "catalog/catalog.h" @@ -48,6 +49,7 @@ #include "storage/ipc.h" #include "storage/procarray.h" #include "storage/smgr.h" +#include "storage/standby.h" #include "utils/acl.h" #include "utils/builtins.h" #include "utils/fmgroids.h" @@ -1941,6 +1943,26 @@ dbase_redo(XLogRecPtr lsn, XLogRecord *record) dst_path = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id); + if (InHotStandby) + { + VirtualTransactionId *database_users; + + /* + * Find all users connected to this database and ask them + * politely to immediately kill their sessions before processing + * the drop database record, after the usual grace period. + * We don't wait for commit because drop database is + * non-transactional. + */ + database_users = GetConflictingVirtualXIDs(InvalidTransactionId, + xlrec->db_id, + false); + + ResolveRecoveryConflictWithVirtualXIDs(database_users, + "drop database", + CONFLICT_MODE_FATAL); + } + /* Drop pages for this database that are in the shared buffer cache */ DropDatabaseBuffers(xlrec->db_id); diff --git a/src/backend/commands/lockcmds.c b/src/backend/commands/lockcmds.c index 043d68ac7a..dbf33e957f 100644 --- a/src/backend/commands/lockcmds.c +++ b/src/backend/commands/lockcmds.c @@ -8,7 +8,7 @@ * * * IDENTIFICATION - * $PostgreSQL: pgsql/src/backend/commands/lockcmds.c,v 1.25 2009/06/11 14:48:56 momjian Exp $ + * $PostgreSQL: pgsql/src/backend/commands/lockcmds.c,v 1.26 2009/12/19 01:32:34 sriggs Exp $ * *------------------------------------------------------------------------- */ @@ -47,6 +47,16 @@ LockTableCommand(LockStmt *lockstmt) reloid = RangeVarGetRelid(relation, false); + /* + * During recovery we only accept these variations: + * LOCK TABLE foo IN ACCESS SHARE MODE + * LOCK TABLE foo IN ROW SHARE MODE + * LOCK TABLE foo IN ROW EXCLUSIVE MODE + * This test must match the restrictions defined in LockAcquire() + */ + if (lockstmt->mode > RowExclusiveLock) + PreventCommandDuringRecovery(); + LockTableRecurse(reloid, relation, lockstmt->mode, lockstmt->nowait, recurse); } diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c index 5f590f0c73..fb10b3c230 100644 --- a/src/backend/commands/sequence.c +++ b/src/backend/commands/sequence.c @@ -8,7 +8,7 @@ * * * IDENTIFICATION - * $PostgreSQL: pgsql/src/backend/commands/sequence.c,v 1.162 2009/10/13 00:53:07 tgl Exp $ + * $PostgreSQL: pgsql/src/backend/commands/sequence.c,v 1.163 2009/12/19 01:32:34 sriggs Exp $ * *------------------------------------------------------------------------- */ @@ -458,6 +458,9 @@ nextval_internal(Oid relid) rescnt = 0; bool logit = false; + /* nextval() writes to database and must be prevented during recovery */ + PreventCommandDuringRecovery(); + /* open and AccessShareLock sequence */ init_sequence(relid, &elm, &seqrel); diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c index 595fb330b6..cd8c741289 100644 --- a/src/backend/commands/tablespace.c +++ b/src/backend/commands/tablespace.c @@ -37,7 +37,7 @@ * * * IDENTIFICATION - * $PostgreSQL: pgsql/src/backend/commands/tablespace.c,v 1.63 2009/11/10 18:53:38 tgl Exp $ + * $PostgreSQL: pgsql/src/backend/commands/tablespace.c,v 1.64 2009/12/19 01:32:34 sriggs Exp $ * *------------------------------------------------------------------------- */ @@ -50,6 +50,7 @@ #include "access/heapam.h" #include "access/sysattr.h" +#include "access/transam.h" #include "access/xact.h" #include "catalog/catalog.h" #include "catalog/dependency.h" @@ -60,6 +61,8 @@ #include "miscadmin.h" #include "postmaster/bgwriter.h" #include "storage/fd.h" +#include "storage/procarray.h" +#include "storage/standby.h" #include "utils/acl.h" #include "utils/builtins.h" #include "utils/fmgroids.h" @@ -1317,11 +1320,58 @@ tblspc_redo(XLogRecPtr lsn, XLogRecord *record) { xl_tblspc_drop_rec *xlrec = (xl_tblspc_drop_rec *) XLogRecGetData(record); + /* + * If we issued a WAL record for a drop tablespace it is + * because there were no files in it at all. That means that + * no permanent objects can exist in it at this point. + * + * It is possible for standby users to be using this tablespace + * as a location for their temporary files, so if we fail to + * remove all files then do conflict processing and try again, + * if currently enabled. + */ if (!remove_tablespace_directories(xlrec->ts_id, true)) - ereport(ERROR, + { + VirtualTransactionId *temp_file_users; + + /* + * Standby users may be currently using this tablespace for + * for their temporary files. We only care about current + * users because temp_tablespace parameter will just ignore + * tablespaces that no longer exist. + * + * Ask everybody to cancel their queries immediately so + * we can ensure no temp files remain and we can remove the + * tablespace. Nuke the entire site from orbit, it's the only + * way to be sure. + * + * XXX: We could work out the pids of active backends + * using this tablespace by examining the temp filenames in the + * directory. We would then convert the pids into VirtualXIDs + * before attempting to cancel them. + * + * We don't wait for commit because drop tablespace is + * non-transactional. + */ + temp_file_users = GetConflictingVirtualXIDs(InvalidTransactionId, + InvalidOid, + false); + ResolveRecoveryConflictWithVirtualXIDs(temp_file_users, + "drop tablespace", + CONFLICT_MODE_ERROR); + + /* + * If we did recovery processing then hopefully the + * backends who wrote temp files should have cleaned up and + * exited by now. So lets recheck before we throw an error. + * If !process_conflicts then this will just fail again. + */ + if (!remove_tablespace_directories(xlrec->ts_id, true)) + ereport(ERROR, (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), errmsg("tablespace %u is not empty", xlrec->ts_id))); + } } else elog(PANIC, "tblspc_redo: unknown op code %u", info); diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c index 15648d5a39..b9d1fac088 100644 --- a/src/backend/commands/vacuum.c +++ b/src/backend/commands/vacuum.c @@ -13,7 +13,7 @@ * * * IDENTIFICATION - * $PostgreSQL: pgsql/src/backend/commands/vacuum.c,v 1.398 2009/12/09 21:57:51 tgl Exp $ + * $PostgreSQL: pgsql/src/backend/commands/vacuum.c,v 1.399 2009/12/19 01:32:34 sriggs Exp $ * *------------------------------------------------------------------------- */ @@ -141,6 +141,7 @@ typedef struct VRelStats /* vtlinks array for tuple chain following - sorted by new_tid */ int num_vtlinks; VTupleLink vtlinks; + TransactionId latestRemovedXid; } VRelStats; /*---------------------------------------------------------------------- @@ -224,7 +225,7 @@ static void scan_heap(VRelStats *vacrelstats, Relation onerel, static bool repair_frag(VRelStats *vacrelstats, Relation onerel, VacPageList vacuum_pages, VacPageList fraged_pages, int nindexes, Relation *Irel); -static void move_chain_tuple(Relation rel, +static void move_chain_tuple(VRelStats *vacrelstats, Relation rel, Buffer old_buf, Page old_page, HeapTuple old_tup, Buffer dst_buf, Page dst_page, VacPage dst_vacpage, ExecContext ec, ItemPointer ctid, bool cleanVpd); @@ -237,7 +238,7 @@ static void update_hint_bits(Relation rel, VacPageList fraged_pages, int num_moved); static void vacuum_heap(VRelStats *vacrelstats, Relation onerel, VacPageList vacpagelist); -static void vacuum_page(Relation onerel, Buffer buffer, VacPage vacpage); +static void vacuum_page(VRelStats *vacrelstats, Relation onerel, Buffer buffer, VacPage vacpage); static void vacuum_index(VacPageList vacpagelist, Relation indrel, double num_tuples, int keep_tuples); static void scan_index(Relation indrel, double num_tuples); @@ -1300,6 +1301,7 @@ full_vacuum_rel(Relation onerel, VacuumStmt *vacstmt) vacrelstats->rel_tuples = 0; vacrelstats->rel_indexed_tuples = 0; vacrelstats->hasindex = false; + vacrelstats->latestRemovedXid = InvalidTransactionId; /* scan the heap */ vacuum_pages.num_pages = fraged_pages.num_pages = 0; @@ -1708,6 +1710,9 @@ scan_heap(VRelStats *vacrelstats, Relation onerel, { ItemId lpp; + HeapTupleHeaderAdvanceLatestRemovedXid(tuple.t_data, + &vacrelstats->latestRemovedXid); + /* * Here we are building a temporary copy of the page with dead * tuples removed. Below we will apply @@ -2025,7 +2030,7 @@ repair_frag(VRelStats *vacrelstats, Relation onerel, /* there are dead tuples on this page - clean them */ Assert(!isempty); LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE); - vacuum_page(onerel, buf, last_vacuum_page); + vacuum_page(vacrelstats, onerel, buf, last_vacuum_page); LockBuffer(buf, BUFFER_LOCK_UNLOCK); } else @@ -2514,7 +2519,7 @@ repair_frag(VRelStats *vacrelstats, Relation onerel, tuple.t_data = (HeapTupleHeader) PageGetItem(Cpage, Citemid); tuple_len = tuple.t_len = ItemIdGetLength(Citemid); - move_chain_tuple(onerel, Cbuf, Cpage, &tuple, + move_chain_tuple(vacrelstats, onerel, Cbuf, Cpage, &tuple, dst_buffer, dst_page, destvacpage, &ec, &Ctid, vtmove[ti].cleanVpd); @@ -2600,7 +2605,7 @@ repair_frag(VRelStats *vacrelstats, Relation onerel, dst_page = BufferGetPage(dst_buffer); /* if this page was not used before - clean it */ if (!PageIsEmpty(dst_page) && dst_vacpage->offsets_used == 0) - vacuum_page(onerel, dst_buffer, dst_vacpage); + vacuum_page(vacrelstats, onerel, dst_buffer, dst_vacpage); } else LockBuffer(dst_buffer, BUFFER_LOCK_EXCLUSIVE); @@ -2753,7 +2758,7 @@ repair_frag(VRelStats *vacrelstats, Relation onerel, HOLD_INTERRUPTS(); heldoff = true; ForceSyncCommit(); - (void) RecordTransactionCommit(); + (void) RecordTransactionCommit(true); } /* @@ -2781,7 +2786,7 @@ repair_frag(VRelStats *vacrelstats, Relation onerel, LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE); page = BufferGetPage(buf); if (!PageIsEmpty(page)) - vacuum_page(onerel, buf, *curpage); + vacuum_page(vacrelstats, onerel, buf, *curpage); UnlockReleaseBuffer(buf); } } @@ -2917,7 +2922,7 @@ repair_frag(VRelStats *vacrelstats, Relation onerel, recptr = log_heap_clean(onerel, buf, NULL, 0, NULL, 0, unused, uncnt, - false); + vacrelstats->latestRemovedXid, false); PageSetLSN(page, recptr); PageSetTLI(page, ThisTimeLineID); } @@ -2969,7 +2974,7 @@ repair_frag(VRelStats *vacrelstats, Relation onerel, * already too long and almost unreadable. */ static void -move_chain_tuple(Relation rel, +move_chain_tuple(VRelStats *vacrelstats, Relation rel, Buffer old_buf, Page old_page, HeapTuple old_tup, Buffer dst_buf, Page dst_page, VacPage dst_vacpage, ExecContext ec, ItemPointer ctid, bool cleanVpd) @@ -3027,7 +3032,7 @@ move_chain_tuple(Relation rel, int sv_offsets_used = dst_vacpage->offsets_used; dst_vacpage->offsets_used = 0; - vacuum_page(rel, dst_buf, dst_vacpage); + vacuum_page(vacrelstats, rel, dst_buf, dst_vacpage); dst_vacpage->offsets_used = sv_offsets_used; } @@ -3367,7 +3372,7 @@ vacuum_heap(VRelStats *vacrelstats, Relation onerel, VacPageList vacuum_pages) buf = ReadBufferExtended(onerel, MAIN_FORKNUM, (*vacpage)->blkno, RBM_NORMAL, vac_strategy); LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE); - vacuum_page(onerel, buf, *vacpage); + vacuum_page(vacrelstats, onerel, buf, *vacpage); UnlockReleaseBuffer(buf); } } @@ -3397,7 +3402,7 @@ vacuum_heap(VRelStats *vacrelstats, Relation onerel, VacPageList vacuum_pages) * Caller must hold pin and lock on buffer. */ static void -vacuum_page(Relation onerel, Buffer buffer, VacPage vacpage) +vacuum_page(VRelStats *vacrelstats, Relation onerel, Buffer buffer, VacPage vacpage) { Page page = BufferGetPage(buffer); int i; @@ -3426,7 +3431,7 @@ vacuum_page(Relation onerel, Buffer buffer, VacPage vacpage) recptr = log_heap_clean(onerel, buffer, NULL, 0, NULL, 0, vacpage->offsets, vacpage->offsets_free, - false); + vacrelstats->latestRemovedXid, false); PageSetLSN(page, recptr); PageSetTLI(page, ThisTimeLineID); } diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c index 50c96e948e..acc6c42707 100644 --- a/src/backend/commands/vacuumlazy.c +++ b/src/backend/commands/vacuumlazy.c @@ -29,7 +29,7 @@ * * * IDENTIFICATION - * $PostgreSQL: pgsql/src/backend/commands/vacuumlazy.c,v 1.124 2009/11/16 21:32:06 tgl Exp $ + * $PostgreSQL: pgsql/src/backend/commands/vacuumlazy.c,v 1.125 2009/12/19 01:32:34 sriggs Exp $ * *------------------------------------------------------------------------- */ @@ -98,6 +98,7 @@ typedef struct LVRelStats int max_dead_tuples; /* # slots allocated in array */ ItemPointer dead_tuples; /* array of ItemPointerData */ int num_index_scans; + TransactionId latestRemovedXid; } LVRelStats; @@ -265,6 +266,34 @@ lazy_vacuum_rel(Relation onerel, VacuumStmt *vacstmt, return heldoff; } +/* + * For Hot Standby we need to know the highest transaction id that will + * be removed by any change. VACUUM proceeds in a number of passes so + * we need to consider how each pass operates. The first phase runs + * heap_page_prune(), which can issue XLOG_HEAP2_CLEAN records as it + * progresses - these will have a latestRemovedXid on each record. + * In some cases this removes all of the tuples to be removed, though + * often we have dead tuples with index pointers so we must remember them + * for removal in phase 3. Index records for those rows are removed + * in phase 2 and index blocks do not have MVCC information attached. + * So before we can allow removal of any index tuples we need to issue + * a WAL record containing the latestRemovedXid of rows that will be + * removed in phase three. This allows recovery queries to block at the + * correct place, i.e. before phase two, rather than during phase three + * which would be after the rows have become inaccessible. + */ +static void +vacuum_log_cleanup_info(Relation rel, LVRelStats *vacrelstats) +{ + /* + * No need to log changes for temp tables, they do not contain + * data visible on the standby server. + */ + if (rel->rd_istemp || !XLogArchivingActive()) + return; + + (void) log_heap_cleanup_info(rel->rd_node, vacrelstats->latestRemovedXid); +} /* * lazy_scan_heap() -- scan an open heap relation @@ -315,6 +344,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats, nblocks = RelationGetNumberOfBlocks(onerel); vacrelstats->rel_pages = nblocks; vacrelstats->nonempty_pages = 0; + vacrelstats->latestRemovedXid = InvalidTransactionId; lazy_space_alloc(vacrelstats, nblocks); @@ -373,6 +403,9 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats, if ((vacrelstats->max_dead_tuples - vacrelstats->num_dead_tuples) < MaxHeapTuplesPerPage && vacrelstats->num_dead_tuples > 0) { + /* Log cleanup info before we touch indexes */ + vacuum_log_cleanup_info(onerel, vacrelstats); + /* Remove index entries */ for (i = 0; i < nindexes; i++) lazy_vacuum_index(Irel[i], @@ -382,6 +415,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats, lazy_vacuum_heap(onerel, vacrelstats); /* Forget the now-vacuumed tuples, and press on */ vacrelstats->num_dead_tuples = 0; + vacrelstats->latestRemovedXid = InvalidTransactionId; vacrelstats->num_index_scans++; } @@ -613,6 +647,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats, if (tupgone) { lazy_record_dead_tuple(vacrelstats, &(tuple.t_self)); + HeapTupleHeaderAdvanceLatestRemovedXid(tuple.t_data, + &vacrelstats->latestRemovedXid); tups_vacuumed += 1; } else @@ -661,6 +697,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats, lazy_vacuum_page(onerel, blkno, buf, 0, vacrelstats); /* Forget the now-vacuumed tuples, and press on */ vacrelstats->num_dead_tuples = 0; + vacrelstats->latestRemovedXid = InvalidTransactionId; vacuumed_pages++; } @@ -724,6 +761,9 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats, /* XXX put a threshold on min number of tuples here? */ if (vacrelstats->num_dead_tuples > 0) { + /* Log cleanup info before we touch indexes */ + vacuum_log_cleanup_info(onerel, vacrelstats); + /* Remove index entries */ for (i = 0; i < nindexes; i++) lazy_vacuum_index(Irel[i], @@ -868,7 +908,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer, recptr = log_heap_clean(onerel, buffer, NULL, 0, NULL, 0, unused, uncnt, - false); + vacrelstats->latestRemovedXid, false); PageSetLSN(page, recptr); PageSetTLI(page, ThisTimeLineID); } diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index b616eaca13..21fc83ab4b 100644 --- a/src/backend/postmaster/postmaster.c +++ b/src/backend/postmaster/postmaster.c @@ -37,7 +37,7 @@ * * * IDENTIFICATION - * $PostgreSQL: pgsql/src/backend/postmaster/postmaster.c,v 1.596 2009/09/08 17:08:36 tgl Exp $ + * $PostgreSQL: pgsql/src/backend/postmaster/postmaster.c,v 1.597 2009/12/19 01:32:34 sriggs Exp $ * * NOTES * @@ -245,8 +245,9 @@ static bool RecoveryError = false; /* T if WAL recovery failed */ * When archive recovery is finished, the startup process exits with exit * code 0 and we switch to PM_RUN state. * - * Normal child backends can only be launched when we are in PM_RUN state. - * (We also allow it in PM_WAIT_BACKUP state, but only for superusers.) + * Normal child backends can only be launched when we are in PM_RUN or + * PM_RECOVERY_CONSISTENT state. (We also allow launch of normal + * child backends in PM_WAIT_BACKUP state, but only for superusers.) * In other states we handle connection requests by launching "dead_end" * child processes, which will simply send the client an error message and * quit. (We track these in the BackendList so that we can know when they @@ -1868,7 +1869,7 @@ static enum CAC_state canAcceptConnections(void) { /* - * Can't start backends when in startup/shutdown/recovery state. + * Can't start backends when in startup/shutdown/inconsistent recovery state. * * In state PM_WAIT_BACKUP only superusers can connect (this must be * allowed so that a superuser can end online backup mode); we return @@ -1882,9 +1883,11 @@ canAcceptConnections(void) return CAC_SHUTDOWN; /* shutdown is pending */ if (!FatalError && (pmState == PM_STARTUP || - pmState == PM_RECOVERY || - pmState == PM_RECOVERY_CONSISTENT)) + pmState == PM_RECOVERY)) return CAC_STARTUP; /* normal startup */ + if (!FatalError && + pmState == PM_RECOVERY_CONSISTENT) + return CAC_OK; /* connection OK during recovery */ return CAC_RECOVERY; /* else must be crash recovery */ } @@ -4003,9 +4006,8 @@ sigusr1_handler(SIGNAL_ARGS) Assert(PgStatPID == 0); PgStatPID = pgstat_start(); - /* XXX at this point we could accept read-only connections */ - ereport(DEBUG1, - (errmsg("database system is in consistent recovery mode"))); + ereport(LOG, + (errmsg("database system is ready to accept read only connections"))); pmState = PM_RECOVERY_CONSISTENT; } diff --git a/src/backend/storage/ipc/Makefile b/src/backend/storage/ipc/Makefile index 20ac1e75e4..1d897c5afb 100644 --- a/src/backend/storage/ipc/Makefile +++ b/src/backend/storage/ipc/Makefile @@ -1,7 +1,7 @@ # # Makefile for storage/ipc # -# $PostgreSQL: pgsql/src/backend/storage/ipc/Makefile,v 1.22 2009/07/31 20:26:23 tgl Exp $ +# $PostgreSQL: pgsql/src/backend/storage/ipc/Makefile,v 1.23 2009/12/19 01:32:35 sriggs Exp $ # subdir = src/backend/storage/ipc @@ -16,6 +16,6 @@ endif endif OBJS = ipc.o ipci.o pmsignal.o procarray.o procsignal.o shmem.o shmqueue.o \ - sinval.o sinvaladt.o + sinval.o sinvaladt.o standby.o include $(top_srcdir)/src/backend/common.mk diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c index 9a3d2f6260..c4ddf8f2bd 100644 --- a/src/backend/storage/ipc/procarray.c +++ b/src/backend/storage/ipc/procarray.c @@ -17,13 +17,27 @@ * as are the myProcLocks lists. They can be distinguished from regular * backend PGPROCs at need by checking for pid == 0. * + * During recovery, we also keep a list of XIDs representing transactions + * that are known to be running at current point in WAL recovery. This + * list is kept in the KnownAssignedXids array, and updated by watching + * the sequence of arriving xids. This is very important because if we leave + * those xids out of the snapshot then they will appear to be already complete. + * Later, when they have actually completed this could lead to confusion as to + * whether those xids are visible or not, blowing a huge hole in MVCC. + * We need 'em. + * + * It is theoretically possible for a FATAL error to explode before writing + * an abort record. This could tie up KnownAssignedXids indefinitely, so + * we prune the array when a valid list of running xids arrives. These quirks, + * if they do ever exist in reality will not effect the correctness of + * snapshots. * * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group * Portions Copyright (c) 1994, Regents of the University of California * * * IDENTIFICATION - * $PostgreSQL: pgsql/src/backend/storage/ipc/procarray.c,v 1.51 2009/07/29 15:57:11 tgl Exp $ + * $PostgreSQL: pgsql/src/backend/storage/ipc/procarray.c,v 1.52 2009/12/19 01:32:35 sriggs Exp $ * *------------------------------------------------------------------------- */ @@ -31,14 +45,18 @@ #include +#include "access/clog.h" #include "access/subtrans.h" #include "access/transam.h" #include "access/xact.h" #include "access/twophase.h" #include "miscadmin.h" #include "storage/procarray.h" +#include "storage/standby.h" +#include "utils/builtins.h" #include "utils/snapmgr.h" +static RunningTransactionsData CurrentRunningXactsData; /* Our shared memory area */ typedef struct ProcArrayStruct @@ -46,6 +64,14 @@ typedef struct ProcArrayStruct int numProcs; /* number of valid procs entries */ int maxProcs; /* allocated size of procs array */ + int numKnownAssignedXids; /* current number of known assigned xids */ + int maxKnownAssignedXids; /* allocated size of known assigned xids */ + /* + * Highest subxid that overflowed KnownAssignedXids array. Similar to + * overflowing cached subxids in PGPROC entries. + */ + TransactionId lastOverflowedXid; + /* * We declare procs[] as 1 entry because C wants a fixed-size array, but * actually it is maxProcs entries long. @@ -55,6 +81,24 @@ typedef struct ProcArrayStruct static ProcArrayStruct *procArray; +/* + * Bookkeeping for tracking emulated transactions in recovery + */ +static HTAB *KnownAssignedXidsHash; +static TransactionId latestObservedXid = InvalidTransactionId; + +/* + * If we're in STANDBY_SNAPSHOT_PENDING state, standbySnapshotPendingXmin is + * the highest xid that might still be running that we don't have in + * KnownAssignedXids. + */ +static TransactionId standbySnapshotPendingXmin; + +/* + * Oldest transaction still running according to the running-xacts snapshot + * we initialized standby mode from. + */ +static TransactionId snapshotOldestActiveXid; #ifdef XIDCACHE_DEBUG @@ -90,6 +134,17 @@ static void DisplayXidCache(void); #define xc_slow_answer_inc() ((void) 0) #endif /* XIDCACHE_DEBUG */ +/* Primitives for KnownAssignedXids array handling for standby */ +static Size KnownAssignedXidsShmemSize(int size); +static void KnownAssignedXidsInit(int size); +static int KnownAssignedXidsGet(TransactionId *xarray, TransactionId xmax); +static int KnownAssignedXidsGetAndSetXmin(TransactionId *xarray, TransactionId *xmin, + TransactionId xmax); +static bool KnownAssignedXidsExist(TransactionId xid); +static void KnownAssignedXidsAdd(TransactionId *xids, int nxids); +static void KnownAssignedXidsRemove(TransactionId xid); +static void KnownAssignedXidsRemoveMany(TransactionId xid, bool keepPreparedXacts); +static void KnownAssignedXidsDisplay(int trace_level); /* * Report shared-memory space needed by CreateSharedProcArray. @@ -100,8 +155,22 @@ ProcArrayShmemSize(void) Size size; size = offsetof(ProcArrayStruct, procs); - size = add_size(size, mul_size(sizeof(PGPROC *), - add_size(MaxBackends, max_prepared_xacts))); + + /* Normal processing - MyProc slots */ +#define PROCARRAY_MAXPROCS (MaxBackends + max_prepared_xacts) + size = add_size(size, mul_size(sizeof(PGPROC *), PROCARRAY_MAXPROCS)); + + /* + * During recovery processing we have a data structure called KnownAssignedXids, + * created in shared memory. Local data structures are also created in various + * backends during GetSnapshotData(), TransactionIdIsInProgress() and + * GetRunningTransactionData(). All of the main structures created in those + * functions must be identically sized, since we may at times copy the whole + * of the data structures around. We refer to this as TOTAL_MAX_CACHED_SUBXIDS. + */ +#define TOTAL_MAX_CACHED_SUBXIDS ((PGPROC_MAX_CACHED_SUBXIDS + 1) * PROCARRAY_MAXPROCS) + if (XLogRequestRecoveryConnections) + size = add_size(size, KnownAssignedXidsShmemSize(TOTAL_MAX_CACHED_SUBXIDS)); return size; } @@ -116,15 +185,21 @@ CreateSharedProcArray(void) /* Create or attach to the ProcArray shared structure */ procArray = (ProcArrayStruct *) - ShmemInitStruct("Proc Array", ProcArrayShmemSize(), &found); + ShmemInitStruct("Proc Array", + mul_size(sizeof(PGPROC *), PROCARRAY_MAXPROCS), + &found); if (!found) { /* * We're the first - initialize. */ + /* Normal processing */ procArray->numProcs = 0; - procArray->maxProcs = MaxBackends + max_prepared_xacts; + procArray->maxProcs = PROCARRAY_MAXPROCS; + + if (XLogRequestRecoveryConnections) + KnownAssignedXidsInit(TOTAL_MAX_CACHED_SUBXIDS); } } @@ -302,6 +377,7 @@ ProcArrayClearTransaction(PGPROC *proc) proc->xid = InvalidTransactionId; proc->lxid = InvalidLocalTransactionId; proc->xmin = InvalidTransactionId; + proc->recoveryConflictMode = 0; /* redundant, but just in case */ proc->vacuumFlags &= ~PROC_VACUUM_STATE_MASK; @@ -312,6 +388,220 @@ ProcArrayClearTransaction(PGPROC *proc) proc->subxids.overflowed = false; } +void +ProcArrayInitRecoveryInfo(TransactionId oldestActiveXid) +{ + snapshotOldestActiveXid = oldestActiveXid; +} + +/* + * ProcArrayApplyRecoveryInfo -- apply recovery info about xids + * + * Takes us through 3 states: Uninitialized, Pending and Ready. + * Normal case is to go all the way to Ready straight away, though there + * are atypical cases where we need to take it in steps. + * + * Use the data about running transactions on master to create the initial + * state of KnownAssignedXids. We also these records to regularly prune + * KnownAssignedXids because we know it is possible that some transactions + * with FATAL errors do not write abort records, which could cause eventual + * overflow. + * + * Only used during recovery. Notice the signature is very similar to a + * _redo function and its difficult to decide exactly where this code should + * reside. + */ +void +ProcArrayApplyRecoveryInfo(RunningTransactions running) +{ + int xid_index; /* main loop */ + TransactionId *xids; + int nxids; + + Assert(standbyState >= STANDBY_INITIALIZED); + + /* + * Remove stale transactions, if any. + */ + ExpireOldKnownAssignedTransactionIds(running->oldestRunningXid); + StandbyReleaseOldLocks(running->oldestRunningXid); + + /* + * If our snapshot is already valid, nothing else to do... + */ + if (standbyState == STANDBY_SNAPSHOT_READY) + return; + + /* + * If our initial RunningXactData had an overflowed snapshot then we + * knew we were missing some subxids from our snapshot. We can use + * this data as an initial snapshot, but we cannot yet mark it valid. + * We know that the missing subxids are equal to or earlier than + * nextXid. After we initialise we continue to apply changes during + * recovery, so once the oldestRunningXid is later than the nextXid + * from the initial snapshot we know that we no longer have missing + * information and can mark the snapshot as valid. + */ + if (standbyState == STANDBY_SNAPSHOT_PENDING) + { + if (TransactionIdPrecedes(standbySnapshotPendingXmin, + running->oldestRunningXid)) + { + standbyState = STANDBY_SNAPSHOT_READY; + elog(trace_recovery(DEBUG2), + "running xact data now proven complete"); + elog(trace_recovery(DEBUG2), + "recovery snapshots are now enabled"); + } + return; + } + + /* + * OK, we need to initialise from the RunningXactData record + */ + latestObservedXid = running->nextXid; + TransactionIdRetreat(latestObservedXid); + + /* + * If the snapshot overflowed, then we still initialise with what we + * know, but the recovery snapshot isn't fully valid yet because we + * know there are some subxids missing (ergo we don't know which ones) + */ + if (!running->subxid_overflow) + { + standbyState = STANDBY_SNAPSHOT_READY; + standbySnapshotPendingXmin = InvalidTransactionId; + } + else + { + standbyState = STANDBY_SNAPSHOT_PENDING; + standbySnapshotPendingXmin = latestObservedXid; + ereport(LOG, + (errmsg("consistent state delayed because recovery snapshot incomplete"))); + } + + nxids = running->xcnt; + xids = running->xids; + + KnownAssignedXidsDisplay(trace_recovery(DEBUG3)); + + /* + * Scan through the incoming array of RunningXacts and collect xids. + * We don't use SubtransSetParent because it doesn't matter yet. If + * we aren't overflowed then all xids will fit in snapshot and so we + * don't need subtrans. If we later overflow, an xid assignment record + * will add xids to subtrans. If RunningXacts is overflowed then we + * don't have enough information to correctly update subtrans anyway. + */ + + /* + * Nobody else is running yet, but take locks anyhow + */ + LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE); + + /* Reset latestCompletedXid */ + ShmemVariableCache->latestCompletedXid = running->nextXid; + TransactionIdRetreat(ShmemVariableCache->latestCompletedXid); + + /* + * Add our new xids into the array + */ + for (xid_index = 0; xid_index < running->xcnt; xid_index++) + { + TransactionId xid = running->xids[xid_index]; + + /* + * The running-xacts snapshot can contain xids that did finish between + * when the snapshot was taken and when it was written to WAL. Such + * transactions are not running anymore, so ignore them. + */ + if (TransactionIdDidCommit(xid) || TransactionIdDidAbort(xid)) + continue; + + KnownAssignedXidsAdd(&xid, 1); + } + + KnownAssignedXidsDisplay(trace_recovery(DEBUG3)); + + /* + * Update lastOverflowedXid if the snapshot had overflown. We don't know + * the exact value for this, so conservatively assume that it's nextXid-1 + */ + if (running->subxid_overflow && + TransactionIdFollows(latestObservedXid, procArray->lastOverflowedXid)) + procArray->lastOverflowedXid = latestObservedXid; + else if (TransactionIdFollows(running->oldestRunningXid, + procArray->lastOverflowedXid)) + procArray->lastOverflowedXid = InvalidTransactionId; + + LWLockRelease(ProcArrayLock); + + /* nextXid must be beyond any observed xid */ + if (TransactionIdFollows(running->nextXid, ShmemVariableCache->nextXid)) + ShmemVariableCache->nextXid = running->nextXid; + + elog(trace_recovery(DEBUG2), + "running transaction data initialized"); + if (standbyState == STANDBY_SNAPSHOT_READY) + elog(trace_recovery(DEBUG2), + "recovery snapshots are now enabled"); +} + +void +ProcArrayApplyXidAssignment(TransactionId topxid, + int nsubxids, TransactionId *subxids) +{ + TransactionId max_xid; + int i; + + if (standbyState < STANDBY_SNAPSHOT_PENDING) + return; + + max_xid = TransactionIdLatest(topxid, nsubxids, subxids); + + /* + * Mark all the subtransactions as observed. + * + * NOTE: This will fail if the subxid contains too many previously + * unobserved xids to fit into known-assigned-xids. That shouldn't happen + * as the code stands, because xid-assignment records should never contain + * more than PGPROC_MAX_CACHED_SUBXIDS entries. + */ + RecordKnownAssignedTransactionIds(max_xid); + + /* + * Notice that we update pg_subtrans with the top-level xid, rather + * than the parent xid. This is a difference between normal + * processing and recovery, yet is still correct in all cases. The + * reason is that subtransaction commit is not marked in clog until + * commit processing, so all aborted subtransactions have already been + * clearly marked in clog. As a result we are able to refer directly + * to the top-level transaction's state rather than skipping through + * all the intermediate states in the subtransaction tree. This + * should be the first time we have attempted to SubTransSetParent(). + */ + for (i = 0; i < nsubxids; i++) + SubTransSetParent(subxids[i], topxid, false); + + /* + * Uses same locking as transaction commit + */ + LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE); + + /* + * Remove from known-assigned-xacts. + */ + for (i = 0; i < nsubxids; i++) + KnownAssignedXidsRemove(subxids[i]); + + /* + * Advance lastOverflowedXid when required. + */ + if (TransactionIdPrecedes(procArray->lastOverflowedXid, max_xid)) + procArray->lastOverflowedXid = max_xid; + + LWLockRelease(ProcArrayLock); +} /* * TransactionIdIsInProgress -- is given transaction running in some backend @@ -384,8 +674,15 @@ TransactionIdIsInProgress(TransactionId xid) */ if (xids == NULL) { - xids = (TransactionId *) - malloc(arrayP->maxProcs * sizeof(TransactionId)); + /* + * In hot standby mode, reserve enough space to hold all xids in + * the known-assigned list. If we later finish recovery, we no longer + * need the bigger array, but we don't bother to shrink it. + */ + int maxxids = RecoveryInProgress() ? + arrayP->maxProcs : TOTAL_MAX_CACHED_SUBXIDS; + + xids = (TransactionId *) malloc(maxxids * sizeof(TransactionId)); if (xids == NULL) ereport(ERROR, (errcode(ERRCODE_OUT_OF_MEMORY), @@ -465,11 +762,35 @@ TransactionIdIsInProgress(TransactionId xid) xids[nxids++] = pxid; } + /* In hot standby mode, check the known-assigned-xids list. */ + if (RecoveryInProgress()) + { + /* none of the PGPROC entries should have XIDs in hot standby mode */ + Assert(nxids == 0); + + if (KnownAssignedXidsExist(xid)) + { + LWLockRelease(ProcArrayLock); + /* XXX: should we have a separate counter for this? */ + /* xc_by_main_xid_inc(); */ + return true; + } + + /* + * If the KnownAssignedXids overflowed, we have to check + * pg_subtrans too. Copy all xids from KnownAssignedXids that are + * lower than xid, since if xid is a subtransaction its parent will + * always have a lower value. + */ + if (TransactionIdPrecedesOrEquals(xid, procArray->lastOverflowedXid)) + nxids = KnownAssignedXidsGet(xids, xid); + } + LWLockRelease(ProcArrayLock); /* * If none of the relevant caches overflowed, we know the Xid is not - * running without looking at pg_subtrans. + * running without even looking at pg_subtrans. */ if (nxids == 0) { @@ -590,6 +911,9 @@ GetOldestXmin(bool allDbs, bool ignoreVacuum) TransactionId result; int index; + /* Cannot look for individual databases during recovery */ + Assert(allDbs || !RecoveryInProgress()); + LWLockAcquire(ProcArrayLock, LW_SHARED); /* @@ -635,6 +959,13 @@ GetOldestXmin(bool allDbs, bool ignoreVacuum) LWLockRelease(ProcArrayLock); + /* + * Compute the cutoff XID, being careful not to generate a "permanent" XID + */ + result -= vacuum_defer_cleanup_age; + if (!TransactionIdIsNormal(result)) + result = FirstNormalTransactionId; + return result; } @@ -656,7 +987,7 @@ GetOldestXmin(bool allDbs, bool ignoreVacuum) * but since PGPROC has only a limited cache area for subxact XIDs, full * information may not be available. If we find any overflowed subxid arrays, * we have to mark the snapshot's subxid data as overflowed, and extra work - * will need to be done to determine what's running (see XidInMVCCSnapshot() + * *may* need to be done to determine what's running (see XidInMVCCSnapshot() * in tqual.c). * * We also update the following backend-global variables: @@ -681,6 +1012,7 @@ GetSnapshotData(Snapshot snapshot) int index; int count = 0; int subcount = 0; + bool suboverflowed = false; Assert(snapshot != NULL); @@ -698,7 +1030,8 @@ GetSnapshotData(Snapshot snapshot) if (snapshot->xip == NULL) { /* - * First call for this snapshot + * First call for this snapshot. Snapshot is same size whether + * or not we are in recovery, see later comments. */ snapshot->xip = (TransactionId *) malloc(arrayP->maxProcs * sizeof(TransactionId)); @@ -708,13 +1041,15 @@ GetSnapshotData(Snapshot snapshot) errmsg("out of memory"))); Assert(snapshot->subxip == NULL); snapshot->subxip = (TransactionId *) - malloc(arrayP->maxProcs * PGPROC_MAX_CACHED_SUBXIDS * sizeof(TransactionId)); + malloc(TOTAL_MAX_CACHED_SUBXIDS * sizeof(TransactionId)); if (snapshot->subxip == NULL) ereport(ERROR, (errcode(ERRCODE_OUT_OF_MEMORY), errmsg("out of memory"))); } + snapshot->takenDuringRecovery = RecoveryInProgress(); + /* * It is sufficient to get shared lock on ProcArrayLock, even if we are * going to set MyProc->xmin. @@ -763,6 +1098,7 @@ GetSnapshotData(Snapshot snapshot) */ if (TransactionIdIsNormal(xid)) { + Assert(!snapshot->takenDuringRecovery); if (TransactionIdFollowsOrEquals(xid, xmax)) continue; if (proc != MyProc) @@ -785,16 +1121,17 @@ GetSnapshotData(Snapshot snapshot) * * Again, our own XIDs are not included in the snapshot. */ - if (subcount >= 0 && proc != MyProc) + if (!suboverflowed && proc != MyProc) { if (proc->subxids.overflowed) - subcount = -1; /* overflowed */ + suboverflowed = true; else { int nxids = proc->subxids.nxids; if (nxids > 0) { + Assert(!snapshot->takenDuringRecovery); memcpy(snapshot->subxip + subcount, (void *) proc->subxids.xids, nxids * sizeof(TransactionId)); @@ -804,6 +1141,40 @@ GetSnapshotData(Snapshot snapshot) } } + /* + * If in recovery get any known assigned xids. + */ + if (snapshot->takenDuringRecovery) + { + Assert(count == 0); + + /* + * We store all xids directly into subxip[]. Here's why: + * + * In recovery we don't know which xids are top-level and which are + * subxacts, a design choice that greatly simplifies xid processing. + * + * It seems like we would want to try to put xids into xip[] only, + * but that is fairly small. We would either need to make that bigger + * or to increase the rate at which we WAL-log xid assignment; + * neither is an appealing choice. + * + * We could try to store xids into xip[] first and then into subxip[] + * if there are too many xids. That only works if the snapshot doesn't + * overflow because we do not search subxip[] in that case. A simpler + * way is to just store all xids in the subxact array because this + * is by far the bigger array. We just leave the xip array empty. + * + * Either way we need to change the way XidInMVCCSnapshot() works + * depending upon when the snapshot was taken, or change normal + * snapshot processing so it matches. + */ + subcount = KnownAssignedXidsGetAndSetXmin(snapshot->subxip, &xmin, xmax); + + if (TransactionIdPrecedes(xmin, procArray->lastOverflowedXid)) + suboverflowed = true; + } + if (!TransactionIdIsValid(MyProc->xmin)) MyProc->xmin = TransactionXmin = xmin; @@ -818,13 +1189,16 @@ GetSnapshotData(Snapshot snapshot) globalxmin = xmin; /* Update global variables too */ - RecentGlobalXmin = globalxmin; + RecentGlobalXmin = globalxmin - vacuum_defer_cleanup_age; + if (!TransactionIdIsNormal(RecentGlobalXmin)) + RecentGlobalXmin = FirstNormalTransactionId; RecentXmin = xmin; snapshot->xmin = xmin; snapshot->xmax = xmax; snapshot->xcnt = count; snapshot->subxcnt = subcount; + snapshot->suboverflowed = suboverflowed; snapshot->curcid = GetCurrentCommandId(false); @@ -839,6 +1213,129 @@ GetSnapshotData(Snapshot snapshot) return snapshot; } +/* + * GetRunningTransactionData -- returns information about running transactions. + * + * Similar to GetSnapshotData but returning more information. We include + * all PGPROCs with an assigned TransactionId, even VACUUM processes. + * + * This is never executed during recovery so there is no need to look at + * KnownAssignedXids. + * + * We don't worry about updating other counters, we want to keep this as + * simple as possible and leave GetSnapshotData() as the primary code for + * that bookkeeping. + */ +RunningTransactions +GetRunningTransactionData(void) +{ + ProcArrayStruct *arrayP = procArray; + RunningTransactions CurrentRunningXacts = (RunningTransactions) &CurrentRunningXactsData; + TransactionId latestCompletedXid; + TransactionId oldestRunningXid; + TransactionId *xids; + int index; + int count; + int subcount; + bool suboverflowed; + + Assert(!RecoveryInProgress()); + + /* + * Allocating space for maxProcs xids is usually overkill; numProcs would + * be sufficient. But it seems better to do the malloc while not holding + * the lock, so we can't look at numProcs. Likewise, we allocate much + * more subxip storage than is probably needed. + * + * Should only be allocated for bgwriter, since only ever executed + * during checkpoints. + */ + if (CurrentRunningXacts->xids == NULL) + { + /* + * First call + */ + CurrentRunningXacts->xids = (TransactionId *) + malloc(TOTAL_MAX_CACHED_SUBXIDS * sizeof(TransactionId)); + if (CurrentRunningXacts->xids == NULL) + ereport(ERROR, + (errcode(ERRCODE_OUT_OF_MEMORY), + errmsg("out of memory"))); + } + + xids = CurrentRunningXacts->xids; + + count = subcount = 0; + suboverflowed = false; + + /* + * Ensure that no xids enter or leave the procarray while we obtain + * snapshot. + */ + LWLockAcquire(ProcArrayLock, LW_SHARED); + LWLockAcquire(XidGenLock, LW_SHARED); + + latestCompletedXid = ShmemVariableCache->latestCompletedXid; + + oldestRunningXid = ShmemVariableCache->nextXid; + /* + * Spin over procArray collecting all xids and subxids. + */ + for (index = 0; index < arrayP->numProcs; index++) + { + volatile PGPROC *proc = arrayP->procs[index]; + TransactionId xid; + int nxids; + + /* Fetch xid just once - see GetNewTransactionId */ + xid = proc->xid; + + /* + * We don't need to store transactions that don't have a TransactionId + * yet because they will not show as running on a standby server. + */ + if (!TransactionIdIsValid(xid)) + continue; + + xids[count++] = xid; + + if (TransactionIdPrecedes(xid, oldestRunningXid)) + oldestRunningXid = xid; + + /* + * Save subtransaction XIDs. Other backends can't add or remove entries + * while we're holding XidGenLock. + */ + nxids = proc->subxids.nxids; + if (nxids > 0) + { + memcpy(&xids[count], (void *) proc->subxids.xids, + nxids * sizeof(TransactionId)); + count += nxids; + subcount += nxids; + + if (proc->subxids.overflowed) + suboverflowed = true; + + /* + * Top-level XID of a transaction is always greater than any of + * its subxids, so we don't need to check if any of the subxids + * are smaller than oldestRunningXid + */ + } + } + + CurrentRunningXacts->xcnt = count; + CurrentRunningXacts->subxid_overflow = suboverflowed; + CurrentRunningXacts->nextXid = ShmemVariableCache->nextXid; + CurrentRunningXacts->oldestRunningXid = oldestRunningXid; + + LWLockRelease(XidGenLock); + LWLockRelease(ProcArrayLock); + + return CurrentRunningXacts; +} + /* * GetTransactionsInCommit -- Get the XIDs of transactions that are committing * @@ -1101,6 +1598,154 @@ GetCurrentVirtualXIDs(TransactionId limitXmin, bool excludeXmin0, return vxids; } +/* + * GetConflictingVirtualXIDs -- returns an array of currently active VXIDs. + * + * The array is palloc'd and is terminated with an invalid VXID. + * + * Usage is limited to conflict resolution during recovery on standby servers. + * limitXmin is supplied as either latestRemovedXid, or InvalidTransactionId + * in cases where we cannot accurately determine a value for latestRemovedXid. + * If limitXmin is InvalidTransactionId then we know that the very + * latest xid that might have caused a cleanup record will be + * latestCompletedXid, so we set limitXmin to be latestCompletedXid instead. + * We then skip any backends with xmin > limitXmin. This means that + * cleanup records don't conflict with some recent snapshots. + * + * We replace InvalidTransactionId with latestCompletedXid here because + * this is the most convenient place to do that, while we hold ProcArrayLock. + * The originator of the cleanup record wanted to avoid checking the value of + * latestCompletedXid since doing so would be a performance issue during + * normal running, so we check it essentially for free on the standby. + * + * If dbOid is valid we skip backends attached to other databases. Some + * callers choose to skipExistingConflicts. + * + * Be careful to *not* pfree the result from this function. We reuse + * this array sufficiently often that we use malloc for the result. + */ +VirtualTransactionId * +GetConflictingVirtualXIDs(TransactionId limitXmin, Oid dbOid, + bool skipExistingConflicts) +{ + static VirtualTransactionId *vxids; + ProcArrayStruct *arrayP = procArray; + int count = 0; + int index; + + /* + * If not first time through, get workspace to remember main XIDs in. We + * malloc it permanently to avoid repeated palloc/pfree overhead. + * Allow result space, remembering room for a terminator. + */ + if (vxids == NULL) + { + vxids = (VirtualTransactionId *) + malloc(sizeof(VirtualTransactionId) * (arrayP->maxProcs + 1)); + if (vxids == NULL) + ereport(ERROR, + (errcode(ERRCODE_OUT_OF_MEMORY), + errmsg("out of memory"))); + } + + LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE); + + /* + * If we don't know the TransactionId that created the conflict, set + * it to latestCompletedXid which is the latest possible value. + */ + if (!TransactionIdIsValid(limitXmin)) + limitXmin = ShmemVariableCache->latestCompletedXid; + + for (index = 0; index < arrayP->numProcs; index++) + { + volatile PGPROC *proc = arrayP->procs[index]; + + /* Exclude prepared transactions */ + if (proc->pid == 0) + continue; + + if (skipExistingConflicts && proc->recoveryConflictMode > 0) + continue; + + if (!OidIsValid(dbOid) || + proc->databaseId == dbOid) + { + /* Fetch xmin just once - can't change on us, but good coding */ + TransactionId pxmin = proc->xmin; + + /* + * We ignore an invalid pxmin because this means that backend + * has no snapshot and cannot get another one while we hold exclusive lock. + */ + if (TransactionIdIsValid(pxmin) && !TransactionIdFollows(pxmin, limitXmin)) + { + VirtualTransactionId vxid; + + GET_VXID_FROM_PGPROC(vxid, *proc); + if (VirtualTransactionIdIsValid(vxid)) + vxids[count++] = vxid; + } + } + } + + LWLockRelease(ProcArrayLock); + + /* add the terminator */ + vxids[count].backendId = InvalidBackendId; + vxids[count].localTransactionId = InvalidLocalTransactionId; + + return vxids; +} + +/* + * CancelVirtualTransaction - used in recovery conflict processing + * + * Returns pid of the process signaled, or 0 if not found. + */ +pid_t +CancelVirtualTransaction(VirtualTransactionId vxid, int cancel_mode) +{ + ProcArrayStruct *arrayP = procArray; + int index; + pid_t pid = 0; + + LWLockAcquire(ProcArrayLock, LW_SHARED); + + for (index = 0; index < arrayP->numProcs; index++) + { + VirtualTransactionId procvxid; + PGPROC *proc = arrayP->procs[index]; + + GET_VXID_FROM_PGPROC(procvxid, *proc); + + if (procvxid.backendId == vxid.backendId && + procvxid.localTransactionId == vxid.localTransactionId) + { + /* + * Issue orders for the proc to read next time it receives SIGINT + */ + if (proc->recoveryConflictMode < cancel_mode) + proc->recoveryConflictMode = cancel_mode; + + pid = proc->pid; + break; + } + } + + LWLockRelease(ProcArrayLock); + + if (pid != 0) + { + /* + * Kill the pid if it's still here. If not, that's what we wanted + * so ignore any errors. + */ + kill(pid, SIGINT); + } + + return pid; +} /* * CountActiveBackends --- count backends (other than myself) that are in @@ -1400,3 +2045,457 @@ DisplayXidCache(void) } #endif /* XIDCACHE_DEBUG */ + +/* ---------------------------------------------- + * KnownAssignedTransactions sub-module + * ---------------------------------------------- + */ + +/* + * In Hot Standby mode, we maintain a list of transactions that are (or were) + * running in the master at the current point in WAL. + * + * RecordKnownAssignedTransactionIds() should be run for *every* WAL record + * type apart from XLOG_XACT_RUNNING_XACTS, since that initialises the first + * snapshot so that RecordKnownAssignedTransactionIds() can be callsed. Uses + * local variables, so should only be called by Startup process. + * + * We record all xids that we know have been assigned. That includes + * all the xids on the WAL record, plus all unobserved xids that + * we can deduce have been assigned. We can deduce the existence of + * unobserved xids because we know xids are in sequence, with no gaps. + * + * During recovery we do not fret too much about the distinction between + * top-level xids and subtransaction xids. We hold both together in + * a hash table called KnownAssignedXids. In backends, this is copied into + * snapshots in GetSnapshotData(), taking advantage + * of the fact that XidInMVCCSnapshot() doesn't care about the distinction + * either. Subtransaction xids are effectively treated as top-level xids + * and in the typical case pg_subtrans is *not* maintained (and that + * does not effect visibility). + * + * KnownAssignedXids expands as new xids are observed or inferred, and + * contracts when transaction completion records arrive. We have room in a + * snapshot to hold maxProcs * (1 + PGPROC_MAX_CACHED_SUBXIDS) xids, so + * every transaction must report their subtransaction xids in a special + * WAL assignment record every PGPROC_MAX_CACHED_SUBXIDS. This allows us + * to remove the subtransaction xids and update pg_subtrans instead. Snapshots + * are still correct yet we don't overflow SnapshotData structure. When we do + * this we need + * to keep track of which xids caused the snapshot to overflow. We do that + * by simply tracking the lastOverflowedXid - if it is within the bounds of + * the KnownAssignedXids then we know the snapshot overflowed. (Note that + * subxid overflow occurs on primary when 65th subxid arrives, whereas on + * standby it occurs when 64th subxid arrives - that is not an error). + * + * Should FATAL errors result in a backend on primary disappearing before + * it can write an abort record then we just leave those xids in + * KnownAssignedXids. They actually aborted but we think they were running; + * the distinction is irrelevant because either way any changes done by the + * transaction are not visible to backends in the standby. + * We prune KnownAssignedXids when XLOG_XACT_RUNNING_XACTS arrives, to + * ensure we do not overflow. + * + * If we are in STANDBY_SNAPSHOT_PENDING state, then we may try to remove + * xids that are not present. + */ +void +RecordKnownAssignedTransactionIds(TransactionId xid) +{ + /* + * Skip processing if the current snapshot is not initialized. + */ + if (standbyState < STANDBY_SNAPSHOT_PENDING) + return; + + /* + * We can see WAL records before the running-xacts snapshot that + * contain XIDs that are not in the running-xacts snapshot, but that we + * know to have finished before the running-xacts snapshot was taken. + * Don't waste precious shared memory by keeping them in the hash table. + * + * We can also see WAL records before the running-xacts snapshot that + * contain XIDs that are not in the running-xacts snapshot for a different + * reason: the transaction started *after* the running-xacts snapshot + * was taken, but before it was written to WAL. We must be careful to + * not ignore such XIDs. Because such a transaction started after the + * running-xacts snapshot was taken, it must have an XID larger than + * the oldest XID according to the running-xacts snapshot. + */ + if (TransactionIdPrecedes(xid, snapshotOldestActiveXid)) + return; + + ereport(trace_recovery(DEBUG4), + (errmsg("record known xact %u latestObservedXid %u", + xid, latestObservedXid))); + + /* + * When a newly observed xid arrives, it is frequently the case + * that it is *not* the next xid in sequence. When this occurs, we + * must treat the intervening xids as running also. + */ + if (TransactionIdFollows(xid, latestObservedXid)) + { + TransactionId next_expected_xid = latestObservedXid; + TransactionIdAdvance(next_expected_xid); + + /* + * Locking requirement is currently higher than for xid assignment + * in normal running. However, we only get called here for new + * high xids - so on a multi-processor where it is common that xids + * arrive out of order the average number of locks per assignment + * will actually reduce. So not too worried about this locking. + * + * XXX It does seem possible that we could add a whole range + * of numbers atomically to KnownAssignedXids, if we use a sorted + * list for KnownAssignedXids. But that design also increases the + * length of time we hold lock when we process commits/aborts, so + * on balance don't worry about this. + */ + LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE); + + while (TransactionIdPrecedesOrEquals(next_expected_xid, xid)) + { + if (TransactionIdPrecedes(next_expected_xid, xid)) + ereport(trace_recovery(DEBUG4), + (errmsg("recording unobserved xid %u (latestObservedXid %u)", + next_expected_xid, latestObservedXid))); + KnownAssignedXidsAdd(&next_expected_xid, 1); + + /* + * Extend clog and subtrans like we do in GetNewTransactionId() + * during normal operation + */ + ExtendCLOG(next_expected_xid); + ExtendSUBTRANS(next_expected_xid); + + TransactionIdAdvance(next_expected_xid); + } + + LWLockRelease(ProcArrayLock); + + latestObservedXid = xid; + } + + /* nextXid must be beyond any observed xid */ + if (TransactionIdFollowsOrEquals(latestObservedXid, + ShmemVariableCache->nextXid)) + { + ShmemVariableCache->nextXid = latestObservedXid; + TransactionIdAdvance(ShmemVariableCache->nextXid); + } +} + +void +ExpireTreeKnownAssignedTransactionIds(TransactionId xid, int nsubxids, + TransactionId *subxids) +{ + int i; + TransactionId max_xid; + + if (standbyState == STANDBY_DISABLED) + return; + + max_xid = TransactionIdLatest(xid, nsubxids, subxids); + + /* + * Uses same locking as transaction commit + */ + LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE); + + if (TransactionIdIsValid(xid)) + KnownAssignedXidsRemove(xid); + for (i = 0; i < nsubxids; i++) + KnownAssignedXidsRemove(subxids[i]); + + /* Like in ProcArrayRemove, advance latestCompletedXid */ + if (TransactionIdFollowsOrEquals(max_xid, + ShmemVariableCache->latestCompletedXid)) + ShmemVariableCache->latestCompletedXid = max_xid; + + LWLockRelease(ProcArrayLock); +} + +void +ExpireAllKnownAssignedTransactionIds(void) +{ + LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE); + KnownAssignedXidsRemoveMany(InvalidTransactionId, false); + LWLockRelease(ProcArrayLock); +} + +void +ExpireOldKnownAssignedTransactionIds(TransactionId xid) +{ + LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE); + KnownAssignedXidsRemoveMany(xid, true); + LWLockRelease(ProcArrayLock); +} + +/* + * Private module functions to manipulate KnownAssignedXids + * + * There are 3 main users of the KnownAssignedXids data structure: + * + * * backends taking snapshots + * * startup process adding new knownassigned xids + * * startup process removing xids as transactions end + * + * If we make KnownAssignedXids a simple sorted array then the first two + * operations are fast, but the last one is at least O(N). If we make + * KnownAssignedXids a hash table then the last two operations are fast, + * though we have to do more work at snapshot time. Doing more work at + * commit could slow down taking snapshots anyway because of lwlock + * contention. Scanning the hash table is O(N) on the max size of the array, + * so performs poorly in comparison when we have very low numbers of + * write transactions to process. But at least it is constant overhead + * and a sequential memory scan will utilise hardware memory readahead + * to give much improved performance. In any case the emphasis must be on + * having the standby process changes quickly so that it can provide + * high availability. So we choose to implement as a hash table. + */ + +static Size +KnownAssignedXidsShmemSize(int size) +{ + return hash_estimate_size(size, sizeof(TransactionId)); +} + +static void +KnownAssignedXidsInit(int size) +{ + HASHCTL info; + + /* assume no locking is needed yet */ + + info.keysize = sizeof(TransactionId); + info.entrysize = sizeof(TransactionId); + info.hash = tag_hash; + + KnownAssignedXidsHash = ShmemInitHash("KnownAssignedXids Hash", + size, size, + &info, + HASH_ELEM | HASH_FUNCTION); + + if (!KnownAssignedXidsHash) + elog(FATAL, "could not initialize known assigned xids hash table"); + + procArray->numKnownAssignedXids = 0; + procArray->maxKnownAssignedXids = TOTAL_MAX_CACHED_SUBXIDS; + procArray->lastOverflowedXid = InvalidTransactionId; +} + +/* + * Add xids into KnownAssignedXids. + * + * Must be called while holding ProcArrayLock in Exclusive mode + */ +static void +KnownAssignedXidsAdd(TransactionId *xids, int nxids) +{ + TransactionId *result; + bool found; + int i; + + for (i = 0; i < nxids; i++) + { + Assert(TransactionIdIsValid(xids[i])); + + elog(trace_recovery(DEBUG4), "adding KnownAssignedXid %u", xids[i]); + + procArray->numKnownAssignedXids++; + if (procArray->numKnownAssignedXids > procArray->maxKnownAssignedXids) + { + KnownAssignedXidsDisplay(LOG); + LWLockRelease(ProcArrayLock); + ereport(ERROR, + (errcode(ERRCODE_OUT_OF_MEMORY), + errmsg("too many KnownAssignedXids"))); + } + + result = (TransactionId *) hash_search(KnownAssignedXidsHash, &xids[i], HASH_ENTER, + &found); + + if (!result) + { + LWLockRelease(ProcArrayLock); + ereport(ERROR, + (errcode(ERRCODE_OUT_OF_MEMORY), + errmsg("out of shared memory"))); + } + + if (found) + { + KnownAssignedXidsDisplay(LOG); + LWLockRelease(ProcArrayLock); + elog(ERROR, "found duplicate KnownAssignedXid %u", xids[i]); + } + } +} + +/* + * Is an xid present in KnownAssignedXids? + * + * Must be called while holding ProcArrayLock in shared mode + */ +static bool +KnownAssignedXidsExist(TransactionId xid) +{ + bool found; + (void) hash_search(KnownAssignedXidsHash, &xid, HASH_FIND, &found); + return found; +} + +/* + * Remove one xid from anywhere in KnownAssignedXids. + * + * Must be called while holding ProcArrayLock in Exclusive mode + */ +static void +KnownAssignedXidsRemove(TransactionId xid) +{ + bool found; + + Assert(TransactionIdIsValid(xid)); + + elog(trace_recovery(DEBUG4), "remove KnownAssignedXid %u", xid); + + (void) hash_search(KnownAssignedXidsHash, &xid, HASH_REMOVE, &found); + + if (found) + procArray->numKnownAssignedXids--; + Assert(procArray->numKnownAssignedXids >= 0); + + /* + * We can fail to find an xid if the xid came from a subtransaction + * that aborts, though the xid hadn't yet been reported and no WAL records + * have been written using the subxid. In that case the abort record will + * contain that subxid and we haven't seen it before. + * + * If we fail to find it for other reasons it might be a problem, but + * it isn't much use to log that it happened, since we can't divine much + * from just an isolated xid value. + */ +} + +/* + * KnownAssignedXidsGet - Get an array of xids by scanning KnownAssignedXids. + * We filter out anything higher than xmax. + * + * Must be called while holding ProcArrayLock (in shared mode) + */ +static int +KnownAssignedXidsGet(TransactionId *xarray, TransactionId xmax) +{ + TransactionId xtmp = InvalidTransactionId; + + return KnownAssignedXidsGetAndSetXmin(xarray, &xtmp, xmax); +} + +/* + * KnownAssignedXidsGetAndSetXmin - as KnownAssignedXidsGet, plus we reduce *xmin + * to the lowest xid value seen if not already lower. + * + * Must be called while holding ProcArrayLock (in shared mode) + */ +static int +KnownAssignedXidsGetAndSetXmin(TransactionId *xarray, TransactionId *xmin, + TransactionId xmax) +{ + HASH_SEQ_STATUS status; + TransactionId *knownXid; + int count = 0; + + hash_seq_init(&status, KnownAssignedXidsHash); + while ((knownXid = (TransactionId *) hash_seq_search(&status)) != NULL) + { + /* + * Filter out anything higher than xmax + */ + if (TransactionIdPrecedes(xmax, *knownXid)) + continue; + + *xarray = *knownXid; + xarray++; + count++; + + /* update xmin if required */ + if (TransactionIdPrecedes(*knownXid, *xmin)) + *xmin = *knownXid; + } + + return count; +} + +/* + * Prune KnownAssignedXids up to, but *not* including xid. If xid is invalid + * then clear the whole table. + * + * Must be called while holding ProcArrayLock in Exclusive mode. + */ +static void +KnownAssignedXidsRemoveMany(TransactionId xid, bool keepPreparedXacts) +{ + TransactionId *knownXid; + HASH_SEQ_STATUS status; + + if (TransactionIdIsValid(xid)) + elog(trace_recovery(DEBUG4), "prune KnownAssignedXids to %u", xid); + else + elog(trace_recovery(DEBUG4), "removing all KnownAssignedXids"); + + hash_seq_init(&status, KnownAssignedXidsHash); + while ((knownXid = (TransactionId *) hash_seq_search(&status)) != NULL) + { + TransactionId removeXid = *knownXid; + bool found; + + if (!TransactionIdIsValid(xid) || TransactionIdPrecedes(removeXid, xid)) + { + if (keepPreparedXacts && StandbyTransactionIdIsPrepared(xid)) + continue; + else + { + (void) hash_search(KnownAssignedXidsHash, &removeXid, + HASH_REMOVE, &found); + if (found) + procArray->numKnownAssignedXids--; + Assert(procArray->numKnownAssignedXids >= 0); + } + } + } +} + +/* + * Display KnownAssignedXids to provide debug trail + * + * Must be called while holding ProcArrayLock (in shared mode) + */ +void +KnownAssignedXidsDisplay(int trace_level) +{ + HASH_SEQ_STATUS status; + TransactionId *knownXid; + StringInfoData buf; + TransactionId *xids; + int nxids; + int i; + + xids = palloc(sizeof(TransactionId) * TOTAL_MAX_CACHED_SUBXIDS); + nxids = 0; + + hash_seq_init(&status, KnownAssignedXidsHash); + while ((knownXid = (TransactionId *) hash_seq_search(&status)) != NULL) + xids[nxids++] = *knownXid; + + qsort(xids, nxids, sizeof(TransactionId), xidComparator); + + initStringInfo(&buf); + + for (i = 0; i < nxids; i++) + appendStringInfo(&buf, "%u ", xids[i]); + + elog(trace_level, "%d KnownAssignedXids %s", nxids, buf.data); + + pfree(buf.data); +} diff --git a/src/backend/storage/ipc/sinvaladt.c b/src/backend/storage/ipc/sinvaladt.c index dfa0ad7b5e..e33664fc48 100644 --- a/src/backend/storage/ipc/sinvaladt.c +++ b/src/backend/storage/ipc/sinvaladt.c @@ -8,7 +8,7 @@ * * * IDENTIFICATION - * $PostgreSQL: pgsql/src/backend/storage/ipc/sinvaladt.c,v 1.79 2009/07/31 20:26:23 tgl Exp $ + * $PostgreSQL: pgsql/src/backend/storage/ipc/sinvaladt.c,v 1.80 2009/12/19 01:32:35 sriggs Exp $ * *------------------------------------------------------------------------- */ @@ -144,6 +144,13 @@ typedef struct ProcState bool resetState; /* backend needs to reset its state */ bool signaled; /* backend has been sent catchup signal */ + /* + * Backend only sends invalidations, never receives them. This only makes sense + * for Startup process during recovery because it doesn't maintain a relcache, + * yet it fires inval messages to allow query backends to see schema changes. + */ + bool sendOnly; /* backend only sends, never receives */ + /* * Next LocalTransactionId to use for each idle backend slot. We keep * this here because it is indexed by BackendId and it is convenient to @@ -249,7 +256,7 @@ CreateSharedInvalidationState(void) * Initialize a new backend to operate on the sinval buffer */ void -SharedInvalBackendInit(void) +SharedInvalBackendInit(bool sendOnly) { int index; ProcState *stateP = NULL; @@ -308,6 +315,7 @@ SharedInvalBackendInit(void) stateP->nextMsgNum = segP->maxMsgNum; stateP->resetState = false; stateP->signaled = false; + stateP->sendOnly = sendOnly; LWLockRelease(SInvalWriteLock); @@ -579,7 +587,9 @@ SICleanupQueue(bool callerHasWriteLock, int minFree) /* * Recompute minMsgNum = minimum of all backends' nextMsgNum, identify the * furthest-back backend that needs signaling (if any), and reset any - * backends that are too far back. + * backends that are too far back. Note that because we ignore sendOnly + * backends here it is possible for them to keep sending messages without + * a problem even when they are the only active backend. */ min = segP->maxMsgNum; minsig = min - SIG_THRESHOLD; @@ -591,7 +601,7 @@ SICleanupQueue(bool callerHasWriteLock, int minFree) int n = stateP->nextMsgNum; /* Ignore if inactive or already in reset state */ - if (stateP->procPid == 0 || stateP->resetState) + if (stateP->procPid == 0 || stateP->resetState || stateP->sendOnly) continue; /* diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c new file mode 100644 index 0000000000..38bc005820 --- /dev/null +++ b/src/backend/storage/ipc/standby.c @@ -0,0 +1,717 @@ +/*------------------------------------------------------------------------- + * + * standby.c + * Misc functions used in Hot Standby mode. + * + * InitRecoveryTransactionEnvironment() + * ShutdownRecoveryTransactionEnvironment() + * + * ResolveRecoveryConflictWithVirtualXIDs() + * + * All functions for handling RM_STANDBY_ID, which relate to + * AccessExclusiveLocks and starting snapshots for Hot Standby mode. + * + * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * IDENTIFICATION + * $PostgreSQL: pgsql/src/backend/storage/ipc/standby.c,v 1.1 2009/12/19 01:32:35 sriggs Exp $ + * + *------------------------------------------------------------------------- + */ +#include "postgres.h" +#include "access/transam.h" +#include "access/twophase.h" +#include "access/xact.h" +#include "access/xlog.h" +#include "miscadmin.h" +#include "pgstat.h" +#include "storage/lmgr.h" +#include "storage/proc.h" +#include "storage/procarray.h" +#include "storage/sinvaladt.h" +#include "storage/standby.h" +#include "utils/ps_status.h" + +int vacuum_defer_cleanup_age; + +static List *RecoveryLockList; + +static void LogCurrentRunningXacts(RunningTransactions CurrRunningXacts); +static void LogAccessExclusiveLocks(int nlocks, xl_standby_lock *locks); + +/* + * InitRecoveryTransactionEnvironment + * Initiallize tracking of in-progress transactions in master + * + * We need to issue shared invalidations and hold locks. Holding locks + * means others may want to wait on us, so we need to make lock table + * inserts to appear like a transaction. We could create and delete + * lock table entries for each transaction but its simpler just to create + * one permanent entry and leave it there all the time. Locks are then + * acquired and released as needed. Yes, this means you can see the + * Startup process in pg_locks once we have run this. + */ +void +InitRecoveryTransactionEnvironment(void) +{ + VirtualTransactionId vxid; + + /* + * Initialise shared invalidation management for Startup process, + * being careful to register ourselves as a sendOnly process so + * we don't need to read messages, nor will we get signalled + * when the queue starts filling up. + */ + SharedInvalBackendInit(true); + + /* + * Record the PID and PGPROC structure of the startup process. + */ + PublishStartupProcessInformation(); + + /* + * Lock a virtual transaction id for Startup process. + * + * We need to do GetNextLocalTransactionId() because + * SharedInvalBackendInit() leaves localTransactionid invalid and + * the lock manager doesn't like that at all. + * + * Note that we don't need to run XactLockTableInsert() because nobody + * needs to wait on xids. That sounds a little strange, but table locks + * are held by vxids and row level locks are held by xids. All queries + * hold AccessShareLocks so never block while we write or lock new rows. + */ + vxid.backendId = MyBackendId; + vxid.localTransactionId = GetNextLocalTransactionId(); + VirtualXactLockTableInsert(vxid); + + standbyState = STANDBY_INITIALIZED; +} + +/* + * ShutdownRecoveryTransactionEnvironment + * Shut down transaction tracking + * + * Prepare to switch from hot standby mode to normal operation. Shut down + * recovery-time transaction tracking. + */ +void +ShutdownRecoveryTransactionEnvironment(void) +{ + /* Mark all tracked in-progress transactions as finished. */ + ExpireAllKnownAssignedTransactionIds(); + + /* Release all locks the tracked transactions were holding */ + StandbyReleaseAllLocks(); +} + + +/* + * ----------------------------------------------------- + * Standby wait timers and backend cancel logic + * ----------------------------------------------------- + */ + +#define STANDBY_INITIAL_WAIT_US 1000 +static int standbyWait_us = STANDBY_INITIAL_WAIT_US; + +/* + * Standby wait logic for ResolveRecoveryConflictWithVirtualXIDs. + * We wait here for a while then return. If we decide we can't wait any + * more then we return true, if we can wait some more return false. + */ +static bool +WaitExceedsMaxStandbyDelay(void) +{ + long delay_secs; + int delay_usecs; + + /* max_standby_delay = -1 means wait forever, if necessary */ + if (MaxStandbyDelay < 0) + return false; + + /* Are we past max_standby_delay? */ + TimestampDifference(GetLatestXLogTime(), GetCurrentTimestamp(), + &delay_secs, &delay_usecs); + if (delay_secs > MaxStandbyDelay) + return true; + + /* + * Sleep, then do bookkeeping. + */ + pg_usleep(standbyWait_us); + + /* + * Progressively increase the sleep times. + */ + standbyWait_us *= 2; + if (standbyWait_us > 1000000) + standbyWait_us = 1000000; + if (standbyWait_us > MaxStandbyDelay * 1000000 / 4) + standbyWait_us = MaxStandbyDelay * 1000000 / 4; + + return false; +} + +/* + * This is the main executioner for any query backend that conflicts with + * recovery processing. Judgement has already been passed on it within + * a specific rmgr. Here we just issue the orders to the procs. The procs + * then throw the required error as instructed. + * + * We may ask for a specific cancel_mode, typically ERROR or FATAL. + */ +void +ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist, + char *reason, int cancel_mode) +{ + char waitactivitymsg[100]; + + Assert(cancel_mode > 0); + + while (VirtualTransactionIdIsValid(*waitlist)) + { + long wait_s; + int wait_us; /* wait in microseconds (us) */ + TimestampTz waitStart; + bool logged; + + waitStart = GetCurrentTimestamp(); + standbyWait_us = STANDBY_INITIAL_WAIT_US; + logged = false; + + /* wait until the virtual xid is gone */ + while(!ConditionalVirtualXactLockTableWait(*waitlist)) + { + /* + * Report if we have been waiting for a while now... + */ + TimestampTz now = GetCurrentTimestamp(); + TimestampDifference(waitStart, now, &wait_s, &wait_us); + if (!logged && (wait_s > 0 || wait_us > 500000)) + { + const char *oldactivitymsg; + int len; + + oldactivitymsg = get_ps_display(&len); + snprintf(waitactivitymsg, sizeof(waitactivitymsg), + "waiting for max_standby_delay (%u ms)", + MaxStandbyDelay); + set_ps_display(waitactivitymsg, false); + if (len > 100) + len = 100; + memcpy(waitactivitymsg, oldactivitymsg, len); + + ereport(trace_recovery(DEBUG5), + (errmsg("virtual transaction %u/%u is blocking %s", + waitlist->backendId, + waitlist->localTransactionId, + reason))); + + pgstat_report_waiting(true); + + logged = true; + } + + /* Is it time to kill it? */ + if (WaitExceedsMaxStandbyDelay()) + { + pid_t pid; + + /* + * Now find out who to throw out of the balloon. + */ + Assert(VirtualTransactionIdIsValid(*waitlist)); + pid = CancelVirtualTransaction(*waitlist, cancel_mode); + + if (pid != 0) + { + /* + * Startup process debug messages + */ + switch (cancel_mode) + { + case CONFLICT_MODE_FATAL: + elog(trace_recovery(DEBUG1), + "recovery disconnects session with pid %d because of conflict with %s", + pid, + reason); + break; + case CONFLICT_MODE_ERROR: + elog(trace_recovery(DEBUG1), + "recovery cancels virtual transaction %u/%u pid %d because of conflict with %s", + waitlist->backendId, + waitlist->localTransactionId, + pid, + reason); + break; + default: + /* No conflict pending, so fall through */ + break; + } + + /* + * Wait awhile for it to die so that we avoid flooding an + * unresponsive backend when system is heavily loaded. + */ + pg_usleep(5000); + } + } + } + + /* Reset ps display */ + if (logged) + { + set_ps_display(waitactivitymsg, false); + pgstat_report_waiting(false); + } + + /* The virtual transaction is gone now, wait for the next one */ + waitlist++; + } +} + +/* + * ----------------------------------------------------- + * Locking in Recovery Mode + * ----------------------------------------------------- + * + * All locks are held by the Startup process using a single virtual + * transaction. This implementation is both simpler and in some senses, + * more correct. The locks held mean "some original transaction held + * this lock, so query access is not allowed at this time". So the Startup + * process is the proxy by which the original locks are implemented. + * + * We only keep track of AccessExclusiveLocks, which are only ever held by + * one transaction on one relation, and don't worry about lock queuing. + * + * We keep a single dynamically expandible list of locks in local memory, + * RelationLockList, so we can keep track of the various entried made by + * the Startup process's virtual xid in the shared lock table. + * + * List elements use type xl_rel_lock, since the WAL record type exactly + * matches the information that we need to keep track of. + * + * We use session locks rather than normal locks so we don't need + * ResourceOwners. + */ + + +void +StandbyAcquireAccessExclusiveLock(TransactionId xid, Oid dbOid, Oid relOid) +{ + xl_standby_lock *newlock; + LOCKTAG locktag; + bool report_memory_error = false; + int num_attempts = 0; + + /* Already processed? */ + if (TransactionIdDidCommit(xid) || TransactionIdDidAbort(xid)) + return; + + elog(trace_recovery(DEBUG4), + "adding recovery lock: db %d rel %d", dbOid, relOid); + + /* dbOid is InvalidOid when we are locking a shared relation. */ + Assert(OidIsValid(relOid)); + + newlock = palloc(sizeof(xl_standby_lock)); + newlock->xid = xid; + newlock->dbOid = dbOid; + newlock->relOid = relOid; + RecoveryLockList = lappend(RecoveryLockList, newlock); + + /* + * Attempt to acquire the lock as requested. + */ + SET_LOCKTAG_RELATION(locktag, newlock->dbOid, newlock->relOid); + + /* + * Wait for lock to clear or kill anyone in our way. + */ + while (LockAcquireExtended(&locktag, AccessExclusiveLock, + true, true, report_memory_error) + == LOCKACQUIRE_NOT_AVAIL) + { + VirtualTransactionId *backends; + + /* + * If blowing away everybody with conflicting locks doesn't work, + * after the first two attempts then we just start blowing everybody + * away until it does work. We do this because its likely that we + * either have too many locks and we just can't get one at all, + * or that there are many people crowding for the same table. + * Recovery must win; the end justifies the means. + */ + if (++num_attempts < 3) + backends = GetLockConflicts(&locktag, AccessExclusiveLock); + else + { + backends = GetConflictingVirtualXIDs(InvalidTransactionId, + InvalidOid, + true); + report_memory_error = true; + } + + ResolveRecoveryConflictWithVirtualXIDs(backends, + "exclusive lock", + CONFLICT_MODE_ERROR); + } +} + +static void +StandbyReleaseLocks(TransactionId xid) +{ + ListCell *cell, + *prev, + *next; + + /* + * Release all matching locks and remove them from list + */ + prev = NULL; + for (cell = list_head(RecoveryLockList); cell; cell = next) + { + xl_standby_lock *lock = (xl_standby_lock *) lfirst(cell); + next = lnext(cell); + + if (!TransactionIdIsValid(xid) || lock->xid == xid) + { + LOCKTAG locktag; + + elog(trace_recovery(DEBUG4), + "releasing recovery lock: xid %u db %d rel %d", + lock->xid, lock->dbOid, lock->relOid); + SET_LOCKTAG_RELATION(locktag, lock->dbOid, lock->relOid); + if (!LockRelease(&locktag, AccessExclusiveLock, true)) + elog(trace_recovery(LOG), + "RecoveryLockList contains entry for lock " + "no longer recorded by lock manager " + "xid %u database %d relation %d", + lock->xid, lock->dbOid, lock->relOid); + + RecoveryLockList = list_delete_cell(RecoveryLockList, cell, prev); + pfree(lock); + } + else + prev = cell; + } +} + +/* + * Release locks for a transaction tree, starting at xid down, from + * RecoveryLockList. + * + * Called during WAL replay of COMMIT/ROLLBACK when in hot standby mode, + * to remove any AccessExclusiveLocks requested by a transaction. + */ +void +StandbyReleaseLockTree(TransactionId xid, int nsubxids, TransactionId *subxids) +{ + int i; + + StandbyReleaseLocks(xid); + + for (i = 0; i < nsubxids; i++) + StandbyReleaseLocks(subxids[i]); +} + +/* + * StandbyReleaseOldLocks + * Release standby locks held by XIDs < removeXid + * In some cases, keep prepared transactions. + */ +static void +StandbyReleaseLocksMany(TransactionId removeXid, bool keepPreparedXacts) +{ + ListCell *cell, + *prev, + *next; + LOCKTAG locktag; + + /* + * Release all matching locks. + */ + prev = NULL; + for (cell = list_head(RecoveryLockList); cell; cell = next) + { + xl_standby_lock *lock = (xl_standby_lock *) lfirst(cell); + next = lnext(cell); + + if (!TransactionIdIsValid(removeXid) || TransactionIdPrecedes(lock->xid, removeXid)) + { + if (keepPreparedXacts && StandbyTransactionIdIsPrepared(lock->xid)) + continue; + elog(trace_recovery(DEBUG4), + "releasing recovery lock: xid %u db %d rel %d", + lock->xid, lock->dbOid, lock->relOid); + SET_LOCKTAG_RELATION(locktag, lock->dbOid, lock->relOid); + if (!LockRelease(&locktag, AccessExclusiveLock, true)) + elog(trace_recovery(LOG), + "RecoveryLockList contains entry for lock " + "no longer recorded by lock manager " + "xid %u database %d relation %d", + lock->xid, lock->dbOid, lock->relOid); + RecoveryLockList = list_delete_cell(RecoveryLockList, cell, prev); + pfree(lock); + } + else + prev = cell; + } +} + +/* + * Called at end of recovery and when we see a shutdown checkpoint. + */ +void +StandbyReleaseAllLocks(void) +{ + elog(trace_recovery(DEBUG2), "release all standby locks"); + StandbyReleaseLocksMany(InvalidTransactionId, false); +} + +/* + * StandbyReleaseOldLocks + * Release standby locks held by XIDs < removeXid, as long + * as their not prepared transactions. + */ +void +StandbyReleaseOldLocks(TransactionId removeXid) +{ + StandbyReleaseLocksMany(removeXid, true); +} + +/* + * -------------------------------------------------------------------- + * Recovery handling for Rmgr RM_STANDBY_ID + * + * These record types will only be created if XLogStandbyInfoActive() + * -------------------------------------------------------------------- + */ + +void +standby_redo(XLogRecPtr lsn, XLogRecord *record) +{ + uint8 info = record->xl_info & ~XLR_INFO_MASK; + + /* Do nothing if we're not in standby mode */ + if (standbyState == STANDBY_DISABLED) + return; + + if (info == XLOG_STANDBY_LOCK) + { + xl_standby_locks *xlrec = (xl_standby_locks *) XLogRecGetData(record); + int i; + + for (i = 0; i < xlrec->nlocks; i++) + StandbyAcquireAccessExclusiveLock(xlrec->locks[i].xid, + xlrec->locks[i].dbOid, + xlrec->locks[i].relOid); + } + else if (info == XLOG_RUNNING_XACTS) + { + xl_running_xacts *xlrec = (xl_running_xacts *) XLogRecGetData(record); + RunningTransactionsData running; + + running.xcnt = xlrec->xcnt; + running.subxid_overflow = xlrec->subxid_overflow; + running.nextXid = xlrec->nextXid; + running.oldestRunningXid = xlrec->oldestRunningXid; + running.xids = xlrec->xids; + + ProcArrayApplyRecoveryInfo(&running); + } + else + elog(PANIC, "relation_redo: unknown op code %u", info); +} + +static void +standby_desc_running_xacts(StringInfo buf, xl_running_xacts *xlrec) +{ + int i; + + appendStringInfo(buf, + " nextXid %u oldestRunningXid %u", + xlrec->nextXid, + xlrec->oldestRunningXid); + if (xlrec->xcnt > 0) + { + appendStringInfo(buf, "; %d xacts:", xlrec->xcnt); + for (i = 0; i < xlrec->xcnt; i++) + appendStringInfo(buf, " %u", xlrec->xids[i]); + } + + if (xlrec->subxid_overflow) + appendStringInfo(buf, "; subxid ovf"); +} + +void +standby_desc(StringInfo buf, uint8 xl_info, char *rec) +{ + uint8 info = xl_info & ~XLR_INFO_MASK; + + if (info == XLOG_STANDBY_LOCK) + { + xl_standby_locks *xlrec = (xl_standby_locks *) rec; + int i; + + appendStringInfo(buf, "AccessExclusive locks:"); + + for (i = 0; i < xlrec->nlocks; i++) + appendStringInfo(buf, " xid %u db %d rel %d", + xlrec->locks[i].xid, xlrec->locks[i].dbOid, + xlrec->locks[i].relOid); + } + else if (info == XLOG_RUNNING_XACTS) + { + xl_running_xacts *xlrec = (xl_running_xacts *) rec; + + appendStringInfo(buf, " running xacts:"); + standby_desc_running_xacts(buf, xlrec); + } + else + appendStringInfo(buf, "UNKNOWN"); +} + +/* + * Log details of the current snapshot to WAL. This allows the snapshot state + * to be reconstructed on the standby. + */ +void +LogStandbySnapshot(TransactionId *oldestActiveXid, TransactionId *nextXid) +{ + RunningTransactions running; + xl_standby_lock *locks; + int nlocks; + + Assert(XLogStandbyInfoActive()); + + /* + * Get details of any AccessExclusiveLocks being held at the moment. + */ + locks = GetRunningTransactionLocks(&nlocks); + if (nlocks > 0) + LogAccessExclusiveLocks(nlocks, locks); + + /* + * Log details of all in-progress transactions. This should be the last + * record we write, because standby will open up when it sees this. + */ + running = GetRunningTransactionData(); + LogCurrentRunningXacts(running); + + *oldestActiveXid = running->oldestRunningXid; + *nextXid = running->nextXid; +} + +/* + * Record an enhanced snapshot of running transactions into WAL. + * + * The definitions of RunningTransactionData and xl_xact_running_xacts + * are similar. We keep them separate because xl_xact_running_xacts + * is a contiguous chunk of memory and never exists fully until it is + * assembled in WAL. + */ +static void +LogCurrentRunningXacts(RunningTransactions CurrRunningXacts) +{ + xl_running_xacts xlrec; + XLogRecData rdata[2]; + int lastrdata = 0; + XLogRecPtr recptr; + + xlrec.xcnt = CurrRunningXacts->xcnt; + xlrec.subxid_overflow = CurrRunningXacts->subxid_overflow; + xlrec.nextXid = CurrRunningXacts->nextXid; + xlrec.oldestRunningXid = CurrRunningXacts->oldestRunningXid; + + /* Header */ + rdata[0].data = (char *) (&xlrec); + rdata[0].len = MinSizeOfXactRunningXacts; + rdata[0].buffer = InvalidBuffer; + + /* array of TransactionIds */ + if (xlrec.xcnt > 0) + { + rdata[0].next = &(rdata[1]); + rdata[1].data = (char *) CurrRunningXacts->xids; + rdata[1].len = xlrec.xcnt * sizeof(TransactionId); + rdata[1].buffer = InvalidBuffer; + lastrdata = 1; + } + + rdata[lastrdata].next = NULL; + + recptr = XLogInsert(RM_STANDBY_ID, XLOG_RUNNING_XACTS, rdata); + + if (CurrRunningXacts->subxid_overflow) + ereport(trace_recovery(DEBUG2), + (errmsg("snapshot of %u running transactions overflowed (lsn %X/%X oldest xid %u next xid %u)", + CurrRunningXacts->xcnt, + recptr.xlogid, recptr.xrecoff, + CurrRunningXacts->oldestRunningXid, + CurrRunningXacts->nextXid))); + else + ereport(trace_recovery(DEBUG2), + (errmsg("snapshot of %u running transaction ids (lsn %X/%X oldest xid %u next xid %u)", + CurrRunningXacts->xcnt, + recptr.xlogid, recptr.xrecoff, + CurrRunningXacts->oldestRunningXid, + CurrRunningXacts->nextXid))); + +} + +/* + * Wholesale logging of AccessExclusiveLocks. Other lock types need not be + * logged, as described in backend/storage/lmgr/README. + */ +static void +LogAccessExclusiveLocks(int nlocks, xl_standby_lock *locks) +{ + XLogRecData rdata[2]; + xl_standby_locks xlrec; + + xlrec.nlocks = nlocks; + + rdata[0].data = (char *) &xlrec; + rdata[0].len = offsetof(xl_standby_locks, locks); + rdata[0].buffer = InvalidBuffer; + rdata[0].next = &rdata[1]; + + rdata[1].data = (char *) locks; + rdata[1].len = nlocks * sizeof(xl_standby_lock); + rdata[1].buffer = InvalidBuffer; + rdata[1].next = NULL; + + (void) XLogInsert(RM_STANDBY_ID, XLOG_STANDBY_LOCK, rdata); +} + +/* + * Individual logging of AccessExclusiveLocks for use during LockAcquire() + */ +void +LogAccessExclusiveLock(Oid dbOid, Oid relOid) +{ + xl_standby_lock xlrec; + + /* + * Ensure that a TransactionId has been assigned to this transaction. + * We don't actually need the xid yet but if we don't do this then + * RecordTransactionCommit() and RecordTransactionAbort() will optimise + * away the transaction completion record which recovery relies upon to + * release locks. It's a hack, but for a corner case not worth adding + * code for into the main commit path. + */ + xlrec.xid = GetTopTransactionId(); + + /* + * Decode the locktag back to the original values, to avoid + * sending lots of empty bytes with every message. See + * lock.h to check how a locktag is defined for LOCKTAG_RELATION + */ + xlrec.dbOid = dbOid; + xlrec.relOid = relOid; + + LogAccessExclusiveLocks(1, &xlrec); +} diff --git a/src/backend/storage/lmgr/README b/src/backend/storage/lmgr/README index db9db94f01..bfc6853941 100644 --- a/src/backend/storage/lmgr/README +++ b/src/backend/storage/lmgr/README @@ -1,4 +1,4 @@ -$PostgreSQL: pgsql/src/backend/storage/lmgr/README,v 1.24 2008/03/21 13:23:28 momjian Exp $ +$PostgreSQL: pgsql/src/backend/storage/lmgr/README,v 1.25 2009/12/19 01:32:35 sriggs Exp $ Locking Overview ================ @@ -517,3 +517,27 @@ interfere with each other. User locks are always held as session locks, so that they are not released at transaction end. They must be released explicitly by the application --- but they are released automatically when a backend terminates. + +Locking during Hot Standby +-------------------------- + +The Startup process is the only backend that can make changes during +recovery, all other backends are read only. As a result the Startup +process does not acquire locks on relations or objects except when the lock +level is AccessExclusiveLock. + +Regular backends are only allowed to take locks on relations or objects +at RowExclusiveLock or lower. This ensures that they do not conflict with +each other or with the Startup process, unless AccessExclusiveLocks are +requested by one of the backends. + +Deadlocks involving AccessExclusiveLocks are not possible, so we need +not be concerned that a user initiated deadlock can prevent recovery from +progressing. + +AccessExclusiveLocks on the primary or master node generate WAL records +that are then applied by the Startup process. Locks are released at end +of transaction just as they are in normal processing. These locks are +held by the Startup process, acting as a proxy for the backends that +originally acquired these locks. Again, these locks cannot conflict with +one another, so the Startup process cannot deadlock itself either. diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c index 6a29210496..03459a71ec 100644 --- a/src/backend/storage/lmgr/lock.c +++ b/src/backend/storage/lmgr/lock.c @@ -8,7 +8,7 @@ * * * IDENTIFICATION - * $PostgreSQL: pgsql/src/backend/storage/lmgr/lock.c,v 1.188 2009/06/11 14:49:02 momjian Exp $ + * $PostgreSQL: pgsql/src/backend/storage/lmgr/lock.c,v 1.189 2009/12/19 01:32:35 sriggs Exp $ * * NOTES * A lock table is a shared memory hash table. When @@ -38,6 +38,7 @@ #include "miscadmin.h" #include "pg_trace.h" #include "pgstat.h" +#include "storage/standby.h" #include "utils/memutils.h" #include "utils/ps_status.h" #include "utils/resowner.h" @@ -468,6 +469,25 @@ LockAcquire(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock, bool dontWait) +{ + return LockAcquireExtended(locktag, lockmode, sessionLock, dontWait, true); +} + +/* + * LockAcquireExtended - allows us to specify additional options + * + * reportMemoryError specifies whether a lock request that fills the + * lock table should generate an ERROR or not. This allows a priority + * caller to note that the lock table is full and then begin taking + * extreme action to reduce the number of other lock holders before + * retrying the action. + */ +LockAcquireResult +LockAcquireExtended(const LOCKTAG *locktag, + LOCKMODE lockmode, + bool sessionLock, + bool dontWait, + bool reportMemoryError) { LOCKMETHODID lockmethodid = locktag->locktag_lockmethodid; LockMethod lockMethodTable; @@ -490,6 +510,16 @@ LockAcquire(const LOCKTAG *locktag, if (lockmode <= 0 || lockmode > lockMethodTable->numLockModes) elog(ERROR, "unrecognized lock mode: %d", lockmode); + if (RecoveryInProgress() && !InRecovery && + (locktag->locktag_type == LOCKTAG_OBJECT || + locktag->locktag_type == LOCKTAG_RELATION ) && + lockmode > RowExclusiveLock) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("cannot acquire lockmode %s on database objects while recovery is in progress", + lockMethodTable->lockModeNames[lockmode]), + errhint("Only RowExclusiveLock or less can be acquired on database objects during recovery."))); + #ifdef LOCK_DEBUG if (LOCK_DEBUG_ENABLED(locktag)) elog(LOG, "LockAcquire: lock [%u,%u] %s", @@ -578,10 +608,13 @@ LockAcquire(const LOCKTAG *locktag, if (!lock) { LWLockRelease(partitionLock); - ereport(ERROR, - (errcode(ERRCODE_OUT_OF_MEMORY), - errmsg("out of shared memory"), - errhint("You might need to increase max_locks_per_transaction."))); + if (reportMemoryError) + ereport(ERROR, + (errcode(ERRCODE_OUT_OF_MEMORY), + errmsg("out of shared memory"), + errhint("You might need to increase max_locks_per_transaction."))); + else + return LOCKACQUIRE_NOT_AVAIL; } locallock->lock = lock; @@ -644,10 +677,13 @@ LockAcquire(const LOCKTAG *locktag, elog(PANIC, "lock table corrupted"); } LWLockRelease(partitionLock); - ereport(ERROR, - (errcode(ERRCODE_OUT_OF_MEMORY), - errmsg("out of shared memory"), - errhint("You might need to increase max_locks_per_transaction."))); + if (reportMemoryError) + ereport(ERROR, + (errcode(ERRCODE_OUT_OF_MEMORY), + errmsg("out of shared memory"), + errhint("You might need to increase max_locks_per_transaction."))); + else + return LOCKACQUIRE_NOT_AVAIL; } locallock->proclock = proclock; @@ -778,6 +814,25 @@ LockAcquire(const LOCKTAG *locktag, return LOCKACQUIRE_NOT_AVAIL; } + /* + * In Hot Standby we abort the lock wait if Startup process is waiting + * since this would result in a deadlock. The deadlock occurs because + * if we are waiting it must be behind an AccessExclusiveLock, which + * can only clear when a transaction completion record is replayed. + * If Startup process is waiting we never will clear that lock, so to + * wait for it just causes a deadlock. + */ + if (RecoveryInProgress() && !InRecovery && + locktag->locktag_type == LOCKTAG_RELATION) + { + LWLockRelease(partitionLock); + ereport(ERROR, + (errcode(ERRCODE_T_R_DEADLOCK_DETECTED), + errmsg("possible deadlock detected"), + errdetail("process conflicts with recovery - please resubmit query later"), + errdetail_log("process conflicts with recovery"))); + } + /* * Set bitmask of locks this process already holds on this object. */ @@ -827,6 +882,27 @@ LockAcquire(const LOCKTAG *locktag, LWLockRelease(partitionLock); + /* + * Emit a WAL record if acquisition of this lock need to be replayed in + * a standby server. Only AccessExclusiveLocks can conflict with lock + * types that read-only transactions can acquire in a standby server. + * + * Make sure this definition matches the one GetRunningTransactionLocks(). + */ + if (lockmode >= AccessExclusiveLock && + locktag->locktag_type == LOCKTAG_RELATION && + !RecoveryInProgress() && + XLogStandbyInfoActive()) + { + /* + * Decode the locktag back to the original values, to avoid + * sending lots of empty bytes with every message. See + * lock.h to check how a locktag is defined for LOCKTAG_RELATION + */ + LogAccessExclusiveLock(locktag->locktag_field1, + locktag->locktag_field2); + } + return LOCKACQUIRE_OK; } @@ -2193,6 +2269,79 @@ GetLockStatusData(void) return data; } +/* + * Returns a list of currently held AccessExclusiveLocks, for use + * by GetRunningTransactionData(). + */ +xl_standby_lock * +GetRunningTransactionLocks(int *nlocks) +{ + PROCLOCK *proclock; + HASH_SEQ_STATUS seqstat; + int i; + int index; + int els; + xl_standby_lock *accessExclusiveLocks; + + /* + * Acquire lock on the entire shared lock data structure. + * + * Must grab LWLocks in partition-number order to avoid LWLock deadlock. + */ + for (i = 0; i < NUM_LOCK_PARTITIONS; i++) + LWLockAcquire(FirstLockMgrLock + i, LW_SHARED); + + /* Now scan the tables to copy the data */ + hash_seq_init(&seqstat, LockMethodProcLockHash); + + /* Now we can safely count the number of proclocks */ + els = hash_get_num_entries(LockMethodProcLockHash); + + /* + * Allocating enough space for all locks in the lock table is overkill, + * but it's more convenient and faster than having to enlarge the array. + */ + accessExclusiveLocks = palloc(els * sizeof(xl_standby_lock)); + + /* + * If lock is a currently granted AccessExclusiveLock then + * it will have just one proclock holder, so locks are never + * accessed twice in this particular case. Don't copy this code + * for use elsewhere because in the general case this will + * give you duplicate locks when looking at non-exclusive lock types. + */ + index = 0; + while ((proclock = (PROCLOCK *) hash_seq_search(&seqstat))) + { + /* make sure this definition matches the one used in LockAcquire */ + if ((proclock->holdMask & LOCKBIT_ON(AccessExclusiveLock)) && + proclock->tag.myLock->tag.locktag_type == LOCKTAG_RELATION) + { + PGPROC *proc = proclock->tag.myProc; + LOCK *lock = proclock->tag.myLock; + + accessExclusiveLocks[index].xid = proc->xid; + accessExclusiveLocks[index].dbOid = lock->tag.locktag_field1; + accessExclusiveLocks[index].relOid = lock->tag.locktag_field2; + + index++; + } + } + + /* + * And release locks. We do this in reverse order for two reasons: (1) + * Anyone else who needs more than one of the locks will be trying to lock + * them in increasing order; we don't want to release the other process + * until it can get all the locks it needs. (2) This avoids O(N^2) + * behavior inside LWLockRelease. + */ + for (i = NUM_LOCK_PARTITIONS; --i >= 0;) + LWLockRelease(FirstLockMgrLock + i); + + *nlocks = index; + return accessExclusiveLocks; +} + /* Provide the textual name of any lock mode */ const char * GetLockmodeName(LOCKMETHODID lockmethodid, LOCKMODE mode) @@ -2288,6 +2437,24 @@ DumpAllLocks(void) * Because this function is run at db startup, re-acquiring the locks should * never conflict with running transactions because there are none. We * assume that the lock state represented by the stored 2PC files is legal. + * + * When switching from Hot Standby mode to normal operation, the locks will + * be already held by the startup process. The locks are acquired for the new + * procs without checking for conflicts, so we don'get a conflict between the + * startup process and the dummy procs, even though we will momentarily have + * a situation where two procs are holding the same AccessExclusiveLock, + * which isn't normally possible because the conflict. If we're in standby + * mode, but a recovery snapshot hasn't been established yet, it's possible + * that some but not all of the locks are already held by the startup process. + * + * This approach is simple, but also a bit dangerous, because if there isn't + * enough shared memory to acquire the locks, an error will be thrown, which + * is promoted to FATAL and recovery will abort, bringing down postmaster. + * A safer approach would be to transfer the locks like we do in + * AtPrepare_Locks, but then again, in hot standby mode it's possible for + * read-only backends to use up all the shared lock memory anyway, so that + * replaying the WAL record that needs to acquire a lock will throw an error + * and PANIC anyway. */ void lock_twophase_recover(TransactionId xid, uint16 info, @@ -2443,12 +2610,45 @@ lock_twophase_recover(TransactionId xid, uint16 info, /* * We ignore any possible conflicts and just grant ourselves the lock. + * Not only because we don't bother, but also to avoid deadlocks when + * switching from standby to normal mode. See function comment. */ GrantLock(lock, proclock, lockmode); LWLockRelease(partitionLock); } +/* + * Re-acquire a lock belonging to a transaction that was prepared, when + * when starting up into hot standby mode. + */ +void +lock_twophase_standby_recover(TransactionId xid, uint16 info, + void *recdata, uint32 len) +{ + TwoPhaseLockRecord *rec = (TwoPhaseLockRecord *) recdata; + LOCKTAG *locktag; + LOCKMODE lockmode; + LOCKMETHODID lockmethodid; + + Assert(len == sizeof(TwoPhaseLockRecord)); + locktag = &rec->locktag; + lockmode = rec->lockmode; + lockmethodid = locktag->locktag_lockmethodid; + + if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods)) + elog(ERROR, "unrecognized lock method: %d", lockmethodid); + + if (lockmode == AccessExclusiveLock && + locktag->locktag_type == LOCKTAG_RELATION) + { + StandbyAcquireAccessExclusiveLock(xid, + locktag->locktag_field1 /* dboid */, + locktag->locktag_field2 /* reloid */); + } +} + + /* * 2PC processing routine for COMMIT PREPARED case. * diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c index 56a785d138..4286a5e3ed 100644 --- a/src/backend/storage/lmgr/proc.c +++ b/src/backend/storage/lmgr/proc.c @@ -8,7 +8,7 @@ * * * IDENTIFICATION - * $PostgreSQL: pgsql/src/backend/storage/lmgr/proc.c,v 1.209 2009/08/31 19:41:00 tgl Exp $ + * $PostgreSQL: pgsql/src/backend/storage/lmgr/proc.c,v 1.210 2009/12/19 01:32:36 sriggs Exp $ * *------------------------------------------------------------------------- */ @@ -318,6 +318,7 @@ InitProcess(void) MyProc->waitProcLock = NULL; for (i = 0; i < NUM_LOCK_PARTITIONS; i++) SHMQueueInit(&(MyProc->myProcLocks[i])); + MyProc->recoveryConflictMode = 0; /* * We might be reusing a semaphore that belonged to a failed process. So @@ -374,6 +375,11 @@ InitProcessPhase2(void) * to the ProcArray or the sinval messaging mechanism, either. They also * don't get a VXID assigned, since this is only useful when we actually * hold lockmgr locks. + * + * Startup process however uses locks but never waits for them in the + * normal backend sense. Startup process also takes part in sinval messaging + * as a sendOnly process, so never reads messages from sinval queue. So + * Startup process does have a VXID and does show up in pg_locks. */ void InitAuxiliaryProcess(void) @@ -461,6 +467,24 @@ InitAuxiliaryProcess(void) on_shmem_exit(AuxiliaryProcKill, Int32GetDatum(proctype)); } +/* + * Record the PID and PGPROC structures for the Startup process, for use in + * ProcSendSignal(). See comments there for further explanation. + */ +void +PublishStartupProcessInformation(void) +{ + /* use volatile pointer to prevent code rearrangement */ + volatile PROC_HDR *procglobal = ProcGlobal; + + SpinLockAcquire(ProcStructLock); + + procglobal->startupProc = MyProc; + procglobal->startupProcPid = MyProcPid; + + SpinLockRelease(ProcStructLock); +} + /* * Check whether there are at least N free PGPROC objects. * @@ -1289,7 +1313,31 @@ ProcWaitForSignal(void) void ProcSendSignal(int pid) { - PGPROC *proc = BackendPidGetProc(pid); + PGPROC *proc = NULL; + + if (RecoveryInProgress()) + { + /* use volatile pointer to prevent code rearrangement */ + volatile PROC_HDR *procglobal = ProcGlobal; + + SpinLockAcquire(ProcStructLock); + + /* + * Check to see whether it is the Startup process we wish to signal. + * This call is made by the buffer manager when it wishes to wake + * up a process that has been waiting for a pin in so it can obtain a + * cleanup lock using LockBufferForCleanup(). Startup is not a normal + * backend, so BackendPidGetProc() will not return any pid at all. + * So we remember the information for this special case. + */ + if (pid == procglobal->startupProcPid) + proc = procglobal->startupProc; + + SpinLockRelease(ProcStructLock); + } + + if (proc == NULL) + proc = BackendPidGetProc(pid); if (proc != NULL) PGSemaphoreUnlock(&proc->sem); diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c index f2e892374b..e8c4820a71 100644 --- a/src/backend/tcop/postgres.c +++ b/src/backend/tcop/postgres.c @@ -8,7 +8,7 @@ * * * IDENTIFICATION - * $PostgreSQL: pgsql/src/backend/tcop/postgres.c,v 1.578 2009/12/16 23:05:00 petere Exp $ + * $PostgreSQL: pgsql/src/backend/tcop/postgres.c,v 1.579 2009/12/19 01:32:36 sriggs Exp $ * * NOTES * this is the "main" module of the postgres backend and @@ -62,6 +62,7 @@ #include "storage/proc.h" #include "storage/procsignal.h" #include "storage/sinval.h" +#include "storage/standby.h" #include "tcop/fastpath.h" #include "tcop/pquery.h" #include "tcop/tcopprot.h" @@ -2643,8 +2644,8 @@ StatementCancelHandler(SIGNAL_ARGS) * the interrupt immediately. No point in interrupting if we're * waiting for input, however. */ - if (ImmediateInterruptOK && InterruptHoldoffCount == 0 && - CritSectionCount == 0 && !DoingCommandRead) + if (InterruptHoldoffCount == 0 && CritSectionCount == 0 && + (DoingCommandRead || ImmediateInterruptOK)) { /* bump holdoff count to make ProcessInterrupts() a no-op */ /* until we are done getting ready for it */ @@ -2735,9 +2736,58 @@ ProcessInterrupts(void) (errcode(ERRCODE_QUERY_CANCELED), errmsg("canceling autovacuum task"))); else + { + int cancelMode = MyProc->recoveryConflictMode; + + /* + * XXXHS: We don't yet have a clean way to cancel an + * idle-in-transaction session, so make it FATAL instead. + * This isn't as bad as it looks because we don't issue a + * CONFLICT_MODE_ERROR for a session with proc->xmin == 0 + * on cleanup conflicts. There's a possibility that we + * marked somebody as a conflict and then they go idle. + */ + if (DoingCommandRead && IsTransactionBlock() && + cancelMode == CONFLICT_MODE_ERROR) + { + cancelMode = CONFLICT_MODE_FATAL; + } + + switch (cancelMode) + { + case CONFLICT_MODE_FATAL: + Assert(RecoveryInProgress()); + ereport(FATAL, + (errcode(ERRCODE_QUERY_CANCELED), + errmsg("canceling session due to conflict with recovery"))); + + case CONFLICT_MODE_ERROR: + /* + * We are aborting because we need to release + * locks. So we need to abort out of all + * subtransactions to make sure we release + * all locks at whatever their level. + * + * XXX Should we try to examine the + * transaction tree and cancel just enough + * subxacts to remove locks? Doubt it. + */ + Assert(RecoveryInProgress()); + AbortOutOfAnyTransaction(); + ereport(ERROR, + (errcode(ERRCODE_QUERY_CANCELED), + errmsg("canceling statement due to conflict with recovery"))); + return; + + default: + /* No conflict pending, so fall through */ + break; + } + ereport(ERROR, (errcode(ERRCODE_QUERY_CANCELED), errmsg("canceling statement due to user request"))); + } } /* If we get here, do nothing (probably, QueryCancelPending was reset) */ } diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c index 10fb728fc7..53e59b59b0 100644 --- a/src/backend/tcop/utility.c +++ b/src/backend/tcop/utility.c @@ -10,7 +10,7 @@ * * * IDENTIFICATION - * $PostgreSQL: pgsql/src/backend/tcop/utility.c,v 1.324 2009/12/15 20:04:49 tgl Exp $ + * $PostgreSQL: pgsql/src/backend/tcop/utility.c,v 1.325 2009/12/19 01:32:36 sriggs Exp $ * *------------------------------------------------------------------------- */ @@ -351,6 +351,7 @@ standard_ProcessUtility(Node *parsetree, break; case TRANS_STMT_PREPARE: + PreventCommandDuringRecovery(); if (!PrepareTransactionBlock(stmt->gid)) { /* report unsuccessful commit in completionTag */ @@ -360,11 +361,13 @@ standard_ProcessUtility(Node *parsetree, break; case TRANS_STMT_COMMIT_PREPARED: + PreventCommandDuringRecovery(); PreventTransactionChain(isTopLevel, "COMMIT PREPARED"); FinishPreparedTransaction(stmt->gid, true); break; case TRANS_STMT_ROLLBACK_PREPARED: + PreventCommandDuringRecovery(); PreventTransactionChain(isTopLevel, "ROLLBACK PREPARED"); FinishPreparedTransaction(stmt->gid, false); break; @@ -742,6 +745,7 @@ standard_ProcessUtility(Node *parsetree, break; case T_GrantStmt: + PreventCommandDuringRecovery(); ExecuteGrantStmt((GrantStmt *) parsetree); break; @@ -923,6 +927,7 @@ standard_ProcessUtility(Node *parsetree, case T_NotifyStmt: { NotifyStmt *stmt = (NotifyStmt *) parsetree; + PreventCommandDuringRecovery(); Async_Notify(stmt->conditionname); } @@ -931,6 +936,7 @@ standard_ProcessUtility(Node *parsetree, case T_ListenStmt: { ListenStmt *stmt = (ListenStmt *) parsetree; + PreventCommandDuringRecovery(); CheckRestrictedOperation("LISTEN"); Async_Listen(stmt->conditionname); @@ -940,6 +946,7 @@ standard_ProcessUtility(Node *parsetree, case T_UnlistenStmt: { UnlistenStmt *stmt = (UnlistenStmt *) parsetree; + PreventCommandDuringRecovery(); CheckRestrictedOperation("UNLISTEN"); if (stmt->conditionname) @@ -960,10 +967,12 @@ standard_ProcessUtility(Node *parsetree, break; case T_ClusterStmt: + PreventCommandDuringRecovery(); cluster((ClusterStmt *) parsetree, isTopLevel); break; case T_VacuumStmt: + PreventCommandDuringRecovery(); vacuum((VacuumStmt *) parsetree, InvalidOid, true, NULL, false, isTopLevel); break; @@ -1083,12 +1092,21 @@ standard_ProcessUtility(Node *parsetree, ereport(ERROR, (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), errmsg("must be superuser to do CHECKPOINT"))); - RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT); + /* + * You might think we should have a PreventCommandDuringRecovery() + * here, but we interpret a CHECKPOINT command during recovery + * as a request for a restartpoint instead. We allow this since + * it can be a useful way of reducing switchover time when + * using various forms of replication. + */ + RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_WAIT | + (RecoveryInProgress() ? 0 : CHECKPOINT_FORCE)); break; case T_ReindexStmt: { ReindexStmt *stmt = (ReindexStmt *) parsetree; + PreventCommandDuringRecovery(); switch (stmt->kind) { @@ -2604,3 +2622,12 @@ GetCommandLogLevel(Node *parsetree) return lev; } + +void +PreventCommandDuringRecovery(void) +{ + if (RecoveryInProgress()) + ereport(ERROR, + (errcode(ERRCODE_READ_ONLY_SQL_TRANSACTION), + errmsg("cannot be executed during recovery"))); +} diff --git a/src/backend/utils/adt/txid.c b/src/backend/utils/adt/txid.c index a4a5b86676..fe9f7c5d39 100644 --- a/src/backend/utils/adt/txid.c +++ b/src/backend/utils/adt/txid.c @@ -14,7 +14,7 @@ * Author: Jan Wieck, Afilias USA INC. * 64-bit txids: Marko Kreen, Skype Technologies * - * $PostgreSQL: pgsql/src/backend/utils/adt/txid.c,v 1.8 2009/01/01 17:23:50 momjian Exp $ + * $PostgreSQL: pgsql/src/backend/utils/adt/txid.c,v 1.9 2009/12/19 01:32:36 sriggs Exp $ * *------------------------------------------------------------------------- */ @@ -24,6 +24,7 @@ #include "access/transam.h" #include "access/xact.h" #include "funcapi.h" +#include "miscadmin.h" #include "libpq/pqformat.h" #include "utils/builtins.h" #include "utils/snapmgr.h" @@ -338,6 +339,15 @@ txid_current(PG_FUNCTION_ARGS) txid val; TxidEpoch state; + /* + * Must prevent during recovery because if an xid is + * not assigned we try to assign one, which would fail. + * Programs already rely on this function to always + * return a valid current xid, so we should not change + * this to return NULL or similar invalid xid. + */ + PreventCommandDuringRecovery(); + load_xid_epoch(&state); val = convert_xid(GetTopTransactionId(), &state); diff --git a/src/backend/utils/adt/xid.c b/src/backend/utils/adt/xid.c index 0fbd394bbf..2cb197fbff 100644 --- a/src/backend/utils/adt/xid.c +++ b/src/backend/utils/adt/xid.c @@ -8,7 +8,7 @@ * * * IDENTIFICATION - * $PostgreSQL: pgsql/src/backend/utils/adt/xid.c,v 1.12 2009/01/01 17:23:50 momjian Exp $ + * $PostgreSQL: pgsql/src/backend/utils/adt/xid.c,v 1.13 2009/12/19 01:32:36 sriggs Exp $ * *------------------------------------------------------------------------- */ @@ -102,6 +102,25 @@ xid_age(PG_FUNCTION_ARGS) PG_RETURN_INT32((int32) (now - xid)); } +/* + * xidComparator + * qsort comparison function for XIDs + * + * We can't use wraparound comparison for XIDs because that does not respect + * the triangle inequality! Any old sort order will do. + */ +int +xidComparator(const void *arg1, const void *arg2) +{ + TransactionId xid1 = *(const TransactionId *) arg1; + TransactionId xid2 = *(const TransactionId *) arg2; + + if (xid1 > xid2) + return 1; + if (xid1 < xid2) + return -1; + return 0; +} /***************************************************************************** * COMMAND IDENTIFIER ROUTINES * diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c index 5fac924207..5e59d0ab8e 100644 --- a/src/backend/utils/cache/inval.c +++ b/src/backend/utils/cache/inval.c @@ -80,7 +80,7 @@ * Portions Copyright (c) 1994, Regents of the University of California * * IDENTIFICATION - * $PostgreSQL: pgsql/src/backend/utils/cache/inval.c,v 1.89 2009/06/11 14:49:05 momjian Exp $ + * $PostgreSQL: pgsql/src/backend/utils/cache/inval.c,v 1.90 2009/12/19 01:32:36 sriggs Exp $ * *------------------------------------------------------------------------- */ @@ -155,6 +155,11 @@ typedef struct TransInvalidationInfo static TransInvalidationInfo *transInvalInfo = NULL; +static SharedInvalidationMessage *SharedInvalidMessagesArray; +static int numSharedInvalidMessagesArray; +static int maxSharedInvalidMessagesArray; + + /* * Dynamically-registered callback functions. Current implementation * assumes there won't be very many of these at once; could improve if needed. @@ -180,14 +185,6 @@ static struct RELCACHECALLBACK static int relcache_callback_count = 0; -/* info values for 2PC callback */ -#define TWOPHASE_INFO_MSG 0 /* SharedInvalidationMessage */ -#define TWOPHASE_INFO_FILE_BEFORE 1 /* relcache file inval */ -#define TWOPHASE_INFO_FILE_AFTER 2 /* relcache file inval */ - -static void PersistInvalidationMessage(SharedInvalidationMessage *msg); - - /* ---------------------------------------------------------------- * Invalidation list support functions * @@ -741,38 +738,8 @@ AtStart_Inval(void) MemoryContextAllocZero(TopTransactionContext, sizeof(TransInvalidationInfo)); transInvalInfo->my_level = GetCurrentTransactionNestLevel(); -} - -/* - * AtPrepare_Inval - * Save the inval lists state at 2PC transaction prepare. - * - * In this phase we just generate 2PC records for all the pending invalidation - * work. - */ -void -AtPrepare_Inval(void) -{ - /* Must be at top of stack */ - Assert(transInvalInfo != NULL && transInvalInfo->parent == NULL); - - /* - * Relcache init file invalidation requires processing both before and - * after we send the SI messages. - */ - if (transInvalInfo->RelcacheInitFileInval) - RegisterTwoPhaseRecord(TWOPHASE_RM_INVAL_ID, TWOPHASE_INFO_FILE_BEFORE, - NULL, 0); - - AppendInvalidationMessages(&transInvalInfo->PriorCmdInvalidMsgs, - &transInvalInfo->CurrentCmdInvalidMsgs); - - ProcessInvalidationMessages(&transInvalInfo->PriorCmdInvalidMsgs, - PersistInvalidationMessage); - - if (transInvalInfo->RelcacheInitFileInval) - RegisterTwoPhaseRecord(TWOPHASE_RM_INVAL_ID, TWOPHASE_INFO_FILE_AFTER, - NULL, 0); + SharedInvalidMessagesArray = NULL; + numSharedInvalidMessagesArray = 0; } /* @@ -812,46 +779,98 @@ AtSubStart_Inval(void) } /* - * PersistInvalidationMessage - * Write an invalidation message to the 2PC state file. + * Collect invalidation messages into SharedInvalidMessagesArray array. */ static void -PersistInvalidationMessage(SharedInvalidationMessage *msg) +MakeSharedInvalidMessagesArray(const SharedInvalidationMessage *msgs, int n) { - RegisterTwoPhaseRecord(TWOPHASE_RM_INVAL_ID, TWOPHASE_INFO_MSG, - msg, sizeof(SharedInvalidationMessage)); + /* + * Initialise array first time through in each commit + */ + if (SharedInvalidMessagesArray == NULL) + { + maxSharedInvalidMessagesArray = FIRSTCHUNKSIZE; + numSharedInvalidMessagesArray = 0; + + /* + * Although this is being palloc'd we don't actually free it directly. + * We're so close to EOXact that we now we're going to lose it anyhow. + */ + SharedInvalidMessagesArray = palloc(maxSharedInvalidMessagesArray + * sizeof(SharedInvalidationMessage)); + } + + if ((numSharedInvalidMessagesArray + n) > maxSharedInvalidMessagesArray) + { + while ((numSharedInvalidMessagesArray + n) > maxSharedInvalidMessagesArray) + maxSharedInvalidMessagesArray *= 2; + + SharedInvalidMessagesArray = repalloc(SharedInvalidMessagesArray, + maxSharedInvalidMessagesArray + * sizeof(SharedInvalidationMessage)); + } + + /* + * Append the next chunk onto the array + */ + memcpy(SharedInvalidMessagesArray + numSharedInvalidMessagesArray, + msgs, n * sizeof(SharedInvalidationMessage)); + numSharedInvalidMessagesArray += n; } /* - * inval_twophase_postcommit - * Process an invalidation message from the 2PC state file. + * xactGetCommittedInvalidationMessages() is executed by + * RecordTransactionCommit() to add invalidation messages onto the + * commit record. This applies only to commit message types, never to + * abort records. Must always run before AtEOXact_Inval(), since that + * removes the data we need to see. + * + * Remember that this runs before we have officially committed, so we + * must not do anything here to change what might occur *if* we should + * fail between here and the actual commit. + * + * see also xact_redo_commit() and xact_desc_commit() */ -void -inval_twophase_postcommit(TransactionId xid, uint16 info, - void *recdata, uint32 len) +int +xactGetCommittedInvalidationMessages(SharedInvalidationMessage **msgs, + bool *RelcacheInitFileInval) { - SharedInvalidationMessage *msg; + MemoryContext oldcontext; - switch (info) - { - case TWOPHASE_INFO_MSG: - msg = (SharedInvalidationMessage *) recdata; - Assert(len == sizeof(SharedInvalidationMessage)); - SendSharedInvalidMessages(msg, 1); - break; - case TWOPHASE_INFO_FILE_BEFORE: - RelationCacheInitFileInvalidate(true); - break; - case TWOPHASE_INFO_FILE_AFTER: - RelationCacheInitFileInvalidate(false); - break; - default: - Assert(false); - break; - } + /* Must be at top of stack */ + Assert(transInvalInfo != NULL && transInvalInfo->parent == NULL); + + /* + * Relcache init file invalidation requires processing both before and + * after we send the SI messages. However, we need not do anything + * unless we committed. + */ + *RelcacheInitFileInval = transInvalInfo->RelcacheInitFileInval; + + /* + * Walk through TransInvalidationInfo to collect all the messages + * into a single contiguous array of invalidation messages. It must + * be contiguous so we can copy directly into WAL message. Maintain the + * order that they would be processed in by AtEOXact_Inval(), to ensure + * emulated behaviour in redo is as similar as possible to original. + * We want the same bugs, if any, not new ones. + */ + oldcontext = MemoryContextSwitchTo(CurTransactionContext); + + ProcessInvalidationMessagesMulti(&transInvalInfo->CurrentCmdInvalidMsgs, + MakeSharedInvalidMessagesArray); + ProcessInvalidationMessagesMulti(&transInvalInfo->PriorCmdInvalidMsgs, + MakeSharedInvalidMessagesArray); + MemoryContextSwitchTo(oldcontext); + + Assert(!(numSharedInvalidMessagesArray > 0 && + SharedInvalidMessagesArray == NULL)); + + *msgs = SharedInvalidMessagesArray; + + return numSharedInvalidMessagesArray; } - /* * AtEOXact_Inval * Process queued-up invalidation messages at end of main transaction. @@ -1028,6 +1047,8 @@ CommandEndInvalidationMessages(void) * no need to worry about cleaning up if there's an elog(ERROR) before * reaching EndNonTransactionalInvalidation (the invals will just be thrown * away if that happens). + * + * Note that these are not replayed in standby mode. */ void BeginNonTransactionalInvalidation(void) @@ -1041,6 +1062,9 @@ BeginNonTransactionalInvalidation(void) Assert(transInvalInfo->CurrentCmdInvalidMsgs.cclist == NULL); Assert(transInvalInfo->CurrentCmdInvalidMsgs.rclist == NULL); Assert(transInvalInfo->RelcacheInitFileInval == false); + + SharedInvalidMessagesArray = NULL; + numSharedInvalidMessagesArray = 0; } /* diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c index 59fa07a379..552b3392da 100644 --- a/src/backend/utils/error/elog.c +++ b/src/backend/utils/error/elog.c @@ -42,7 +42,7 @@ * * * IDENTIFICATION - * $PostgreSQL: pgsql/src/backend/utils/error/elog.c,v 1.219 2009/11/28 23:38:07 tgl Exp $ + * $PostgreSQL: pgsql/src/backend/utils/error/elog.c,v 1.220 2009/12/19 01:32:37 sriggs Exp $ * *------------------------------------------------------------------------- */ @@ -2794,3 +2794,21 @@ is_log_level_output(int elevel, int log_min_level) return false; } + +/* + * If trace_recovery_messages is set to make this visible, then show as LOG, + * else display as whatever level is set. It may still be shown, but only + * if log_min_messages is set lower than trace_recovery_messages. + * + * Intention is to keep this for at least the whole of the 8.5 production + * release, so we can more easily diagnose production problems in the field. + */ +int +trace_recovery(int trace_level) +{ + if (trace_level < LOG && + trace_level >= trace_recovery_messages) + return LOG; + + return trace_level; +} diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c index b6c93c7f8e..120405cae5 100644 --- a/src/backend/utils/init/postinit.c +++ b/src/backend/utils/init/postinit.c @@ -8,7 +8,7 @@ * * * IDENTIFICATION - * $PostgreSQL: pgsql/src/backend/utils/init/postinit.c,v 1.198 2009/10/07 22:14:23 alvherre Exp $ + * $PostgreSQL: pgsql/src/backend/utils/init/postinit.c,v 1.199 2009/12/19 01:32:37 sriggs Exp $ * * *------------------------------------------------------------------------- @@ -481,7 +481,7 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username, */ MyBackendId = InvalidBackendId; - SharedInvalBackendInit(); + SharedInvalBackendInit(false); if (MyBackendId > MaxBackends || MyBackendId <= 0) elog(FATAL, "bad backend id: %d", MyBackendId); @@ -495,11 +495,11 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username, InitBufferPoolBackend(); /* - * Initialize local process's access to XLOG. In bootstrap case we may - * skip this since StartupXLOG() was run instead. + * Initialize local process's access to XLOG, if appropriate. In bootstrap + * case we skip this since StartupXLOG() was run instead. */ if (!bootstrap) - InitXLOGAccess(); + (void) RecoveryInProgress(); /* * Initialize the relation cache and the system catalog caches. Note that diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index 900c366278..0c9998614f 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -10,7 +10,7 @@ * Written by Peter Eisentraut . * * IDENTIFICATION - * $PostgreSQL: pgsql/src/backend/utils/misc/guc.c,v 1.527 2009/12/11 03:34:56 itagaki Exp $ + * $PostgreSQL: pgsql/src/backend/utils/misc/guc.c,v 1.528 2009/12/19 01:32:37 sriggs Exp $ * *-------------------------------------------------------------------- */ @@ -114,6 +114,9 @@ extern char *default_tablespace; extern char *temp_tablespaces; extern bool synchronize_seqscans; extern bool fullPageWrites; +extern int vacuum_defer_cleanup_age; + +int trace_recovery_messages = LOG; #ifdef TRACE_SORT extern bool trace_sort; @@ -1206,6 +1209,17 @@ static struct config_bool ConfigureNamesBool[] = false, NULL, NULL }, + { + {"recovery_connections", PGC_POSTMASTER, WAL_SETTINGS, + gettext_noop("During recovery, allows connections and queries. " + " During normal running, causes additional info to be written" + " to WAL to enable hot standby mode on WAL standby nodes."), + NULL + }, + &XLogRequestRecoveryConnections, + true, NULL, NULL + }, + { {"allow_system_table_mods", PGC_POSTMASTER, DEVELOPER_OPTIONS, gettext_noop("Allows modifications of the structure of system tables."), @@ -1347,6 +1361,8 @@ static struct config_int ConfigureNamesInt[] = * plus autovacuum_max_workers plus one (for the autovacuum launcher). * * Likewise we have to limit NBuffers to INT_MAX/2. + * + * See also CheckRequiredParameterValues() if this parameter changes */ { {"max_connections", PGC_POSTMASTER, CONN_AUTH_SETTINGS, @@ -1357,6 +1373,15 @@ static struct config_int ConfigureNamesInt[] = 100, 1, INT_MAX / 4, assign_maxconnections, NULL }, + { + {"max_standby_delay", PGC_SIGHUP, WAL_SETTINGS, + gettext_noop("Sets the maximum delay to avoid conflict processing on Hot Standby servers."), + NULL + }, + &MaxStandbyDelay, + 30, -1, INT_MAX, NULL, NULL + }, + { {"superuser_reserved_connections", PGC_POSTMASTER, CONN_AUTH_SETTINGS, gettext_noop("Sets the number of connection slots reserved for superusers."), @@ -1514,6 +1539,9 @@ static struct config_int ConfigureNamesInt[] = 1000, 25, INT_MAX, NULL, NULL }, + /* + * See also CheckRequiredParameterValues() if this parameter changes + */ { {"max_prepared_transactions", PGC_POSTMASTER, RESOURCES, gettext_noop("Sets the maximum number of simultaneously prepared transactions."), @@ -1572,6 +1600,18 @@ static struct config_int ConfigureNamesInt[] = 150000000, 0, 2000000000, NULL, NULL }, + { + {"vacuum_defer_cleanup_age", PGC_USERSET, CLIENT_CONN_STATEMENT, + gettext_noop("Age by which VACUUM and HOT cleanup should be deferred, if any."), + NULL + }, + &vacuum_defer_cleanup_age, + 0, 0, 1000000, NULL, NULL + }, + + /* + * See also CheckRequiredParameterValues() if this parameter changes + */ { {"max_locks_per_transaction", PGC_POSTMASTER, LOCK_MANAGEMENT, gettext_noop("Sets the maximum number of locks per transaction."), @@ -2684,6 +2724,16 @@ static struct config_enum ConfigureNamesEnum[] = assign_session_replication_role, NULL }, + { + {"trace_recovery_messages", PGC_SUSET, LOGGING_WHEN, + gettext_noop("Sets the message levels that are logged during recovery."), + gettext_noop("Each level includes all the levels that follow it. The later" + " the level, the fewer messages are sent.") + }, + &trace_recovery_messages, + DEBUG1, server_message_level_options, NULL, NULL + }, + { {"track_functions", PGC_SUSET, STATS_COLLECTOR, gettext_noop("Collects function-level statistics on database activity."), @@ -7511,6 +7561,18 @@ assign_transaction_read_only(bool newval, bool doit, GucSource source) if (source != PGC_S_OVERRIDE) return false; } + + /* Can't go to r/w mode while recovery is still active */ + if (newval == false && XactReadOnly && RecoveryInProgress()) + { + ereport(GUC_complaint_elevel(source), + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("cannot set transaction read-write mode during recovery"))); + /* source == PGC_S_OVERRIDE means do it anyway, eg at xact abort */ + if (source != PGC_S_OVERRIDE) + return false; + } + return true; } diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample index d2da9b9c3d..c4ddeaf2bc 100644 --- a/src/backend/utils/misc/postgresql.conf.sample +++ b/src/backend/utils/misc/postgresql.conf.sample @@ -181,6 +181,9 @@ #archive_timeout = 0 # force a logfile segment switch after this # number of seconds; 0 disables +#recovery_connections = on # allows connections during recovery +#max_standby_delay = 30 # max acceptable standby lag (s) to help queries + # complete without conflict; -1 disables #------------------------------------------------------------------------------ # QUERY TUNING diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c index 41ff3405a3..cf3479e064 100644 --- a/src/backend/utils/time/snapmgr.c +++ b/src/backend/utils/time/snapmgr.c @@ -19,7 +19,7 @@ * Portions Copyright (c) 1994, Regents of the University of California * * IDENTIFICATION - * $PostgreSQL: pgsql/src/backend/utils/time/snapmgr.c,v 1.12 2009/10/07 16:27:18 alvherre Exp $ + * $PostgreSQL: pgsql/src/backend/utils/time/snapmgr.c,v 1.13 2009/12/19 01:32:37 sriggs Exp $ * *------------------------------------------------------------------------- */ @@ -224,8 +224,14 @@ CopySnapshot(Snapshot snapshot) else newsnap->xip = NULL; - /* setup subXID array */ - if (snapshot->subxcnt > 0) + /* + * Setup subXID array. Don't bother to copy it if it had overflowed, + * though, because it's not used anywhere in that case. Except if it's + * a snapshot taken during recovery; all the top-level XIDs are in subxip + * as well in that case, so we mustn't lose them. + */ + if (snapshot->subxcnt > 0 && + (!snapshot->suboverflowed || snapshot->takenDuringRecovery)) { newsnap->subxip = (TransactionId *) ((char *) newsnap + subxipoff); memcpy(newsnap->subxip, snapshot->subxip, diff --git a/src/backend/utils/time/tqual.c b/src/backend/utils/time/tqual.c index 6d8f86acc9..32eeabb999 100644 --- a/src/backend/utils/time/tqual.c +++ b/src/backend/utils/time/tqual.c @@ -50,7 +50,7 @@ * Portions Copyright (c) 1994, Regents of the University of California * * IDENTIFICATION - * $PostgreSQL: pgsql/src/backend/utils/time/tqual.c,v 1.113 2009/06/11 14:49:06 momjian Exp $ + * $PostgreSQL: pgsql/src/backend/utils/time/tqual.c,v 1.114 2009/12/19 01:32:37 sriggs Exp $ * *------------------------------------------------------------------------- */ @@ -1257,42 +1257,84 @@ XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot) return true; /* - * If the snapshot contains full subxact data, the fastest way to check - * things is just to compare the given XID against both subxact XIDs and - * top-level XIDs. If the snapshot overflowed, we have to use pg_subtrans - * to convert a subxact XID to its parent XID, but then we need only look - * at top-level XIDs not subxacts. + * Snapshot information is stored slightly differently in snapshots + * taken during recovery. */ - if (snapshot->subxcnt >= 0) + if (!snapshot->takenDuringRecovery) + { + /* + * If the snapshot contains full subxact data, the fastest way to check + * things is just to compare the given XID against both subxact XIDs and + * top-level XIDs. If the snapshot overflowed, we have to use pg_subtrans + * to convert a subxact XID to its parent XID, but then we need only look + * at top-level XIDs not subxacts. + */ + if (!snapshot->suboverflowed) + { + /* full data, so search subxip */ + int32 j; + + for (j = 0; j < snapshot->subxcnt; j++) + { + if (TransactionIdEquals(xid, snapshot->subxip[j])) + return true; + } + + /* not there, fall through to search xip[] */ + } + else + { + /* overflowed, so convert xid to top-level */ + xid = SubTransGetTopmostTransaction(xid); + + /* + * If xid was indeed a subxact, we might now have an xid < xmin, so + * recheck to avoid an array scan. No point in rechecking xmax. + */ + if (TransactionIdPrecedes(xid, snapshot->xmin)) + return false; + } + + for (i = 0; i < snapshot->xcnt; i++) + { + if (TransactionIdEquals(xid, snapshot->xip[i])) + return true; + } + } + else { - /* full data, so search subxip */ int32 j; + /* + * In recovery we store all xids in the subxact array because it + * is by far the bigger array, and we mostly don't know which xids + * are top-level and which are subxacts. The xip array is empty. + * + * We start by searching subtrans, if we overflowed. + */ + if (snapshot->suboverflowed) + { + /* overflowed, so convert xid to top-level */ + xid = SubTransGetTopmostTransaction(xid); + + /* + * If xid was indeed a subxact, we might now have an xid < xmin, so + * recheck to avoid an array scan. No point in rechecking xmax. + */ + if (TransactionIdPrecedes(xid, snapshot->xmin)) + return false; + } + + /* + * We now have either a top-level xid higher than xmin or an + * indeterminate xid. We don't know whether it's top level or subxact + * but it doesn't matter. If it's present, the xid is visible. + */ for (j = 0; j < snapshot->subxcnt; j++) { if (TransactionIdEquals(xid, snapshot->subxip[j])) return true; } - - /* not there, fall through to search xip[] */ - } - else - { - /* overflowed, so convert xid to top-level */ - xid = SubTransGetTopmostTransaction(xid); - - /* - * If xid was indeed a subxact, we might now have an xid < xmin, so - * recheck to avoid an array scan. No point in rechecking xmax. - */ - if (TransactionIdPrecedes(xid, snapshot->xmin)) - return false; - } - - for (i = 0; i < snapshot->xcnt; i++) - { - if (TransactionIdEquals(xid, snapshot->xip[i])) - return true; } return false; diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c index e5b6480eb7..9551a1d878 100644 --- a/src/bin/pg_controldata/pg_controldata.c +++ b/src/bin/pg_controldata/pg_controldata.c @@ -6,7 +6,7 @@ * copyright (c) Oliver Elphick , 2001; * licence: BSD * - * $PostgreSQL: pgsql/src/bin/pg_controldata/pg_controldata.c,v 1.44 2009/08/31 02:23:22 tgl Exp $ + * $PostgreSQL: pgsql/src/bin/pg_controldata/pg_controldata.c,v 1.45 2009/12/19 01:32:38 sriggs Exp $ */ #include "postgres_fe.h" @@ -196,6 +196,8 @@ main(int argc, char *argv[]) ControlFile.checkPointCopy.oldestXid); printf(_("Latest checkpoint's oldestXID's DB: %u\n"), ControlFile.checkPointCopy.oldestXidDB); + printf(_("Latest checkpoint's oldestActiveXID: %u\n"), + ControlFile.checkPointCopy.oldestActiveXid); printf(_("Time of latest checkpoint: %s\n"), ckpttime_str); printf(_("Minimum recovery ending location: %X/%X\n"), diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h index f31a505dda..0f3eabce19 100644 --- a/src/include/access/heapam.h +++ b/src/include/access/heapam.h @@ -7,7 +7,7 @@ * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group * Portions Copyright (c) 1994, Regents of the University of California * - * $PostgreSQL: pgsql/src/include/access/heapam.h,v 1.144 2009/08/24 02:18:32 tgl Exp $ + * $PostgreSQL: pgsql/src/include/access/heapam.h,v 1.145 2009/12/19 01:32:42 sriggs Exp $ * *------------------------------------------------------------------------- */ @@ -130,11 +130,13 @@ extern XLogRecPtr log_heap_move(Relation reln, Buffer oldbuf, ItemPointerData from, Buffer newbuf, HeapTuple newtup, bool all_visible_cleared, bool new_all_visible_cleared); +extern XLogRecPtr log_heap_cleanup_info(RelFileNode rnode, + TransactionId latestRemovedXid); extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer, OffsetNumber *redirected, int nredirected, OffsetNumber *nowdead, int ndead, OffsetNumber *nowunused, int nunused, - bool redirect_move); + TransactionId latestRemovedXid, bool redirect_move); extern XLogRecPtr log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid, OffsetNumber *offsets, int offcnt); diff --git a/src/include/access/htup.h b/src/include/access/htup.h index f7fa60cb70..017f6917e1 100644 --- a/src/include/access/htup.h +++ b/src/include/access/htup.h @@ -7,7 +7,7 @@ * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group * Portions Copyright (c) 1994, Regents of the University of California * - * $PostgreSQL: pgsql/src/include/access/htup.h,v 1.107 2009/06/11 14:49:08 momjian Exp $ + * $PostgreSQL: pgsql/src/include/access/htup.h,v 1.108 2009/12/19 01:32:42 sriggs Exp $ * *------------------------------------------------------------------------- */ @@ -580,6 +580,7 @@ typedef HeapTupleData *HeapTuple; #define XLOG_HEAP2_FREEZE 0x00 #define XLOG_HEAP2_CLEAN 0x10 #define XLOG_HEAP2_CLEAN_MOVE 0x20 +#define XLOG_HEAP2_CLEANUP_INFO 0x30 /* * All what we need to find changed tuple @@ -668,6 +669,7 @@ typedef struct xl_heap_clean { RelFileNode node; BlockNumber block; + TransactionId latestRemovedXid; uint16 nredirected; uint16 ndead; /* OFFSET NUMBERS FOLLOW */ @@ -675,6 +677,19 @@ typedef struct xl_heap_clean #define SizeOfHeapClean (offsetof(xl_heap_clean, ndead) + sizeof(uint16)) +/* + * Cleanup_info is required in some cases during a lazy VACUUM. + * Used for reporting the results of HeapTupleHeaderAdvanceLatestRemovedXid() + * see vacuumlazy.c for full explanation + */ +typedef struct xl_heap_cleanup_info +{ + RelFileNode node; + TransactionId latestRemovedXid; +} xl_heap_cleanup_info; + +#define SizeOfHeapCleanupInfo (sizeof(xl_heap_cleanup_info)) + /* This is for replacing a page's contents in toto */ /* NB: this is used for indexes as well as heaps */ typedef struct xl_heap_newpage @@ -718,6 +733,9 @@ typedef struct xl_heap_freeze #define SizeOfHeapFreeze (offsetof(xl_heap_freeze, cutoff_xid) + sizeof(TransactionId)) +extern void HeapTupleHeaderAdvanceLatestRemovedXid(HeapTupleHeader tuple, + TransactionId *latestRemovedXid); + /* HeapTupleHeader functions implemented in utils/time/combocid.c */ extern CommandId HeapTupleHeaderGetCmin(HeapTupleHeader tup); extern CommandId HeapTupleHeaderGetCmax(HeapTupleHeader tup); diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h index ed5ec57e47..bef4db461c 100644 --- a/src/include/access/nbtree.h +++ b/src/include/access/nbtree.h @@ -7,7 +7,7 @@ * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group * Portions Copyright (c) 1994, Regents of the University of California * - * $PostgreSQL: pgsql/src/include/access/nbtree.h,v 1.125 2009/07/29 20:56:19 tgl Exp $ + * $PostgreSQL: pgsql/src/include/access/nbtree.h,v 1.126 2009/12/19 01:32:42 sriggs Exp $ * *------------------------------------------------------------------------- */ @@ -214,12 +214,13 @@ typedef struct BTMetaPageData #define XLOG_BTREE_SPLIT_R 0x40 /* as above, new item on right */ #define XLOG_BTREE_SPLIT_L_ROOT 0x50 /* add tuple with split of root */ #define XLOG_BTREE_SPLIT_R_ROOT 0x60 /* as above, new item on right */ -#define XLOG_BTREE_DELETE 0x70 /* delete leaf index tuple */ +#define XLOG_BTREE_DELETE 0x70 /* delete leaf index tuples for a page */ #define XLOG_BTREE_DELETE_PAGE 0x80 /* delete an entire page */ #define XLOG_BTREE_DELETE_PAGE_META 0x90 /* same, and update metapage */ #define XLOG_BTREE_NEWROOT 0xA0 /* new root page */ #define XLOG_BTREE_DELETE_PAGE_HALF 0xB0 /* page deletion that makes * parent half-dead */ +#define XLOG_BTREE_VACUUM 0xC0 /* delete entries on a page during vacuum */ /* * All that we need to find changed index tuple @@ -306,16 +307,53 @@ typedef struct xl_btree_split /* * This is what we need to know about delete of individual leaf index tuples. * The WAL record can represent deletion of any number of index tuples on a - * single index page. + * single index page when *not* executed by VACUUM. */ typedef struct xl_btree_delete { RelFileNode node; BlockNumber block; + TransactionId latestRemovedXid; + int numItems; /* number of items in the offset array */ + /* TARGET OFFSET NUMBERS FOLLOW AT THE END */ } xl_btree_delete; -#define SizeOfBtreeDelete (offsetof(xl_btree_delete, block) + sizeof(BlockNumber)) +#define SizeOfBtreeDelete (offsetof(xl_btree_delete, latestRemovedXid) + sizeof(TransactionId)) + +/* + * This is what we need to know about vacuum of individual leaf index tuples. + * The WAL record can represent deletion of any number of index tuples on a + * single index page when executed by VACUUM. + * + * The correctness requirement for applying these changes during recovery is + * that we must do one of these two things for every block in the index: + * * lock the block for cleanup and apply any required changes + * * EnsureBlockUnpinned() + * The purpose of this is to ensure that no index scans started before we + * finish scanning the index are still running by the time we begin to remove + * heap tuples. + * + * Any changes to any one block are registered on just one WAL record. All + * blocks that we need to run EnsureBlockUnpinned() before we touch the changed + * block are also given on this record as a variable length array. The array + * is compressed by way of storing an array of block ranges, rather than an + * actual array of blockids. + * + * Note that the *last* WAL record in any vacuum of an index is allowed to + * have numItems == 0. All other WAL records must have numItems > 0. + */ +typedef struct xl_btree_vacuum +{ + RelFileNode node; + BlockNumber block; + BlockNumber lastBlockVacuumed; + int numItems; /* number of items in the offset array */ + + /* TARGET OFFSET NUMBERS FOLLOW */ +} xl_btree_vacuum; + +#define SizeOfBtreeVacuum (offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber)) /* * This is what we need to know about deletion of a btree page. The target @@ -537,7 +575,8 @@ extern void _bt_relbuf(Relation rel, Buffer buf); extern void _bt_pageinit(Page page, Size size); extern bool _bt_page_recyclable(Page page); extern void _bt_delitems(Relation rel, Buffer buf, - OffsetNumber *itemnos, int nitems); + OffsetNumber *itemnos, int nitems, bool isVacuum, + BlockNumber lastBlockVacuumed); extern int _bt_pagedel(Relation rel, Buffer buf, BTStack stack, bool vacuum_full); diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h index eeca306d8b..2761d1d8ad 100644 --- a/src/include/access/relscan.h +++ b/src/include/access/relscan.h @@ -7,7 +7,7 @@ * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group * Portions Copyright (c) 1994, Regents of the University of California * - * $PostgreSQL: pgsql/src/include/access/relscan.h,v 1.67 2009/01/01 17:23:56 momjian Exp $ + * $PostgreSQL: pgsql/src/include/access/relscan.h,v 1.68 2009/12/19 01:32:42 sriggs Exp $ * *------------------------------------------------------------------------- */ @@ -68,6 +68,7 @@ typedef struct IndexScanDescData /* signaling to index AM about killing index tuples */ bool kill_prior_tuple; /* last-returned tuple is dead */ bool ignore_killed_tuples; /* do not return killed entries */ + bool xactStartedInRecovery; /* prevents killing/seeing killed tuples */ /* index access method's private state */ void *opaque; /* access-method-specific info */ diff --git a/src/include/access/rmgr.h b/src/include/access/rmgr.h index 44b8a07a3f..32b1bd535c 100644 --- a/src/include/access/rmgr.h +++ b/src/include/access/rmgr.h @@ -3,7 +3,7 @@ * * Resource managers definition * - * $PostgreSQL: pgsql/src/include/access/rmgr.h,v 1.19 2008/11/19 10:34:52 heikki Exp $ + * $PostgreSQL: pgsql/src/include/access/rmgr.h,v 1.20 2009/12/19 01:32:42 sriggs Exp $ */ #ifndef RMGR_H #define RMGR_H @@ -23,6 +23,7 @@ typedef uint8 RmgrId; #define RM_DBASE_ID 4 #define RM_TBLSPC_ID 5 #define RM_MULTIXACT_ID 6 +#define RM_STANDBY_ID 8 #define RM_HEAP2_ID 9 #define RM_HEAP_ID 10 #define RM_BTREE_ID 11 diff --git a/src/include/access/subtrans.h b/src/include/access/subtrans.h index 07941ac6c1..0658da0c5c 100644 --- a/src/include/access/subtrans.h +++ b/src/include/access/subtrans.h @@ -6,7 +6,7 @@ * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group * Portions Copyright (c) 1994, Regents of the University of California * - * $PostgreSQL: pgsql/src/include/access/subtrans.h,v 1.12 2009/01/01 17:23:56 momjian Exp $ + * $PostgreSQL: pgsql/src/include/access/subtrans.h,v 1.13 2009/12/19 01:32:42 sriggs Exp $ */ #ifndef SUBTRANS_H #define SUBTRANS_H @@ -14,7 +14,7 @@ /* Number of SLRU buffers to use for subtrans */ #define NUM_SUBTRANS_BUFFERS 32 -extern void SubTransSetParent(TransactionId xid, TransactionId parent); +extern void SubTransSetParent(TransactionId xid, TransactionId parent, bool overwriteOK); extern TransactionId SubTransGetParent(TransactionId xid); extern TransactionId SubTransGetTopmostTransaction(TransactionId xid); diff --git a/src/include/access/transam.h b/src/include/access/transam.h index 6f3370dcd2..5917129f17 100644 --- a/src/include/access/transam.h +++ b/src/include/access/transam.h @@ -7,7 +7,7 @@ * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group * Portions Copyright (c) 1994, Regents of the University of California * - * $PostgreSQL: pgsql/src/include/access/transam.h,v 1.70 2009/09/01 04:46:49 tgl Exp $ + * $PostgreSQL: pgsql/src/include/access/transam.h,v 1.71 2009/12/19 01:32:42 sriggs Exp $ * *------------------------------------------------------------------------- */ @@ -129,6 +129,9 @@ typedef VariableCacheData *VariableCache; * ---------------- */ +/* in transam/xact.c */ +extern bool TransactionStartedDuringRecovery(void); + /* in transam/varsup.c */ extern PGDLLIMPORT VariableCache ShmemVariableCache; diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h index 5652145f8c..864d6d6da0 100644 --- a/src/include/access/twophase.h +++ b/src/include/access/twophase.h @@ -7,7 +7,7 @@ * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group * Portions Copyright (c) 1994, Regents of the University of California * - * $PostgreSQL: pgsql/src/include/access/twophase.h,v 1.12 2009/11/23 09:58:36 heikki Exp $ + * $PostgreSQL: pgsql/src/include/access/twophase.h,v 1.13 2009/12/19 01:32:42 sriggs Exp $ * *------------------------------------------------------------------------- */ @@ -40,8 +40,10 @@ extern GlobalTransaction MarkAsPreparing(TransactionId xid, const char *gid, extern void StartPrepare(GlobalTransaction gxact); extern void EndPrepare(GlobalTransaction gxact); +extern bool StandbyTransactionIdIsPrepared(TransactionId xid); -extern TransactionId PrescanPreparedTransactions(void); +extern TransactionId PrescanPreparedTransactions(TransactionId **xids_p, + int *nxids_p); extern void RecoverPreparedTransactions(void); extern void RecreateTwoPhaseFile(TransactionId xid, void *content, int len); diff --git a/src/include/access/twophase_rmgr.h b/src/include/access/twophase_rmgr.h index 37d03495bc..5fd5e10a78 100644 --- a/src/include/access/twophase_rmgr.h +++ b/src/include/access/twophase_rmgr.h @@ -7,7 +7,7 @@ * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group * Portions Copyright (c) 1994, Regents of the University of California * - * $PostgreSQL: pgsql/src/include/access/twophase_rmgr.h,v 1.9 2009/11/23 09:58:36 heikki Exp $ + * $PostgreSQL: pgsql/src/include/access/twophase_rmgr.h,v 1.10 2009/12/19 01:32:42 sriggs Exp $ * *------------------------------------------------------------------------- */ @@ -23,15 +23,15 @@ typedef uint8 TwoPhaseRmgrId; */ #define TWOPHASE_RM_END_ID 0 #define TWOPHASE_RM_LOCK_ID 1 -#define TWOPHASE_RM_INVAL_ID 2 -#define TWOPHASE_RM_NOTIFY_ID 3 -#define TWOPHASE_RM_PGSTAT_ID 4 -#define TWOPHASE_RM_MULTIXACT_ID 5 +#define TWOPHASE_RM_NOTIFY_ID 2 +#define TWOPHASE_RM_PGSTAT_ID 3 +#define TWOPHASE_RM_MULTIXACT_ID 4 #define TWOPHASE_RM_MAX_ID TWOPHASE_RM_MULTIXACT_ID extern const TwoPhaseCallback twophase_recover_callbacks[]; extern const TwoPhaseCallback twophase_postcommit_callbacks[]; extern const TwoPhaseCallback twophase_postabort_callbacks[]; +extern const TwoPhaseCallback twophase_standby_recover_callbacks[]; extern void RegisterTwoPhaseRecord(TwoPhaseRmgrId rmid, uint16 info, diff --git a/src/include/access/xact.h b/src/include/access/xact.h index 880b41b707..678a23da96 100644 --- a/src/include/access/xact.h +++ b/src/include/access/xact.h @@ -7,7 +7,7 @@ * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group * Portions Copyright (c) 1994, Regents of the University of California * - * $PostgreSQL: pgsql/src/include/access/xact.h,v 1.98 2009/06/11 14:49:09 momjian Exp $ + * $PostgreSQL: pgsql/src/include/access/xact.h,v 1.99 2009/12/19 01:32:42 sriggs Exp $ * *------------------------------------------------------------------------- */ @@ -84,19 +84,49 @@ typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid, #define XLOG_XACT_ABORT 0x20 #define XLOG_XACT_COMMIT_PREPARED 0x30 #define XLOG_XACT_ABORT_PREPARED 0x40 +#define XLOG_XACT_ASSIGNMENT 0x50 + +typedef struct xl_xact_assignment +{ + TransactionId xtop; /* assigned XID's top-level XID */ + int nsubxacts; /* number of subtransaction XIDs */ + TransactionId xsub[1]; /* assigned subxids */ +} xl_xact_assignment; + +#define MinSizeOfXactAssignment offsetof(xl_xact_assignment, xsub) typedef struct xl_xact_commit { TimestampTz xact_time; /* time of commit */ + uint32 xinfo; /* info flags */ int nrels; /* number of RelFileNodes */ int nsubxacts; /* number of subtransaction XIDs */ + int nmsgs; /* number of shared inval msgs */ /* Array of RelFileNode(s) to drop at commit */ RelFileNode xnodes[1]; /* VARIABLE LENGTH ARRAY */ /* ARRAY OF COMMITTED SUBTRANSACTION XIDs FOLLOWS */ + /* ARRAY OF SHARED INVALIDATION MESSAGES FOLLOWS */ } xl_xact_commit; #define MinSizeOfXactCommit offsetof(xl_xact_commit, xnodes) +/* + * These flags are set in the xinfo fields of WAL commit records, + * indicating a variety of additional actions that need to occur + * when emulating transaction effects during recovery. + * They are named XactCompletion... to differentiate them from + * EOXact... routines which run at the end of the original + * transaction completion. + */ +#define XACT_COMPLETION_UPDATE_RELCACHE_FILE 0x01 +#define XACT_COMPLETION_VACUUM_FULL 0x02 +#define XACT_COMPLETION_FORCE_SYNC_COMMIT 0x04 + +/* Access macros for above flags */ +#define XactCompletionRelcacheInitFileInval(xlrec) ((xlrec)->xinfo & XACT_COMPLETION_UPDATE_RELCACHE_FILE) +#define XactCompletionVacuumFull(xlrec) ((xlrec)->xinfo & XACT_COMPLETION_VACUUM_FULL) +#define XactCompletionForceSyncCommit(xlrec) ((xlrec)->xinfo & XACT_COMPLETION_FORCE_SYNC_COMMIT) + typedef struct xl_xact_abort { TimestampTz xact_time; /* time of abort */ @@ -106,6 +136,7 @@ typedef struct xl_xact_abort RelFileNode xnodes[1]; /* VARIABLE LENGTH ARRAY */ /* ARRAY OF ABORTED SUBTRANSACTION XIDs FOLLOWS */ } xl_xact_abort; +/* Note the intentional lack of an invalidation message array c.f. commit */ #define MinSizeOfXactAbort offsetof(xl_xact_abort, xnodes) @@ -181,7 +212,7 @@ extern void UnregisterXactCallback(XactCallback callback, void *arg); extern void RegisterSubXactCallback(SubXactCallback callback, void *arg); extern void UnregisterSubXactCallback(SubXactCallback callback, void *arg); -extern TransactionId RecordTransactionCommit(void); +extern TransactionId RecordTransactionCommit(bool isVacuumFull); extern int xactGetCommittedChildren(TransactionId **ptr); diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h index 052a314d74..ae624a0815 100644 --- a/src/include/access/xlog.h +++ b/src/include/access/xlog.h @@ -6,7 +6,7 @@ * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group * Portions Copyright (c) 1994, Regents of the University of California * - * $PostgreSQL: pgsql/src/include/access/xlog.h,v 1.93 2009/06/26 20:29:04 tgl Exp $ + * $PostgreSQL: pgsql/src/include/access/xlog.h,v 1.94 2009/12/19 01:32:42 sriggs Exp $ */ #ifndef XLOG_H #define XLOG_H @@ -133,7 +133,45 @@ typedef struct XLogRecData } XLogRecData; extern TimeLineID ThisTimeLineID; /* current TLI */ + +/* + * Prior to 8.4, all activity during recovery was carried out by Startup + * process. This local variable continues to be used in many parts of the + * code to indicate actions taken by RecoveryManagers. Other processes who + * potentially perform work during recovery should check RecoveryInProgress() + * see XLogCtl notes in xlog.c + */ extern bool InRecovery; + +/* + * Like InRecovery, standbyState is only valid in the startup process. + * + * In DISABLED state, we're performing crash recovery or hot standby was + * disabled in recovery.conf. + * + * In INITIALIZED state, we haven't yet received a RUNNING_XACTS or shutdown + * checkpoint record to initialize our master transaction tracking system. + * + * When the transaction tracking is initialized, we enter the SNAPSHOT_PENDING + * state. The tracked information might still be incomplete, so we can't allow + * connections yet, but redo functions must update the in-memory state when + * appropriate. + * + * In SNAPSHOT_READY mode, we have full knowledge of transactions that are + * (or were) running in the master at the current WAL location. Snapshots + * can be taken, and read-only queries can be run. + */ +typedef enum +{ + STANDBY_DISABLED, + STANDBY_INITIALIZED, + STANDBY_SNAPSHOT_PENDING, + STANDBY_SNAPSHOT_READY +} HotStandbyState; +extern HotStandbyState standbyState; + +#define InHotStandby (standbyState >= STANDBY_SNAPSHOT_PENDING) + extern XLogRecPtr XactLastRecEnd; /* these variables are GUC parameters related to XLOG */ @@ -143,9 +181,12 @@ extern bool XLogArchiveMode; extern char *XLogArchiveCommand; extern int XLogArchiveTimeout; extern bool log_checkpoints; +extern bool XLogRequestRecoveryConnections; +extern int MaxStandbyDelay; #define XLogArchivingActive() (XLogArchiveMode) #define XLogArchiveCommandSet() (XLogArchiveCommand[0] != '\0') +#define XLogStandbyInfoActive() (XLogRequestRecoveryConnections && XLogArchiveMode) #ifdef WAL_DEBUG extern bool XLOG_DEBUG; @@ -203,6 +244,7 @@ extern void xlog_desc(StringInfo buf, uint8 xl_info, char *rec); extern bool RecoveryInProgress(void); extern bool XLogInsertAllowed(void); +extern TimestampTz GetLatestXLogTime(void); extern void UpdateControlFile(void); extern Size XLOGShmemSize(void); diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h index 508c2eeb8d..9747dd8c96 100644 --- a/src/include/access/xlog_internal.h +++ b/src/include/access/xlog_internal.h @@ -11,7 +11,7 @@ * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group * Portions Copyright (c) 1994, Regents of the University of California * - * $PostgreSQL: pgsql/src/include/access/xlog_internal.h,v 1.25 2009/01/01 17:23:56 momjian Exp $ + * $PostgreSQL: pgsql/src/include/access/xlog_internal.h,v 1.26 2009/12/19 01:32:42 sriggs Exp $ */ #ifndef XLOG_INTERNAL_H #define XLOG_INTERNAL_H @@ -71,7 +71,7 @@ typedef struct XLogContRecord /* * Each page of XLOG file has a header like this: */ -#define XLOG_PAGE_MAGIC 0xD063 /* can be used as WAL version indicator */ +#define XLOG_PAGE_MAGIC 0xD166 /* can be used as WAL version indicator */ typedef struct XLogPageHeaderData { @@ -255,5 +255,6 @@ extern Datum pg_current_xlog_location(PG_FUNCTION_ARGS); extern Datum pg_current_xlog_insert_location(PG_FUNCTION_ARGS); extern Datum pg_xlogfile_name_offset(PG_FUNCTION_ARGS); extern Datum pg_xlogfile_name(PG_FUNCTION_ARGS); +extern Datum pg_is_in_recovery(PG_FUNCTION_ARGS); #endif /* XLOG_INTERNAL_H */ diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h index c7a740ca71..a4d19f8b4a 100644 --- a/src/include/catalog/pg_control.h +++ b/src/include/catalog/pg_control.h @@ -8,7 +8,7 @@ * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group * Portions Copyright (c) 1994, Regents of the University of California * - * $PostgreSQL: pgsql/src/include/catalog/pg_control.h,v 1.44 2009/08/31 02:23:23 tgl Exp $ + * $PostgreSQL: pgsql/src/include/catalog/pg_control.h,v 1.45 2009/12/19 01:32:42 sriggs Exp $ * *------------------------------------------------------------------------- */ @@ -40,6 +40,20 @@ typedef struct CheckPoint TransactionId oldestXid; /* cluster-wide minimum datfrozenxid */ Oid oldestXidDB; /* database with minimum datfrozenxid */ pg_time_t time; /* time stamp of checkpoint */ + + /* Important parameter settings at time of shutdown checkpoints */ + int MaxConnections; + int max_prepared_xacts; + int max_locks_per_xact; + bool XLogStandbyInfoMode; + + /* + * Oldest XID still running. This is only needed to initialize hot standby + * mode from an online checkpoint, so we only bother calculating this for + * online checkpoints and only when archiving is enabled. Otherwise it's + * set to InvalidTransactionId. + */ + TransactionId oldestActiveXid; } CheckPoint; /* XLOG info values for XLOG rmgr */ diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h index c9307df58a..ba5b49a563 100644 --- a/src/include/catalog/pg_proc.h +++ b/src/include/catalog/pg_proc.h @@ -7,7 +7,7 @@ * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group * Portions Copyright (c) 1994, Regents of the University of California * - * $PostgreSQL: pgsql/src/include/catalog/pg_proc.h,v 1.556 2009/12/06 02:55:54 tgl Exp $ + * $PostgreSQL: pgsql/src/include/catalog/pg_proc.h,v 1.557 2009/12/19 01:32:42 sriggs Exp $ * * NOTES * The script catalog/genbki.sh reads this file and generates .bki @@ -3285,6 +3285,9 @@ DESCR("xlog filename and byte offset, given an xlog location"); DATA(insert OID = 2851 ( pg_xlogfile_name PGNSP PGUID 12 1 0 0 f f f t f i 1 0 25 "25" _null_ _null_ _null_ _null_ pg_xlogfile_name _null_ _null_ _null_ )); DESCR("xlog filename, given an xlog location"); +DATA(insert OID = 3810 ( pg_is_in_recovery PGNSP PGUID 12 1 0 0 f f f t f v 0 0 16 "" _null_ _null_ _null_ _null_ pg_is_in_recovery _null_ _null_ _null_ )); +DESCR("true if server is in recovery"); + DATA(insert OID = 2621 ( pg_reload_conf PGNSP PGUID 12 1 0 0 f f f t f v 0 0 16 "" _null_ _null_ _null_ _null_ pg_reload_conf _null_ _null_ _null_ )); DESCR("reload configuration files"); DATA(insert OID = 2622 ( pg_rotate_logfile PGNSP PGUID 12 1 0 0 f f f t f v 0 0 16 "" _null_ _null_ _null_ _null_ pg_rotate_logfile _null_ _null_ _null_ )); diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h index 424af2e56a..da46c49496 100644 --- a/src/include/miscadmin.h +++ b/src/include/miscadmin.h @@ -13,7 +13,7 @@ * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group * Portions Copyright (c) 1994, Regents of the University of California * - * $PostgreSQL: pgsql/src/include/miscadmin.h,v 1.215 2009/12/09 21:57:51 tgl Exp $ + * $PostgreSQL: pgsql/src/include/miscadmin.h,v 1.216 2009/12/19 01:32:41 sriggs Exp $ * * NOTES * some of the information in this file should be moved to other files. @@ -236,6 +236,12 @@ extern bool VacuumCostActive; /* in tcop/postgres.c */ extern void check_stack_depth(void); +/* in tcop/utility.c */ +extern void PreventCommandDuringRecovery(void); + +/* in utils/misc/guc.c */ +extern int trace_recovery_messages; +int trace_recovery(int trace_level); /***************************************************************************** * pdir.h -- * diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h index e2b27ccb98..2749b833f6 100644 --- a/src/include/storage/lock.h +++ b/src/include/storage/lock.h @@ -7,7 +7,7 @@ * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group * Portions Copyright (c) 1994, Regents of the University of California * - * $PostgreSQL: pgsql/src/include/storage/lock.h,v 1.116 2009/04/04 17:40:36 tgl Exp $ + * $PostgreSQL: pgsql/src/include/storage/lock.h,v 1.117 2009/12/19 01:32:44 sriggs Exp $ * *------------------------------------------------------------------------- */ @@ -477,6 +477,11 @@ extern LockAcquireResult LockAcquire(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock, bool dontWait); +extern LockAcquireResult LockAcquireExtended(const LOCKTAG *locktag, + LOCKMODE lockmode, + bool sessionLock, + bool dontWait, + bool report_memory_error); extern bool LockRelease(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock); extern void LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks); @@ -494,6 +499,17 @@ extern void GrantAwaitedLock(void); extern void RemoveFromWaitQueue(PGPROC *proc, uint32 hashcode); extern Size LockShmemSize(void); extern LockData *GetLockStatusData(void); + +extern void ReportLockTableError(bool report); + +typedef struct xl_standby_lock +{ + TransactionId xid; /* xid of holder of AccessExclusiveLock */ + Oid dbOid; + Oid relOid; +} xl_standby_lock; + +extern xl_standby_lock *GetRunningTransactionLocks(int *nlocks); extern const char *GetLockmodeName(LOCKMETHODID lockmethodid, LOCKMODE mode); extern void lock_twophase_recover(TransactionId xid, uint16 info, @@ -502,6 +518,8 @@ extern void lock_twophase_postcommit(TransactionId xid, uint16 info, void *recdata, uint32 len); extern void lock_twophase_postabort(TransactionId xid, uint16 info, void *recdata, uint32 len); +extern void lock_twophase_standby_recover(TransactionId xid, uint16 info, + void *recdata, uint32 len); extern DeadLockState DeadLockCheck(PGPROC *proc); extern PGPROC *GetBlockingAutoVacuumPgproc(void); diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h index c8a9042fdc..cc55529128 100644 --- a/src/include/storage/proc.h +++ b/src/include/storage/proc.h @@ -7,7 +7,7 @@ * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group * Portions Copyright (c) 1994, Regents of the University of California * - * $PostgreSQL: pgsql/src/include/storage/proc.h,v 1.114 2009/08/31 19:41:00 tgl Exp $ + * $PostgreSQL: pgsql/src/include/storage/proc.h,v 1.115 2009/12/19 01:32:44 sriggs Exp $ * *------------------------------------------------------------------------- */ @@ -95,6 +95,13 @@ struct PGPROC uint8 vacuumFlags; /* vacuum-related flags, see above */ + /* + * While in hot standby mode, setting recoveryConflictMode instructs + * the backend to commit suicide. Possible values are the same as those + * passed to ResolveRecoveryConflictWithVirtualXIDs(). + */ + int recoveryConflictMode; + /* Info about LWLock the process is currently waiting for, if any. */ bool lwWaiting; /* true if waiting for an LW lock */ bool lwExclusive; /* true if waiting for exclusive access */ @@ -135,6 +142,9 @@ typedef struct PROC_HDR PGPROC *autovacFreeProcs; /* Current shared estimate of appropriate spins_per_delay value */ int spins_per_delay; + /* The proc of the Startup process, since not in ProcArray */ + PGPROC *startupProc; + int startupProcPid; } PROC_HDR; /* @@ -165,6 +175,9 @@ extern void InitProcGlobal(void); extern void InitProcess(void); extern void InitProcessPhase2(void); extern void InitAuxiliaryProcess(void); + +extern void PublishStartupProcessInformation(void); + extern bool HaveNFreeProcs(int n); extern void ProcReleaseLocks(bool isCommit); diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h index fab84ee1a0..a7fb379cf6 100644 --- a/src/include/storage/procarray.h +++ b/src/include/storage/procarray.h @@ -7,7 +7,7 @@ * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group * Portions Copyright (c) 1994, Regents of the University of California * - * $PostgreSQL: pgsql/src/include/storage/procarray.h,v 1.26 2009/06/11 14:49:12 momjian Exp $ + * $PostgreSQL: pgsql/src/include/storage/procarray.h,v 1.27 2009/12/19 01:32:44 sriggs Exp $ * *------------------------------------------------------------------------- */ @@ -15,6 +15,7 @@ #define PROCARRAY_H #include "storage/lock.h" +#include "storage/standby.h" #include "utils/snapshot.h" @@ -26,6 +27,19 @@ extern void ProcArrayRemove(PGPROC *proc, TransactionId latestXid); extern void ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid); extern void ProcArrayClearTransaction(PGPROC *proc); +extern void ProcArrayInitRecoveryInfo(TransactionId oldestActiveXid); +extern void ProcArrayApplyRecoveryInfo(RunningTransactions running); +extern void ProcArrayApplyXidAssignment(TransactionId topxid, + int nsubxids, TransactionId *subxids); + +extern void RecordKnownAssignedTransactionIds(TransactionId xid); +extern void ExpireTreeKnownAssignedTransactionIds(TransactionId xid, + int nsubxids, TransactionId *subxids); +extern void ExpireAllKnownAssignedTransactionIds(void); +extern void ExpireOldKnownAssignedTransactionIds(TransactionId xid); + +extern RunningTransactions GetRunningTransactionData(void); + extern Snapshot GetSnapshotData(Snapshot snapshot); extern bool TransactionIdIsInProgress(TransactionId xid); @@ -42,6 +56,11 @@ extern bool IsBackendPid(int pid); extern VirtualTransactionId *GetCurrentVirtualXIDs(TransactionId limitXmin, bool excludeXmin0, bool allDbs, int excludeVacuum, int *nvxids); +extern VirtualTransactionId *GetConflictingVirtualXIDs(TransactionId limitXmin, + Oid dbOid, bool skipExistingConflicts); +extern pid_t CancelVirtualTransaction(VirtualTransactionId vxid, + int cancel_mode); + extern int CountActiveBackends(void); extern int CountDBBackends(Oid databaseid); extern int CountUserBackends(Oid roleid); diff --git a/src/include/storage/sinval.h b/src/include/storage/sinval.h index 84edb5b31e..2dbebaf9f7 100644 --- a/src/include/storage/sinval.h +++ b/src/include/storage/sinval.h @@ -7,7 +7,7 @@ * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group * Portions Copyright (c) 1994, Regents of the University of California * - * $PostgreSQL: pgsql/src/include/storage/sinval.h,v 1.53 2009/07/31 20:26:23 tgl Exp $ + * $PostgreSQL: pgsql/src/include/storage/sinval.h,v 1.54 2009/12/19 01:32:44 sriggs Exp $ * *------------------------------------------------------------------------- */ @@ -100,4 +100,7 @@ extern void HandleCatchupInterrupt(void); extern void EnableCatchupInterrupt(void); extern bool DisableCatchupInterrupt(void); +extern int xactGetCommittedInvalidationMessages(SharedInvalidationMessage **msgs, + bool *RelcacheInitFileInval); + #endif /* SINVAL_H */ diff --git a/src/include/storage/sinvaladt.h b/src/include/storage/sinvaladt.h index 87a8c6d3a1..5d188aaa2f 100644 --- a/src/include/storage/sinvaladt.h +++ b/src/include/storage/sinvaladt.h @@ -15,7 +15,7 @@ * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group * Portions Copyright (c) 1994, Regents of the University of California * - * $PostgreSQL: pgsql/src/include/storage/sinvaladt.h,v 1.51 2009/06/11 14:49:12 momjian Exp $ + * $PostgreSQL: pgsql/src/include/storage/sinvaladt.h,v 1.52 2009/12/19 01:32:44 sriggs Exp $ * *------------------------------------------------------------------------- */ @@ -29,7 +29,7 @@ */ extern Size SInvalShmemSize(void); extern void CreateSharedInvalidationState(void); -extern void SharedInvalBackendInit(void); +extern void SharedInvalBackendInit(bool sendOnly); extern bool BackendIdIsActive(int backendID); extern void SIInsertDataEntries(const SharedInvalidationMessage *data, int n); diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h new file mode 100644 index 0000000000..45d44f60f6 --- /dev/null +++ b/src/include/storage/standby.h @@ -0,0 +1,106 @@ +/*------------------------------------------------------------------------- + * + * standby.h + * Definitions for hot standby mode. + * + * + * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * $PostgreSQL: pgsql/src/include/storage/standby.h,v 1.1 2009/12/19 01:32:44 sriggs Exp $ + * + *------------------------------------------------------------------------- + */ +#ifndef STANDBY_H +#define STANDBY_H + +#include "access/xlog.h" +#include "storage/lock.h" + +extern int vacuum_defer_cleanup_age; + +/* cancel modes for ResolveRecoveryConflictWithVirtualXIDs */ +#define CONFLICT_MODE_NOT_SET 0 +#define CONFLICT_MODE_ERROR 1 /* Conflict can be resolved by canceling query */ +#define CONFLICT_MODE_FATAL 2 /* Conflict can only be resolved by disconnecting session */ + +extern void ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist, + char *reason, int cancel_mode); + +extern void InitRecoveryTransactionEnvironment(void); +extern void ShutdownRecoveryTransactionEnvironment(void); + +/* + * Standby Rmgr (RM_STANDBY_ID) + * + * Standby recovery manager exists to perform actions that are required + * to make hot standby work. That includes logging AccessExclusiveLocks taken + * by transactions and running-xacts snapshots. + */ +extern void StandbyAcquireAccessExclusiveLock(TransactionId xid, Oid dbOid, Oid relOid); +extern void StandbyReleaseLockTree(TransactionId xid, + int nsubxids, TransactionId *subxids); +extern void StandbyReleaseAllLocks(void); +extern void StandbyReleaseOldLocks(TransactionId removeXid); + +/* + * XLOG message types + */ +#define XLOG_STANDBY_LOCK 0x00 +#define XLOG_RUNNING_XACTS 0x10 + +typedef struct xl_standby_locks +{ + int nlocks; /* number of entries in locks array */ + xl_standby_lock locks[1]; /* VARIABLE LENGTH ARRAY */ +} xl_standby_locks; + +/* + * When we write running xact data to WAL, we use this structure. + */ +typedef struct xl_running_xacts +{ + int xcnt; /* # of xact ids in xids[] */ + bool subxid_overflow; /* snapshot overflowed, subxids missing */ + TransactionId nextXid; /* copy of ShmemVariableCache->nextXid */ + TransactionId oldestRunningXid; /* *not* oldestXmin */ + + TransactionId xids[1]; /* VARIABLE LENGTH ARRAY */ +} xl_running_xacts; + +#define MinSizeOfXactRunningXacts offsetof(xl_running_xacts, xids) + + +/* Recovery handlers for the Standby Rmgr (RM_STANDBY_ID) */ +extern void standby_redo(XLogRecPtr lsn, XLogRecord *record); +extern void standby_desc(StringInfo buf, uint8 xl_info, char *rec); + +/* + * Declarations for GetRunningTransactionData(). Similar to Snapshots, but + * not quite. This has nothing at all to do with visibility on this server, + * so this is completely separate from snapmgr.c and snapmgr.h + * This data is important for creating the initial snapshot state on a + * standby server. We need lots more information than a normal snapshot, + * hence we use a specific data structure for our needs. This data + * is written to WAL as a separate record immediately after each + * checkpoint. That means that wherever we start a standby from we will + * almost immediately see the data we need to begin executing queries. + */ + +typedef struct RunningTransactionsData +{ + int xcnt; /* # of xact ids in xids[] */ + bool subxid_overflow; /* snapshot overflowed, subxids missing */ + TransactionId nextXid; /* copy of ShmemVariableCache->nextXid */ + TransactionId oldestRunningXid; /* *not* oldestXmin */ + + TransactionId *xids; /* array of (sub)xids still running */ +} RunningTransactionsData; + +typedef RunningTransactionsData *RunningTransactions; + +extern void LogAccessExclusiveLock(Oid dbOid, Oid relOid); + +extern void LogStandbySnapshot(TransactionId *oldestActiveXid, TransactionId *nextXid); + +#endif /* STANDBY_H */ diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h index c400c67e0e..67b98632e0 100644 --- a/src/include/utils/builtins.h +++ b/src/include/utils/builtins.h @@ -7,7 +7,7 @@ * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group * Portions Copyright (c) 1994, Regents of the University of California * - * $PostgreSQL: pgsql/src/include/utils/builtins.h,v 1.341 2009/10/21 20:38:58 tgl Exp $ + * $PostgreSQL: pgsql/src/include/utils/builtins.h,v 1.342 2009/12/19 01:32:44 sriggs Exp $ * *------------------------------------------------------------------------- */ @@ -730,6 +730,7 @@ extern Datum xidrecv(PG_FUNCTION_ARGS); extern Datum xidsend(PG_FUNCTION_ARGS); extern Datum xideq(PG_FUNCTION_ARGS); extern Datum xid_age(PG_FUNCTION_ARGS); +extern int xidComparator(const void *arg1, const void *arg2); extern Datum cidin(PG_FUNCTION_ARGS); extern Datum cidout(PG_FUNCTION_ARGS); extern Datum cidrecv(PG_FUNCTION_ARGS); diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h index e5003b669a..18094180c4 100644 --- a/src/include/utils/snapshot.h +++ b/src/include/utils/snapshot.h @@ -6,7 +6,7 @@ * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group * Portions Copyright (c) 1994, Regents of the University of California * - * $PostgreSQL: pgsql/src/include/utils/snapshot.h,v 1.5 2009/06/11 14:49:13 momjian Exp $ + * $PostgreSQL: pgsql/src/include/utils/snapshot.h,v 1.6 2009/12/19 01:32:44 sriggs Exp $ * *------------------------------------------------------------------------- */ @@ -49,8 +49,10 @@ typedef struct SnapshotData uint32 xcnt; /* # of xact ids in xip[] */ TransactionId *xip; /* array of xact IDs in progress */ /* note: all ids in xip[] satisfy xmin <= xip[i] < xmax */ - int32 subxcnt; /* # of xact ids in subxip[], -1 if overflow */ + int32 subxcnt; /* # of xact ids in subxip[] */ TransactionId *subxip; /* array of subxact IDs in progress */ + bool suboverflowed; /* has the subxip array overflowed? */ + bool takenDuringRecovery; /* recovery-shaped snapshot? */ /* * note: all ids in subxip[] are >= xmin, but we don't bother filtering diff --git a/src/test/regress/GNUmakefile b/src/test/regress/GNUmakefile index 4a47bdfe4f..3a23918a1c 100644 --- a/src/test/regress/GNUmakefile +++ b/src/test/regress/GNUmakefile @@ -6,7 +6,7 @@ # Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group # Portions Copyright (c) 1994, Regents of the University of California # -# $PostgreSQL: pgsql/src/test/regress/GNUmakefile,v 1.80 2009/12/18 21:28:42 momjian Exp $ +# $PostgreSQL: pgsql/src/test/regress/GNUmakefile,v 1.81 2009/12/19 01:32:45 sriggs Exp $ # #------------------------------------------------------------------------- @@ -149,6 +149,8 @@ installcheck: all installcheck-parallel: all $(pg_regress_call) --psqldir=$(PSQLDIR) --schedule=$(srcdir)/parallel_schedule $(MAXCONNOPT) +standbycheck: all + $(pg_regress_call) --psqldir=$(PSQLDIR) --schedule=$(srcdir)/standby_schedule --use-existing # old interfaces follow... diff --git a/src/test/regress/expected/hs_standby_allowed.out b/src/test/regress/expected/hs_standby_allowed.out new file mode 100644 index 0000000000..1abe5f6fe9 --- /dev/null +++ b/src/test/regress/expected/hs_standby_allowed.out @@ -0,0 +1,215 @@ +-- +-- Hot Standby tests +-- +-- hs_standby_allowed.sql +-- +-- SELECT +select count(*) as should_be_1 from hs1; + should_be_1 +------------- + 1 +(1 row) + +select count(*) as should_be_2 from hs2; + should_be_2 +------------- + 2 +(1 row) + +select count(*) as should_be_3 from hs3; + should_be_3 +------------- + 3 +(1 row) + +COPY hs1 TO '/tmp/copy_test'; +\! cat /tmp/copy_test +1 +-- Access sequence directly +select min_value as sequence_min_value from hsseq; + sequence_min_value +-------------------- + 1 +(1 row) + +-- Transactions +begin; +select count(*) as should_be_1 from hs1; + should_be_1 +------------- + 1 +(1 row) + +end; +begin transaction read only; +select count(*) as should_be_1 from hs1; + should_be_1 +------------- + 1 +(1 row) + +end; +begin transaction isolation level serializable; +select count(*) as should_be_1 from hs1; + should_be_1 +------------- + 1 +(1 row) + +select count(*) as should_be_1 from hs1; + should_be_1 +------------- + 1 +(1 row) + +select count(*) as should_be_1 from hs1; + should_be_1 +------------- + 1 +(1 row) + +commit; +begin; +select count(*) as should_be_1 from hs1; + should_be_1 +------------- + 1 +(1 row) + +commit; +begin; +select count(*) as should_be_1 from hs1; + should_be_1 +------------- + 1 +(1 row) + +abort; +start transaction; +select count(*) as should_be_1 from hs1; + should_be_1 +------------- + 1 +(1 row) + +commit; +begin; +select count(*) as should_be_1 from hs1; + should_be_1 +------------- + 1 +(1 row) + +rollback; +begin; +select count(*) as should_be_1 from hs1; + should_be_1 +------------- + 1 +(1 row) + +savepoint s; +select count(*) as should_be_2 from hs2; + should_be_2 +------------- + 2 +(1 row) + +commit; +begin; +select count(*) as should_be_1 from hs1; + should_be_1 +------------- + 1 +(1 row) + +savepoint s; +select count(*) as should_be_2 from hs2; + should_be_2 +------------- + 2 +(1 row) + +release savepoint s; +select count(*) as should_be_2 from hs2; + should_be_2 +------------- + 2 +(1 row) + +savepoint s; +select count(*) as should_be_3 from hs3; + should_be_3 +------------- + 3 +(1 row) + +rollback to savepoint s; +select count(*) as should_be_2 from hs2; + should_be_2 +------------- + 2 +(1 row) + +commit; +-- SET parameters +-- has no effect on read only transactions, but we can still set it +set synchronous_commit = on; +show synchronous_commit; + synchronous_commit +-------------------- + on +(1 row) + +reset synchronous_commit; +discard temp; +discard all; +-- CURSOR commands +BEGIN; +DECLARE hsc CURSOR FOR select * from hs3; +FETCH next from hsc; + col1 +------ + 113 +(1 row) + +fetch first from hsc; + col1 +------ + 113 +(1 row) + +fetch last from hsc; + col1 +------ + 115 +(1 row) + +fetch 1 from hsc; + col1 +------ +(0 rows) + +CLOSE hsc; +COMMIT; +-- Prepared plans +PREPARE hsp AS select count(*) from hs1; +PREPARE hsp_noexec (integer) AS insert into hs1 values ($1); +EXECUTE hsp; + count +------- + 1 +(1 row) + +DEALLOCATE hsp; +-- LOCK +BEGIN; +LOCK hs1 IN ACCESS SHARE MODE; +LOCK hs1 IN ROW SHARE MODE; +LOCK hs1 IN ROW EXCLUSIVE MODE; +COMMIT; +-- LOAD +-- should work, easier if there is no test for that... +-- ALLOWED COMMANDS +CHECKPOINT; +discard all; diff --git a/src/test/regress/expected/hs_standby_check.out b/src/test/regress/expected/hs_standby_check.out new file mode 100644 index 0000000000..df885ea9e0 --- /dev/null +++ b/src/test/regress/expected/hs_standby_check.out @@ -0,0 +1,20 @@ +-- +-- Hot Standby tests +-- +-- hs_standby_check.sql +-- +-- +-- If the query below returns false then all other tests will fail after it. +-- +select case pg_is_in_recovery() when false then + 'These tests are intended only for execution on a standby server that is reading ' || + 'WAL from a server upon which the regression database is already created and into ' || + 'which src/test/regress/sql/hs_primary_setup.sql has been run' +else + 'Tests are running on a standby server during recovery' +end; + case +------------------------------------------------------- + Tests are running on a standby server during recovery +(1 row) + diff --git a/src/test/regress/expected/hs_standby_disallowed.out b/src/test/regress/expected/hs_standby_disallowed.out new file mode 100644 index 0000000000..030201d30d --- /dev/null +++ b/src/test/regress/expected/hs_standby_disallowed.out @@ -0,0 +1,137 @@ +-- +-- Hot Standby tests +-- +-- hs_standby_disallowed.sql +-- +SET transaction_read_only = off; +ERROR: cannot set transaction read-write mode during recovery +begin transaction read write; +ERROR: cannot set transaction read-write mode during recovery +commit; +WARNING: there is no transaction in progress +-- SELECT +select * from hs1 FOR SHARE; +ERROR: transaction is read-only +select * from hs1 FOR UPDATE; +ERROR: transaction is read-only +-- DML +BEGIN; +insert into hs1 values (37); +ERROR: transaction is read-only +ROLLBACK; +BEGIN; +delete from hs1 where col1 = 1; +ERROR: transaction is read-only +ROLLBACK; +BEGIN; +update hs1 set col1 = NULL where col1 > 0; +ERROR: transaction is read-only +ROLLBACK; +BEGIN; +truncate hs3; +ERROR: transaction is read-only +ROLLBACK; +-- DDL +create temporary table hstemp1 (col1 integer); +ERROR: transaction is read-only +BEGIN; +drop table hs2; +ERROR: transaction is read-only +ROLLBACK; +BEGIN; +create table hs4 (col1 integer); +ERROR: transaction is read-only +ROLLBACK; +-- Sequences +SELECT nextval('hsseq'); +ERROR: cannot be executed during recovery +-- Two-phase commit transaction stuff +BEGIN; +SELECT count(*) FROM hs1; + count +------- + 1 +(1 row) + +PREPARE TRANSACTION 'foobar'; +ERROR: cannot be executed during recovery +ROLLBACK; +BEGIN; +SELECT count(*) FROM hs1; + count +------- + 1 +(1 row) + +COMMIT PREPARED 'foobar'; +ERROR: cannot be executed during recovery +ROLLBACK; +BEGIN; +SELECT count(*) FROM hs1; + count +------- + 1 +(1 row) + +PREPARE TRANSACTION 'foobar'; +ERROR: cannot be executed during recovery +ROLLBACK PREPARED 'foobar'; +ERROR: current transaction is aborted, commands ignored until end of transaction block +ROLLBACK; +BEGIN; +SELECT count(*) FROM hs1; + count +------- + 1 +(1 row) + +ROLLBACK PREPARED 'foobar'; +ERROR: cannot be executed during recovery +ROLLBACK; +-- Locks +BEGIN; +LOCK hs1; +ERROR: cannot be executed during recovery +COMMIT; +BEGIN; +LOCK hs1 IN SHARE UPDATE EXCLUSIVE MODE; +ERROR: cannot be executed during recovery +COMMIT; +BEGIN; +LOCK hs1 IN SHARE MODE; +ERROR: cannot be executed during recovery +COMMIT; +BEGIN; +LOCK hs1 IN SHARE ROW EXCLUSIVE MODE; +ERROR: cannot be executed during recovery +COMMIT; +BEGIN; +LOCK hs1 IN EXCLUSIVE MODE; +ERROR: cannot be executed during recovery +COMMIT; +BEGIN; +LOCK hs1 IN ACCESS EXCLUSIVE MODE; +ERROR: cannot be executed during recovery +COMMIT; +-- Listen +listen a; +ERROR: cannot be executed during recovery +notify a; +ERROR: cannot be executed during recovery +unlisten a; +ERROR: cannot be executed during recovery +unlisten *; +ERROR: cannot be executed during recovery +-- disallowed commands +ANALYZE hs1; +ERROR: cannot be executed during recovery +VACUUM hs2; +ERROR: cannot be executed during recovery +CLUSTER hs2 using hs1_pkey; +ERROR: cannot be executed during recovery +REINDEX TABLE hs2; +ERROR: cannot be executed during recovery +REVOKE SELECT ON hs1 FROM PUBLIC; +ERROR: transaction is read-only +GRANT SELECT ON hs1 TO PUBLIC; +ERROR: transaction is read-only diff --git a/src/test/regress/expected/hs_standby_functions.out b/src/test/regress/expected/hs_standby_functions.out new file mode 100644 index 0000000000..edcf1c72ad --- /dev/null +++ b/src/test/regress/expected/hs_standby_functions.out @@ -0,0 +1,40 @@ +-- +-- Hot Standby tests +-- +-- hs_standby_functions.sql +-- +-- should fail +select txid_current(); +ERROR: cannot be executed during recovery +select length(txid_current_snapshot()::text) >= 4; + ?column? +---------- + t +(1 row) + +select pg_start_backup('should fail'); +ERROR: recovery is in progress +HINT: WAL control functions cannot be executed during recovery. +select pg_switch_xlog(); +ERROR: recovery is in progress +HINT: WAL control functions cannot be executed during recovery. +select pg_stop_backup(); +ERROR: recovery is in progress +HINT: WAL control functions cannot be executed during recovery. +-- should return no rows +select * from pg_prepared_xacts; + transaction | gid | prepared | owner | database +-------------+-----+----------+-------+---------- +(0 rows) + +-- just the startup process +select locktype, virtualxid, virtualtransaction, mode, granted +from pg_locks where virtualxid = '1/1'; + locktype | virtualxid | virtualtransaction | mode | granted +------------+------------+--------------------+---------------+--------- + virtualxid | 1/1 | 1/0 | ExclusiveLock | t +(1 row) + +-- suicide is painless +select pg_cancel_backend(pg_backend_pid()); +ERROR: canceling statement due to user request diff --git a/src/test/regress/pg_regress.c b/src/test/regress/pg_regress.c index 7fe472b503..78c30bdb2f 100644 --- a/src/test/regress/pg_regress.c +++ b/src/test/regress/pg_regress.c @@ -11,7 +11,7 @@ * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group * Portions Copyright (c) 1994, Regents of the University of California * - * $PostgreSQL: pgsql/src/test/regress/pg_regress.c,v 1.67 2009/11/23 16:02:24 tgl Exp $ + * $PostgreSQL: pgsql/src/test/regress/pg_regress.c,v 1.68 2009/12/19 01:32:45 sriggs Exp $ * *------------------------------------------------------------------------- */ @@ -93,6 +93,7 @@ static char *temp_install = NULL; static char *temp_config = NULL; static char *top_builddir = NULL; static bool nolocale = false; +static bool use_existing = false; static char *hostname = NULL; static int port = -1; static bool port_specified_by_user = false; @@ -1545,7 +1546,7 @@ run_schedule(const char *schedule, test_function tfunc) if (num_tests == 1) { - status(_("test %-20s ... "), tests[0]); + status(_("test %-24s ... "), tests[0]); pids[0] = (tfunc) (tests[0], &resultfiles[0], &expectfiles[0], &tags[0]); wait_for_tests(pids, statuses, NULL, 1); /* status line is finished below */ @@ -1590,7 +1591,7 @@ run_schedule(const char *schedule, test_function tfunc) bool differ = false; if (num_tests > 1) - status(_(" %-20s ... "), tests[i]); + status(_(" %-24s ... "), tests[i]); /* * Advance over all three lists simultaneously. @@ -1918,6 +1919,7 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc {"dlpath", required_argument, NULL, 17}, {"create-role", required_argument, NULL, 18}, {"temp-config", required_argument, NULL, 19}, + {"use-existing", no_argument, NULL, 20}, {NULL, 0, NULL, 0} }; @@ -2008,6 +2010,9 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc case 19: temp_config = strdup(optarg); break; + case 20: + use_existing = true; + break; default: /* getopt_long already emitted a complaint */ fprintf(stderr, _("\nTry \"%s -h\" for more information.\n"), @@ -2254,19 +2259,25 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc * Using an existing installation, so may need to get rid of * pre-existing database(s) and role(s) */ - for (sl = dblist; sl; sl = sl->next) - drop_database_if_exists(sl->str); - for (sl = extraroles; sl; sl = sl->next) - drop_role_if_exists(sl->str); + if (!use_existing) + { + for (sl = dblist; sl; sl = sl->next) + drop_database_if_exists(sl->str); + for (sl = extraroles; sl; sl = sl->next) + drop_role_if_exists(sl->str); + } } /* * Create the test database(s) and role(s) */ - for (sl = dblist; sl; sl = sl->next) - create_database(sl->str); - for (sl = extraroles; sl; sl = sl->next) - create_role(sl->str, dblist); + if (!use_existing) + { + for (sl = dblist; sl; sl = sl->next) + create_database(sl->str); + for (sl = extraroles; sl; sl = sl->next) + create_role(sl->str, dblist); + } /* * Ready to run the tests diff --git a/src/test/regress/sql/hs_primary_extremes.sql b/src/test/regress/sql/hs_primary_extremes.sql new file mode 100644 index 0000000000..900bd1924e --- /dev/null +++ b/src/test/regress/sql/hs_primary_extremes.sql @@ -0,0 +1,74 @@ +-- +-- Hot Standby tests +-- +-- hs_primary_extremes.sql +-- + +drop table if exists hs_extreme; +create table hs_extreme (col1 integer); + +CREATE OR REPLACE FUNCTION hs_subxids (n integer) +RETURNS void +LANGUAGE plpgsql +AS $$ + BEGIN + IF n <= 0 THEN RETURN; END IF; + INSERT INTO hs_extreme VALUES (n); + PERFORM hs_subxids(n - 1); + RETURN; + EXCEPTION WHEN raise_exception THEN NULL; END; +$$; + +BEGIN; +SELECT hs_subxids(257); +ROLLBACK; +BEGIN; +SELECT hs_subxids(257); +COMMIT; + +set client_min_messages = 'warning'; + +CREATE OR REPLACE FUNCTION hs_locks_create (n integer) +RETURNS void +LANGUAGE plpgsql +AS $$ + BEGIN + IF n <= 0 THEN + CHECKPOINT; + RETURN; + END IF; + EXECUTE 'CREATE TABLE hs_locks_' || n::text || ' ()'; + PERFORM hs_locks_create(n - 1); + RETURN; + EXCEPTION WHEN raise_exception THEN NULL; END; +$$; + +CREATE OR REPLACE FUNCTION hs_locks_drop (n integer) +RETURNS void +LANGUAGE plpgsql +AS $$ + BEGIN + IF n <= 0 THEN + CHECKPOINT; + RETURN; + END IF; + EXECUTE 'DROP TABLE IF EXISTS hs_locks_' || n::text; + PERFORM hs_locks_drop(n - 1); + RETURN; + EXCEPTION WHEN raise_exception THEN NULL; END; +$$; + +BEGIN; +SELECT hs_locks_drop(257); +SELECT hs_locks_create(257); +SELECT count(*) > 257 FROM pg_locks; +ROLLBACK; +BEGIN; +SELECT hs_locks_drop(257); +SELECT hs_locks_create(257); +SELECT count(*) > 257 FROM pg_locks; +COMMIT; +SELECT hs_locks_drop(257); + +SELECT pg_switch_xlog(); + diff --git a/src/test/regress/sql/hs_primary_setup.sql b/src/test/regress/sql/hs_primary_setup.sql new file mode 100644 index 0000000000..a00b367cbc --- /dev/null +++ b/src/test/regress/sql/hs_primary_setup.sql @@ -0,0 +1,25 @@ +-- +-- Hot Standby tests +-- +-- hs_primary_setup.sql +-- + +drop table if exists hs1; +create table hs1 (col1 integer primary key); +insert into hs1 values (1); + +drop table if exists hs2; +create table hs2 (col1 integer primary key); +insert into hs2 values (12); +insert into hs2 values (13); + +drop table if exists hs3; +create table hs3 (col1 integer primary key); +insert into hs3 values (113); +insert into hs3 values (114); +insert into hs3 values (115); + +DROP sequence if exists hsseq; +create sequence hsseq; + +SELECT pg_switch_xlog(); diff --git a/src/test/regress/sql/hs_standby_allowed.sql b/src/test/regress/sql/hs_standby_allowed.sql new file mode 100644 index 0000000000..58e2c010d3 --- /dev/null +++ b/src/test/regress/sql/hs_standby_allowed.sql @@ -0,0 +1,121 @@ +-- +-- Hot Standby tests +-- +-- hs_standby_allowed.sql +-- + +-- SELECT + +select count(*) as should_be_1 from hs1; + +select count(*) as should_be_2 from hs2; + +select count(*) as should_be_3 from hs3; + +COPY hs1 TO '/tmp/copy_test'; +\! cat /tmp/copy_test + +-- Access sequence directly +select min_value as sequence_min_value from hsseq; + +-- Transactions + +begin; +select count(*) as should_be_1 from hs1; +end; + +begin transaction read only; +select count(*) as should_be_1 from hs1; +end; + +begin transaction isolation level serializable; +select count(*) as should_be_1 from hs1; +select count(*) as should_be_1 from hs1; +select count(*) as should_be_1 from hs1; +commit; + +begin; +select count(*) as should_be_1 from hs1; +commit; + +begin; +select count(*) as should_be_1 from hs1; +abort; + +start transaction; +select count(*) as should_be_1 from hs1; +commit; + +begin; +select count(*) as should_be_1 from hs1; +rollback; + +begin; +select count(*) as should_be_1 from hs1; +savepoint s; +select count(*) as should_be_2 from hs2; +commit; + +begin; +select count(*) as should_be_1 from hs1; +savepoint s; +select count(*) as should_be_2 from hs2; +release savepoint s; +select count(*) as should_be_2 from hs2; +savepoint s; +select count(*) as should_be_3 from hs3; +rollback to savepoint s; +select count(*) as should_be_2 from hs2; +commit; + +-- SET parameters + +-- has no effect on read only transactions, but we can still set it +set synchronous_commit = on; +show synchronous_commit; +reset synchronous_commit; + +discard temp; +discard all; + +-- CURSOR commands + +BEGIN; + +DECLARE hsc CURSOR FOR select * from hs3; + +FETCH next from hsc; +fetch first from hsc; +fetch last from hsc; +fetch 1 from hsc; + +CLOSE hsc; + +COMMIT; + +-- Prepared plans + +PREPARE hsp AS select count(*) from hs1; +PREPARE hsp_noexec (integer) AS insert into hs1 values ($1); + +EXECUTE hsp; + +DEALLOCATE hsp; + +-- LOCK + +BEGIN; +LOCK hs1 IN ACCESS SHARE MODE; +LOCK hs1 IN ROW SHARE MODE; +LOCK hs1 IN ROW EXCLUSIVE MODE; +COMMIT; + +-- LOAD +-- should work, easier if there is no test for that... + + +-- ALLOWED COMMANDS + +CHECKPOINT; + +discard all; diff --git a/src/test/regress/sql/hs_standby_check.sql b/src/test/regress/sql/hs_standby_check.sql new file mode 100644 index 0000000000..3fe8a02720 --- /dev/null +++ b/src/test/regress/sql/hs_standby_check.sql @@ -0,0 +1,16 @@ +-- +-- Hot Standby tests +-- +-- hs_standby_check.sql +-- + +-- +-- If the query below returns false then all other tests will fail after it. +-- +select case pg_is_in_recovery() when false then + 'These tests are intended only for execution on a standby server that is reading ' || + 'WAL from a server upon which the regression database is already created and into ' || + 'which src/test/regress/sql/hs_primary_setup.sql has been run' +else + 'Tests are running on a standby server during recovery' +end; diff --git a/src/test/regress/sql/hs_standby_disallowed.sql b/src/test/regress/sql/hs_standby_disallowed.sql new file mode 100644 index 0000000000..21bbf526b7 --- /dev/null +++ b/src/test/regress/sql/hs_standby_disallowed.sql @@ -0,0 +1,105 @@ +-- +-- Hot Standby tests +-- +-- hs_standby_disallowed.sql +-- + +SET transaction_read_only = off; + +begin transaction read write; +commit; + +-- SELECT + +select * from hs1 FOR SHARE; +select * from hs1 FOR UPDATE; + +-- DML +BEGIN; +insert into hs1 values (37); +ROLLBACK; +BEGIN; +delete from hs1 where col1 = 1; +ROLLBACK; +BEGIN; +update hs1 set col1 = NULL where col1 > 0; +ROLLBACK; +BEGIN; +truncate hs3; +ROLLBACK; + +-- DDL + +create temporary table hstemp1 (col1 integer); +BEGIN; +drop table hs2; +ROLLBACK; +BEGIN; +create table hs4 (col1 integer); +ROLLBACK; + +-- Sequences + +SELECT nextval('hsseq'); + +-- Two-phase commit transaction stuff + +BEGIN; +SELECT count(*) FROM hs1; +PREPARE TRANSACTION 'foobar'; +ROLLBACK; +BEGIN; +SELECT count(*) FROM hs1; +COMMIT PREPARED 'foobar'; +ROLLBACK; + +BEGIN; +SELECT count(*) FROM hs1; +PREPARE TRANSACTION 'foobar'; +ROLLBACK PREPARED 'foobar'; +ROLLBACK; + +BEGIN; +SELECT count(*) FROM hs1; +ROLLBACK PREPARED 'foobar'; +ROLLBACK; + + +-- Locks +BEGIN; +LOCK hs1; +COMMIT; +BEGIN; +LOCK hs1 IN SHARE UPDATE EXCLUSIVE MODE; +COMMIT; +BEGIN; +LOCK hs1 IN SHARE MODE; +COMMIT; +BEGIN; +LOCK hs1 IN SHARE ROW EXCLUSIVE MODE; +COMMIT; +BEGIN; +LOCK hs1 IN EXCLUSIVE MODE; +COMMIT; +BEGIN; +LOCK hs1 IN ACCESS EXCLUSIVE MODE; +COMMIT; + +-- Listen +listen a; +notify a; +unlisten a; +unlisten *; + +-- disallowed commands + +ANALYZE hs1; + +VACUUM hs2; + +CLUSTER hs2 using hs1_pkey; + +REINDEX TABLE hs2; + +REVOKE SELECT ON hs1 FROM PUBLIC; +GRANT SELECT ON hs1 TO PUBLIC; diff --git a/src/test/regress/sql/hs_standby_functions.sql b/src/test/regress/sql/hs_standby_functions.sql new file mode 100644 index 0000000000..7577045f11 --- /dev/null +++ b/src/test/regress/sql/hs_standby_functions.sql @@ -0,0 +1,24 @@ +-- +-- Hot Standby tests +-- +-- hs_standby_functions.sql +-- + +-- should fail +select txid_current(); + +select length(txid_current_snapshot()::text) >= 4; + +select pg_start_backup('should fail'); +select pg_switch_xlog(); +select pg_stop_backup(); + +-- should return no rows +select * from pg_prepared_xacts; + +-- just the startup process +select locktype, virtualxid, virtualtransaction, mode, granted +from pg_locks where virtualxid = '1/1'; + +-- suicide is painless +select pg_cancel_backend(pg_backend_pid()); diff --git a/src/test/regress/standby_schedule b/src/test/regress/standby_schedule new file mode 100644 index 0000000000..7e239d4b28 --- /dev/null +++ b/src/test/regress/standby_schedule @@ -0,0 +1,21 @@ +# $PostgreSQL: pgsql/src/test/regress/standby_schedule,v 1.1 2009/12/19 01:32:45 sriggs Exp $ +# +# Test schedule for Hot Standby +# +# First test checks we are on a standby server. +# Subsequent tests rely upon a setup script having already +# been executed in the appropriate database on the primary server +# which is feeding WAL files to target standby. +# +# psql -f src/test/regress/sql/hs_primary_setup.sql regression +# +test: hs_standby_check +# +# These tests will pass on both primary and standby servers +# +test: hs_standby_allowed +# +# These tests will fail on a non-standby server +# +test: hs_standby_disallowed +test: hs_standby_functions