postgresql/src/backend/access/transam/rmgr.c

/*
 * rmgr.c
 *
 * Resource managers definition
 *
 * src/backend/access/transam/rmgr.c
 */
#include "postgres.h"

#include "access/clog.h"
#include "access/gin.h"
#include "access/gist_private.h"
#include "access/hash.h"
#include "access/heapam.h"
#include "access/multixact.h"
#include "access/nbtree.h"
#include "access/xact.h"
#include "access/xlog_internal.h"
#include "catalog/storage.h"
#include "commands/dbcommands.h"
#include "commands/sequence.h"
#include "commands/tablespace.h"
#include "storage/standby.h"
#include "utils/relmapper.h"


const RmgrData RmgrTable[RM_MAX_ID + 1] = {
	{"XLOG", xlog_redo, xlog_desc, NULL, NULL, NULL},
	{"Transaction", xact_redo, xact_desc, NULL, NULL, NULL},
	{"Storage", smgr_redo, smgr_desc, NULL, NULL, NULL},
	{"CLOG", clog_redo, clog_desc, NULL, NULL, NULL},
	{"Database", dbase_redo, dbase_desc, NULL, NULL, NULL},
	{"Tablespace", tblspc_redo, tblspc_desc, NULL, NULL, NULL},
	{"MultiXact", multixact_redo, multixact_desc, NULL, NULL, NULL},
	{"RelMap", relmap_redo, relmap_desc, NULL, NULL, NULL},
	{"Standby", standby_redo, standby_desc, NULL, NULL, NULL},
	{"Heap2", heap2_redo, heap2_desc, NULL, NULL, NULL},
	{"Heap", heap_redo, heap_desc, NULL, NULL, NULL},
	{"Btree", btree_redo, btree_desc, btree_xlog_startup, btree_xlog_cleanup, btree_safe_restartpoint},
	{"Hash", hash_redo, hash_desc, NULL, NULL, NULL},
	{"Gin", gin_redo, gin_desc, gin_xlog_startup, gin_xlog_cleanup, gin_safe_restartpoint},
	{"Gist", gist_redo, gist_desc, gist_xlog_startup, gist_xlog_cleanup, NULL},
	{"Sequence", seq_redo, seq_desc, NULL, NULL, NULL}
};
Replace implementation of pg_log as a relation accessed through the buffer manager with 'pg_clog', a specialized access method modeled on pg_xlog. This simplifies startup (don't need to play games to open pg_log; among other things, OverrideTransactionSystem goes away), should improve performance a little, and opens the door to recycling commit log space by removing no-longer-needed segments of the commit log. Actual recycling is not there yet, but I felt I should commit this part separately since it'd still be useful if we chose not to do transaction ID wraparound. 2001-08-25 20:52:43 +02:00			`/*`
			`* rmgr.c`
			`*`
			`* Resource managers definition`
			`*`
Remove cvs keywords from all files. 2010-09-20 22:08:53 +02:00			`* src/backend/access/transam/rmgr.c`
Replace implementation of pg_log as a relation accessed through the buffer manager with 'pg_clog', a specialized access method modeled on pg_xlog. This simplifies startup (don't need to play games to open pg_log; among other things, OverrideTransactionSystem goes away), should improve performance a little, and opens the door to recycling commit log space by removing no-longer-needed segments of the commit log. Actual recycling is not there yet, but I felt I should commit this part separately since it'd still be useful if we chose not to do transaction ID wraparound. 2001-08-25 20:52:43 +02:00			`*/`
Transaction log manager core code. It doesn't work currently but also don't break anything -:) 1999-09-27 17:48:12 +02:00			`#include "postgres.h"`
Replace implementation of pg_log as a relation accessed through the buffer manager with 'pg_clog', a specialized access method modeled on pg_xlog. This simplifies startup (don't need to play games to open pg_log; among other things, OverrideTransactionSystem goes away), should improve performance a little, and opens the door to recycling commit log space by removing no-longer-needed segments of the commit log. Actual recycling is not there yet, but I felt I should commit this part separately since it'd still be useful if we chose not to do transaction ID wraparound. 2001-08-25 20:52:43 +02:00
Add WAL logging for CREATE/DROP DATABASE and CREATE/DROP TABLESPACE. Fix TablespaceCreateDbspace() to be able to create a dummy directory in place of a dropped tablespace's symlink. This eliminates the open problem of a PANIC during WAL replay when a replayed action attempts to touch a file in a since-deleted tablespace. It also makes for a significant improvement in the usability of PITR replay. 2004-08-29 23:08:48 +02:00			`#include "access/clog.h"`
Alphabetically order reference to include files, "N" - "S". 2006-07-11 19:26:59 +02:00			`#include "access/gin.h"`
Cleanup GiST header files. Since GiST extensions are often written as external projects, we should be careful about what parts of the GiST API are considered implementation details, and which are part of the public API. Therefore, I've moved internal-only declarations into gist_private.h -- future backward-incompatible changes to gist.h should be made with care, to avoid needlessly breaking external GiST extensions. Also did some related header cleanup: remove some unnecessary #includes from gist.h, and remove some unused definitions: isAttByVal(), _gistdump(), and GISTNStrategies. 2005-05-17 05:34:18 +02:00			`#include "access/gist_private.h"`
Put external declarations into header files. 2000-11-21 22:16:06 +01:00			`#include "access/hash.h"`
			`#include "access/heapam.h"`
Change WAL-logging scheme for multixacts to be more like regular transaction IDs, rather than like subtrans; in particular, the information now survives a database restart. Per previous discussion, this is essential for PITR log shipping and for 2PC. 2005-06-08 17:50:28 +02:00			`#include "access/multixact.h"`
Put external declarations into header files. 2000-11-21 22:16:06 +01:00			`#include "access/nbtree.h"`
			`#include "access/xact.h"`
Invent WAL timelines, as per recent discussion, to make point-in-time recovery more manageable. Also, undo recent change to add FILE_HEADER and WASTED_SPACE records to XLOG; instead make the XLOG page header variable-size with extra fields in the first page of an XLOG file. This should fix the boundary-case bugs observed by Mark Kirkwood. initdb forced due to change of XLOG representation. 2004-07-22 00:31:26 +02:00			`#include "access/xlog_internal.h"`
Rethink the way FSM truncation works. Instead of WAL-logging FSM truncations in FSM code, call FreeSpaceMapTruncateRel from smgr_redo. To make that cleaner from modularity point of view, move the WAL-logging one level up to RelationTruncate, and move RelationTruncate and all the related WAL-logging to new src/backend/catalog/storage.c file. Introduce new RelationCreateStorage and RelationDropStorage functions that are used instead of calling smgrcreate/smgrscheduleunlink directly. Move the pending rel deletion stuff from smgrcreate/smgrscheduleunlink to the new functions. This leaves smgr.c as a thin wrapper around md.c; all the transactional stuff is now in storage.c. This will make it easier to add new forks with similar truncation logic, like the visibility map. 2008-11-19 11:34:52 +01:00			`#include "catalog/storage.h"`
Add WAL logging for CREATE/DROP DATABASE and CREATE/DROP TABLESPACE. Fix TablespaceCreateDbspace() to be able to create a dummy directory in place of a dropped tablespace's symlink. This eliminates the open problem of a PANIC during WAL replay when a replayed action attempts to touch a file in a since-deleted tablespace. It also makes for a significant improvement in the usability of PITR replay. 2004-08-29 23:08:48 +02:00			`#include "commands/dbcommands.h"`
XLOG stuff for sequences. CommitDelay in guc.c 2000-11-30 02:47:33 +01:00			`#include "commands/sequence.h"`
Add WAL logging for CREATE/DROP DATABASE and CREATE/DROP TABLESPACE. Fix TablespaceCreateDbspace() to be able to create a dummy directory in place of a dropped tablespace's symlink. This eliminates the open problem of a PANIC during WAL replay when a replayed action attempts to touch a file in a since-deleted tablespace. It also makes for a significant improvement in the usability of PITR replay. 2004-08-29 23:08:48 +02:00			`#include "commands/tablespace.h"`
Allow read only connections during recovery, known as Hot Standby. Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record. New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far. This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required. Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit. Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members. 2009-12-19 02:32:45 +01:00			`#include "storage/standby.h"`
Create a "relation mapping" infrastructure to support changing the relfilenodes of shared or nailed system catalogs. This has two key benefits: * The new CLUSTER-based VACUUM FULL can be applied safely to all catalogs. * We no longer have to use an unsafe reindex-in-place approach for reindexing shared catalogs. CLUSTER on nailed catalogs now works too, although I left it disabled on shared catalogs because the resulting pg_index.indisclustered update would only be visible in one database. Since reindexing shared system catalogs is now fully transactional and crash-safe, the former special cases in REINDEX behavior have been removed; shared catalogs are treated the same as non-shared. This commit does not do anything about the recently-discussed problem of deadlocks between VACUUM FULL/CLUSTER on a system catalog and other concurrent queries; will address that in a separate patch. As a stopgap, parallel_schedule has been tweaked to run vacuum.sql by itself, to avoid such failures during the regression tests. 2010-02-07 21:48:13 +01:00			`#include "utils/relmapper.h"`
WAL 2000-10-21 17:43:36 +02:00
Replace implementation of pg_log as a relation accessed through the buffer manager with 'pg_clog', a specialized access method modeled on pg_xlog. This simplifies startup (don't need to play games to open pg_log; among other things, OverrideTransactionSystem goes away), should improve performance a little, and opens the door to recycling commit log space by removing no-longer-needed segments of the commit log. Actual recycling is not there yet, but I felt I should commit this part separately since it'd still be useful if we chose not to do transaction ID wraparound. 2001-08-25 20:52:43 +02:00
Invent WAL timelines, as per recent discussion, to make point-in-time recovery more manageable. Also, undo recent change to add FILE_HEADER and WASTED_SPACE records to XLOG; instead make the XLOG page header variable-size with extra fields in the first page of an XLOG file. This should fix the boundary-case bugs observed by Mark Kirkwood. initdb forced due to change of XLOG representation. 2004-07-22 00:31:26 +02:00			`const RmgrData RmgrTable[RM_MAX_ID + 1] = {`
Make recovery from WAL be restartable, by executing a checkpoint-like operation every so often. This improves the usefulness of PITR log shipping for hot standby: formerly, if the standby server crashed, it was necessary to restart it from the last base backup and replay all the WAL since then. Now it will only need to reread about the same amount of WAL as the master server would. The behavior might also come in handy during a long PITR replay sequence. Simon Riggs, with some editorialization by Tom Lane. 2006-08-07 18:57:57 +02:00			`{"XLOG", xlog_redo, xlog_desc, NULL, NULL, NULL},`
			`{"Transaction", xact_redo, xact_desc, NULL, NULL, NULL},`
			`{"Storage", smgr_redo, smgr_desc, NULL, NULL, NULL},`
			`{"CLOG", clog_redo, clog_desc, NULL, NULL, NULL},`
			`{"Database", dbase_redo, dbase_desc, NULL, NULL, NULL},`
			`{"Tablespace", tblspc_redo, tblspc_desc, NULL, NULL, NULL},`
			`{"MultiXact", multixact_redo, multixact_desc, NULL, NULL, NULL},`
Create a "relation mapping" infrastructure to support changing the relfilenodes of shared or nailed system catalogs. This has two key benefits: * The new CLUSTER-based VACUUM FULL can be applied safely to all catalogs. * We no longer have to use an unsafe reindex-in-place approach for reindexing shared catalogs. CLUSTER on nailed catalogs now works too, although I left it disabled on shared catalogs because the resulting pg_index.indisclustered update would only be visible in one database. Since reindexing shared system catalogs is now fully transactional and crash-safe, the former special cases in REINDEX behavior have been removed; shared catalogs are treated the same as non-shared. This commit does not do anything about the recently-discussed problem of deadlocks between VACUUM FULL/CLUSTER on a system catalog and other concurrent queries; will address that in a separate patch. As a stopgap, parallel_schedule has been tweaked to run vacuum.sql by itself, to avoid such failures during the regression tests. 2010-02-07 21:48:13 +01:00			`{"RelMap", relmap_redo, relmap_desc, NULL, NULL, NULL},`
Allow read only connections during recovery, known as Hot Standby. Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record. New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far. This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required. Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit. Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members. 2009-12-19 02:32:45 +01:00			`{"Standby", standby_redo, standby_desc, NULL, NULL, NULL},`
Fix recently-understood problems with handling of XID freezing, particularly in PITR scenarios. We now WAL-log the replacement of old XIDs with FrozenTransactionId, so that such replacement is guaranteed to propagate to PITR slave databases. Also, rather than relying on hint-bit updates to be preserved, pg_clog is not truncated until all instances of an XID are known to have been replaced by FrozenTransactionId. Add new GUC variables and pg_autovacuum columns to allow management of the freezing policy, so that users can trade off the size of pg_clog against the amount of freezing work done. Revise the already-existing code that forces autovacuum of tables approaching the wraparound point to make it more bulletproof; also, revise the autovacuum logic so that anti-wraparound vacuuming is done per-table rather than per-database. initdb forced because of changes in pg_class, pg_database, and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane. 2006-11-05 23:42:10 +01:00			`{"Heap2", heap2_redo, heap2_desc, NULL, NULL, NULL},`
Make recovery from WAL be restartable, by executing a checkpoint-like operation every so often. This improves the usefulness of PITR log shipping for hot standby: formerly, if the standby server crashed, it was necessary to restart it from the last base backup and replay all the WAL since then. Now it will only need to reread about the same amount of WAL as the master server would. The behavior might also come in handy during a long PITR replay sequence. Simon Riggs, with some editorialization by Tom Lane. 2006-08-07 18:57:57 +02:00			`{"Heap", heap_redo, heap_desc, NULL, NULL, NULL},`
			`{"Btree", btree_redo, btree_desc, btree_xlog_startup, btree_xlog_cleanup, btree_safe_restartpoint},`
			`{"Hash", hash_redo, hash_desc, NULL, NULL, NULL},`
			`{"Gin", gin_redo, gin_desc, gin_xlog_startup, gin_xlog_cleanup, gin_safe_restartpoint},`
Rewrite the GiST insertion logic so that we don't need the post-recovery cleanup stage to finish incomplete inserts or splits anymore. There was two reasons for the cleanup step: 1. When a new tuple was inserted to a leaf page, the downlink in the parent needed to be updated to contain (ie. to be consistent with) the new key. Updating the parent in turn might require recursively updating the parent of the parent. We now handle that by updating the parent while traversing down the tree, so that when we insert the leaf tuple, all the parents are already consistent with the new key, and the tree is consistent at every step. 2. When a page is split, we need to insert the downlink for the new right page(s), and update the downlink for the original page to not include keys that moved to the right page(s). We now handle that by setting a new flag, F_FOLLOW_RIGHT, on the non-rightmost pages in the split. When that flag is set, scans always follow the rightlink, regardless of the NSN mechanism used to detect concurrent page splits. That way the tree is consistent right after split, even though the downlink is still missing. This is very similar to the way B-tree splits are handled. When the downlink is inserted in the parent, the flag is cleared. To keep the insertion algorithm simple, when an insertion sees an incomplete split, indicated by the F_FOLLOW_RIGHT flag, it finishes the split before doing anything else. These changes allow removing the whole "invalid tuple" mechanism, but I retained the scan code to still follow invalid tuples correctly. While we don't create any such tuples anymore, we want to handle them gracefully in case you pg_upgrade a GiST index that has them. If we encounter any on an insert, though, we just throw an error saying that you need to REINDEX. The issue that got me into doing this is that if you did a checkpoint while an insert or split was in progress, and the checkpoint finishes quickly so that there is no WAL record related to the insert between RedoRecPtr and the checkpoint record, recovery from that checkpoint would not know to finish the incomplete insert. IOW, we have the same issue we solved with the rm_safe_restartpoint mechanism during normal operation too. It's highly unlikely to happen in practice, and this fix is far too large to backpatch, so we're just going to live with in previous versions, but this refactoring fixes it going forward. With this patch, you don't get the annoying 'index "FOO" needs VACUUM or REINDEX to finish crash recovery' notices anymore if you crash at an unfortunate moment. 2010-12-23 15:03:08 +01:00			`{"Gist", gist_redo, gist_desc, gist_xlog_startup, gist_xlog_cleanup, NULL},`
Make recovery from WAL be restartable, by executing a checkpoint-like operation every so often. This improves the usefulness of PITR log shipping for hot standby: formerly, if the standby server crashed, it was necessary to restart it from the last base backup and replay all the WAL since then. Now it will only need to reread about the same amount of WAL as the master server would. The behavior might also come in handy during a long PITR replay sequence. Simon Riggs, with some editorialization by Tom Lane. 2006-08-07 18:57:57 +02:00			`{"Sequence", seq_redo, seq_desc, NULL, NULL, NULL}`
WAL 2000-10-21 17:43:36 +02:00			`};`