postgresql

mirror of https://git.postgresql.org/git/postgresql.git synced 2024-09-30 15:11:16 +02:00

Author	SHA1	Message	Date
Peter Eisentraut	c248d17120	Message tuning	2010-03-21 00:17:59 +00:00
Simon Riggs	3cdafe40e7	Adjust comment in .history file to match recovery target specified. Comment present since 8.0 was never fully meaningful, since two recovery targets cannot be specified. Refactor recovery target type to make this change and associated code easier to understand. No change in function. Bug report arising from internal support question.	2010-03-19 11:05:15 +00:00
Simon Riggs	5c73ae17d1	Reset btpo.xact following recovery of btree delete page. Add btpo_xact field into WAL record and reset it from there, rather than using FrozenTransactionId which can lead to some corner case bugs. Problem report and suggested route to a fix from Heikki, details by me.	2010-03-19 10:41:22 +00:00
Heikki Linnakangas	c21ac0b58e	Add restartpoint_command option to recovery.conf. Fix bug in %r handling in recovery_end_command, it always came out as 0 because InRedo was cleared before recovery_end_command was executed. Also, always take ControlFileLock when reading checkpoint location for %r. The recovery_end_command bug and the missing locking was present in 8.4 as well, that part of this patch will be backported separately.	2010-03-18 09:17:18 +00:00
Simon Riggs	1a163a0c68	Remove incorrect comment from GetWriteRecPtr(): the return value is always correct, as described in comments at start of xlog.c	2010-03-15 18:49:17 +00:00
Tom Lane	1f44a313bd	Add missing reset of need_initialization in reloptions code. This resulted in useless extra work during every call of parseRelOptions, but no bad effects other than that. Noted by Alvaro.	2010-03-11 21:47:19 +00:00
Itagaki Takahiro	17d8de0e61	pg_start_backup() can use a share lock to lock ControlFileLock instead of an exclusive lock. The change is almost for code cleanup. Since there seems to be no performance benefits from it, backports should not be needed. Fujii Masao	2010-03-10 02:04:48 +00:00
Bruce Momjian	65e806cba1	pgindent run for 9.0	2010-02-26 02:01:40 +00:00
Tom Lane	a2239b96e0	Make pg_stop_backup's reporting a bit more verbose in hopes of making error cases less intimidating for novices. Per discussion. Greg Smith	2010-02-25 02:17:50 +00:00
Tom Lane	05d8a561ff	Clean up handling of XactReadOnly and RecoveryInProgress checks. Add some checks that seem logically necessary, in particular let's make real sure that HS slave sessions cannot create temp tables. (If they did they would think that temp tables belonging to the master's session with the same BackendId were theirs. We must not allow myTempNamespace to become set in a slave session.) Change setval() and nextval() so that they are only allowed on temp sequences in a read-only transaction. This seems consistent with what we allow for table modifications in read-only transactions. Since an HS slave can't have a temp sequence, this also provides a nicer cure for the setval PANIC reported by Erik Rijkers. Make the error messages more uniform, and have them mention the specific command being complained of. This seems worth the trifling amount of extra code, since people are likely to see such messages a lot more than before.	2010-02-20 21:24:02 +00:00
Heikki Linnakangas	ad458cfe81	Don't use O_DIRECT when writing WAL files if archiving or streaming is enabled. Bypassing the kernel cache is counter-productive in that case, because the archiver/walsender process will read from the WAL file soon after it's written, and if it's not cached the read will cause a physical read, eating I/O bandwidth available on the WAL drive. Also, walreceiver process does unaligned writes, so disable O_DIRECT in walreceiver process for that reason too.	2010-02-19 10:51:04 +00:00
Itagaki Takahiro	3230fd056a	Fix STOP WAL LOCATION in backup history files no to return the next segment of XLOG_BACKUP_END record even if the the record is placed at a segment boundary. Furthermore the previous implementation could return nonexistent segment file name when the boundary is in segments that has "FE" suffix; We never use segments with "FF" suffix. Backpatch to 8.0, where hot backup was introduced. Reported by Fujii Masao.	2010-02-19 01:04:03 +00:00
Tom Lane	50a90fac40	Stamp HEAD as 9.0devel, and update various places that were referring to 8.5 (hope I got 'em all). Per discussion, this release will be 9.0 not 8.5.	2010-02-17 04:19:41 +00:00
Tom Lane	c64339face	When updating ShmemVariableCache from a checkpoint record, be sure to set all the values derived from oldestXid, not just that field. Brain fade in one of my patches associated with flat file removal, exposed by a report from Fujii Masao. With this change, xidVacLimit should always be valid, so remove a couple of bits of complexity associated with the previous assumption that sometimes it wouldn't get set right away.	2010-02-17 03:10:33 +00:00
Tom Lane	d1e027221d	Replace the pg_listener-based LISTEN/NOTIFY mechanism with an in-memory queue. In addition, add support for a "payload" string to be passed along with each notify event. This implementation should be significantly more efficient than the old one, and is also more compatible with Hot Standby usage. There is not yet any facility for HS slaves to receive notifications generated on the master, although such a thing is possible in future. Joachim Wieland, reviewed by Jeff Davis; also hacked on by me.	2010-02-16 22:34:57 +00:00
Robert Haas	e26c539e9f	Wrap calls to SearchSysCache and related functions using macros. The purpose of this change is to eliminate the need for every caller of SearchSysCache, SearchSysCacheCopy, SearchSysCacheExists, GetSysCacheOid, and SearchSysCacheList to know the maximum number of allowable keys for a syscache entry (currently 4). This will make it far easier to increase the maximum number of keys in a future release should we choose to do so, and it makes the code shorter, too. Design and review by Tom Lane.	2010-02-14 18:42:19 +00:00
Simon Riggs	dd428c79a4	Fix relcache init file invalidation during Hot Standby for the case where a database has a non-default tablespaceid. Pass thru MyDatabaseId and MyDatabaseTableSpace to allow file path to be re-created in standby and correct invalidation to take place in all cases. Update and rework xact_commit_desc() debug messages. Bug report from Tom by code inspection. Fix by me.	2010-02-13 16:15:48 +00:00
Simon Riggs	fafa374f2d	Introduce WAL records to log reuse of btree pages, allowing conflict resolution during Hot Standby. Page reuse interlock requested by Tom. Analysis and patch by me.	2010-02-13 00:59:58 +00:00
Heikki Linnakangas	e465390d03	Reduce the chatter to the log when starting a standby server. Don't echo all the recovery.conf options. Don't emit the "initializing recovery connections" message, which doesn't mean anything to a user. Remove the "starting archive recovery" message and replace the "automatic recovery in progress" message with a more informative message saying whether the server is doing PITR, normal archive recovery, or standby mode.	2010-02-12 09:49:08 +00:00
Heikki Linnakangas	54cbd1757e	If primary_conninfo is not set, don't try to establish streaming connection.	2010-02-12 07:56:36 +00:00
Heikki Linnakangas	9fa01f6c8a	Check for partial WAL files in standby mode. If restore_command restores a partial WAL file, assume it's because the file is just being copied to the archive and treat it the same as "file not found" in standby mode. pg_standby has a similar check, so it seems reasonable to have the same level of protection in the built-in standby mode.	2010-02-12 07:36:44 +00:00
Teodor Sigaev	5209c084a6	Generic implementation of red-black binary tree. It's planned to use in several places, but for now only GIN uses it during index creation. Using self-balanced tree greatly speeds up index creation in corner cases with preordered data.	2010-02-11 14:29:50 +00:00
Heikki Linnakangas	161d9d51b3	Now that streaming replication switches between streaming mode and restoring from archive, the last WAL segment is not necessarily open at the end of recovery. Fix assertion that assumed that. Fujii Masao, fixing the assertion failure reported by Martin Pihlak.	2010-02-10 08:25:25 +00:00
Tom Lane	cbe9d6beb4	Fix up rickety handling of relation-truncation interlocks. Move rd_targblock, rd_fsm_nblocks, and rd_vm_nblocks from relcache to the smgr relation entries, so that they will get reset to InvalidBlockNumber whenever an smgr-level flush happens. Because we now send smgr invalidation messages immediately (not at end of transaction) when a relation truncation occurs, this ensures that other backends will reset their values before they next access the relation. We no longer need the unreliable assumption that a VACUUM that's doing a truncation will hold its AccessExclusive lock until commit --- in fact, we can intentionally release that lock as soon as we've completed the truncation. This patch therefore reverts (most of) Alvaro's patch of 2009-11-10, as well as my marginal hacking on it yesterday. We can also get rid of assorted no-longer-needed relcache flushes, which are far more expensive than an smgr flush because they kill a lot more state. In passing this patch fixes smgr_redo's failure to perform visibility-map truncation, and cleans up some rather dubious assumptions in freespace.c and visibilitymap.c about when rd_fsm_nblocks and rd_vm_nblocks can be out of date.	2010-02-09 21:43:30 +00:00
Heikki Linnakangas	79647eed86	Fix bug in GIN WAL redo cleanup function: don't free fake relcache entry while it's still being used. Backpatch to 8.4, where the fake relcache method was introduced.	2010-02-09 20:31:24 +00:00
Heikki Linnakangas	4cea603128	Remove piece of code to zero out minRecoveryPoint when starting crash recovery. It's zeroed out whenever a checkpoint is written, so the only scenario where the removed code did anything is when you kill archive recovery, remove recovery.conf, and start up the server, so that it goes into crash recovery instead. That's a "don't do that" scenario, but it seems better to not clear minRecoveryPoint but instead update it like we do in archive recovery, which is what will now happen.	2010-02-08 09:08:51 +00:00
Tom Lane	68446b2c87	Remove some more dead VACUUM-FULL-only code.	2010-02-08 05:17:31 +00:00
Tom Lane	0a469c8769	Remove old-style VACUUM FULL (which was known for a little while as VACUUM FULL INPLACE), along with a boatload of subsidiary code and complexity. Per discussion, the use case for this method of vacuuming is no longer large enough to justify maintaining it; not to mention that we don't wish to invest the work that would be needed to make it play nicely with Hot Standby. Aside from the code directly related to old-style VACUUM FULL, this commit removes support for certain WAL record types that could only be generated within VACUUM FULL, redirect-pointer removal in heap_page_prune, and nontransactional generation of cache invalidation sinval messages (the last being the sticking point for Hot Standby). We still have to retain all code that copes with finding HEAP_MOVED_OFF and HEAP_MOVED_IN flag bits on existing tuples. This can't be removed as long as we want to support in-place update from pre-9.0 databases.	2010-02-08 04:33:55 +00:00
Tom Lane	b9b8831ad6	Create a "relation mapping" infrastructure to support changing the relfilenodes of shared or nailed system catalogs. This has two key benefits: * The new CLUSTER-based VACUUM FULL can be applied safely to all catalogs. * We no longer have to use an unsafe reindex-in-place approach for reindexing shared catalogs. CLUSTER on nailed catalogs now works too, although I left it disabled on shared catalogs because the resulting pg_index.indisclustered update would only be visible in one database. Since reindexing shared system catalogs is now fully transactional and crash-safe, the former special cases in REINDEX behavior have been removed; shared catalogs are treated the same as non-shared. This commit does not do anything about the recently-discussed problem of deadlocks between VACUUM FULL/CLUSTER on a system catalog and other concurrent queries; will address that in a separate patch. As a stopgap, parallel_schedule has been tweaked to run vacuum.sql by itself, to avoid such failures during the regression tests.	2010-02-07 20:48:13 +00:00
Tom Lane	9727c583fe	Restructure CLUSTER/newstyle VACUUM FULL/ALTER TABLE support so that swapping of old and new toast tables can be done either at the logical level (by swapping the heaps' reltoastrelid links) or at the physical level (by swapping the relfilenodes of the toast tables and their indexes). This is necessary infrastructure for upcoming changes to support CLUSTER/VAC FULL on shared system catalogs, where we cannot change reltoastrelid. The physical swap saves a few catalog updates too. We unfortunately have to keep the logical-level swap logic because in some cases we will be adding or deleting a toast table, so there's no possibility of a physical swap. However, that only happens as a consequence of schema changes in the table, which we do not need to support for system catalogs, so such cases aren't an obstacle for that. In passing, refactor the cluster support functions a little bit to eliminate unnecessarily-duplicated code; and fix the problem that while CLUSTER had been taught to rename the final toast table at need, ALTER TABLE had not.	2010-02-04 00:09:14 +00:00
Heikki Linnakangas	9de778b24b	Move the responsibility of writing a "unlogged WAL operation" record from heap_sync() to the callers, because heap_sync() is sometimes called even if the operation itself is WAL-logged. This eliminates the bogus unlogged records from CLUSTER that Simon Riggs reported, patch by Fujii Masao.	2010-02-03 10:01:30 +00:00
Simon Riggs	296578feb4	Revoke augmentation of WAL records for btree delete, per discussion.	2010-02-01 13:40:28 +00:00
Simon Riggs	6d2bc0a6cf	Augment WAL records for btree delete with GetOldestXmin() to reduce false positives during Hot Standby conflict processing. Simple patch to enhance conflict processing, following previous discussions. Controlled by parameter minimize_standby_conflicts = on \| off, with default off allows measurement of performance impact to see whether it should be set on all the time.	2010-01-29 18:39:05 +00:00
Simon Riggs	76be0c81cc	Filter recovery conflicts based upon dboid from relfilenode of WAL records for heap and btree. Minor change, mostly API changes to pass through the required values. This is a simple change though also provides the refactoring required for further enhancements to conflict processing using the relOid. Changes only have effect during Hot Standby.	2010-01-29 17:10:05 +00:00
Heikki Linnakangas	b0509ef601	Fix crashing bug at the end of recovery in Streaming Replication, when restore_command is not given. Fujii Masao.	2010-01-28 19:17:22 +00:00
Heikki Linnakangas	83cb7da7dc	Fix bug in wasender's xlogid boundary handling, reported by Erik Rijkers. LogwrtRqst.Write can be set to non-existent FF log segment, we mustn't try to send that in XLogSend(). Also fix similar bug in ReadRecord(), which I just introduced in the ReadRecord() refactoring patch.	2010-01-27 16:41:09 +00:00
Heikki Linnakangas	1bb2558046	Make standby server continuously retry restoring the next WAL segment with restore_command, if the connection to the primary server is lost. This ensures that the standby can recover automatically, if the connection is lost for a long time and standby falls behind so much that the required WAL segments have been archived and deleted in the master. This also makes standby_mode useful without streaming replication; the server will keep retrying restore_command every few seconds until the trigger file is found. That's the same basic functionality pg_standby offers, but without the bells and whistles. To implement that, refactor the ReadRecord/FetchRecord functions. The FetchRecord() function introduced in the original streaming replication patch is removed, and all the retry logic is now in a new function called XLogReadPage(). XLogReadPage() is now responsible for executing restore_command, launching walreceiver, and waiting for new WAL to arrive from primary, as required. This also changes the life cycle of walreceiver. When launched, it now only tries to connect to the master once, and exits if the connection fails, or is lost during streaming for any reason. The startup process detects the death, and re-launches walreceiver if necessary.	2010-01-27 15:27:51 +00:00
Simon Riggs	aed1a0121a	Fix longstanding gripe that we check for 0000000001.history at start of archive recovery, even when we know it is never present.	2010-01-26 00:07:13 +00:00
Tom Lane	875353b99f	Fix assorted core dumps and Assert failures that could occur during AbortTransaction or AbortSubTransaction, when trying to clean up after an error that prevented (sub)transaction start from completing: * access to TopTransactionResourceOwner that might not exist * assert failure in AtEOXact_GUC, if AtStart_GUC not called yet * assert failure or core dump in AfterTriggerEndSubXact, if AfterTriggerBeginSubXact not called yet Per testing by injecting elog(ERROR) at successive steps in StartTransaction and StartSubTransaction. It's not clear whether all of these cases could really occur in the field, but at least one of them is easily exposed by simple stress testing, as per my accidental discovery yesterday.	2010-01-24 21:49:17 +00:00
Simon Riggs	959ac58c04	In HS, Startup process sets SIGALRM when waiting for buffer pin. If woken by alarm we send SIGUSR1 to all backends requesting that they check to see if they are blocking Startup process. If so, they throw ERROR/FATAL as for other conflict resolutions. Deadlock stop gap removed. max_standby_delay = -1 option removed to prevent deadlock.	2010-01-23 16:37:12 +00:00
Robert Haas	76a47c0e74	Replace ALTER TABLE ... SET STATISTICS DISTINCT with a more general mechanism. Attributes can now have options, just as relations and tablespaces do, and the reloptions code is used to parse, validate, and store them. For simplicity and because these options are not performance critical, we store them in a separate cache rather than the main relcache. Thanks to Alex Hunsaker for the review.	2010-01-22 16:40:19 +00:00
Heikki Linnakangas	09b115f706	Write a WAL record whenever we perform an operation without WAL-logging that would've been WAL-logged if archiving was enabled. If we encounter such records in archive recovery anyway, we know that some data is missing from the log. A WARNING is emitted in that case. Original patch by Fujii Masao, with changes by me.	2010-01-20 19:43:40 +00:00
Teodor Sigaev	a0a7e63434	Fix incorrect comparison of scan key in GIN. Per report from Vyacheslav Kalinin <vka@mgcp.com>	2010-01-18 11:50:43 +00:00
Simon Riggs	a8ce974cdd	Teach standby conflict resolution to use SIGUSR1 Conflict reason is passed through directly to the backend, so we can take decisions about the effect of the conflict based upon the local state. No specific changes, as yet, though this prepares for later work. CancelVirtualTransaction() sends signals while holding ProcArrayLock. Introduce errdetail_abort() to give message detail explaining that the abort was caused by conflict processing. Remove CONFLICT_MODE states in favour of using PROCSIG_RECOVERY_CONFLICT states directly, for clarity.	2010-01-16 10:05:59 +00:00
Heikki Linnakangas	40f908bdcd	Introduce Streaming Replication. This includes two new kinds of postmaster processes, walsenders and walreceiver. Walreceiver is responsible for connecting to the primary server and streaming WAL to disk, while walsender runs in the primary server and streams WAL from disk to the client. Documentation still needs work, but the basics are there. We will probably pull the replication section to a new chapter later on, as well as the sections describing file-based replication. But let's do that as a separate patch, so that it's easier to see what has been added/changed. This patch also adds a new section to the chapter about FE/BE protocol, documenting the protocol used by walsender/walreceivxer. Bump catalog version because of two new functions, pg_last_xlog_receive_location() and pg_last_xlog_replay_location(), for monitoring the progress of replication. Fujii Masao, with additional hacking by me	2010-01-15 09:19:10 +00:00
Teodor Sigaev	4cbe473938	Add point_ops opclass for GiST.	2010-01-14 16:31:09 +00:00
Simon Riggs	e99767bc28	First part of refactoring of code for ResolveRecoveryConflict. Purposes of this are to centralise the conflict code to allow further change, as well as to allow passing through the full reason for the conflict through to the conflicting backends. Backend state alters how we can handle different types of conflict so this is now required. As originally suggested by Heikki, no longer optional.	2010-01-14 11:08:02 +00:00
Robert Haas	84b6d5f359	Remove partial, broken support for NULL pointers when fetching attributes. Previously, fastgetattr() and heap_getattr() tested their fourth argument against a null pointer, but any attempt to use them with a literal-NULL fourth argument evaluated to (void )0, resulting in a compiler error. Remove these NULL tests to avoid leading future readers of this code to believe that this has a chance of working. Also clean up related legacy code in nocachegetattr(), heap_getsysattr(), and nocache_index_getattr(). The new coding standard is that any code which calls a getattr-type function or macro which takes an isnull argument MUST pass a valid boolean pointer. Per discussion with Bruce Momjian, Tom Lane, Alvaro Herrera.	2010-01-10 04:26:36 +00:00
Simon Riggs	42edbd16fb	During Hot Standby, set DatabasePath correctly during relcache init file deletion, so that we attempt to unlink the correct filepath. unlink() errors are ignorable there, so lack of a DatabasePath initialization step did not cause visible problems until a related bug showed up on Solaris. Code refactored from xact_redo_commit() to ProcessCommittedInvalidationMessages() in inval.c. Recovery may replay shared invalidation messages for many databases, so we cannot SetDatabasePath() once as we do in normal backends. Read the databaseid from the shared invalidation messages, then set DatabasePath temporarily before calling RelationCacheInitFileInvalidate(). Problem report by Robert Treat, analysis and fix by me.	2010-01-09 16:49:27 +00:00
Tom Lane	901be0fad4	Remove all the special-case code for INT64_IS_BUSTED, per decision that we're not going to support that anymore. I did keep the 64-bit-CRC-with-32-bit-arithmetic code, since it has a performance excuse to live. It's a bit moot since that's all ifdef'd out, of course.	2010-01-07 04:53:35 +00:00
Robert Haas	d86d51a958	Support ALTER TABLESPACE name SET/RESET ( tablespace_options ). This patch only supports seq_page_cost and random_page_cost as parameters, but it provides the infrastructure to scalably support many more. In particular, we may want to add support for effective_io_concurrency, but I'm leaving that as future work for now. Thanks to Tom Lane for design help and Alvaro Herrera for the review.	2010-01-05 21:54:00 +00:00
Heikki Linnakangas	06f82b2961	Write an end-of-backup WAL record at pg_stop_backup(), and wait for it at recovery instead of reading the backup history file. This is more robust, as it stops you from prematurely starting up an inconsisten cluster if the backup history file is lost for some reason, or if the base backup was never finished with pg_stop_backup(). This also paves the way for a simpler streaming replication patch, which doesn't need to care about backup history files anymore. The backup history file is still created and archived as before, but it's not used by the system anymore. It's just for informational purposes now. Bump PG_CONTROL_VERSION as the location of the backup startpoint is now written to a new field in pg_control, and catversion because initdb is required Original patch by Fujii Masao per Simon's idea, with further fixes by me.	2010-01-04 12:50:50 +00:00
Tom Lane	5b76bb180f	Dept of second thoughts: my first cut at supporting "x IS NOT NULL" btree indexscans would do the wrong thing if index_rescan() was called with a NULL instead of a new set of scankeys and the index was DESC order, because sk_strategy would not get flipped a second time. I think that those provisions for a NULL argument are dead code now as far as the core backend goes, but possibly somebody somewhere is still using it. In any case, this refactoring seems clearer, and it's definitely shorter.	2010-01-03 05:39:08 +00:00
Bruce Momjian	0239800893	Update copyright for the year 2010.	2010-01-02 16:58:17 +00:00
Tom Lane	29c4ad9829	Support "x IS NOT NULL" clauses as indexscan conditions. This turns out to be just a minor extension of the previous patch that made "x IS NULL" indexable, because we can treat the IS NOT NULL condition as if it were "x < NULL" or "x > NULL" (depending on the index's NULLS FIRST/LAST option), just like IS NULL is treated like "x = NULL". Aside from any possible usefulness in its own right, this is an important improvement for index-optimized MAX/MIN aggregates: it is now reliably possible to get a column's min or max value cheaply, even when there are a lot of nulls cluttering the interesting end of the index.	2010-01-01 21:53:49 +00:00
Tom Lane	85d02a6586	Redefine Datum as uintptr_t, instead of unsigned long. This is more in keeping with modern practice, and is a first step towards porting to Win64 (which has sizeof(pointer) > sizeof(long)). Tsutomu Yamada, Magnus Hagander, Tom Lane	2009-12-31 19:41:37 +00:00
Heikki Linnakangas	ff1e1e45b9	Reset minRecoveryPoint at checkpoints, so that we don't uselessly update it in the control file at crash recovery following an archive recovery. Per Fujii Masao and subsequent discussion.	2009-12-30 08:37:21 +00:00
Tom Lane	668e37d138	Fix wrong WAL info value generated when gistContinueInsert() performs an index page split. This would result in index corruption, or even more likely an error during WAL replay, if we were unlucky enough to crash during end-of-recovery cleanup after having completed an incomplete GIST insertion. Yoichi Hirai	2009-12-24 17:52:04 +00:00
Simon Riggs	efc16ea520	Allow read only connections during recovery, known as Hot Standby. Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record. New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far. This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required. Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit. Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.	2009-12-19 01:32:45 +00:00
Tom Lane	62aba76568	Prevent indirect security attacks via changing session-local state within an allegedly immutable index function. It was previously recognized that we had to prevent such a function from executing SET/RESET ROLE/SESSION AUTHORIZATION, or it could trivially obtain the privileges of the session user. However, since there is in general no privilege checking for changes of session-local state, it is also possible for such a function to change settings in a way that might subvert later operations in the same session. Examples include changing search_path to cause an unexpected function to be called, or replacing an existing prepared statement with another one that will execute a function of the attacker's choosing. The present patch secures VACUUM, ANALYZE, and CREATE INDEX/REINDEX against these threats, which are the same places previously deemed to need protection against the SET ROLE issue. GUC changes are still allowed, since there are many useful cases for that, but we prevent security problems by forcing a rollback of any GUC change after completing the operation. Other cases are handled by throwing an error if any change is attempted; these include temp table creation, closing a cursor, and creating or deleting a prepared statement. (In 7.4, the infrastructure to roll back GUC changes doesn't exist, so we settle for rejecting changes of "search_path" in these contexts.) Original report and patch by Gurjeet Singh, additional analysis by Tom Lane. Security: CVE-2009-4136	2009-12-09 21:57:51 +00:00
Tom Lane	0cb65564e5	Add exclusion constraints, which generalize the concept of uniqueness to support any indexable commutative operator, not just equality. Two rows violate the exclusion constraint if "row1.col OP row2.col" is TRUE for each of the columns in the constraint. Jeff Davis, reviewed by Robert Haas	2009-12-07 05:22:23 +00:00
Heikki Linnakangas	cd87b6f8a5	Fix an old bug in multixact and two-phase commit. Prepared transactions can be part of multixacts, so allocate a slot for each prepared transaction in the "oldest member" array in multixact.c. On PREPARE TRANSACTION, transfer the oldest member value from the current backends slot to the prepared xact slot. Also save and recover the value from the 2pc state file. The symptom of the bug was that after a transaction prepared, a shared lock still held by the prepared transaction was sometimes ignored by other transactions. Fix back to 8.1, where both 2PC and multixact were introduced.	2009-11-23 09:58:36 +00:00
Teodor Sigaev	5e75f6790c	Fix multicolumn GIN's wrong results with fastupdate enabled. User-defined consistent functions believes the check array contains at least one true element which was not a true for scanning pending list. Per report from Yury Don <yura@vpcit.ru>	2009-11-13 11:17:04 +00:00
Tom Lane	7d535ebe5b	Dept of second thoughts: after studying index_getnext() a bit more I realize that it can scribble on scan->xs_ctup.t_self while following HOT chains, so we can't rely on that to stay valid between hashgettuple() calls. Introduce a private variable in HashScanOpaque, instead.	2009-11-01 22:30:54 +00:00
Tom Lane	c4afdca4c2	Fix two serious bugs introduced into hash indexes by the 8.4 patch that made hash indexes keep entries sorted by hash value. First, the original plans for concurrency assumed that insertions would happen only at the end of a page, which is no longer true; this could cause scans to transiently fail to find index entries in the presence of concurrent insertions. We can compensate by teaching scans to re-find their position after re-acquiring read locks. Second, neither the bucket split nor the bucket compaction logic had been fixed to preserve hashvalue ordering, so application of either of those processes could lead to permanent corruption of an index, in the sense that searches might fail to find entries that are present. This patch fixes the split and compaction logic to preserve hashvalue ordering, but it cannot do anything about pre-existing corruption. We will need to recommend reindexing all hash indexes in the 8.4.2 release notes. To buy back the performance loss hereby induced in split and compaction, fix them to use PageIndexMultiDelete instead of retail PageIndexDelete operations. We might later want to do something with qsort'ing the page contents rather than doing a binary search for each insertion, but that seemed more invasive than I cared to risk in a back-patch. Per bug #5157 from Jeff Janes and subsequent investigation.	2009-11-01 21:25:25 +00:00
Tom Lane	8d54c2482b	Code review for LIKE INCLUDING patch --- clean up some cosmetic and not so cosmetic stuff.	2009-10-13 00:53:08 +00:00
Andrew Dunstan	faa1afc6c1	CREATE LIKE INCLUDING COMMENTS and STORAGE, and INCLUDING ALL shortcut. Itagaki Takahiro.	2009-10-12 19:49:24 +00:00
Tom Lane	c970292a94	Remove very ancient tuple-counting infrastructure (IncrRetrieved() and friends). This code has all been ifdef'd out for many years, and doesn't seem to have any prospect of becoming any more useful in the future. EXPLAIN ANALYZE is what people use in practice, and I think if we did want process-wide counters we'd be more likely to put in dtrace events for that than try to resurrect this code. Get rid of it so as to have one less detail to worry about while refactoring execMain.c.	2009-10-08 22:34:57 +00:00
Tom Lane	e66d714386	Make sure that GIN fast-insert and regular code paths enforce the same tuple size limit. Improve the error message for index-tuple-too-large so that it includes the actual size, the limit, and the index name. Sync with the btree occurrences of the same error. Back-patch to 8.4 because it appears that the out-of-sync problem is occurring in the field. Teodor and Tom	2009-10-02 21:14:04 +00:00
Teodor Sigaev	f92bbb899a	Fix incorrect arguments for gist_box_penalty call. The bug could be observed only for secondary page split (i.e. for non-first columns of index) Patch by Paul Ramsey <pramsey@opengeo.org>	2009-09-18 14:01:56 +00:00
Tom Lane	384cad5c7b	Fix two distinct errors in creation of GIN_INSERT_LISTPAGE xlog records. In practice these mistakes were always masked when full_page_writes was on, because XLogInsert would always choose to log the full page, and then ginRedoInsertListPage wouldn't try to do anything. But with full_page_writes off a WAL replay failure was certain. The GIN_INSERT_LISTPAGE record type could probably be eliminated entirely in favor of using XLOG_HEAP_NEWPAGE, but I refrained from doing that now since it would have required a significantly more invasive patch. In passing do a little bit of code cleanup, including making the accounting for free space on GIN list pages more precise. (This wasn't a bug as the errors were always in the conservative direction.) Per report from Simon. Back-patch to 8.4 which contains the identical code.	2009-09-15 20:31:30 +00:00
Heikki Linnakangas	7f2a10fecd	Don't error out if recycling or removing an old WAL segment fails at the end of checkpoint. Although the checkpoint has been written to WAL at that point already, so that all data is safe, and we'll retry removing the WAL segment at the next checkpoint, if such a failure persists we won't be able to remove any other old WAL segments either and will eventually run out of disk space. It's better to treat the failure as non-fatal, and move on to clean any other WAL segment and continue with any other end-of-checkpoint cleanup. We don't normally expect any such failures, but on Windows it can happen with some anti-virus or backup software that lock files without FILE_SHARE_DELETE flag. Also, the loop in pgrename() to retry when the file is locked was broken. If a file is locked on Windows, you get ERROR_SHARE_VIOLATION, not ERROR_ACCESS_DENIED, at least on modern versions. Fix that, although I left the check for ERROR_ACCESS_DENIED in there as well (presumably it was correct in some environment), and added ERROR_LOCK_VIOLATION to be consistent with similar checks in pgwin32_open(). Reduce the timeout on the loop from 30s to 10s, on the grounds that since it's been broken, we've effectively had a timeout of 0s and no-one has complained, so a smaller timeout is actually closer to the old behavior. A longer timeout would mean that if recycling a WAL file fails because it's locked for some reason, InstallXLogFileSegment() will hold ControlFileLock for longer, potentially blocking other backends, so a long timeout isn't totally harmless. While we're at it, set errno correctly in pgrename(). Backpatch to 8.2, which is the oldest version supported on Windows. The xlog.c changes would make sense on other platforms and thus on older versions as well, but since there's no such locking issues on other platforms, it's not worth it.	2009-09-13 18:32:08 +00:00
Heikki Linnakangas	4e2d5efc6a	On Windows, when a file is deleted and another process still has an open file handle on it, the file goes into "pending deletion" state where it still shows up in directory listing, but isn't accessible otherwise. That confuses RemoveOldXLogFiles(), making it think that the file hasn't been archived yet, while it actually was, and it was deleted along with the .done file. Fix that by renaming the file with ".deleted" extension before deleting it. Also check the return value of rename() and unlink(), so that if the removal fails for any reason (e.g another process is holding the file locked), we don't delete the .done file until the WAL file is really gone. Backpatch to 8.2, which is the oldest version supported on Windows.	2009-09-10 09:42:10 +00:00
Tom Lane	794e3e81a0	Force VACUUM to recalculate oldestXmin even when we haven't changed our own database's datfrozenxid, if the current value is old enough to be forcing autovacuums or warning messages. This ensures that a bogus value is replaced as soon as possible. Per a comment from Heikki.	2009-09-01 04:46:49 +00:00
Tom Lane	14f445fccf	Actually, we need to bump the format identifier on twophase files because of readjustment of 2PC rmgr IDs for flatfile removal.	2009-09-01 04:15:45 +00:00
Alvaro Herrera	a8bb8eb583	Remove flatfiles.c, which is now obsolete. Recent commits have removed the various uses it was supporting. It was a performance bottleneck, according to bug report #4919 by Lauris Ulmanis; seems it slowed down user creation after a billion users.	2009-09-01 02:54:52 +00:00
Tom Lane	25ec228ef7	Track the current XID wrap limit (or more accurately, the oldest unfrozen XID) in checkpoint records. This eliminates the need to recompute the value from scratch during database startup, which is one of the two remaining reasons for the flatfile code to exist. It should also simplify life for hot-standby operation. To avoid bloating the checkpoint records unreasonably, I switched from tracking the oldest database by name to tracking it by OID. This turns out to save cycles in general (everywhere but the warning-generating paths, which we hardly care about) and also helps us deal with the case that the oldest database got dropped instead of being vacuumed. The prior coding might go for a long time without updating the wrap limit in that case, which is bad because it might result in a lot of useless autovacuum activity.	2009-08-31 02:23:23 +00:00
Alvaro Herrera	53af86c55c	Fix handling of autovacuum reloptions. In the original coding, setting a single reloption would cause default values to be used for all the other reloptions. This is a problem particularly for autovacuum reloptions. Itagaki Takahiro	2009-08-27 17:18:44 +00:00
Heikki Linnakangas	9cd6685f91	In the checkpoint written at the end of archive recovery, the WAL page header was incorrectly initialized with timeline ID 0. That rendered the WAL page unrecoverable, making a subsequent archive recovery stop at that point. ThisTimeLineID needs to be initialized before calling AdvanceXLInsertBuffer(). This fixes bug #5011 reported by James Bardin. Backpatch to 8.4, as the bug was introduced by the changes to use of bgwriter for writing the end-of-archive-recovery checkpoint. Patch by Tom Lane.	2009-08-27 07:15:41 +00:00
Tom Lane	7fc7a7c4d0	Fix a violation of WAL coding rules in the recent patch to include an "all tuples visible" flag in heap page headers. The flag update must be applied before calling XLogInsert, but heap_update and the tuple moving routines in VACUUM FULL were ignoring this rule. A crash and replay could therefore leave the flag incorrectly set, causing rows to appear visible in seqscans when they should not be. This might explain recent reports of data corruption from Jeff Ross and others. In passing, do a bit of editorialization on comments in visibilitymap.c.	2009-08-24 02:18:32 +00:00
Tom Lane	67a5f8ff9e	Department of marginal improvements: teach tupconvert.c to avoid doing a physical conversion when there are dropped columns in the same places in the input and output tupdescs. This avoids possible performance loss from the recent patch to improve dropped-column handling, in some cases where the old code would have worked.	2009-08-17 20:34:31 +00:00
Tom Lane	04011cc970	Allow backends to start up without use of the flat-file copy of pg_database. To make this work in the base case, pg_database now has a nailed-in-cache relation descriptor that is initialized using hardwired knowledge in relcache.c. This means pg_database is added to the set of relations that need to have a Schema_pg_xxx macro maintained in pg_attribute.h. When this path is taken, we'll have to do a seqscan of pg_database to find the row we need. In the normal case, we are able to do an indexscan to find the database's row by name. This is made possible by storing a global relcache init file that describes only the shared catalogs and their indexes (and therefore is usable by all backends in any database). A new backend loads this cache file, finds its database OID after an indexscan on pg_database, and then loads the local relcache init file for that database. This change should effectively eliminate number of databases as a factor in backend startup time, even with large numbers of databases. However, the real reason for doing it is as a first step towards getting rid of the flat files altogether. There are still several other sub-projects to be tackled before that can happen.	2009-08-12 20:53:31 +00:00
Tom Lane	97e14f6e93	Document that LocalSetXLogInsertAllowed can be re-executed. Per comment from Simon.	2009-08-08 16:39:17 +00:00
Tom Lane	87740caa01	rm_cleanup functions need to be allowed to write WAL entries. This oversight appears to explain the recent reports of "PANIC: cannot make new WAL entries during recovery".	2009-08-07 19:29:49 +00:00
Tom Lane	dcb2bda9b7	Improve plpgsql's ability to cope with rowtypes containing dropped columns, by supporting conversions in places that used to demand exact rowtype match. Since this issue is certain to come up elsewhere (in fact, already has, in ExecEvalConvertRowtype), factor out the support code into new core functions for tuple conversion. I chose to put these in a new source file since heaptuple.c is already overly long. Heavily revised version of a patch by Pavel Stehule.	2009-08-06 20:44:32 +00:00
Tom Lane	9072592946	Add ALTER TABLE ... ALTER COLUMN ... SET STATISTICS DISTINCT Robert Haas	2009-08-02 22:14:53 +00:00
Tom Lane	527f0ae3fa	Department of second thoughts: let's show the exact key during unique index build failures, too. Refactor a bit more since that error message isn't spelled the same.	2009-08-01 20:59:17 +00:00
Tom Lane	b680ae4bdb	Improve unique-constraint-violation error messages to include the exact values being complained of. In passing, also remove the arbitrary length limitation in the similar error detail message for foreign key violations. Itagaki Takahiro	2009-08-01 19:59:41 +00:00
Tom Lane	25d9bf2e3e	Support deferrable uniqueness constraints. The current implementation fires an AFTER ROW trigger for each tuple that looks like it might be non-unique according to the index contents at the time of insertion. This works well as long as there aren't many conflicts, but won't scale to massive unique-key reassignments. Improving that case is a TODO item. Dean Rasheed	2009-07-29 20:56:21 +00:00
Tom Lane	ca7c8168de	Tweak TOAST code so that columns marked with MAIN storage strategy are not forced out-of-line unless that is necessary to make the row fit on a page. Previously, they were forced out-of-line if needed to get the row down to the default target size (1/4th page). Kevin Grittner	2009-07-22 01:21:22 +00:00
Peter Eisentraut	de160e2c00	Make backend header files C++ safe This alters various incidental uses of C++ key words to use other similar identifiers, so that a C++ compiler won't choke outright. You still (probably) need extern "C" { }; around the inclusion of backend headers. based on a patch by Kurt Harriman <harriman@acm.org> Also add a script cpluspluscheck to check for C++ compatibility in the future. As of right now, this passes without error for me.	2009-07-16 06:33:46 +00:00
Tom Lane	2de48a83e6	Cleanup and code review for the patch that made bgwriter active during archive recovery. Invent a separate state variable and inquiry function for XLogInsertAllowed() to clarify some tests and make the management of writing the end-of-recovery checkpoint less klugy. Fix several places that were incorrectly testing InRecovery when they should be looking at RecoveryInProgress or XLogInsertAllowed (because they will now be executed in the bgwriter not startup process). Clarify handling of bad LSNs passed to XLogFlush during recovery. Use a spinlock for setting/testing SharedRecoveryInProgress. Improve quite a lot of comments. Heikki and Tom	2009-06-26 20:29:04 +00:00
Heikki Linnakangas	7e48b77b1c	Fix some serious bugs in archive recovery, now that bgwriter is active during it: When bgwriter is active, the startup process can't perform mdsync() correctly because it won't see the fsync requests accumulated in bgwriter's private pendingOpsTable. Therefore make bgwriter responsible for the end-of-recovery checkpoint as well, when it's active. When bgwriter is active (= archive recovery), the startup process must not accumulate fsync requests to its own pendingOpsTable, since bgwriter won't see them there when it performs restartpoints. Make startup process drop its pendingOpsTable when bgwriter is launched to avoid that. Update minimum recovery point one last time when leaving archive recovery. It won't be updated by the end-of-recovery checkpoint because XLogFlush() sees us as out of recovery already. This fixes bug #4879 reported by Fujii Masao.	2009-06-25 21:36:00 +00:00
Heikki Linnakangas	ebaa1952f1	The code to unlink dropped relations in FinishPreparedTransaction() was acting like runs inside WAL recovery, but it doesn't. I must've copy-pasted this from a redo-function in the relation forks patch. Noticed by Tom Lane while he was looking through callers of smgrdounlink().	2009-06-25 19:05:52 +00:00
Peter Eisentraut	4183b10661	Correct grammar in picksplit debug messages	2009-06-24 15:16:22 +00:00
Heikki Linnakangas	efa8544fd5	Fix a few errors in comments. Patch by Fujii Masao, plus the one in visibilitymap.c by me.	2009-06-18 10:08:08 +00:00
Bruce Momjian	d747140279	8.4 pgindent run, with new combined Linux/FreeBSD/MinGW typedef list provided by Andrew.	2009-06-11 14:49:15 +00:00
Peter Eisentraut	746c28a8e4	Improve capitalization and punctuation in recently added GiST message.	2009-06-10 20:02:15 +00:00
Tom Lane	61dd4185ff	Keep rs_startblock the same during heap_rescan, so that a rescan of a SeqScan node starts from the same place as the first scan did. This avoids surprising behavior of scrollable and WITH HOLD cursors, as seen in Mark Kirkwood's bug report of yesterday. It's not entirely clear whether a rescan should be forced to drop out of the syncscan mode, but for the moment I left the code behaving the same on that point. Any change there would only be a performance and not a correctness issue, anyway. Back-patch to 8.3, since the unstable behavior was created by the syncscan patch.	2009-06-10 18:54:16 +00:00
Tom Lane	32ea236361	Improve the IndexVacuumInfo/IndexBulkDeleteResult API to allow somewhat sane behavior in cases where we don't know the heap tuple count accurately; in particular partial vacuum, but this also makes the API a bit more useful for ANALYZE. This patch adds "estimated_count" flags to both structs so that an approximate count can be flagged as such, and adjusts the logic so that approximate counts are not used for updating pg_class.reltuples. This fixes my previous complaint that VACUUM was putting ridiculous values into pg_class.reltuples for indexes. The actual impact of that bug is limited, because the planner only pays attention to reltuples for an index if the index is partial; which probably explains why beta testers hadn't noticed a degradation in plan quality from it. But it needs to be fixed. The whole thing is a bit messy and should be redesigned in future, because reltuples now has the potential to drift quite far away from reality when a long period elapses with no non-partial vacuums. But this is as good as it's going to get for 8.4.	2009-06-06 22:13:52 +00:00
Tom Lane	356eea24ce	Fix a serious bug introduced into GIN in 8.4: now that MergeItemPointers() is supposed to remove duplicate heap TIDs, we have to be sure to reduce the tuple size and posting-item count accordingly in addItemPointersToTuple(). Failing to do so resulted in the effective injection of garbage TIDs into the index contents, ie, whatever happened to be in the memory palloc'd for the new tuple. I'm not sure that this fully explains the index corruption reported by Tatsuo Ishii, but the test case I'm using no longer fails.	2009-06-06 02:39:40 +00:00
Heikki Linnakangas	7c8d7a2eec	Only recycle normal files in pg_xlog as WAL segments. pg_standby creates symbolic links with the -l option, and as Fujii Masao pointed out we ended up overwriting files in the archive directory before this patch. Patch by Aidan Van Dyk, Fujii Masao and me. Backpatch to 8.3, where pg_standby was introduced.	2009-06-02 06:18:06 +00:00
Heikki Linnakangas	2e6107cb62	When archiving is enabled, rotate the last WAL segment at shutdown so that all transactions are archived. Original patch by Guillaume Smet.	2009-05-28 11:02:16 +00:00
Tom Lane	c3707a4fcd	Use more-portable coding for the check on handing out the last available relopt_kind value in add_reloption_kind(). Per Zdenek Kotala.	2009-05-24 22:22:44 +00:00
Tom Lane	7280fab717	Fix bug #4814 (wrong subscript in consistent-function call), and add some minimal regression test coverage for matchPartialInPendingList().	2009-05-19 02:48:26 +00:00
Tom Lane	4616d57dad	Fix all the server-side SIGQUIT handlers (grumble ... why so many identical copies?) to ensure they really don't run proc_exit/shmem_exit callbacks, as was intended. I broke this behavior recently by installing atexit callbacks without thinking about the one case where we truly don't want to run those callback functions. Noted in an example from Dave Page.	2009-05-15 15:56:39 +00:00
Tom Lane	bfab3f19e3	Include recovery_end_command in recovery.conf.sample. Per suggestion of Jaime Casanova.	2009-05-14 22:22:01 +00:00
Tom Lane	284e12c398	Improve a couple of comments.	2009-05-14 21:28:35 +00:00
Heikki Linnakangas	9e403c2587	Add recovery_end_command option to recovery.conf. recovery_end_command is run at the end of archive recovery, providing a chance to do external cleanup. Modify pg_standby so that it no longer removes the trigger file, that is to be done using the recovery_end_command now. Provide a "smart" failover mode in pg_standby, where we don't fail over immediately, but only after recovering all unapplied WAL from the archive. That gives you zero data loss assuming all WAL was archived before failover, which is what most users of pg_standby actually want. recovery_end_command by Simon Riggs, pg_standby changes by Fujii Masao and myself.	2009-05-14 20:31:09 +00:00
Tom Lane	23543c732b	Rewrite xml.c's memory management (yet again). Give up on the idea of redirecting libxml's allocations into a Postgres context. Instead, just let it use malloc directly, and add PG_TRY blocks as needed to be sure we release libxml data structures in error recovery code paths. This is ugly but seems much more likely to play nicely with third-party uses of libxml, as seen in recent trouble reports about using Perl XML facilities in pl/perl and bug #4774 about contrib/xml2. I left the code for allocation redirection in place, but it's only built/used if you #define USE_LIBXMLCONTEXT. This is because I found it useful to corral libxml's allocations in a palloc context when hunting for libxml memory leaks, and we're surely going to have more of those in the future with this type of approach. But we don't want it turned on in a normal build because it breaks exactly what we need to fix. I have not re-indented most of the code sections that are now wrapped by PG_TRY(); that's for ease of review. pg_indent will fix it. This is a pre-existing bug in 8.3, but I don't dare back-patch this change until it's gotten a reasonable amount of field testing.	2009-05-13 20:27:17 +00:00
Tom Lane	f23bdda324	Fix LOCK TABLE to eliminate the race condition that could make it give weird errors when tables are concurrently dropped. To do this we must take lock on each relation before we check its privileges. The old code was trying to do that the other way around, which is a bit pointless when there are lots of other commands that lock relations before checking privileges. I did keep it checking each relation's privilege before locking the next relation, which is a detail that ALTER TABLE isn't too picky about.	2009-05-12 16:43:32 +00:00
Heikki Linnakangas	223431cba1	Request XLOG switch before writing checkpoint in pg_start_backup(). Otherwise you can end up with an unrecoverable backup if you start a new base backup right after finishing archive recovery. In that scenario, the redo pointer of the checkpoint that pg_start_backup() writes points to the XLOG segment where the timeline-changing end-of-archive-recovery checkpoint is. The beginning of that segment contains pages with the old timeline ID, and we don't accept that in recovery unless we find a history file covering the old timeline ID. If you omit pg_xlog from the base backup and clear the archive directory before starting the backup, there will be no such history file available. The bug is present in all versions since PITR was introduced in 8.0, but I'm back-patching only back to 8.2. Earlier versions didn't have XLOG switch records, making this fix unfeasible. Given the lack of reports until now, it doesn't seem worthwhile to spend more effort to fix 8.0 and 8.1. Per report and suggestion by Mikael Krantz	2009-05-07 11:25:25 +00:00
Tom Lane	8f348112f3	Insert CHECK_FOR_INTERRUPTS() calls into btree and hash index scans at the points where we step right or left to the next page. This should ensure reasonable response time to a query cancel request during an unsuccessful index scan, as seen in recent gripe from Marc Cousin. It's a bit trickier than it might seem at first glance, because CHECK_FOR_INTERRUPTS() is a no-op if executed while holding a buffer lock. So we have to do it just at the point where we've dropped one page lock and not yet acquired the next. Remove CHECK_FOR_INTERRUPTS calls at the top level of btgetbitmap and hashgetbitmap, since they're pointless given the added checks. I think that GIST is okay already --- at least, there's a CHECK_FOR_INTERRUPTS at a plausible-looking place in gistnext(). I don't claim to know GIN well enough to try to poke it for this, if indeed it has a problem at all. This is a pre-existing issue, but in view of the lack of prior complaints I'm not going to risk back-patching.	2009-05-05 19:36:32 +00:00
Tom Lane	2aa5ca952f	Update comment for _bt_relandgetbuf.	2009-05-05 19:02:22 +00:00
Tom Lane	8d4f2ecd41	Change the default value of max_prepared_transactions to zero, and add documentation warnings against setting it nonzero unless active use of prepared transactions is intended and a suitable transaction manager has been installed. This should help to prevent the type of scenario we've seen several times now where a prepared transaction is forgotten and eventually causes severe maintenance problems (or even anti-wraparound shutdown). The only real reason we had the default be nonzero in the first place was to support regression testing of the feature. To still be able to do that, tweak pg_regress to force a nonzero value during "make check". Since we cannot force a nonzero value in "make installcheck", add a variant regression test "expected" file that shows the results that will be obtained when max_prepared_transactions is zero. Also, extend the HINT messages for transaction wraparound warnings to mention the possibility that old prepared transactions are causing the problem. All per today's discussion.	2009-04-23 00:23:46 +00:00
Heikki Linnakangas	bae8102f52	After archive recovery, mark the last WAL segment from the parent timeline ready for archival. It was marked at the next checkpoint anyway, but waiting for the next checkpoint is an unnecessary delay. Fujii Masao	2009-04-22 19:51:12 +00:00
Tom Lane	387060951e	Add an optional parameter to pg_start_backup() that specifies whether to do the checkpoint in immediate or lazy mode. This is to address complaints that pg_start_backup() takes a long time even when there's no need to minimize its I/O consumption.	2009-04-07 00:31:26 +00:00
Teodor Sigaev	09368d23db	Fix 'all at one page bug' in picksplit method of R-tree emulation. Add defense from buggy user-defined picksplit to GiST.	2009-04-06 14:27:27 +00:00
Teodor Sigaev	329a5322e9	Fix infinite loop while checking of partial match in pending list. Improve comments. Now GIN-indexable operators should be strict. Per Tom's questions/suggestions.	2009-04-05 11:32:01 +00:00
Tom Lane	090173a3f9	Remove the recently added node types ReloptElem and OptionDefElem in favor of adding optional namespace and action fields to DefElem. Having three node types that do essentially the same thing bloats the code and leads to errors of confusion, such as in yesterday's bug report from Khee Chin.	2009-04-04 21:12:31 +00:00
Alvaro Herrera	1c855f01ea	Disallow setting fillfactor for TOAST tables. To implement this without almost duplicating the reloption table, treat relopt_kind as a bitmask instead of an integer value. This decreases the range of allowed values, but it's not clear that there's need for that much values anyway. This patch also makes heap_reloptions explicitly a no-op for relation kinds other than heap and TOAST tables. Patch by ITAGAKI Takahiro with minor edits from me. (In particular I removed the bit about adding relation kind to an error message, which I intend to commit separately.)	2009-04-04 00:45:02 +00:00
Bruce Momjian	0e550ff617	Revert DTrace patch from Robert Lor	2009-04-02 20:59:10 +00:00
Bruce Momjian	227f817c1f	Add support for additional DTrace probes. Robert Lor	2009-04-02 19:14:34 +00:00
Tom Lane	793d5662e8	Fix an oversight in the support for storing/retrieving "minimal tuples" in TupleTableSlots. We have functions for retrieving a minimal tuple from a slot after storing a regular tuple in it, or vice versa; but these were implemented by converting the internal storage from one format to the other. The problem with that is it invalidates any pass-by-reference Datums that were already fetched from the slot, since they'll be pointing into the just-freed version of the tuple. The known problem cases involve fetching both a whole-row variable and a pass-by-reference value from a slot that is fed from a tuplestore or tuplesort object. The added regression tests illustrate some simple cases, but there may be other failure scenarios traceable to the same bug. Note that the added tests probably only fail on unpatched code if it's built with --enable-cassert; otherwise the bug leads to fetching from freed memory, which will not have been overwritten without additional conditions. Fix by allowing a slot to contain both formats simultaneously; which turns out not to complicate the logic much at all, if anything it seems less contorted than before. Back-patch to 8.2, where minimal tuples were introduced.	2009-03-30 04:08:43 +00:00
Tom Lane	87b8db3774	Adjust the APIs for GIN opclass support functions to allow the extractQuery() method to pass extra data to the consistent() and comparePartial() methods. This is the core infrastructure needed to support the soon-to-appear contrib/btree_gin module. The APIs are still upward compatible with the definitions used in 8.3 and before, although not with the previous 8.4devel function definitions. catversion bump for changes in pg_proc entries (although these are just cosmetic, since GIN doesn't actually look at the function signature before calling it...) Teodor Sigaev and Oleg Bartunov	2009-03-25 22:19:02 +00:00
Tom Lane	e5efda442c	Install a search tree depth limit in GIN bulk-insert operations, to prevent them from degrading badly when the input is sorted or nearly so. In this scenario the tree is unbalanced to the point of becoming a mere linked list, so insertions become O(N^2). The easiest and most safely back-patchable solution is to stop growing the tree sooner, ie limit the growth of N. We might later consider a rebalancing tree algorithm, but it's not clear that the benefit would be worth the cost and complexity. Per report from Sergey Burladyan and an earlier complaint from Heikki. Back-patch to 8.2; older versions didn't have GIN indexes.	2009-03-24 22:06:03 +00:00
Tom Lane	ff301d6e69	Implement "fastupdate" support for GIN indexes, in which we try to accumulate multiple index entries in a holding area before adding them to the main index structure. This helps because bulk insert is (usually) significantly faster than retail insert for GIN. This patch also removes GIN support for amgettuple-style index scans. The API defined for amgettuple is difficult to support with fastupdate, and the previously committed partial-match feature didn't really work with it either. We might eventually figure a way to put back amgettuple support, but it won't happen for 8.4. catversion bumped because of change in GIN's pg_am entry, and because the format of GIN indexes changed on-disk (there's a metapage now, and possibly a pending list). Teodor Sigaev	2009-03-24 20:17:18 +00:00
Tom Lane	1079564979	Const-ify the parse table passed to fillRelOptions. The previous coding meant it had to be built on-the-fly at each entry to default_reloptions.	2009-03-23 16:36:27 +00:00
Tom Lane	e04810e8c4	Code review for dtrace probes added (so far) to 8.4. Adjust placement of some bufmgr probes, take out redundant and memory-leak-inducing path arguments to smgr__md__read__done and smgr__md__write__done, fix bogus attempt to recalculate space used in sort__done, clean up formatting in places where I'm not sure pgindent will do a nice job by itself.	2009-03-11 23:19:25 +00:00
Heikki Linnakangas	fb7df896fc	Reload config file in startup process on SIGHUP. Fujii Masao	2009-03-04 13:56:40 +00:00
Tom Lane	640796ff41	Reduce the maximum value of vacuum_cost_delay and autovacuum_vacuum_cost_delay to 100ms (from 1000). This still seems to be comfortably larger than the useful range of the parameter, and it should help discourage people from picking uselessly large values. Tweak the documentation to recommend small values, too. Per discussion of a couple weeks ago.	2009-02-28 00:10:52 +00:00
Heikki Linnakangas	bc134d7a51	Change the signaling of end-of-recovery. Startup process now indicates end of recovery by exiting with exit code 0, like in previous releases. Per Tom's suggestion.	2009-02-23 09:28:50 +00:00
Heikki Linnakangas	cdd46c7654	Start background writer during archive recovery. Background writer now performs its usual buffer cleaning duties during archive recovery, and it's responsible for performing restartpoints. This requires some changes in postmaster. When the startup process has done all the initialization and is ready to start WAL redo, it signals the postmaster to launch the background writer. The postmaster is signaled again when the point in recovery is reached where we know that the database is in consistent state. Postmaster isn't interested in that at the moment, but that's the point where we could let other backends in to perform read-only queries. The postmaster is signaled third time when the recovery has ended, so that postmaster knows that it's safe to start accepting connections. The startup process now traps SIGTERM, and performs a "clean" shutdown. If you do a fast shutdown during recovery, a shutdown restartpoint is performed, like a shutdown checkpoint, and postmaster kills the processes cleanly. You still have to continue the recovery at next startup, though. Currently, the background writer is only launched during archive recovery. We could launch it during crash recovery as well, but it seems better to keep that codepath as simple as possible, for the sake of robustness. And it couldn't do any restartpoints during crash recovery anyway, so it wouldn't be that useful. log_restartpoints is gone. Use log_checkpoints instead. This is yet to be documented. This whole operation is a pre-requisite for Hot Standby, but has some value of its own whether the hot standby patch makes 8.4 or not. Simon Riggs, with lots of modifications by me.	2009-02-18 15:58:41 +00:00
Tom Lane	8205258fa6	Adopt Bob Jenkins' improved hash function for hash_any(). This changes the contents of hash indexes (again), so bump catversion. Kenneth Marshall	2009-02-09 21:18:28 +00:00
Alvaro Herrera	834a6da4f7	Update autovacuum to use reloptions instead of a system catalog, for per-table overrides of parameters. This removes a whole class of problems related to misusing the catalog, and perhaps more importantly, gives us pg_dump support for the parameters. Based on a patch by Euler Taveira de Oliveira, heavily reworked by me.	2009-02-09 20:57:59 +00:00
Heikki Linnakangas	b75b66332a	Fix obsolete comment. Zdenek Kotala	2009-02-07 10:49:36 +00:00
Alvaro Herrera	3a5b773715	Allow reloption names to have qualifiers, initially supporting a TOAST qualifier, and add support for this in pg_dump. This allows TOAST tables to have user-defined fillfactor, and will also enable us to move the autovacuum parameters to reloptions without taking away the possibility of setting values for TOAST tables.	2009-02-02 19:31:40 +00:00
Alvaro Herrera	c0f92b57dc	Allow extracting and parsing of reloptions from a bare pg_class tuple, and refactor the relcache code that used to do that. This allows other callers (particularly autovacuum) to do the same without necessarily having to open and lock a table.	2009-01-26 19:41:06 +00:00
Heikki Linnakangas	9187cedd7c	Put back fast-path for the case that there's no backup blocks in RestoreBkpBlocks. Went missing in my recent refactoring patch, as pointed out by Simon's hot standby patch.	2009-01-23 11:19:34 +00:00
Tom Lane	3cb5d6580a	Support column-level privileges, as required by SQL standard. Stephen Frost, with help from KaiGai Kohei and others	2009-01-22 20:16:10 +00:00
Heikki Linnakangas	b2a667b9ee	Add a new option to RestoreBkpBlocks() to indicate if a cleanup lock should be used instead of the normal exclusive lock, and make WAL redo functions responsible for calling RestoreBkpBlocks(). They know better what kind of a lock they need. At the moment, this just moves things around with no functional change, but makes the hot standby patch that's under review cleaner.	2009-01-20 18:59:37 +00:00
Alvaro Herrera	8ebe1e356c	Simplify the writing of amoptions routines by introducing a convenience fillRelOptions routine that stores the parsed values in the struct using a table-based approach. Per Tom suggestion. Also remove the "continue" in HANDLE_*_RELOPTION macros, which were useless and in spirit they were assuming too much of how the macros were going to be used. (Note that these macros are now unused, but the intention is to introduce some usage in a future autovacuum patch, which is why they weren't completely removed.) Also, do not call the string validation routine when not validating. It seems less error-prone this way, per commentary on the amoptions SGML docs.	2009-01-12 21:02:15 +00:00
Tom Lane	1a37056a74	Re-enable the old code in xlog.c that tried to use posix_fadvise(), so that we can get some buildfarm feedback about whether that function is still problematic. (Note that the planned async-preread patch will not really prove anything one way or the other in buildfarm testing, since it will be inactive with default GUC settings.)	2009-01-11 18:02:17 +00:00
Tom Lane	43a57cf365	Revise the TIDBitmap API to support multiple concurrent iterations over a bitmap. This is extracted from Greg Stark's posix_fadvise patch; it seems worth committing separately, since it's potentially useful independently of posix_fadvise.	2009-01-10 21:08:36 +00:00
Alvaro Herrera	b813c8daca	A couple further reloptions improvements, per KaiGai Kohei: add a validation function to the string type and add a couple of macros for string handling. In passing, fix an off-by-one bug of mine.	2009-01-08 19:34:41 +00:00
Alvaro Herrera	b25433da5d	Fix string reloption handling, per KaiGai Kohei.	2009-01-06 14:47:37 +00:00
Bruce Momjian	1d2a185eae	Suppress compiler warning in a different way, per Alvaro.	2009-01-06 03:15:51 +00:00
Bruce Momjian	02aa5f19c6	Supress compiler warning.	2009-01-06 02:44:17 +00:00
Alvaro Herrera	ba748f7a11	Change the reloptions machinery to use a table-based parser, and provide a more complete framework for writing custom option processing routines by user-defined access methods. Catalog version bumped due to the general API changes, which are going to affect user-defined "amoptions" routines.	2009-01-05 17:14:28 +00:00
Bruce Momjian	511db38ace	Update copyright for 2009.	2009-01-01 17:24:05 +00:00
Bruce Momjian	4ee79fd20d	Change the name of dtrace wal tracepoints: TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY Robert Lor	2008-12-24 20:41:29 +00:00
Bruce Momjian	5a90bc1fbe	The attached patch contains a couple of fixes in the existing probes and includes a few new ones. - Fixed compilation errors on OS X for probes that use typedefs - Fixed a number of probes to pass ForkNumber per the relation forks patch - The new probes are those that were taken out from the previous submitted patch and required simple fixes. Will submit the other probes that may require more discussion in a separate patch. Robert Lor	2008-12-17 01:39:04 +00:00
Tom Lane	fc3297d828	Make heap_update() set newtup->t_tableOid correctly, for consistency with the other major heapam.c functions. The only known consequence of this omission is that UPDATE RETURNING failed to return the correct value for "tableoid", as per report from KaiGai Kohei. Back-patch to 8.2. Arguably it's wrong all the way back; but without evidence of visible breakage before RETURNING was added, I'll desist from patching the older branches.	2008-12-16 16:26:08 +00:00
Tom Lane	17dc173660	To reduce confusion over whether VACUUM FULL is needed for anti-wraparound vacuuming (it's not), say "database-wide VACUUM" instead of "full-database VACUUM" in the relevant hint messages. Also, document the permissions needed to do this. Per today's discussion.	2008-12-11 18:16:18 +00:00
Heikki Linnakangas	dea81a6cf6	Revert SIGUSR1 multiplexing patch, per Tom's objection.	2008-12-09 15:59:39 +00:00
Heikki Linnakangas	7b05b3fa39	Provide support for multiplexing SIGUSR1 signal. The upcoming synchronous replication patch needs a signal, but we've already used SIGUSR1 and SIGUSR2 in normal backends. This patch allows reusing SIGUSR1 for that, and for other purposes too if the need arises.	2008-12-09 14:28:20 +00:00
Heikki Linnakangas	7a567d9407	MAPSIZE macro needs to use MAXALIGN(SizeOfPageHeaderData) instead of SizeOfPageHeaderData, like PageGetContents does. Per report by Pavan Deolasee.	2008-12-06 17:31:37 +00:00
Alvaro Herrera	7b640b0345	Fix a couple of snapshot management bugs in the new ResourceOwner world: non-writable large objects need to have their snapshots registered on the transaction resowner, not the current portal's, because it must persist until the large object is closed (which the portal does not). Also, ensure that the serializable snapshot is recorded by the transaction resource owner too, even when a subtransaction has changed the current resource owner before serializable is taken. Per bug reports from Pavan Deolasee.	2008-12-04 14:51:02 +00:00
Teodor Sigaev	69b3383cfb	Initialize GISTScanOpaque->qual_ok even if there is no conditions.	2008-12-04 11:08:46 +00:00
Heikki Linnakangas	608195a3a3	Introduce visibility map. The visibility map is a bitmap with one bit per heap page, where a set bit indicates that all tuples on the page are visible to all transactions, and the page therefore doesn't need vacuuming. It is stored in a new relation fork. Lazy vacuum uses the visibility map to skip pages that don't need vacuuming. Vacuum is also responsible for setting the bits in the map. In the future, this can hopefully be used to implement index-only-scans, but we can't currently guarantee that the visibility map is always 100% up-to-date. In addition to the visibility map, there's a new PD_ALL_VISIBLE flag on each heap page, also indicating that all tuples on the page are visible to all transactions. It's important that this flag is kept up-to-date. It is also used to skip visibility tests in sequential scans, which gives a small performance gain on seqscans.	2008-12-03 13:05:22 +00:00
Heikki Linnakangas	b457b2a24e	If pg_stop_backup() is called just after switching to a new xlog file, wait for the previous instead of the new file to be archived. Based on patch by Simon Riggs.	2008-12-03 08:20:11 +00:00
Tom Lane	c1f3073333	Clean up the API for DestReceiver objects by eliminating the assumption that a Portal is a useful and sufficient additional argument for CreateDestReceiver --- it just isn't, in most cases. Instead formalize the approach of passing any needed parameters to the receiver separately. One unexpected benefit of this change is that we can declare typedef Portal in a less surprising location. This patch is just code rearrangement and doesn't change any functionality. I'll tackle the HOLD-cursor-vs-toast problem in a follow-on patch.	2008-11-30 20:51:25 +00:00
Heikki Linnakangas	9858a8c81c	Rely on relcache invalidation to update the cached size of the FSM.	2008-11-26 17:08:58 +00:00
Heikki Linnakangas	3396000684	Rethink the way FSM truncation works. Instead of WAL-logging FSM truncations in FSM code, call FreeSpaceMapTruncateRel from smgr_redo. To make that cleaner from modularity point of view, move the WAL-logging one level up to RelationTruncate, and move RelationTruncate and all the related WAL-logging to new src/backend/catalog/storage.c file. Introduce new RelationCreateStorage and RelationDropStorage functions that are used instead of calling smgrcreate/smgrscheduleunlink directly. Move the pending rel deletion stuff from smgrcreate/smgrscheduleunlink to the new functions. This leaves smgr.c as a thin wrapper around md.c; all the transactional stuff is now in storage.c. This will make it easier to add new forks with similar truncation logic, like the visibility map.	2008-11-19 10:34:52 +00:00
Alvaro Herrera	03e5248d0f	Replace the usage of heap_addheader to create pg_attribute tuples with regular heap_form_tuple. Since this removes the last remaining caller of heap_addheader, remove it. Extracted from the column privileges patch from Stephen Frost, with further code cleanups by me.	2008-11-14 01:57:42 +00:00
Tom Lane	10e3acb8e7	Prevent synchronous scan during GIN index build, because GIN is optimized for inserting tuples in increasing TID order. It's not clear whether this fully explains Ivan Sergio Borgonovo's complaint, but simple testing confirms that a scan that doesn't start at block 0 can slow GIN build by a factor of three or four. Backpatch to 8.3. Sync scan didn't exist before that.	2008-11-13 17:42:10 +00:00
Tom Lane	cad3a26a95	Fix sloppy omission of now-required #include's.	2008-11-11 14:17:02 +00:00
Heikki Linnakangas	7e8b0b9ab1	Change error messages to print the physical path, like "base/11517/3767_fsm", instead of symbolic names like "1663/11517/3767/1", per Alvaro's suggestion. I didn't change the messages in the higher-level index, heap and FSM routines, though, where the fork is implicit.	2008-11-11 13:19:16 +00:00
Tom Lane	1d577f5e49	Add a startup check that pg_xlog and pg_xlog/archive_status exist. If the latter doesn't exist, automatically recreate it. (We don't do this for pg_xlog, though, per discussion.) Jonah Harris	2008-11-09 17:51:15 +00:00
Tom Lane	85e2cedf98	Improve bulk-insert performance by keeping the current target buffer pinned (but not locked, as that would risk deadlocks). Also, make it work in a small ring of buffers to avoid having bulk inserts trash the whole buffer arena. Robert Haas, after an idea of Simon Riggs'.	2008-11-06 20:51:15 +00:00
Heikki Linnakangas	5ae29525d1	The logic in systable_beginscan to translate heap attribute numbers to index column numbers needs to handle the case where you have more than one scankey on the same index column. toast_fetch_datum_slice() needs it.	2008-11-06 13:07:08 +00:00
Tom Lane	b4eae023bb	Clean up the messy semantics (not to mention inefficiency) of PageGetTempPage by splitting it into three functions with better-defined behaviors. Zdenek Kotala	2008-11-03 20:47:49 +00:00
Alvaro Herrera	4ff0468371	Fix silly typo in previous commit.	2008-11-03 19:26:07 +00:00
Alvaro Herrera	d698bf83d1	Fix TransactionIdSetStatusBit so that it doesn't try to change a transaction from COMMITTED to SUBCOMMITTED during recovery. This wasn't previously possible, but it is now due to the recent changes on clog commit protocol for subtransactions. Simon Riggs	2008-11-03 19:24:03 +00:00
Alvaro Herrera	b107299c40	Fix mistakes in comment headers	2008-11-03 15:10:17 +00:00
Tom Lane	d7112cfa88	Remove the last vestiges of the MAKE_PTR/MAKE_OFFSET mechanism. We haven't allowed different processes to have different addresses for the shmem segment in quite a long time, but there were still a few places left that used the old coding convention. Clean them up to reduce confusion and improve the compiler's ability to detect pointer type mismatches. Kris Jurka	2008-11-02 21:24:52 +00:00
Tom Lane	902d1cb35f	Remove all uses of the deprecated functions heap_formtuple, heap_modifytuple, and heap_deformtuple in favor of the newer functions heap_form_tuple et al (which do the same things but use bool control flags instead of arbitrary char values). Eliminate the former duplicate coding of these functions, reducing the deprecated functions to mere wrappers around the newer ones. We can't get rid of them entirely because add-on modules probably still contain many instances of the old coding style. Kris Jurka	2008-11-02 01:45:28 +00:00
Heikki Linnakangas	e9816533e3	Update FSM on WAL replay. This is a bit limited; the FSM is only updated on non-full-page-image WAL records, and quite arbitrarily, only if there's less than 20% free space on the page after the insert/update (not on HOT updates, though). The 20% cutoff should avoid most of the overhead, when replaying a bulk insertion, for example, while ensuring that pages that are full are marked as full in the FSM. This is mostly to avoid the nasty worst case scenario, where you replay from a PITR archive, and the FSM information in the base backup is really out of date. If there was a lot of pages that the outdated FSM claims to have free space, but don't actually have any, the first unlucky inserter after the recovery would traverse through all those pages, just to find out that they're full. We didn't have this problem with the old FSM implementation, because we simply threw the FSM information away on a non-clean shutdown.	2008-10-31 19:40:27 +00:00
Heikki Linnakangas	19c8dc839b	Unite ReadBufferWithFork, ReadBufferWithStrategy, and ZeroOrReadBuffer functions into one ReadBufferExtended function, that takes the strategy and mode as argument. There's three modes, RBM_NORMAL which is the default used by plain ReadBuffer(), RBM_ZERO, which replaces ZeroOrReadBuffer, and a new mode RBM_ZERO_ON_ERROR, which allows callers to read corrupt pages without throwing an error. The FSM needs the new mode to recover from corrupt pages, which could happend if we crash after extending an FSM file, and the new page is "torn". Add fork number to some error messages in bufmgr.c, that still lacked it.	2008-10-31 15:05:00 +00:00
Tom Lane	2314baef38	Fix recoveryLastXTime logic so that it actually does what one would expect. Per gripe from Kevin Grittner. Backpatch to 8.3, where the bug was introduced.	2008-10-30 04:06:16 +00:00
Alvaro Herrera	c9d1efda96	No need for extra code to log freezing zero tuples. Callers already check that they are freezing a nonzero amount anyway.	2008-10-27 21:50:12 +00:00
Teodor Sigaev	b9856b67a7	Fix GiST's killing tuple: GISTScanOpaque->curpos wasn't correctly set. As result, killtuple() marks as dead wrong tuple on page. Bug was introduced by me while fixing possible duplicates during GiST index scan.	2008-10-22 12:53:56 +00:00
Alvaro Herrera	97227e9ec0	These functions no longer return a value, per complaint from gothic_moth via Zdenek Kotala.	2008-10-20 20:38:24 +00:00
Alvaro Herrera	06da3c570f	Rework subtransaction commit protocol for hot standby. This patch eliminates the marking of subtransactions as SUBCOMMITTED in pg_clog during their commit; instead they remain in-progress until main transaction commit. At main transaction commit, the commit protocol is atomic-by-page instead of one transaction at a time. To avoid a race condition with some subtransactions appearing committed before others in the case where they span more than one pg_clog page, we conserve the logic that marks them subcommitted before marking the parent committed. Simon Riggs with minor help from me	2008-10-20 19:18:18 +00:00
Teodor Sigaev	3afffbc902	Remove support of backward scan in GiST. Per discussion http://archives.postgresql.org/pgsql-hackers/2008-10/msg00857.php	2008-10-20 16:35:14 +00:00
Teodor Sigaev	77db9d9ff2	Remove mark/restore support in GIN and GiST indexes. Per Tom's comment. Also revome useless GISTScanOpaque->flags field.	2008-10-20 13:39:44 +00:00
Tom Lane	af59a0650b	Remove useless mark/restore support in hash index AM, per discussion. (I'm leaving GiST/GIN cleanup to Teodor.)	2008-10-17 23:50:57 +00:00
Teodor Sigaev	beeb3562dd	During repeated rescan of GiST index it's possible that scan key is NULL but SK_SEARCHNULL is not set. Add checking IS NULL of keys to set during key initialization. If key is NULL and SK_SEARCHNULL is not set then nothnig can be satisfied. With assert-enabled compilation that causes coredump. Bug was introduced in 8.3 by support of IS NULL index scan.	2008-10-17 17:02:21 +00:00
Tom Lane	30584cda35	Fix small query-lifespan memory leak introduced by 8.4 change in index AM API for bitmap index scans. Per report and test case from Kevin Grittner.	2008-10-10 14:17:08 +00:00
Tom Lane	3437286356	Modify the parser's error reporting to include a specific hint for the case of referencing a WITH item that's not yet in scope according to the SQL spec's semantics. This seems to be an easy error to make, and the bare "relation doesn't exist" message doesn't lead one's mind in the correct direction to fix it.	2008-10-08 01:14:44 +00:00
Heikki Linnakangas	89f373bf5b	Index FSMs needs to be vacuumed as well. Report by Jeff Davis.	2008-10-06 08:04:11 +00:00
Bruce Momjian	2cc1633a35	Update README.HOT to reflect new snapshot tracking and xmin advancement code in 8.4.	2008-10-02 20:59:31 +00:00
Heikki Linnakangas	15c121b3ed	Rewrite the FSM. Instead of relying on a fixed-size shared memory segment, the free space information is stored in a dedicated FSM relation fork, with each relation (except for hash indexes; they don't use FSM). This eliminates the max_fsm_relations and max_fsm_pages GUC options; remove any trace of them from the backend, initdb, and documentation. Rewrite contrib/pg_freespacemap to match the new FSM implementation. Also introduce a new variant of the get_raw_page(regclass, int4, int4) function in contrib/pageinspect that let's you to return pages from any relation fork, and a new fsm_page_contents() function to inspect the new FSM pages.	2008-09-30 10:52:14 +00:00
Heikki Linnakangas	61d9674988	Make LC_COLLATE and LC_CTYPE database-level settings. Collation and ctype are now more like encoding, stored in new datcollate and datctype columns in pg_database. This is a stripped-down version of Radek Strnad's patch, with further changes by me.	2008-09-23 09:20:39 +00:00
Tom Lane	4adc2f72a4	Change hash indexes to store only the hash code rather than the whole indexed value. This means that hash index lookups are always lossy and have to be rechecked when the heap is visited; however, the gain in index compactness outweighs this when the indexed values are wide. Also, we only need to perform datatype comparisons when the hash codes match exactly, rather than for every entry in the hash bucket; so it could also win for datatypes that have expensive comparison functions. A small additional win is gained by keeping hash index pages sorted by hash code and using binary search to reduce the number of index tuples we have to look at. Xiao Meng This commit also incorporates Zdenek Kotala's patch to isolate hash metapages and hash bitmaps a bit better from the page header datastructures.	2008-09-15 18:43:41 +00:00
Alvaro Herrera	d53a56687f	Initialize the minimum frozen Xid in vac_update_datfrozenxid using GetOldestXmin() instead of RecentGlobalXmin; this is safer because we do not depend on the latter being correctly set elsewhere, and while it is more expensive, this code path is not performance-critical. This is a real risk for autovacuum, because it can execute whole cycles without doing a single vacuum, which would mean that RecentGlobalXmin would stay at its initialization value, FirstNormalTransactionId, causing a bogus value to be inserted in pg_database. This bug could explain some recent reports of failure to truncate pg_clog. At the same time, change the initialization of RecentGlobalXmin to InvalidTransactionId, and ensure that it's set to something else whenever it's going to be used. Using it as FirstNormalTransactionId in HOT page pruning could incur in data loss. InitPostgres takes care of setting it to a valid value, but the extra checks are there to prevent "special" backends from behaving in unusual ways. Per Tom Lane's detailed problem dissection in 29544.1221061979@sss.pgh.pa.us	2008-09-11 14:01:10 +00:00
Tom Lane	ead21631e8	Fix a couple of problems pointed out by Fujii Masao in the 2008-Apr-05 patch for pg_stop_backup. First, it is possible that the history file name is not alphabetically later than the last WAL file name, so we should explicitly check that both have been archived. Second, the previous coding would wait forever if a checkpoint had managed to remove the WAL file before we look for it. Simon Riggs, plus some code cleanup by me.	2008-09-08 16:42:15 +00:00
Teodor Sigaev	5373817cf2	Fix strategy propagation to scanEntry for partial match by moving propagation to initializaion of scanEntry.	2008-09-04 11:47:05 +00:00
Teodor Sigaev	1dcf6fdf1b	Fix possible duplicate tuples while GiST scan. Now page is processed at once and ItemPointers are collected in memory. Remove tuple's killing by killtuple() if tuple was moved to another page - it could produce unaceptable overhead. Backpatch up to 8.1 because the bug was introduced by GiST's concurrency support.	2008-08-23 10:37:24 +00:00
Heikki Linnakangas	3f0e808c4a	Introduce the concept of relation forks. An smgr relation can now consist of multiple forks, and each fork can be created and grown separately. The bulk of this patch is about changing the smgr API to include an extra ForkNumber argument in every smgr function. Also, smgrscheduleunlink and smgrdounlink no longer implicitly call smgrclose, because other forks might still exist after unlinking one. The callers of those functions have been modified to call smgrclose instead. This patch in itself doesn't have any user-visible effect, but provides the infrastructure needed for upcoming patches. The additional forks envisioned are a rewritten FSM implementation that doesn't rely on a fixed-size shared memory block, and a visibility map to allow skipping portions of a table in VACUUM that have no dead tuples.	2008-08-11 11:05:11 +00:00

... 2 3 4 5 6 ...

1768 Commits