postgresql

Commit Graph

Author	SHA1	Message	Date
Tom Lane	2b78d101d1	Fix SQL function execution to be safe with long-lived FmgrInfos. fmgr_sql had been designed on the assumption that the FmgrInfo it's called with has only query lifespan. This is demonstrably unsafe in connection with range types, as shown in bug #7881 from Andrew Gierth. Fix things so that we re-generate the function's cache data if the (sub)transaction it was made in is no longer active. Back-patch to 9.2. This might be needed further back, but it's not clear whether the case can realistically arise without range types, so for now I'll desist from back-patching further.	2013-03-03 17:39:58 -05:00
Tom Lane	fdaf44862b	Invent pre-commit/pre-prepare/pre-subcommit events for xact callbacks. Currently it's only possible for loadable modules to get control during post-commit cleanup of a transaction. That doesn't work too well if they want to do something that could throw an error; for example, an FDW might need to issue a remote commit, which could well fail. To improve matters, extend the existing APIs for XactCallback and SubXactCallback functions to provide new pre-commit events for this purpose. The release notes will need to mention that existing callback functions should be checked to make sure they don't do something unwanted when one of the new event types occurs. In the examples within our source tree, contrib/sepgsql was fine but plpgsql had been a bit too cute.	2013-02-14 20:35:08 -05:00
Bruce Momjian	bd61a623ac	Update copyrights for 2013 Fully update git head, and update back branches in ./COPYRIGHT and legal.sgml files.	2013-01-01 17:15:01 -05:00
Heikki Linnakangas	7bffc9b7bf	Update minimum recovery point on truncation. If a file is truncated, we must update minRecoveryPoint. Once a file is truncated, there's no going back; it would not be safe to stop recovery at a point earlier than that anymore. Per report from Kyotaro HORIGUCHI. Backpatch to 8.4. Before that, minRecoveryPoint was not updated during recovery at all.	2012-12-10 16:57:16 +02:00
Simon Riggs	f21bb9cfb5	Refactor inCommit flag into generic delayChkpt flag. Rename PGXACT->inCommit flag into delayChkpt flag, and generalise comments to allow use in other situations, such as the forthcoming potential use in checksum patch. Replace wait loop to look for VXIDs with delayChkpt set. No user visible changes, not behaviour changes at present. Simon Riggs, reviewed and rebased by Jeff Davis	2012-12-03 13:13:53 +00:00
Alvaro Herrera	1577b46b7c	Split out rmgr rm_desc functions into their own files This is necessary (but not sufficient) to have them compilable outside of a backend environment.	2012-11-28 13:01:15 -03:00
Simon Riggs	d9fad1076d	Skip searching for subxact locks at commit. At commit all standby locks are released for the top-level transaction, so searching for locks for each subtransaction is both pointless and costly (N^2) in the presence of many AccessExclusiveLocks.	2012-11-13 16:00:19 -03:00
Tom Lane	ff3f9c8de5	Close un-owned SMgrRelations at transaction end. If an SMgrRelation is not "owned" by a relcache entry, don't allow it to live past transaction end. This design allows the same SMgrRelation to be used for blind writes of multiple blocks during a transaction, but ensures that we don't hold onto such an SMgrRelation indefinitely. Because an SMgrRelation typically corresponds to open file descriptors at the fd.c level, leaving it open when there's no corresponding relcache entry can mean that we prevent the kernel from reclaiming deleted disk space. (While CacheInvalidateSmgr messages usually fix that, there are cases where they're not issued, such as DROP DATABASE. We might want to add some more sinval messaging for that, but I'd be inclined to keep this type of logic anyway, since allowing VFDs to accumulate indefinitely for blind-written relations doesn't seem like a good idea.) This code replaces a previous attempt towards the same goal that proved to be unreliable. Back-patch to 9.1 where the previous patch was added.	2012-10-17 12:38:21 -04:00
Alvaro Herrera	45326c5a11	Split resowner.h This lets files that are mere users of ResourceOwner not automatically include the headers for stuff that is managed by the resowner mechanism.	2012-08-28 18:02:07 -04:00
Bruce Momjian	41fa3dfb0a	Update C comment to NOTICE to reflect previous commit changing the error level, per report from Tom.	2012-08-15 19:09:37 -04:00
Robert Haas	f11e8be3e8	Make commit_delay much smarter. Instead of letting every backend participating in a group commit wait independently, have the first one that becomes ready to flush WAL wait for the configured delay, and let all the others wait just long enough for that first process to complete its flush. This greatly increases the chances of being able to configure a commit_delay setting that actually improves performance. As a side consequence of this change, commit_delay now affects all WAL flushes, rather than just commits. There was some discussion on pgsql-hackers about whether to rename the GUC to, say, wal_flush_delay, but in the absence of consensus I am leaving it alone for now. Peter Geoghegan, with some changes, mostly to the documentation, by me.	2012-07-02 10:26:31 -04:00
Robert Haas	f83b59997d	Make walsender more responsive. Per testing by Andres Freund, this improves replication performance and reduces replication latency and latency jitter. I was a bit concerned about moving more work into XLogInsert, but testing seems to show that it's not a problem in practice. Along the way, improve comments for WaitLatchOrSocket. Andres Freund. Review and stylistic cleanup by me.	2012-07-02 09:41:01 -04:00
Alvaro Herrera	77ed0c6950	Tighten up includes in sinvaladt.h, twophase.h, proc.h Remove proc.h from sinvaladt.h and twophase.h; also replace xlog.h in proc.h with xlogdefs.h.	2012-06-25 18:40:40 -04:00
Heikki Linnakangas	0ab9d1c4b3	Replace XLogRecPtr struct with a 64-bit integer. This simplifies code that needs to do arithmetic on XLogRecPtrs. To avoid changing on-disk format of data pages, the LSN on data pages is still stored in the old format. That should keep pg_upgrade happy. However, we have XLogRecPtrs embedded in the control file, and in the structs that are sent over the replication protocol, so this changes breaks compatibility of pg_basebackup and server. I didn't do anything about this in this patch, per discussion on -hackers, the right thing to do would to be to change the replication protocol to be architecture-independent, so that you could use a newer version of pg_receivexlog, for example, against an older server version.	2012-06-24 19:19:45 +03:00
Robert Haas	cd80073445	During transaction cleanup, release locks before deleting files. There's no need to hold onto the locks until the files are needed, and by doing it this way, we reduce the impact on other backends who may be awaiting locks we hold. Noah Misch	2012-06-14 10:19:33 -04:00
Bruce Momjian	927d61eeff	Run pgindent on 9.2 source tree in preparation for first 9.3 commit-fest.	2012-06-10 15:20:04 -04:00
Tom Lane	ece01aae47	Scan the buffer pool just once, not once per fork, during relation drop. This provides a speedup of about 4X when NBuffers is large enough. There is also a useful reduction in sinval traffic, since we only do CacheInvalidateSmgr() once not once per fork. Simon Riggs, reviewed and somewhat revised by Tom Lane	2012-06-07 17:43:11 -04:00
Tom Lane	a04dc87db1	Improve comment for GetStableLatestTransactionId().	2012-05-31 11:20:02 -04:00
Tom Lane	2755abf386	Teach AbortOutOfAnyTransaction to clean up partially-started transactions. AbortOutOfAnyTransaction failed to do anything if the state it saw on entry corresponded to failing partway through StartTransaction. I fixed AbortCurrentTransaction to cope with that case way back in commit `60b2444cc3`, but evidently overlooked that AbortOutOfAnyTransaction should do likewise. Back-patch to all supported branches. It's not clear that this omission has any more-than-cosmetic consequences, but it's also not clear that it doesn't, so back-patching seems the least risky choice.	2012-05-28 23:57:06 -04:00
Simon Riggs	867540b49c	Ensure backwards compatibility for GetStableLatestTransactionId()	2012-05-12 13:26:10 +01:00
Simon Riggs	b06679e012	Ensure age() returns a stable value rather than the latest value	2012-05-11 14:36:24 +01:00
Robert Haas	53c5b869b4	Tighten up error recovery for fast-path locking. The previous code could cause a backend crash after BEGIN; SAVEPOINT a; LOCK TABLE foo (interrupted by ^C or statement timeout); ROLLBACK TO SAVEPOINT a; LOCK TABLE foo, and might have leaked strong-lock counts in other situations. Report by Zoltán Böszörményi; patch review by Jeff Davis.	2012-04-18 11:17:30 -04:00
Heikki Linnakangas	fe546f3da6	Don't wait for the commit record to be replicated if we wrote no WAL. When using synchronous replication, we waited for the commit record to be replicated, but if we our transaction didn't write any other WAL records, that's not required because we don't even flush the WAL locally to disk in that case. This lead to long waits when committing a transaction that only modified a temporary table. Bug spotted by Thom Brown.	2012-04-17 16:28:31 +03:00
Bruce Momjian	e126958c2e	Update copyright notices for year 2012.	2012-01-01 18:01:58 -05:00
Tom Lane	d0024cd188	Avoid crashing when we have problems unlinking files post-commit. smgrdounlink takes care to not throw an ERROR if it fails to unlink something, but that caution was rendered useless by commit `3396000684`, which put an smgrexists call in front of it; smgrexists does throw error if anything looks funny, such as getting a permissions error from trying to open the file. If that happens post-commit, you get a PANIC, and what's worse the same logic appears in the WAL replay code, so the database even fails to restart. Restore the intended behavior by removing the smgrexists call --- it isn't accomplishing anything that we can't do better by adjusting mdunlink's ideas of whether it ought to warn about ENOENT or not. Per report from Joseph Shraibman of unrecoverable crash after trying to drop a table whose FSM fork had somehow gotten chmod'd to 000 permissions. Backpatch to 8.4, where the bogus coding was introduced.	2011-12-20 15:00:36 -05:00
Robert Haas	ed0b409d22	Move "hot" members of PGPROC into a separate PGXACT array. This speeds up snapshot-taking and reduces ProcArrayLock contention. Also, the PGPROC (and PGXACT) structures used by two-phase commit are now allocated as part of the main array, rather than in a separate array, and we keep ProcArray sorted in pointer order. These changes are intended to minimize the number of cache lines that must be pulled in to take a snapshot, and testing shows a substantial increase in performance on both read and write workloads at high concurrencies. Pavan Deolasee, Heikki Linnakangas, Robert Haas	2011-11-25 08:02:10 -05:00
Tom Lane	bb446b689b	Support synchronization of snapshots through an export/import procedure. A transaction can export a snapshot with pg_export_snapshot(), and then others can import it with SET TRANSACTION SNAPSHOT. The data does not leave the server so there are not security issues. A snapshot can only be imported while the exporting transaction is still running, and there are some other restrictions. I'm not totally convinced that we've covered all the bases for SSI (true serializable) mode, but it works fine for lesser isolation modes. Joachim Wieland, reviewed by Marko Tiikkaja, and rather heavily modified by Tom Lane	2011-10-22 18:23:30 -04:00
Tom Lane	57eb009092	Allow snapshot references to still work during transaction abort. In REPEATABLE READ (nee SERIALIZABLE) mode, an attempt to do GetTransactionSnapshot() between AbortTransaction and CleanupTransaction failed, because GetTransactionSnapshot would recompute the transaction snapshot (which is already wrong, given the isolation mode) and then re-register it in the TopTransactionResourceOwner, leading to an Assert because the TopTransactionResourceOwner should be empty of resources after AbortTransaction. This is the root cause of bug #6218 from Yamamoto Takashi. While changing plancache.c to avoid requesting a snapshot when handling a ROLLBACK masks the problem, I think this is really a snapmgr.c bug: it's lower-level than the resource manager mechanism and should not be shutting itself down before we unwind resource manager resources. However, just postponing the release of the transaction snapshot until cleanup time didn't work because of the circular dependency with TopTransactionResourceOwner. Fix by managing the internal reference to that snapshot manually instead of depending on TopTransactionResourceOwner. This saves a few cycles as well as making the module layering more straightforward. predicate.c's dependencies on TopTransactionResourceOwner go away too. I think this is a longstanding bug, but there's no evidence that it's more than a latent bug, so it doesn't seem worth any risk of back-patching.	2011-09-26 22:25:28 -04:00
Tom Lane	a7801b62f2	Move Timestamp/Interval typedefs and basic macros into datatype/timestamp.h. As per my recent proposal, this refactors things so that these typedefs and macros are available in a header that can be included in frontend-ish code. I also changed various headers that were undesirably including utils/timestamp.h to include datatype/timestamp.h instead. Unsurprisingly, this showed that half the system was getting utils/timestamp.h by way of xlog.h. No actual code changes here, just header refactoring.	2011-09-09 13:23:41 -04:00
Simon Riggs	df383b03e6	Partially revoke attempt to improve performance with many savepoints. Maintain difference between subtransaction release and commit introduced by earlier patch.	2011-09-07 12:11:26 +01:00
Tom Lane	1609797c25	Clean up the #include mess a little. walsender.h should depend on xlog.h, not vice versa. (Actually, the inclusion was circular until a couple hours ago, which was even sillier; but Bruce broke it in the expedient rather than logically correct direction.) Because of that poor decision, plus blind application of pgrminclude, we had a situation where half the system was depending on xlog.h to include such unrelated stuff as array.h and guc.h. Clean up the header inclusion, and manually revert a lot of what pgrminclude had done so things build again. This episode reinforces my feeling that pgrminclude should not be run without adult supervision. Inclusion changes in header files in particular need to be reviewed with great care. More generally, it'd be good if we had a clearer notion of module layering to dictate which headers can sanely include which others ... but that's a big task for another day.	2011-09-04 01:13:16 -04:00
Bruce Momjian	6416a82a62	Remove unnecessary #include references, per pgrminclude script.	2011-09-01 10:04:27 -04:00
Simon Riggs	7cb7122800	Remove O(N^2) performance issue with multiple SAVEPOINTs. Subtransaction locks now released en masse at main commit, rather than repeatedly re-scanning for locks as we ascend the nested transaction tree. Split transaction state TBLOCK_SUBEND into two states, TBLOCK_SUBCOMMIT and TBLOCK_SUBRELEASE to allow the commit path to be optimised using the existing code in ResourceOwnerRelease() which appears to have been intended for this usage, judging from comments therein.	2011-07-19 17:21:24 +01:00
Simon Riggs	465883b0a2	Introduce compact WAL record for the common case of commit (non-DDL). XLOG_XACT_COMMIT_COMPACT leaves out invalidation messages and relfilenodes, saving considerable space for the vast majority of transaction commits. XLOG_XACT_COMMIT keeps same definition as XLOG_PAGE_MAGIC 0xD067 and earlier. Leonardo Francalanci and Simon Riggs	2011-06-28 22:58:17 +01:00
Tom Lane	d2088ae949	Move RegisterPredicateLockingXid() call to a safer place. The SSI patch inserted a call of RegisterPredicateLockingXid into GetNewTransactionId, which was a bad idea on a couple of grounds. First, it's not necessary to hold XidGenLock while manipulating that shared memory, and doing so is bad because XidGenLock is a high-contention lock that should be held for as short a time as possible. (Not to mention that it adds an entirely unnecessary deadlock hazard, since we must take SerializableXactHashLock as well.) Second, the specific place where it was put was between extending CLOG and advancing nextXid, which could result in unpleasant behavior in case of a failure there. Pull the call out to AssignTransactionId, which is much safer and arguably better from a modularity standpoint too. There is more work to do to clean up the failure-before-advancing-nextXid issue, but that is a separate change that will need to be back-patched. So for the moment I just want to make GetNewTransactionId look the same as it did in prior versions.	2011-05-06 12:57:28 -04:00
Bruce Momjian	bf50caf105	pgindent run before PG 9.1 beta 1.	2011-04-10 11:42:00 -04:00
Simon Riggs	88f32b7ca2	Avoid assuming there will be only 3 states for synchronous_commit. Also avoid hardcoding the current default state by giving it the name "on" and replace with a meaningful name that reflects its behaviour. Coding only, no change in behaviour.	2011-04-04 23:23:13 +01:00
Robert Haas	240067b3b0	Merge synchronous_replication setting into synchronous_commit. This means one less thing to configure when setting up synchronous replication, and also avoids some ambiguity around what the behavior should be when the settings of these variables conflict. Fujii Masao, with additional hacking by me.	2011-04-04 16:25:52 -04:00
Simon Riggs	a8a8a3e096	Efficient transaction-controlled synchronous replication. If a standby is broadcasting reply messages and we have named one or more standbys in synchronous_standby_names then allow users who set synchronous_replication to wait for commit, which then provides strict data integrity guarantees. Design avoids sending and receiving transaction state information so minimises bookkeeping overheads. We synchronize with the highest priority standby that is connected and ready to synchronize. Other standbys can be defined to takeover in case of standby failure. This version has very strict behaviour; more relaxed options may be added at a later date. Simon Riggs and Fujii Masao, with reviews by Yeb Havinga, Jaime Casanova, Heikki Linnakangas and Robert Haas, plus the assistance of many other design reviewers.	2011-03-06 22:49:16 +00:00
Tom Lane	a874fe7b4c	Refactor the executor's API to support data-modifying CTEs better. The originally committed patch for modifying CTEs didn't interact well with EXPLAIN, as noted by myself, and also had corner-case problems with triggers, as noted by Dean Rasheed. Those problems show it is really not practical for ExecutorEnd to call any user-defined code; so split the cleanup duties out into a new function ExecutorFinish, which must be called between the last ExecutorRun call and ExecutorEnd. Some Asserts have been added to these functions to help verify correct usage. It is no longer necessary for callers of the executor to call AfterTriggerBeginQuery/AfterTriggerEndQuery for themselves, as this is now done by ExecutorStart/ExecutorFinish respectively. If you really need to suppress that and do it for yourself, pass EXEC_FLAG_SKIP_TRIGGERS to ExecutorStart. Also, refactor portal commit processing to allow for the possibility that PortalDrop will invoke user-defined code. I think this is not actually necessary just yet, since the portal-execution-strategy logic forces any non-pure-SELECT query to be run to completion before we will consider committing. But it seems like good future-proofing.	2011-02-27 13:44:12 -05:00
Heikki Linnakangas	dafaa3efb7	Implement genuine serializable isolation level. Until now, our Serializable mode has in fact been what's called Snapshot Isolation, which allows some anomalies that could not occur in any serialized ordering of the transactions. This patch fixes that using a method called Serializable Snapshot Isolation, based on research papers by Michael J. Cahill (see README-SSI for full references). In Serializable Snapshot Isolation, transactions run like they do in Snapshot Isolation, but a predicate lock manager observes the reads and writes performed and aborts transactions if it detects that an anomaly might occur. This method produces some false positives, ie. it sometimes aborts transactions even though there is no anomaly. To track reads we implement predicate locking, see storage/lmgr/predicate.c. Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared memory is finite, so when a transaction takes many tuple-level locks on a page, the locks are promoted to a single page-level lock, and further to a single relation level lock if necessary. To lock key values with no matching tuple, a sequential scan always takes a relation-level lock, and an index scan acquires a page-level lock that covers the search key, whether or not there are any matching keys at the moment. A predicate lock doesn't conflict with any regular locks or with another predicate locks in the normal sense. They're only used by the predicate lock manager to detect the danger of anomalies. Only serializable transactions participate in predicate locking, so there should be no extra overhead for for other transactions. Predicate locks can't be released at commit, but must be remembered until all the transactions that overlapped with it have completed. That means that we need to remember an unbounded amount of predicate locks, so we apply a lossy but conservative method of tracking locks for committed transactions. If we run short of shared memory, we overflow to a new "pg_serial" SLRU pool. We don't currently allow Serializable transactions in Hot Standby mode. That would be hard, because even read-only transactions can cause anomalies that wouldn't otherwise occur. Serializable isolation mode now means the new fully serializable level. Repeatable Read gives you the old Snapshot Isolation level that we have always had. Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and Anssi Kääriäinen	2011-02-08 00:09:08 +02:00
Robert Haas	dc8a14311a	Update comments in RecordTransactionCommit() to mention unlogged tables.	2011-01-03 10:29:22 -05:00
Bruce Momjian	5d950e3b0c	Stamp copyrights for year 2011.	2011-01-01 13:18:15 -05:00
Robert Haas	f6a0863e3c	Allow transactions that don't write WAL to commit asynchronously. This case can arise if a transaction has written data, but only to temporary tables. Loss of the commit record in case of a crash won't matter, because the temporary tables will be lost anyway. Reviewed by Heikki Linnakangas and Simon Riggs.	2010-12-20 12:59:33 -05:00
Simon Riggs	e620ee35b2	Optimize commit_siblings in two ways to improve group commit. First, avoid scanning the whole ProcArray once we know there are at least commit_siblings active; second, skip the check altogether if commit_siblings = 0. Greg Smith	2010-12-08 18:48:03 +00:00
Alvaro Herrera	17a16663d0	Remove AtStart_Cache() call in CommandCounterIncrement(). This call was present in the aboriginal code from Berkeley, and has never been touched; it may very well be that it was there to mask effects of bugs in other places and it may no longer be necessary. The removal has been foreseen in a code comment since 2007; this seems to be a good time to test this hypothesis.	2010-10-20 11:33:57 -03:00
Magnus Hagander	9f2e211386	Remove cvs keywords from all files.	2010-09-20 22:08:53 +02:00
Heikki Linnakangas	2746e5f21d	Introduce latches. A latch is a boolean variable, with the capability to wait until it is set. Latches can be used to reliably wait until a signal arrives, which is hard otherwise because signals don't interrupt select() on some platforms, and even when they do, there's race conditions. On Unix, latches use the so called self-pipe trick under the covers to implement the sleep until the latch is set, without race conditions. On Windows, Windows events are used. Use the new latch abstraction to sleep in walsender, so that as soon as a transaction finishes, walsender is woken up to immediately send the WAL to the standby. This reduces the latency between master and standby, which is good. Preliminary work by Fujii Masao. The latch implementation is by me, with helpful comments from many people.	2010-09-11 15:48:04 +00:00
Robert Haas	debcec7dc3	Include the backend ID in the relpath of temporary relations. This allows us to reliably remove all leftover temporary relation files on cluster startup without reference to system catalogs or WAL; therefore, we no longer include temporary relations in XLOG_XACT_COMMIT and XLOG_XACT_ABORT WAL records. Since these changes require including a backend ID in each SharedInvalSmgrMsg, the size of the SharedInvalidationMessage.id field has been reduced from two bytes to one, and the maximum number of connections has been reduced from INT_MAX / 4 to 2^23-1. It would be possible to remove these restrictions by increasing the size of SharedInvalidationMessage by 4 bytes, but right now that doesn't seem like a good trade-off. Review by Jaime Casanova and Tom Lane.	2010-08-13 20:10:54 +00:00
Robert Haas	95ef7cd40d	Make RecordTransactionCommit() respect wal_level. Since the only purpose of WAL-loggin SharedInvalidationMessages is to support Hot Standby operation, they needn't be included when wal_level < hot_standby. Back-patch to 9.0. Review by Heikki Linnakanagas and Fujii Masao.	2010-08-13 15:42:21 +00:00
Robert Haas	30c22eb8fc	Correct sundry errors in Hot Standby-related comments. Fujii Masao	2010-08-12 23:24:54 +00:00
Simon Riggs	5b8bd0529e	Rename asyncCommitLSN to asyncXactLSN to reflect changed role in 9.0. Transaction aborts now record their LSN to avoid corner case behaviour in SR/HS, hence change of name of variables and functions. As pointed out by Fujii Masao. Cosmetic changes only.	2010-07-29 22:27:27 +00:00
Robert Haas	7be8946c78	Avoid deep recursion when assigning XIDs to multiple levels of subxacts. Backpatch to 8.0. Andres Freund, with cleanup and adjustment for older branches by me.	2010-07-23 00:43:00 +00:00
Bruce Momjian	239d769e7e	pgindent run for 9.0, second run	2010-07-06 19:19:02 +00:00
Bruce Momjian	b57ddccf05	Add C comment about why synchronous_commit=off behavior can lose committed transactions in a postmaster crash.	2010-06-29 18:44:58 +00:00
Simon Riggs	463f151a23	Ensure that top level aborts call XLogSetAsyncCommit(). Not doing so simply leads to data waiting in wal_buffers which then causes later commits to potentially do emergency writes and for all forms of replication to be potentially delayed without need or benefit. Issue pointed out exactly by Fujii Masao, following bug report by Robert Haas on a separate though related topic.	2010-05-13 11:39:30 +00:00
Simon Riggs	8431e296ea	Cleanup initialization of Hot Standby. Clarify working with reanalysis of requirements and documentation on LogStandbySnapshot(). Fixes two minor bugs reported by Tom Lane that would lead to an incorrect snapshot after transaction wraparound. Also fix two other problems discovered that would give incorrect snapshots in certain cases. ProcArrayApplyRecoveryInfo() substantially rewritten. Some minor refactoring of xact_redo_apply() and ExpireTreeKnownAssignedTransactionIds().	2010-05-13 11:15:38 +00:00
Bruce Momjian	65e806cba1	pgindent run for 9.0	2010-02-26 02:01:40 +00:00
Tom Lane	05d8a561ff	Clean up handling of XactReadOnly and RecoveryInProgress checks. Add some checks that seem logically necessary, in particular let's make real sure that HS slave sessions cannot create temp tables. (If they did they would think that temp tables belonging to the master's session with the same BackendId were theirs. We must not allow myTempNamespace to become set in a slave session.) Change setval() and nextval() so that they are only allowed on temp sequences in a read-only transaction. This seems consistent with what we allow for table modifications in read-only transactions. Since an HS slave can't have a temp sequence, this also provides a nicer cure for the setval PANIC reported by Erik Rijkers. Make the error messages more uniform, and have them mention the specific command being complained of. This seems worth the trifling amount of extra code, since people are likely to see such messages a lot more than before.	2010-02-20 21:24:02 +00:00
Tom Lane	50a90fac40	Stamp HEAD as 9.0devel, and update various places that were referring to 8.5 (hope I got 'em all). Per discussion, this release will be 9.0 not 8.5.	2010-02-17 04:19:41 +00:00
Tom Lane	d1e027221d	Replace the pg_listener-based LISTEN/NOTIFY mechanism with an in-memory queue. In addition, add support for a "payload" string to be passed along with each notify event. This implementation should be significantly more efficient than the old one, and is also more compatible with Hot Standby usage. There is not yet any facility for HS slaves to receive notifications generated on the master, although such a thing is possible in future. Joachim Wieland, reviewed by Jeff Davis; also hacked on by me.	2010-02-16 22:34:57 +00:00
Simon Riggs	dd428c79a4	Fix relcache init file invalidation during Hot Standby for the case where a database has a non-default tablespaceid. Pass thru MyDatabaseId and MyDatabaseTableSpace to allow file path to be re-created in standby and correct invalidation to take place in all cases. Update and rework xact_commit_desc() debug messages. Bug report from Tom by code inspection. Fix by me.	2010-02-13 16:15:48 +00:00
Tom Lane	0a469c8769	Remove old-style VACUUM FULL (which was known for a little while as VACUUM FULL INPLACE), along with a boatload of subsidiary code and complexity. Per discussion, the use case for this method of vacuuming is no longer large enough to justify maintaining it; not to mention that we don't wish to invest the work that would be needed to make it play nicely with Hot Standby. Aside from the code directly related to old-style VACUUM FULL, this commit removes support for certain WAL record types that could only be generated within VACUUM FULL, redirect-pointer removal in heap_page_prune, and nontransactional generation of cache invalidation sinval messages (the last being the sticking point for Hot Standby). We still have to retain all code that copes with finding HEAP_MOVED_OFF and HEAP_MOVED_IN flag bits on existing tuples. This can't be removed as long as we want to support in-place update from pre-9.0 databases.	2010-02-08 04:33:55 +00:00
Tom Lane	b9b8831ad6	Create a "relation mapping" infrastructure to support changing the relfilenodes of shared or nailed system catalogs. This has two key benefits: * The new CLUSTER-based VACUUM FULL can be applied safely to all catalogs. * We no longer have to use an unsafe reindex-in-place approach for reindexing shared catalogs. CLUSTER on nailed catalogs now works too, although I left it disabled on shared catalogs because the resulting pg_index.indisclustered update would only be visible in one database. Since reindexing shared system catalogs is now fully transactional and crash-safe, the former special cases in REINDEX behavior have been removed; shared catalogs are treated the same as non-shared. This commit does not do anything about the recently-discussed problem of deadlocks between VACUUM FULL/CLUSTER on a system catalog and other concurrent queries; will address that in a separate patch. As a stopgap, parallel_schedule has been tweaked to run vacuum.sql by itself, to avoid such failures during the regression tests.	2010-02-07 20:48:13 +00:00
Tom Lane	875353b99f	Fix assorted core dumps and Assert failures that could occur during AbortTransaction or AbortSubTransaction, when trying to clean up after an error that prevented (sub)transaction start from completing: * access to TopTransactionResourceOwner that might not exist * assert failure in AtEOXact_GUC, if AtStart_GUC not called yet * assert failure or core dump in AfterTriggerEndSubXact, if AfterTriggerBeginSubXact not called yet Per testing by injecting elog(ERROR) at successive steps in StartTransaction and StartSubTransaction. It's not clear whether all of these cases could really occur in the field, but at least one of them is easily exposed by simple stress testing, as per my accidental discovery yesterday.	2010-01-24 21:49:17 +00:00
Simon Riggs	a8ce974cdd	Teach standby conflict resolution to use SIGUSR1 Conflict reason is passed through directly to the backend, so we can take decisions about the effect of the conflict based upon the local state. No specific changes, as yet, though this prepares for later work. CancelVirtualTransaction() sends signals while holding ProcArrayLock. Introduce errdetail_abort() to give message detail explaining that the abort was caused by conflict processing. Remove CONFLICT_MODE states in favour of using PROCSIG_RECOVERY_CONFLICT states directly, for clarity.	2010-01-16 10:05:59 +00:00
Simon Riggs	42edbd16fb	During Hot Standby, set DatabasePath correctly during relcache init file deletion, so that we attempt to unlink the correct filepath. unlink() errors are ignorable there, so lack of a DatabasePath initialization step did not cause visible problems until a related bug showed up on Solaris. Code refactored from xact_redo_commit() to ProcessCommittedInvalidationMessages() in inval.c. Recovery may replay shared invalidation messages for many databases, so we cannot SetDatabasePath() once as we do in normal backends. Read the databaseid from the shared invalidation messages, then set DatabasePath temporarily before calling RelationCacheInitFileInvalidate(). Problem report by Robert Treat, analysis and fix by me.	2010-01-09 16:49:27 +00:00
Bruce Momjian	0239800893	Update copyright for the year 2010.	2010-01-02 16:58:17 +00:00
Simon Riggs	efc16ea520	Allow read only connections during recovery, known as Hot Standby. Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record. New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far. This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required. Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit. Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.	2009-12-19 01:32:45 +00:00
Tom Lane	62aba76568	Prevent indirect security attacks via changing session-local state within an allegedly immutable index function. It was previously recognized that we had to prevent such a function from executing SET/RESET ROLE/SESSION AUTHORIZATION, or it could trivially obtain the privileges of the session user. However, since there is in general no privilege checking for changes of session-local state, it is also possible for such a function to change settings in a way that might subvert later operations in the same session. Examples include changing search_path to cause an unexpected function to be called, or replacing an existing prepared statement with another one that will execute a function of the attacker's choosing. The present patch secures VACUUM, ANALYZE, and CREATE INDEX/REINDEX against these threats, which are the same places previously deemed to need protection against the SET ROLE issue. GUC changes are still allowed, since there are many useful cases for that, but we prevent security problems by forcing a rollback of any GUC change after completing the operation. Other cases are handled by throwing an error if any change is attempted; these include temp table creation, closing a cursor, and creating or deleting a prepared statement. (In 7.4, the infrastructure to roll back GUC changes doesn't exist, so we settle for rejecting changes of "search_path" in these contexts.) Original report and patch by Gurjeet Singh, additional analysis by Tom Lane. Security: CVE-2009-4136	2009-12-09 21:57:51 +00:00
Heikki Linnakangas	cd87b6f8a5	Fix an old bug in multixact and two-phase commit. Prepared transactions can be part of multixacts, so allocate a slot for each prepared transaction in the "oldest member" array in multixact.c. On PREPARE TRANSACTION, transfer the oldest member value from the current backends slot to the prepared xact slot. Also save and recover the value from the 2pc state file. The symptom of the bug was that after a transaction prepared, a shared lock still held by the prepared transaction was sometimes ignored by other transactions. Fix back to 8.1, where both 2PC and multixact were introduced.	2009-11-23 09:58:36 +00:00
Alvaro Herrera	a8bb8eb583	Remove flatfiles.c, which is now obsolete. Recent commits have removed the various uses it was supporting. It was a performance bottleneck, according to bug report #4919 by Lauris Ulmanis; seems it slowed down user creation after a billion users.	2009-09-01 02:54:52 +00:00
Bruce Momjian	d747140279	8.4 pgindent run, with new combined Linux/FreeBSD/MinGW typedef list provided by Andrew.	2009-06-11 14:49:15 +00:00
Tom Lane	23543c732b	Rewrite xml.c's memory management (yet again). Give up on the idea of redirecting libxml's allocations into a Postgres context. Instead, just let it use malloc directly, and add PG_TRY blocks as needed to be sure we release libxml data structures in error recovery code paths. This is ugly but seems much more likely to play nicely with third-party uses of libxml, as seen in recent trouble reports about using Perl XML facilities in pl/perl and bug #4774 about contrib/xml2. I left the code for allocation redirection in place, but it's only built/used if you #define USE_LIBXMLCONTEXT. This is because I found it useful to corral libxml's allocations in a palloc context when hunting for libxml memory leaks, and we're surely going to have more of those in the future with this type of approach. But we don't want it turned on in a normal build because it breaks exactly what we need to fix. I have not re-indented most of the code sections that are now wrapped by PG_TRY(); that's for ease of review. pg_indent will fix it. This is a pre-existing bug in 8.3, but I don't dare back-patch this change until it's gotten a reasonable amount of field testing.	2009-05-13 20:27:17 +00:00
Heikki Linnakangas	b2a667b9ee	Add a new option to RestoreBkpBlocks() to indicate if a cleanup lock should be used instead of the normal exclusive lock, and make WAL redo functions responsible for calling RestoreBkpBlocks(). They know better what kind of a lock they need. At the moment, this just moves things around with no functional change, but makes the hot standby patch that's under review cleaner.	2009-01-20 18:59:37 +00:00
Bruce Momjian	511db38ace	Update copyright for 2009.	2009-01-01 17:24:05 +00:00
Alvaro Herrera	7b640b0345	Fix a couple of snapshot management bugs in the new ResourceOwner world: non-writable large objects need to have their snapshots registered on the transaction resowner, not the current portal's, because it must persist until the large object is closed (which the portal does not). Also, ensure that the serializable snapshot is recorded by the transaction resource owner too, even when a subtransaction has changed the current resource owner before serializable is taken. Per bug reports from Pavan Deolasee.	2008-12-04 14:51:02 +00:00
Heikki Linnakangas	3396000684	Rethink the way FSM truncation works. Instead of WAL-logging FSM truncations in FSM code, call FreeSpaceMapTruncateRel from smgr_redo. To make that cleaner from modularity point of view, move the WAL-logging one level up to RelationTruncate, and move RelationTruncate and all the related WAL-logging to new src/backend/catalog/storage.c file. Introduce new RelationCreateStorage and RelationDropStorage functions that are used instead of calling smgrcreate/smgrscheduleunlink directly. Move the pending rel deletion stuff from smgrcreate/smgrscheduleunlink to the new functions. This leaves smgr.c as a thin wrapper around md.c; all the transactional stuff is now in storage.c. This will make it easier to add new forks with similar truncation logic, like the visibility map.	2008-11-19 10:34:52 +00:00
Tom Lane	cad3a26a95	Fix sloppy omission of now-required #include's.	2008-11-11 14:17:02 +00:00
Heikki Linnakangas	7e8b0b9ab1	Change error messages to print the physical path, like "base/11517/3767_fsm", instead of symbolic names like "1663/11517/3767/1", per Alvaro's suggestion. I didn't change the messages in the higher-level index, heap and FSM routines, though, where the fork is implicit.	2008-11-11 13:19:16 +00:00
Alvaro Herrera	06da3c570f	Rework subtransaction commit protocol for hot standby. This patch eliminates the marking of subtransactions as SUBCOMMITTED in pg_clog during their commit; instead they remain in-progress until main transaction commit. At main transaction commit, the commit protocol is atomic-by-page instead of one transaction at a time. To avoid a race condition with some subtransactions appearing committed before others in the case where they span more than one pg_clog page, we conserve the logic that marks them subcommitted before marking the parent committed. Simon Riggs with minor help from me	2008-10-20 19:18:18 +00:00
Heikki Linnakangas	3f0e808c4a	Introduce the concept of relation forks. An smgr relation can now consist of multiple forks, and each fork can be created and grown separately. The bulk of this patch is about changing the smgr API to include an extra ForkNumber argument in every smgr function. Also, smgrscheduleunlink and smgrdounlink no longer implicitly call smgrclose, because other forks might still exist after unlinking one. The callers of those functions have been modified to call smgrclose instead. This patch in itself doesn't have any user-visible effect, but provides the infrastructure needed for upcoming patches. The additional forks envisioned are a rewritten FSM implementation that doesn't rely on a fixed-size shared memory block, and a visibility map to allow skipping portions of a table in VACUUM that have no dead tuples.	2008-08-11 11:05:11 +00:00
Alvaro Herrera	5da9da71c4	Improve snapshot manager by keeping explicit track of snapshots. There are two ways to track a snapshot: there's the "registered" list, which is used for arbitrary long-lived snapshots; and there's the "active stack", which is used for the snapshot that is considered "active" at any time. This also allows users of snapshots to stop worrying about snapshot memory allocation and freeing, and about using PG_TRY blocks around ActiveSnapshot assignment. This is all done automatically now. As a consequence, this allows us to reset MyProc->xmin when there are no more snapshots registered in the current backend, reducing the impact that long-running transactions have on VACUUM.	2008-05-12 20:02:02 +00:00
Alvaro Herrera	f8c4d7db60	Restructure some header files a bit, in particular heapam.h, by removing some unnecessary #include lines in it. Also, move some tuple routine prototypes and macros to htup.h, which allows removal of heapam.h inclusion from some .c files. For this to work, a new header file access/sysattr.h needed to be created, initially containing attribute numbers of system columns, for pg_dump usage. While at it, make contrib ltree, intarray and hstore header files more consistent with our header style.	2008-05-12 00:00:54 +00:00
Alvaro Herrera	78f02ca1f5	Rename snapmgmt.c/h to snapmgr.c/h, for consistency with other files. Per complaint from Tom Lane.	2008-03-26 18:48:59 +00:00
Alvaro Herrera	d43b085d57	Separate snapshot management code from tuple visibility code, create a snapmgmt.c file for the former. The header files have also been reorganized in three parts: the most basic snapshot definitions are now in a new file snapshot.h, and the also new snapmgmt.h keeps the definitions for snapmgmt.c. tqual.h has been reduced to the bare minimum. This patch is just a first step towards managing live snapshots within a transaction; there is no functionality change. Per my proposal to pgsql-patches on 20080318191940.GB27458@alvh.no-ip.org and subsequent discussion.	2008-03-26 16:20:48 +00:00
Peter Eisentraut	a7b7b07af3	Enable probes to work with Mac OS X Leopard and other OSes that will support DTrace in the future. Switch from using DTRACE_PROBEn macros to the dynamically generated macros. Use "dtrace -h" to create a header file that contains the dynamically generated macros to be used in the source code instead of the DTRACE_PROBEn macros. A dummy header file is generated for builds without DTrace support. Author: Robert Lor <Robert.Lor@sun.com>	2008-03-17 19:44:41 +00:00
Tom Lane	32846f8152	Fix TransactionIdIsCurrentTransactionId() to use binary search instead of linear search when checking child-transaction XIDs. This makes for an important speedup in transactions that have large numbers of children, as in a recent example from Craig Ringer. We can also get rid of an ugly kluge that represented lists of TransactionIds as lists of OIDs. Heikki Linnakangas	2008-03-17 02:18:55 +00:00
Tom Lane	7d6e6e2e97	Fix PREPARE TRANSACTION to reject the case where the transaction has dropped a temporary table; we can't support that because there's no way to clean up the source backend's internal state if the eventual COMMIT PREPARED is done by another backend. This was checked correctly in 8.1 but I broke it in 8.2 :-(. Patch by Heikki Linnakangas, original trouble report by John Smith.	2008-03-04 19:54:06 +00:00
Tom Lane	ac12412ede	Revise memory management for libxml calls. Instead of keeping libxml's data in whichever context happens to be current during a call of an xml.c function, use a dedicated context that will not go away until we explicitly delete it (which we do at transaction end or subtransaction abort). This makes recovery after an error much simpler --- we don't have to individually delete the data structures created by libxml. Also, we need to initialize and cleanup libxml only once per transaction (if there's no error) instead of once per function call, so it should be a bit faster. We'll need to keep an eye out for intra-transaction memory leaks, though. Alvaro and Tom.	2008-01-15 18:57:00 +00:00
Tom Lane	eedb068c0a	Make standard maintenance operations (including VACUUM, ANALYZE, REINDEX, and CLUSTER) execute as the table owner rather than the calling user, using the same privilege-switching mechanism already used for SECURITY DEFINER functions. The purpose of this change is to ensure that user-defined functions used in index definitions cannot acquire the privileges of a superuser account that is performing routine maintenance. While a function used in an index is supposed to be IMMUTABLE and thus not able to do anything very interesting, there are several easy ways around that restriction; and even if we could plug them all, there would remain a risk of reading sensitive information and broadcasting it through a covert channel such as CPU usage. To prevent bypassing this security measure, execution of SET SESSION AUTHORIZATION and SET ROLE is now forbidden within a SECURITY DEFINER context. Thanks to Itagaki Takahiro for reporting this vulnerability. Security: CVE-2007-6600	2008-01-03 21:23:15 +00:00
Bruce Momjian	9098ab9e32	Update copyrights in source tree to 2008.	2008-01-01 19:46:01 +00:00
Tom Lane	895a94de6d	Avoid incrementing the CommandCounter when CommandCounterIncrement is called but no database changes have been made since the last CommandCounterIncrement. This should result in a significant improvement in the number of "commands" that can typically be performed within a transaction before hitting the 2^32 CommandId size limit. In particular this buys back (and more) the possible adverse consequences of my previous patch to fix plan caching behavior. The implementation requires tracking whether the current CommandCounter value has been "used" to mark any tuples. CommandCounter values stored into snapshots are presumed not to be used for this purpose. This requires some small executor changes, since the executor used to conflate the curcid of the snapshot it was using with the command ID to mark output tuples with. Separating these concepts allows some small simplifications in executor APIs. Something for the TODO list: look into having CommandCounterIncrement not do AcceptInvalidationMessages. It seems fairly bogus to be doing it there, but exactly where to do it instead isn't clear, and I'm disinclined to mess with asynchronous behavior during late beta.	2007-11-30 21:22:54 +00:00
Bruce Momjian	fdf5a5efb7	pgindent run for 8.3.	2007-11-15 21:14:46 +00:00
Bruce Momjian	82748bc253	Reduce error level of ROLLBACK outside a transaction from WARNING to NOTICE.	2007-11-10 14:36:44 +00:00
Tom Lane	ef4d38c86c	Rename recently-added pg_stat_activity column from txn_start to xact_start, for consistency with other column names such as in pg_stat_database.	2007-09-11 03:28:05 +00:00
Tom Lane	6bd4f401b0	Replace the former method of determining snapshot xmax --- to wit, calling ReadNewTransactionId from GetSnapshotData --- with a "latestCompletedXid" variable that is updated during transaction commit or abort. Since latestCompletedXid is written only in places that had to lock ProcArrayLock exclusively anyway, and is read only in places that had to lock ProcArrayLock shared anyway, it adds no new locking requirements to the system despite being cluster-wide. Moreover, removing ReadNewTransactionId from snapshot acquisition eliminates the need to take both XidGenLock and ProcArrayLock at the same time. Since XidGenLock is sometimes held across I/O this can be a significant win. Some preliminary benchmarking suggested that this patch has no effect on average throughput but can significantly improve the worst-case transaction times seen in pgbench. Concept by Florian Pflug, implementation by Tom Lane.	2007-09-08 20:31:15 +00:00
Tom Lane	0a51e7073c	Don't take ProcArrayLock while exiting a transaction that has no XID; there is no need for serialization against snapshot-taking because the xact doesn't affect anyone else's snapshot anyway. Per discussion. Also, move various info about the interlocking of transactions and snapshots out of code comments and into a hopefully-more-cohesive discussion in access/transam/README. Also, remove a couple of now-obsolete comments about having to force some WAL to be written to persuade RecordTransactionCommit to do its thing.	2007-09-07 20:59:26 +00:00
Tom Lane	295e63983d	Implement lazy XID allocation: transactions that do not modify any database rows will normally never obtain an XID at all. We already did things this way for subtransactions, but this patch extends the concept to top-level transactions. In applications where there are lots of short read-only transactions, this should improve performance noticeably; not so much from removal of the actual XID-assignments, as from reduction of overhead that's driven by the rate of XID consumption. We add a concept of a "virtual transaction ID" so that active transactions can be uniquely identified even if they don't have a regular XID. This is a much lighter-weight concept: uniqueness of VXIDs is only guaranteed over the short term, and no on-disk record is made about them. Florian Pflug, with some editorialization by Tom.	2007-09-05 18:10:48 +00:00
Tom Lane	2abae34a2e	Implement function-local GUC parameter settings, as per recent discussion. There are still some loose ends: I didn't do anything about the SET FROM CURRENT idea yet, and it's not real clear whether we are happy with the interaction of SET LOCAL with function-local settings. The documentation is a bit spartan, too.	2007-09-03 00:39:26 +00:00

1 2 3 4 5 ...

396 Commits