postgresql

mirror of https://git.postgresql.org/git/postgresql.git synced 2024-09-06 19:49:21 +02:00

Author	SHA1	Message	Date
Heikki Linnakangas	5762a4d909	Inherit max_safe_fds to child processes in EXEC_BACKEND mode. Postmaster sets max_safe_fds by testing how many open file descriptors it can open, and that is normally inherited by all child processes at fork(). Not so on EXEC_BACKEND, ie. Windows, however. Because of that, we effectively ignored max_files_per_process on Windows, and always assumed a conservative default of 32 simultaneous open files. That could have an impact on performance, if you need to access a lot of different files in a query. After this patch, the value is passed to child processes by save/restore_backend_variables() among many other global variables. It has been like this forever, but given the lack of complaints about it, I'm not backpatching this.	2012-03-29 08:19:11 +03:00
Robert Haas	40b9b95769	New GUC, track_iotiming, to track I/O timings. Currently, the only way to see the numbers this gathers is via EXPLAIN (ANALYZE, BUFFERS), but the plan is to add visibility through the stats collector and pg_stat_statements in subsequent patches. Ants Aasma, reviewed by Greg Smith, with some further changes by me.	2012-03-27 14:55:02 -04:00
Peter Eisentraut	0e85abd658	Clean up compiler warnings from unused variables with asserts disabled For those variables only used when asserts are enabled, use a new macro PG_USED_FOR_ASSERTS_ONLY, which expands to __attribute__((unused)) when asserts are not enabled.	2012-03-21 23:33:10 +02:00
Robert Haas	07d1edb954	Extend object access hook framework to support arguments, and DROP. This allows loadable modules to get control at drop time, perhaps for the purpose of performing additional security checks or to log the event. The initial purpose of this code is to support sepgsql, but other applications should be possible as well. KaiGai Kohei, reviewed by me.	2012-03-09 14:34:56 -05:00
Heikki Linnakangas	d6a7271958	Correctly detect SSI conflicts of prepared transactions after crash. A prepared transaction can get new conflicts in and out after preparing, so we cannot rely on the in- and out-flags stored in the statefile at prepare- time. As a quick fix, make the conservative assumption that after a restart, all prepared transactions are considered to have both in- and out-conflicts. That can lead to unnecessary rollbacks after a crash, but that shouldn't be a big problem in practice; you don't want prepared transactions to hang around for a long time anyway. Dan Ports	2012-02-29 15:42:36 +02:00
Robert Haas	2254367435	Make EXPLAIN (BUFFERS) track blocks dirtied, as well as those written. Also expose the new counters through pg_stat_statements. Patch by me. Review by Fujii Masao and Greg Smith.	2012-02-22 20:33:05 -05:00
Heikki Linnakangas	1a01560cbb	Rename LWLockWaitUntilFree to LWLockAcquireOrWait. LWLockAcquireOrWait makes it more clear that the lock is acquired if it's free.	2012-02-08 09:17:13 +02:00
Heikki Linnakangas	15ad6f1510	When building with LWLOCK_STATS, initialize the stats in LWLockWaitUntilFree. If LWLockWaitUntilFree was called before the first LWLockAcquire call, you would either crash because of access to uninitialized array or account the acquisition incorrectly. LWLockConditionalAcquire doesn't have this problem because it doesn't update the lwlock stats. In practice, this never happens because there is no codepath where you would call LWLockWaitUntilfree before LWLockAcquire after a new process is launched. But that's just accidental, there's no guarantee that that's always going to be true in the future. Spotted by Jeff Janes.	2012-02-07 10:11:54 +02:00
Tom Lane	c6d76d7c82	Add locking around WAL-replay modification of shared-memory variables. Originally, most of this code assumed that no Postgres backends could be running concurrently with it, and so no locking could be needed. That assumption fails in Hot Standby. While it's still true that Hot Standby backends should never change values like nextXid, they can examine them, and consistency is important in some cases such as when computing a snapshot. Therefore, prudence requires that WAL replay code obtain the relevant locks when modifying such variables, even though it can examine them without taking a lock. We were following that coding rule in some places but not all. This commit applies the coding rule uniformly to all updates of ShmemVariableCache and MultiXactState fields; a search of the replay routines did not find any other cases that seemed to be at risk. In addition, this commit fixes a longstanding thinko in replay of NEXTOID and checkpoint records: we tried to advance nextOid only if it was behind the value in the WAL record, but the comparison would draw the wrong conclusion if OID wraparound had occurred since the previous value. Better to just unconditionally assign the new value, since OID assignment shouldn't be happening during replay anyway. The additional locking seems to be more in the nature of future-proofing than fixing any live bug, so I am not going to back-patch it. The NEXTOID fix will be back-patched separately.	2012-02-06 12:34:10 -05:00
Tom Lane	2af72cefea	Add missing Assert and fix inaccurate elog message in standby_redo(). All other WAL redo routines either call RestoreBkpBlocks() or Assert that they haven't been passed any backup blocks. Make this one do likewise. Also, fix incorrect routine name in its failure message.	2012-02-04 22:32:35 -05:00
Heikki Linnakangas	82d4b262d9	Fix bug in the new wait-until-lwlock-is-free mechanism. If there was a wait-until-free process in the head of the wait queue, followed by an exclusive locker, the exclusive locker was not be woken up as it should.	2012-01-31 00:09:30 +02:00
Heikki Linnakangas	9b38d46d9f	Make group commit more effective. When a backend needs to flush the WAL, and someone else is already flushing the WAL, wait until it releases the WALInsertLock and check if we still need to do the flush or if the other backend already did the work for us, before acquiring WALInsertLock. This helps group commit, because when the WAL flush finishes, all the backends that were waiting for it can be woken up in one go, and the can all concurrently observe that they're done, rather than waking them up one by one in a cascading fashion. This is based on a new LWLock function, LWLockWaitUntilFree(), which has peculiar semantics. If the lock is immediately free, it grabs the lock and returns true. If it's not free, it waits until it is released, but then returns false without grabbing the lock. This is used in XLogFlush(), so that when the lock is acquired, the backend flushes the WAL, but if it's not, the backend first checks the current flush location before retrying. Original patch and benchmarking by Peter Geoghegan and Simon Riggs, although this patch as committed ended up being very different from that.	2012-01-30 16:53:48 +02:00
Tom Lane	ad10853b30	Assorted comment fixes, mostly just typos, but some obsolete statements. YAMAMOTO Takashi	2012-01-29 19:23:56 -05:00
Magnus Hagander	672614cf21	Prevent logging "failed to stat file: success" for temp files This was broken in commit `bc3347484a`, the addition of statistics counters for temp files. Reported by Thom Brown	2012-01-28 10:03:26 +01:00
Heikki Linnakangas	cf3fff6326	Initialize the new bgwriterLatch field properly. Peter Geoghegan	2012-01-27 18:25:32 +02:00
Heikki Linnakangas	6d90eaaa89	Make bgwriter sleep longer when it has no work to do, to save electricity. To make it wake up promptly when activity starts again, backends nudge it by setting a latch in MarkBufferDirty(). The latch is kept set while bgwriter is active, so there is very little overhead from that when the system is busy. It is only armed before going into longer sleep. Peter Geoghegan, with some changes by me.	2012-01-26 18:39:13 +02:00
Robert Haas	467ff207f5	Add missing #include, to suppress compiler warning.	2012-01-26 10:16:26 -05:00
Magnus Hagander	61cb8c5abb	Add deadlock counter to pg_stat_database Adds a counter that tracks number of deadlocks that occurred in each database to pg_stat_database. Magnus Hagander, reviewed by Jaime Casanova	2012-01-26 15:58:19 +01:00
Robert Haas	0e549697d1	Classify DROP operations by whether or not they are user-initiated. This doesn't do anything useful just yet, but is intended as supporting infrastructure for allowing sepgsql to sensibly check DROP permissions. KaiGai Kohei and Robert Haas	2012-01-26 09:30:27 -05:00
Magnus Hagander	bc3347484a	Track temporary file count and size in pg_stat_database Add counters for number and size of temporary files used for spill-to-disk queries for each database to the pg_stat_database view. Tomas Vondra, review by Magnus Hagander	2012-01-26 14:41:19 +01:00
Simon Riggs	c172b7b02e	Resolve timing issue with logging locks for Hot Standby. We log AccessExclusiveLocks for replay onto standby nodes, but because of timing issues on ProcArray it is possible to log a lock that is still held by a just committed transaction that is very soon to be removed. To avoid any timing issue we avoid applying locks made by transactions with InvalidXid. Simon Riggs, bug report Tom Lane, diagnosis Pavan Deolasee	2012-01-23 23:37:32 +00:00
Heikki Linnakangas	326b922e8b	Fix corner case in cleanup of transactions using SSI. When the only remaining active transactions are READ ONLY, we do a "partial cleanup" of committed transactions because certain types of conflicts aren't possible anymore. For committed r/w transactions, we release the SIREAD locks but keep the SERIALIZABLEXACT. However, for committed r/o transactions, we can go further and release the SERIALIZABLEXACT too. The problem was with the latter case: we were returning the SERIALIZABLEXACT to the free list without removing it from the finished list. The only real change in the patch is the SHMQueueDelete line, but I also reworked some of the surrounding code to make it obvious that r/o and r/w transactions are handled differently -- the existing code felt a bit too clever. Dan Ports	2012-01-18 17:57:33 +02:00
Robert Haas	33aaa139e6	Make the number of CLOG buffers adaptive, based on shared_buffers. Previously, this was hardcoded: we always had 8. Performance testing shows that isn't enough, especially on big SMP systems, so we allow it to scale up as high as 32 when there's adequate memory. On the flip side, when shared_buffers is very small, drop the number of CLOG buffers down to as little as 4, so that we can start the postmaster even when very little shared memory is available. Per extensive discussion with Simon Riggs, Tom Lane, and others on pgsql-hackers.	2012-01-06 14:32:18 -05:00
Robert Haas	7e4911b2ae	Fix variable confusion in BufferSync(). As noted by Heikki Linnakangas, the previous coding confused the "flags" variable with the "mask" variable. The affect of this appears to be that unlogged buffers would get written out at every checkpoint rather than only at shutdown time. Although that's arguably an acceptable failure mode, I'm back-patching this change, since it seems like a poor idea to rely on this happening to work.	2012-01-06 08:35:48 -05:00
Bruce Momjian	e126958c2e	Update copyright notices for year 2012.	2012-01-01 18:01:58 -05:00
Peter Eisentraut	d383c23f6f	Remove support for on_exit() All supported platforms support the C89 standard function atexit() (SunOS 4 probably being the last one not to), and supporting both makes the code clumsy.	2011-12-27 20:57:59 +02:00
Tom Lane	d0024cd188	Avoid crashing when we have problems unlinking files post-commit. smgrdounlink takes care to not throw an ERROR if it fails to unlink something, but that caution was rendered useless by commit `3396000684`, which put an smgrexists call in front of it; smgrexists does throw error if anything looks funny, such as getting a permissions error from trying to open the file. If that happens post-commit, you get a PANIC, and what's worse the same logic appears in the WAL replay code, so the database even fails to restart. Restore the intended behavior by removing the smgrexists call --- it isn't accomplishing anything that we can't do better by adjusting mdunlink's ideas of whether it ought to warn about ENOENT or not. Per report from Joseph Shraibman of unrecoverable crash after trying to drop a table whose FSM fork had somehow gotten chmod'd to 000 permissions. Backpatch to 8.4, where the bogus coding was introduced.	2011-12-20 15:00:36 -05:00
Robert Haas	0d76b60db4	Various micro-optimizations for GetSnapshopData(). Heikki Linnakangas had the idea of rearranging GetSnapshotData to avoid checking for sub-XIDs when no top-level XID is present. This patch does that plus further a bit of further, related rearrangement. Benchmarking show a significant improvement on unlogged tables at higher concurrency levels, and mostly indifferent result on permanent tables (which are presumably bottlenecked elsewhere). Most of the benefit seems to come from using the new NormalTransactionIdPrecedes() macro rather than the function call TransactionIdPrecedes().	2011-12-16 21:48:47 -05:00
Alvaro Herrera	9d3b502443	Improve logging of autovacuum I/O activity This adds some I/O stats to the logging of autovacuum (when the operation takes long enough that log_autovacuum_min_duration causes it to be logged), so that it is easier to tune. Notably, it adds buffer I/O counts (hits, misses, dirtied) and read and write rate. Authors: Greg Smith and Noah Misch	2011-11-25 16:34:32 -03:00
Robert Haas	ed0b409d22	Move "hot" members of PGPROC into a separate PGXACT array. This speeds up snapshot-taking and reduces ProcArrayLock contention. Also, the PGPROC (and PGXACT) structures used by two-phase commit are now allocated as part of the main array, rather than in a separate array, and we keep ProcArray sorted in pointer order. These changes are intended to minimize the number of cache lines that must be pulled in to take a snapshot, and testing shows a substantial increase in performance on both read and write workloads at high concurrencies. Pavan Deolasee, Heikki Linnakangas, Robert Haas	2011-11-25 08:02:10 -05:00
Tom Lane	40d35036bb	Avoid floating-point underflow while tracking buffer allocation rate. When the system is idle for awhile after activity, the "smoothed_alloc" state variable in BgBufferSync converges slowly to zero. With standard IEEE float arithmetic this results in several iterations with denormalized values, which causes kernel traps and annoying log messages on some poorly-designed platforms. There's no real need to track such small values of smoothed_alloc, so we can prevent the kernel traps by forcing it to zero as soon as it's too small to be interesting for our purposes. This issue is purely cosmetic, since the iterations don't happen fast enough for the kernel traps to pose any meaningful performance problem, but still it seems worth shutting up the log messages. The kernel log messages were previously reported by a number of people, but kudos to Greg Matthews for tracking down exactly where they were coming from.	2011-11-19 00:35:29 -05:00
Robert Haas	71b2b657c0	Revert removal of trace_userlocks, because userlocks aren't gone. This reverts commit `0180bd6180`. contrib/userlock is gone, but user-level locking still exists, and is exposed via the pg_advisory* family of functions.	2011-11-10 17:54:27 -05:00
Simon Riggs	86e3364899	Derive oldestActiveXid at correct time for Hot Standby. There was a timing window between when oldestActiveXid was derived and when it should have been derived that only shows itself under heavy load. Move code around to ensure correct timing of derivation. No change to StartupSUBTRANS() code, which is where this failed. Bug report by Chris Redekop	2011-11-02 08:54:56 +00:00
Simon Riggs	10b7c686e5	Start Hot Standby faster when initial snapshot is incomplete. If the initial snapshot had overflowed then we can start whenever the latest snapshot is empty, not overflowed or as we did already, start when the xmin on primary was higher than xmax of our starting snapshot, which proves we have full snapshot data. Bug report by Chris Redekop	2011-11-02 08:47:43 +00:00
Robert Haas	c2891b46a4	Initialize myProcLocks queues just once, at postmaster startup. In assert-enabled builds, we assert during the shutdown sequence that the queues have been properly emptied, and during process startup that we are inheriting empty queues. In non-assert enabled builds, we just save a few cycles.	2011-11-01 22:44:54 -04:00
Simon Riggs	806a2aee37	Split work of bgwriter between 2 processes: bgwriter and checkpointer. bgwriter is now a much less important process, responsible for page cleaning duties only. checkpointer is now responsible for checkpoints and so has a key role in shutdown. Later patches will correct doc references to the now old idea that bgwriter performs checkpoints. Has beneficial effect on performance at high write rates, but mainly refactoring to more easily allow changes for power reduction by simplifying previously tortuous code around required to allow page cleaning and checkpointing to time slice in the same process. Patch by me, Review by Dickson Guedes	2011-11-01 17:14:47 +00:00
Robert Haas	53f1ca59b5	Allow hint bits to be set sooner for temporary and unlogged tables. We need not wait until the commit record is durably on disk, because in the event of a crash the page we're updating with hint bits will be gone anyway. Per off-list report from Heikki Linnakangas, this can significantly degrade the performance of unlogged tables; I was able to show a 2x speedup from this patch on a pgbench run with scale factor 15. In practice, this will mostly help small, heavily updated tables, because on larger tables you're unlikely to run into the same row again before the commit record makes it out to disk.	2011-10-28 17:08:09 -04:00
Heikki Linnakangas	cbf65509bb	Fix the number of lwlocks needed by the "fast path" lock patch. It needs one lock per backend or auxiliary process - the need for a lock for each aux processes was not accounted for in NumLWLocks(). No-one noticed, because the three locks needed for the three aux processes fit into the few extra lwlocks we allocate for 3rd party modules that don't call RequestAddinLWLocks() (NUM_USER_DEFINED_LWLOCKS, 4 by default).	2011-10-27 22:39:58 +03:00
Tom Lane	bb446b689b	Support synchronization of snapshots through an export/import procedure. A transaction can export a snapshot with pg_export_snapshot(), and then others can import it with SET TRANSACTION SNAPSHOT. The data does not leave the server so there are not security issues. A snapshot can only be imported while the exporting transaction is still running, and there are some other restrictions. I'm not totally convinced that we've covered all the bases for SSI (true serializable) mode, but it works fine for lesser isolation modes. Joachim Wieland, reviewed by Marko Tiikkaja, and rather heavily modified by Tom Lane	2011-10-22 18:23:30 -04:00
Tom Lane	b4a0223d00	Simplify and improve ProcessStandbyHSFeedbackMessage logic. There's no need to clamp the standby's xmin to be greater than GetOldestXmin's result; if there were any such need this logic would be hopelessly inadequate anyway, because it fails to account for within-database versus cluster-wide values of GetOldestXmin. So get rid of that, and just rely on sanity-checking that the xmin is not wrapped around relative to the nextXid counter. Also, don't reset the walsender's xmin if the current feedback xmin is indeed out of range; that just creates more problems than we already had. Lastly, don't bother to take the ProcArrayLock; there's no need to do that to set xmin. Also improve the comments about this in GetOldestXmin itself.	2011-10-20 19:43:31 -04:00
Bruce Momjian	0180bd6180	Remove all "traces" of trace_userlocks, because userlocks were removed in PG 8.2.	2011-10-13 19:59:57 -04:00
Robert Haas	e76bcaba9c	Repair breakage in VirtualXactLock. I broke this in commit `84e3712677`. Report and fix by Fujii Masao.	2011-10-11 07:39:09 -04:00
Tom Lane	57eb009092	Allow snapshot references to still work during transaction abort. In REPEATABLE READ (nee SERIALIZABLE) mode, an attempt to do GetTransactionSnapshot() between AbortTransaction and CleanupTransaction failed, because GetTransactionSnapshot would recompute the transaction snapshot (which is already wrong, given the isolation mode) and then re-register it in the TopTransactionResourceOwner, leading to an Assert because the TopTransactionResourceOwner should be empty of resources after AbortTransaction. This is the root cause of bug #6218 from Yamamoto Takashi. While changing plancache.c to avoid requesting a snapshot when handling a ROLLBACK masks the problem, I think this is really a snapmgr.c bug: it's lower-level than the resource manager mechanism and should not be shutting itself down before we unwind resource manager resources. However, just postponing the release of the transaction snapshot until cleanup time didn't work because of the circular dependency with TopTransactionResourceOwner. Fix by managing the internal reference to that snapshot manually instead of depending on TopTransactionResourceOwner. This saves a few cycles as well as making the module layering more straightforward. predicate.c's dependencies on TopTransactionResourceOwner go away too. I think this is a longstanding bug, but there's no evidence that it's more than a latent bug, so it doesn't seem worth any risk of back-patching.	2011-09-26 22:25:28 -04:00
Robert Haas	0c8eda6258	Memory barrier support for PostgreSQL. This is not actually used anywhere yet, but it gets the basic infrastructure in place. It is fairly likely that there are bugs, and support for some important platforms may be missing, so we'll need to refine this as we go along.	2011-09-23 17:52:43 -04:00
Peter Eisentraut	1b81c2fe6e	Remove many -Wcast-qual warnings This addresses only those cases that are easy to fix by adding or moving a const qualifier or removing an unnecessary cast. There are many more complicated cases remaining.	2011-09-11 21:54:32 +03:00
Tom Lane	a7801b62f2	Move Timestamp/Interval typedefs and basic macros into datatype/timestamp.h. As per my recent proposal, this refactors things so that these typedefs and macros are available in a header that can be included in frontend-ish code. I also changed various headers that were undesirably including utils/timestamp.h to include datatype/timestamp.h instead. Unsurprisingly, this showed that half the system was getting utils/timestamp.h by way of xlog.h. No actual code changes here, just header refactoring.	2011-09-09 13:23:41 -04:00
Tom Lane	1609797c25	Clean up the #include mess a little. walsender.h should depend on xlog.h, not vice versa. (Actually, the inclusion was circular until a couple hours ago, which was even sillier; but Bruce broke it in the expedient rather than logically correct direction.) Because of that poor decision, plus blind application of pgrminclude, we had a situation where half the system was depending on xlog.h to include such unrelated stuff as array.h and guc.h. Clean up the header inclusion, and manually revert a lot of what pgrminclude had done so things build again. This episode reinforces my feeling that pgrminclude should not be run without adult supervision. Inclusion changes in header files in particular need to be reviewed with great care. More generally, it'd be good if we had a clearer notion of module layering to dictate which headers can sanely include which others ... but that's a big task for another day.	2011-09-04 01:13:16 -04:00
Bruce Momjian	6416a82a62	Remove unnecessary #include references, per pgrminclude script.	2011-09-01 10:04:27 -04:00
Robert Haas	c01c25fbe5	Improve spinlock performance for HP-UX, ia64, non-gcc. At least on this architecture, it's very important to spin on a non-atomic instruction and only retry the atomic once it appears that it will succeed. To fix this, split TAS() into two macros: TAS(), for trying to grab the lock the first time, and TAS_SPIN(), for spinning until we get it. TAS_SPIN() defaults to same as TAS(), but we can override it when we know there's a better way. It's likely that some of the other cases in s_lock.h require similar treatment, but this is the only one we've got conclusive evidence for at present.	2011-08-29 10:05:48 -04:00
Bruce Momjian	f261deb4b4	Add missing includes after pgrminclude run.	2011-08-26 18:15:14 -04:00
Robert Haas	7488936478	Typo fix.	2011-08-22 12:16:27 -04:00
Robert Haas	24bf1552f6	Remove obsolete README file. Perhaps we ought to add some other kind of documentation here instead, but for now let's get rid of this woefully obsolete description of the sinval machinery.	2011-08-18 09:49:41 -04:00
Peter Eisentraut	e5475a80d2	Add "Reason code" prefix to internal SSI error messages This makes it clearer that the error message is perhaps not supposed to be understood by users, and it also makes it somewhat clearer that it was not accidentally omitted from translation. Idea from Heikki Linnakangas, except that we don't mark "Reason code" for translation at this point, because that would make the implementation too cumbersome.	2011-08-15 15:20:16 +03:00
Tom Lane	4dab3d5ae1	Change the autovacuum launcher to use WaitLatch instead of a poll loop. In pursuit of this (and with the expectation that WaitLatch will be needed in more places), convert the latch field that was already added to PGPROC for sync rep into a generic latch that is activated for all PGPROC-owning processes, and change many of the standard backend signal handlers to set that latch when a signal happens. This will allow WaitLatch callers to be wakened properly by these signals. In passing, fix a whole bunch of signal handlers that had been hacked to do things that might change errno, without adding the necessary save/restore logic for errno. Also make some minor fixes in unix_latch.c, and clean up bizarre and unsafe scheme for disowning the process's latch. Much of this has to be back-patched into 9.1. Peter Geoghegan, with additional work by Tom	2011-08-10 12:22:21 -04:00
Tom Lane	4e15a4db5e	Documentation improvement and minor code cleanups for the latch facility. Improve the documentation around weak-memory-ordering risks, and do a pass of general editorialization on the comments in the latch code. Make the Windows latch code more like the Unix latch code where feasible; in particular provide the same Assert checks in both implementations. Fix poorly-placed WaitLatch call in syncrep.c. This patch resolves, for the moment, concerns around weak-memory-ordering bugs in latch-related code: we have documented the restrictions and checked that existing calls meet them. In 9.2 I hope that we will install suitable memory barrier instructions in SetLatch/ResetLatch, so that their callers don't need to be quite so careful.	2011-08-09 15:30:45 -04:00
Robert Haas	84e3712677	Create VXID locks "lazily" in the main lock table. Instead of entering them on transaction startup, we materialize them only when someone wants to wait, which will occur only during CREATE INDEX CONCURRENTLY. In Hot Standby mode, the startup process must also be able to probe for conflicting VXID locks, but the lock need never be fully materialized, because the startup process does not use the normal lock wait mechanism. Since most VXID locks never need to touch the lock manager partition locks, this can significantly reduce blocking contention on read-heavy workloads. Patch by me. Review by Jeff Davis.	2011-08-04 12:38:33 -04:00
Tom Lane	ac36e6f71f	Move CheckRecoveryConflictDeadlock() call to a safer place. This kluge was inserted in a spot apparently chosen at random: the lock manager's state is not yet fully set up for the wait, and in particular LockWaitCancel hasn't been armed by setting lockAwaited, so the ProcLock will not get cleaned up if the ereport is thrown. This seems to not cause any observable problem in trivial test cases, because LockReleaseAll will silently clean up the debris; but I was able to cause failures with tests involving subtransactions. Fixes breakage induced by commit `c85c941470`. Back-patch to all affected branches.	2011-08-02 15:16:29 -04:00
Tom Lane	2e53bd5517	Fix incorrect initialization of ProcGlobal->startupBufferPinWaitBufId. It was initialized in the wrong place and to the wrong value. With bad luck this could result in incorrect query-cancellation failures in hot standby sessions, should a HS backend be holding pin on buffer number 1 while trying to acquire a lock.	2011-08-02 13:23:52 -04:00
Robert Haas	85b436f7b1	Minor stylistic corrections.	2011-08-01 08:24:45 -04:00
Robert Haas	b4fbe392f8	Reduce sinval synchronization overhead. Testing shows that the overhead of acquiring and releasing SInvalReadLock and msgNumLock on high-core count boxes can waste a lot of CPU time and hurt performance. This patch adds a per-backend flag that allows us to skip all that locking in most cases. Further testing shows that this improves performance even when sinval traffic is very high. Patch by me. Review and testing by Noah Misch.	2011-07-29 16:46:13 -04:00
Peter Eisentraut	0fe8150827	Minor message style adjustment	2011-07-27 23:54:46 +03:00
Robert Haas	8e5ac74c12	Some refinement for the "fast path" lock patch. 1. In GetLockStatusData, avoid initializing instance before we've ensured that the array is large enough. Otherwise, if repalloc moves the block around, we're hosed. 2. Add the word "Relation" to the name of some identifiers, to avoid assuming that the fast-path mechanism will only ever apply to relations (though these particular parts certainly will). Some of the macros could possibly use similar treatment, but the names are getting awfully long already. 3. Add a missing word to comment in AtPrepare_Locks().	2011-07-19 12:10:15 -04:00
Peter Eisentraut	30f854537d	Change debug message from ereport to elog	2011-07-19 07:50:10 +03:00
Robert Haas	3cba8999b3	Create a "fast path" for acquiring weak relation locks. When an AccessShareLock, RowShareLock, or RowExclusiveLock is requested on an unshared database relation, and we can verify that no conflicting locks can possibly be present, record the lock in a per-backend queue, stored within the PGPROC, rather than in the primary lock table. This eliminates a great deal of contention on the lock manager LWLocks. This patch also refactors the interface between GetLockStatusData() and pg_lock_status() to be a bit more abstract, so that we don't rely so heavily on the lock manager's internal representation details. The new fast path lock structures don't have a LOCK or PROCLOCK structure to return, so we mustn't depend on that for purposes of listing outstanding locks. Review by Jeff Davis.	2011-07-18 00:49:28 -04:00
Tom Lane	9473bb96d0	Further thoughts about temp_file_limit patch. Move FileClose's decrement of temporary_files_size up, so that it will be executed even if elog() throws an error. This is reasonable since if the unlink() fails, the fact the file is still there is not our fault, and we are going to forget about it anyhow. So we won't count it against temp_file_limit anymore. Update fileSize and temporary_files_size correctly in FileTruncate. We probably don't have any places that truncate temp files, but fd.c surely should not assume that.	2011-07-17 15:05:44 -04:00
Tom Lane	23e5b16c71	Add temp_file_limit GUC parameter to constrain temporary file space usage. The limit is enforced against the total amount of temp file space used by each session. Mark Kirkwood, reviewed by Cédric Villemain and Tatsuo Ishii	2011-07-17 14:19:31 -04:00
Tom Lane	1af37ec96d	Replace errdetail("%s", ...) with errdetail_internal("%s", ...). There may be some other places where we should use errdetail_internal, but they'll have to be evaluated case-by-case. This commit just hits a bunch of places where invoking gettext is obviously a waste of cycles.	2011-07-16 14:22:18 -04:00
Tom Lane	3ee7c8710d	Use errdetail_internal() for SSI transaction cancellation details. Per discussion, these seem too technical to be worth translating. Kevin Grittner	2011-07-16 14:22:16 -04:00
Robert Haas	4240e429d0	Try to acquire relation locks in RangeVarGetRelid. In the previous coding, we would look up a relation in RangeVarGetRelid, lock the resulting OID, and then AcceptInvalidationMessages(). While this was sufficient to ensure that we noticed any changes to the relation definition before building the relcache entry, it didn't handle the possibility that the name we looked up no longer referenced the same OID. This was particularly problematic in the case where a table had been dropped and recreated: we'd latch on to the entry for the old relation and fail later on. Now, we acquire the relation lock inside RangeVarGetRelid, and retry the name lookup if we notice that invalidation messages have been processed meanwhile. Many operations that would previously have failed with an error in the presence of concurrent DDL will now succeed. There is a good deal of work remaining to be done here: many callers of RangeVarGetRelid still pass NoLock for one reason or another. In addition, nothing in this patch guards against the possibility that the meaning of an unqualified name might change due to the creation of a relation in a schema earlier in the user's search path than the one where it was previously found. Furthermore, there's nothing at all here to guard against similar race conditions for non-relations. For all that, it's a start. Noah Misch and Robert Haas	2011-07-08 22:19:30 -04:00
Heikki Linnakangas	89fd72cbf2	Introduce a pipe between postmaster and each backend, which can be used to detect postmaster death. Postmaster keeps the write-end of the pipe open, so when it dies, children get EOF in the read-end. That can conveniently be waited for in select(), which allows eliminating some of the polling loops that check for postmaster death. This patch doesn't yet change all the loops to use the new mechanism, expect a follow-on patch to do that. This changes the interface to WaitLatch, so that it takes as argument a bitmask of events that it waits for. Possible events are latch set, timeout, postmaster death, and socket becoming readable or writeable. The pipe method behaves slightly differently from the kill() method previously used in PostmasterIsAlive() in the case that postmaster has died, but its parent has not yet read its exit code with waitpid(). The pipe returns EOF as soon as the process dies, but kill() continues to return true until waitpid() has been called (IOW while the process is a zombie). Because of that, change PostmasterIsAlive() to use the pipe too, otherwise WaitLatch() would return immediately with WL_POSTMASTER_DEATH, while PostmasterIsAlive() would claim it's still alive. That could easily lead to busy-waiting while postmaster is in zombie state. Peter Geoghegan with further changes by me, reviewed by Fujii Masao and Florian Pflug.	2011-07-08 18:44:07 +03:00
Heikki Linnakangas	9598afa3b0	Fix one overflow and one signedness error, caused by the patch to calculate OLDSERXID_MAX_PAGE based on BLCKSZ. MSVC compiler warned about these.	2011-07-08 17:29:53 +03:00
Heikki Linnakangas	bdaabb9b22	There's a small window wherein a transaction is committed but not yet on the finished list, and we shouldn't flag it as a potential conflict if so. We can also skip adding a doomed transaction to the list of possible conflicts because we know it won't commit. Dan Ports and Kevin Grittner.	2011-07-08 00:36:30 +03:00
Heikki Linnakangas	406d61835b	SSI has a race condition, where the order of commit sequence numbers of transactions might not match the order the work done in those transactions become visible to others. The logic in SSI, however, assumed that it does. Fix that by having two sequence numbers for each serializable transaction, one taken before a transaction becomes visible to others, and one after it. This is easier than trying to make the the transition totally atomic, which would require holding ProcArrayLock and SerializableXactHashLock at the same time. By using prepareSeqNo instead of commitSeqNo in a few places where commit sequence numbers are compared, we can make those comparisons err on the safe side when we don't know for sure which committed first. Per analysis by Kevin Grittner and Dan Ports, but this approach to fix it is different from the original patch.	2011-07-07 23:26:34 +03:00
Robert Haas	5b2b444f66	Adjust OLDSERXID_MAX_PAGE based on BLCKSZ. The value when BLCKSZ = 8192 is unchanged, but with larger-than-normal block sizes we might need to crank things back a bit, as we'll have more entries per page than normal in that case. Kevin Grittner	2011-07-07 15:05:21 -04:00
Heikki Linnakangas	928408d9e5	Fix a bug with SSI and prepared transactions: If there's a dangerous structure T0 ---> T1 ---> T2, and T2 commits first, we need to abort something. If T2 commits before both conflicts appear, then it should be caught by OnConflict_CheckForSerializationFailure. If both conflicts appear before T2 commits, it should be caught by PreCommit_CheckForSerializationFailure. But that is actually run when T2 prepares. Fix that in OnConflict_CheckForSerializationFailure, by treating a prepared T2 as if it committed already. This is mostly a problem for prepared transactions, which are in prepared state for some time, but also for regular transactions because they also go through the prepared state in the SSI code for a short moment when they're committed. Kevin Grittner and Dan Ports	2011-07-07 18:12:15 +03:00
Peter Eisentraut	27af66162b	Message style tweaks	2011-07-05 00:01:35 +03:00
Peter Eisentraut	21f1e15aaf	Unify spelling of "canceled", "canceling", "cancellation" We had previously (`af26857a27`) established the U.S. spellings as standard.	2011-06-29 09:28:46 +03:00
Tom Lane	223be216af	Undo overly enthusiastic de-const-ification. s/const//g wasn't exactly what I was suggesting here ... parameter declarations of the form "const structtype param" are good and useful, so put those occurrences back. Likewise, avoid casting away the const in a "const void " parameter.	2011-06-22 23:04:46 -04:00
Heikki Linnakangas	5da417f7c4	Remove pointless const qualifiers from function arguments in the SSI code. As Tom Lane pointed out, "const Relation foo" doesn't guarantee that you can't modify the data the "foo" pointer points to. It just means that you can't change the pointer to point to something else within the function, which is not very useful.	2011-06-22 12:18:39 +03:00
Tom Lane	a3290f655e	Minor editing for README-SSI. Fix some grammatical issues, try to clarify a couple of proofs, make the terminology more consistent.	2011-06-21 18:01:22 -04:00
Heikki Linnakangas	1eea8e8a06	Fix bug in PreCommit_CheckForSerializationFailure. A transaction that has already been marked as PREPARED cannot be killed. Kill the current transaction instead. One of the prepared_xacts regression tests actually hits this bug. I removed the anomaly from the duplicate-gids test so that it fails in the intended way, and added a new test to check serialization failures with a prepared transaction. Dan Ports	2011-06-21 14:49:50 +03:00
Heikki Linnakangas	7cb2ff9621	Fix bug introduced by recent SSI patch to merge ROLLED_BACK and MARKED_FOR_DEATH flags into one. We still need the ROLLED_BACK flag to mark transactions that are in the process of being rolled back. To be precise, ROLLED_BACK now means that a transaction has already been discounted from the count of transactions with the oldest xmin, but not yet removed from the list of active transactions. Dan Ports	2011-06-21 14:49:50 +03:00
Peter Eisentraut	8a8fbe7e79	Capitalization fixes	2011-06-19 00:37:30 +03:00
Robert Haas	c573486ce9	Fix minor thinko in ProcGlobalShmemSize(). There's no need to add space for startupBufferPinWaitBufId, because it's part of the PROC_HDR object for which this function already allocates space. This has been wrong for a while, but the only consequence is that our shared memory allocation is increased by 4 bytes, so no back-patch.	2011-06-17 09:12:19 -04:00
Heikki Linnakangas	78475b0eca	Update README-SSI. Add a section to describe the "dangerous structure" that SSI is based on, as well as the optimizations about relative commit times and read-only transactions. Plus a bunch of other misc fixes and improvements. Dan Ports	2011-06-16 21:20:39 +03:00
Heikki Linnakangas	cb94db91b2	pgindent run of recent SSI changes. Also, remove an unnecessary #include. Kevin Grittner	2011-06-16 16:17:22 +03:00
Heikki Linnakangas	264a6b127a	The rolled-back flag on serializable xacts was pointless and redundant with the marked-for-death flag. It was only set for a fleeting moment while a transaction was being cleaned up at rollback. All the places that checked for the rolled-back flag should also check the marked-for-death flag, as both flags mean that the transaction will roll back. I also renamed the marked-for-death into "doomed", which is a lot shorter name.	2011-06-15 13:35:28 +03:00
Heikki Linnakangas	0a0e2b52a5	Make non-MVCC snapshots exempt from predicate locking. Scans with non-MVCC snapshots, like in REINDEX, are basically non-transactional operations. The DDL operation itself might participate in SSI, but there's separate functions for that. Kevin Grittner and Dan Ports, with some changes by me.	2011-06-15 12:11:18 +03:00
Heikki Linnakangas	13000b44d6	Remove now-unnecessary casts. Kevin Grittner	2011-06-12 22:49:33 +03:00
Robert Haas	47ebcecc3e	Code cleanup for InitProcGlobal. The old code creates three separate arrays when only one is needed, using two different shmem allocation functions for no obvious reason. It also strangely splits up the initialization of AuxilaryProcs between the top and bottom of the function to no evident purpose. Review by Tom Lane.	2011-06-12 00:07:04 -04:00
Heikki Linnakangas	cb2d158c58	Fix locking while setting flags in MySerializableXact. Even if a flag is modified only by the backend owning the transaction, it's not safe to modify it without a lock. Another backend might be setting or clearing a different flag in the flags field concurrently, and that operation might be lost because setting or clearing a bit in a word is not atomic. Make did-write flag a simple backend-private boolean variable, because it was only set or tested in the owning backend (except when committing a prepared transaction, but it's not worthwhile to optimize for the case of a read-only prepared transaction). This also eliminates the need to add locking where that flag is set. Also, set the did-write flag when doing DDL operations like DROP TABLE or TRUNCATE -- that was missed earlier.	2011-06-10 23:41:10 +03:00
Alvaro Herrera	fba105b109	Use "transient" files for blind writes, take 2 "Blind writes" are a mechanism to push buffers down to disk when evicting them; since they may belong to different databases than the one a backend is connected to, the backend does not necessarily have a relation to link them to, and thus no way to blow them away. We were keeping those files open indefinitely, which would cause a problem if the underlying table was deleted, because the operating system would not be able to reclaim the disk space used by those files. To fix, have bufmgr mark such files as transient to smgr; the lower layer is allowed to close the file descriptor when the current transaction ends. We must be careful to have any other access of the file to remove the transient markings, to prevent unnecessary expensive system calls when evicting buffers belonging to our own database (which files we're likely to require again soon.) This commit fixes a bug in the previous one, which neglected to cleanly handle the LRU ring that fd.c uses to manage open files, and caused an unacceptable failure just before beta2 and was thus reverted.	2011-06-10 13:43:02 -04:00
Alvaro Herrera	3d114b63b2	Use a constant sprintf format to silence compiler warning	2011-06-10 13:38:50 -04:00
Heikki Linnakangas	c79c570bd8	Small comment fixes and enhancements.	2011-06-10 17:22:46 +03:00
Alvaro Herrera	9261557eb1	Revert "Use "transient" files for blind writes" This reverts commit `54d9e8c6c1`, which caused a failure on the buildfarm. Not a good thing to have just before a beta release.	2011-06-09 16:41:44 -04:00
Alvaro Herrera	54d9e8c6c1	Use "transient" files for blind writes "Blind writes" are a mechanism to push buffers down to disk when evicting them; since they may belong to different databases than the one a backend is connected to, the backend does not necessarily have a relation to link them to, and thus no way to blow them away. We were keeping those files open indefinitely, which would cause a problem if the underlying table was deleted, because the operating system would not be able to reclaim the disk space used by those files. To fix, have bufmgr mark such files as transient to smgr; the lower layer is allowed to close the file descriptor when the current transaction ends. We must be careful to have any other access of the file to remove the transient markings, to prevent unnecessary expensive system calls when evicting buffers belonging to our own database (which files we're likely to require again soon.)	2011-06-09 16:25:49 -04:00
Heikki Linnakangas	e1c26ab853	Fix the truncation logic of the OldSerXid SLRU mechanism. We can't pass SimpleLruTruncate() a page number that's "in the future", because it will issue a warning and refuse to truncate anything. Instead, we leave behind the latest segment. If the slru is not needed before XID wrap-around, the segment will appear as new again, and not be cleaned up until it gets old enough again. That's a bit unpleasant, but better than not cleaning up anything. Also, fix broken calculation to check and warn if the span of the OldSerXid SLRU is getting too large to fit in the 64k SLRU pages that we have available. It was not XID wraparound aware. Kevin Grittner and me.	2011-06-09 21:39:39 +03:00
Heikki Linnakangas	5234161ac1	Mark the SLRU page as dirty when setting an entry in pg_serial. In the passing, fix an incorrect comment.	2011-06-09 12:10:14 +03:00
Heikki Linnakangas	8f9622bbb3	Make DDL operations play nicely with Serializable Snapshot Isolation. Truncating or dropping a table is treated like deletion of all tuples, and check for conflicts accordingly. If a table is clustered or rewritten by ALTER TABLE, all predicate locks on the heap are promoted to relation-level locks, because the tuple or page ids of any existing tuples will change and won't be valid after rewriting the table. Arguably ALTER TABLE should be treated like a mass-UPDATE of every row, but if you e.g change the datatype of a column, you could also argue that it's just a change to the physical layout, not a logical change. Reindexing promotes all locks on the index to relation-level lock on the heap. Kevin Grittner, with a lot of cosmetic changes by me.	2011-06-08 14:02:43 +03:00
Heikki Linnakangas	a31ff707a2	Make ascii-art in comments pgindent-safe, and some other formatting changes. Kevin Grittner	2011-06-07 09:54:24 +03:00
Heikki Linnakangas	c8630919e0	SSI comment fixes and enhancements. Notably, document that the conflict-out flag actually means that the transaction has a conflict out to a transaction that committed before the flagged transaction. Kevin Grittner	2011-06-03 12:45:42 +03:00
Peter Eisentraut	ba4cacf075	Recode non-ASCII characters in source to UTF-8 For consistency, have all non-ASCII characters from contributors' names in the source be in UTF-8. But remove some other more gratuitous uses of non-ASCII characters.	2011-05-31 23:11:46 +03:00
Heikki Linnakangas	3103f9a77d	The row-version chaining in Serializable Snapshot Isolation was still wrong. On further analysis, it turns out that it is not needed to duplicate predicate locks to the new row version at update, the lock on the version that the transaction saw as visible is enough. However, there was a different bug in the code that checks for dangerous structures when a new rw-conflict happens. Fix that bug, and remove all the row-version chaining related code. Kevin Grittner & Dan Ports, with some comment editorialization by me.	2011-05-30 20:47:17 +03:00
Robert Haas	74aaa2136d	Fix race condition in CheckTargetForConflictsIn. Dan Ports	2011-05-19 12:12:04 -04:00
Robert Haas	71932ecc2b	Add comment about memory reordering to PredicateLockTupleRowVersionLink. Dan Ports, per head-scratching from Simon Riggs and myself.	2011-05-06 21:55:10 -04:00
Robert Haas	02e6a115cc	Add fast paths for cases when no serializable transactions are running. Dan Ports	2011-04-25 09:52:01 -04:00
Heikki Linnakangas	4c37c1e3b2	Reduce the initial size of local lock hash to 16 entries. The hash table is seq scanned at transaction end, to release all locks, and making the hash table larger than necessary makes that slower. With very simple queries, that overhead can amount to a few percent of the total CPU time used. At the moment, backend startup needs 6 locks, and a simple query with one table and index needs 3 locks. 16 is enough for even quite complicated transactions, and it will grow automatically if it fills up.	2011-04-15 15:07:36 +03:00
Peter Eisentraut	5caa3479c2	Clean up most -Wunused-but-set-variable warnings from gcc 4.6 This warning is new in gcc 4.6 and part of -Wall. This patch cleans up most of the noise, but there are some still warnings that are trickier to remove.	2011-04-11 22:28:45 +03:00
Heikki Linnakangas	dad1f46382	TransferPredicateLocksToNewTarget should initialize a new lock entry's commitSeqNo to that of the old one being transferred, or take the minimum commitSeqNo if it is merging two lock entries. Also, CreatePredicateLock should initialize commitSeqNo for to InvalidSerCommitSeqNo instead of to 0. (I don't think using 0 would actually affect anything, but we should be consistent.) I also added a couple of assertions I used to track this down: a lock's commitSeqNo should never be zero, and it should be InvalidSerCommitSeqNo if and only if the lock is not held by OldCommittedSxact. Dan Ports, to fix leak of predicate locks reported by YAMAMOTO Takashi.	2011-04-11 13:46:37 +03:00
Heikki Linnakangas	7c797e7194	Fix the size of predicate lock manager's shared memory hash tables at creation. This way they don't compete with the regular lock manager for the slack shared memory, making the behavior more predictable.	2011-04-11 13:43:31 +03:00
Bruce Momjian	bf50caf105	pgindent run before PG 9.1 beta 1.	2011-04-10 11:42:00 -04:00
Robert Haas	cdcdfca401	Truncate the predicate lock SLRU to empty, instead of almost empty. Otherwise, the SLRU machinery can get confused and think that the SLRU has wrapped around. Along the way, regardless of whether we're truncating all of the SLRU or just some of it, flush pages after truncating, rather than before. Kevin Grittner	2011-04-08 16:52:19 -04:00
Robert Haas	fbc0d07796	Partially roll back overenthusiastic SSI optimization. When a regular lock is held, SSI can use that in lieu of a predicate lock to detect rw conflicts; but if the regular lock is being taken by a subtransaction, we can't assume that it'll commit, so releasing the parent transaction's lock in that case is a no-no. Kevin Grittner	2011-04-08 15:29:02 -04:00
Robert Haas	56c7140ca8	Tweaks for SSI out-of-shared memory behavior. If we call hash_search() with HASH_ENTER, it will bail out rather than return NULL, so it's redundant to check for NULL again in the caller. Thus, in cases where we believe it's impossible for the hash table to run out of slots anyway, we can simplify the code slightly. On the flip side, in cases where it's theoretically possible to run out of space, we don't want to rely on dynahash.c to throw an error; instead, we pass HASH_ENTER_NULL and throw the error ourselves if a NULL comes back, so that we can provide a more descriptive error message. Kevin Grittner	2011-04-07 16:43:39 -04:00
Robert Haas	632f0faa7c	Repair some flakiness in CheckTargetForConflictsIn. When we release and reacquire SerializableXactHashLock, we must recheck whether an R/W conflict still needs to be flagged, because it could have changed under us in the meantime. And when we release the partition lock, we must re-walk the list of predicate locks from the beginning, because our pointer could get invalidated under us. Bug report #5952 by Yamamoto Takashi. Patch by Kevin Grittner.	2011-04-05 15:17:25 -04:00
Heikki Linnakangas	60b142b9a6	Fix a tiny race condition in predicate locking. Need to hold the lock while examining the head of predicate locks list. Also, fix the comment of RemoveTargetIfNoLongerUsed, it was neglected when we changed the way update chains are handled. Kevin Grittner	2011-03-31 18:43:23 +03:00
Simon Riggs	b98ac467f5	Prevent intermittent hang in recovery from bgwriter interaction. Startup process waited for cleanup lock but when hot_standby = off the pid was not registered, so that the bgwriter would not wake the waiting process as intended.	2011-03-23 13:30:05 +00:00
Robert Haas	6436098795	Minor sync rep corrections. Fujii Masao, with a bit of additional wordsmithing by me.	2011-03-10 14:57:02 -05:00
Heikki Linnakangas	46c333a963	Fix overly strict assertion in SummarizeOldestCommittedSxact(). There's a race condition where SummarizeOldestCommittedSxact() is called even though another backend already cleared out all finished sxact entries. That's OK, RegisterSerializableTransactionInt() can just retry getting a news xact slot from the available-list when that happens. Reported by YAMAMOTO Takashi, bug #5918.	2011-03-08 21:06:26 +02:00
Heikki Linnakangas	93d888232e	Don't throw a warning if vacuum sees PD_ALL_VISIBLE flag set on a page that contains newly-inserted tuples that according to our OldestXmin are not yet visible to everyone. The value returned by GetOldestXmin() is conservative, and it can move backwards on repeated calls, so if we see that contradiction between the PD_ALL_VISIBLE flag and status of tuples on the page, we have to assume it's because an earlier vacuum calculated a higher OldestXmin value, and all the tuples really are visible to everyone. We have received several reports of this bug, with the "PD_ALL_VISIBLE flag was incorrectly set in relation ..." warning appearing in logs. We were finally able to hunt it down with David Gould's help to run extra diagnostics in an environment where this happened frequently. Also reword the warning, per Robert Haas' suggestion, to not imply that the PD_ALL_VISIBLE flag is necessarily at fault, as it might also be a symptom of corruption on a tuple header. Backpatch to 8.4, where the PD_ALL_VISIBLE flag was introduced.	2011-03-08 20:30:53 +02:00
Heikki Linnakangas	4cd3fb6e12	Truncate predicate lock manager's SLRU lazily at checkpoint. That's safer than doing it aggressively whenever the tail-XID pointer is advanced, because this way we don't need to do it while holding SerializableXactHashLock. This also fixes bug #5915 spotted by YAMAMOTO Takashi, and removes an obsolete comment spotted by Kevin Grittner.	2011-03-08 12:12:54 +02:00
Simon Riggs	a8a8a3e096	Efficient transaction-controlled synchronous replication. If a standby is broadcasting reply messages and we have named one or more standbys in synchronous_standby_names then allow users who set synchronous_replication to wait for commit, which then provides strict data integrity guarantees. Design avoids sending and receiving transaction state information so minimises bookkeeping overheads. We synchronize with the highest priority standby that is connected and ready to synchronize. Other standbys can be defined to takeover in case of standby failure. This version has very strict behaviour; more relaxed options may be added at a later date. Simon Riggs and Fujii Masao, with reviews by Yeb Havinga, Jaime Casanova, Heikki Linnakangas and Robert Haas, plus the assistance of many other design reviewers.	2011-03-06 22:49:16 +00:00
Heikki Linnakangas	ee3838b1d3	You must hold a lock on the heap page when you call CheckForSerializableConflictOut(), because it can set hint bits. YAMAMOTO Takashi	2011-03-04 15:43:11 +02:00
Heikki Linnakangas	47ad79122b	Fix bugs in Serializable Snapshot Isolation. Change the way UPDATEs are handled. Instead of maintaining a chain of tuple-level locks in shared memory, copy any existing locks on the old tuple to the new tuple at UPDATE. Any existing page-level lock needs to be duplicated too, as a lock on the new tuple. That was neglected previously. Store xmin on tuple-level predicate locks, to distinguish a lock on an old already-recycled tuple from a new tuple at the same physical location. Failure to distinguish them caused loops in the tuple-lock chains, as reported by YAMAMOTO Takashi. Although we don't use the chain representation of UPDATEs anymore, it seems like a good idea to store the xmin to avoid some false positives if no other reason. CheckSingleTargetForConflictsIn now correctly handles the case where a lock that's being held is not reflected in the local lock table. That happens if another backend acquires a lock on our behalf due to an UPDATE or a page split. PredicateLockPageCombine now retains locks for the page that is being removed, rather than removing them. This prevents a potentially dangerous false-positive inconsistency where the local lock table believes that a lock is held, but it is actually not. Dan Ports and Kevin Grittner	2011-03-01 19:05:16 +02:00
Magnus Hagander	45a6d79b17	Properly initialize variables Kevin Grittner	2011-02-18 11:59:57 +01:00
Itagaki Takahiro	62c7bd31c8	Add transaction-level advisory locks. They share the same locking namespace with the existing session-level advisory locks, but they are automatically released at the end of the current transaction and cannot be released explicitly via unlock functions. Marko Tiikkaja, reviewed by me.	2011-02-18 14:05:12 +09:00
Simon Riggs	bca8b7f16a	Hot Standby feedback for avoidance of cleanup conflicts on standby. Standby optionally sends back information about oldestXmin of queries which is then checked and applied to the WALSender's proc->xmin. GetOldestXmin() is modified slightly to agree with GetSnapshotData(), so that all backends on primary include WALSender within their snapshots. Note this does nothing to change the snapshot xmin on either master or standby. Feedback piggybacks on the standby reply message. vacuum_defer_cleanup_age is no longer used on standby, though parameter still exists on primary, since some use cases still exist. Simon Riggs, review comments from Fujii Masao, Heikki Linnakangas, Robert Haas	2011-02-16 19:29:37 +00:00
Robert Haas	6a77e9385e	Rename max_predicate_locks_per_transaction. The new name, max_pred_locks_per_transaction, is shorter. Kevin Grittner, per discussion.	2011-02-15 08:04:55 -05:00
Heikki Linnakangas	cecb5901b8	Allocate all entries in the serializable xid hash up-front, so that you don't run out of shared memory when you try to assign an xid to a transaction. Kevin Grittner	2011-02-10 12:03:21 +02:00
Heikki Linnakangas	036bb15872	Fix allocation of RW-conflict pool in the new predicate lock manager, and also take the RW-conflict pool into account in the PredicateLockShmemSize() estimate.	2011-02-09 12:23:07 +02:00
Heikki Linnakangas	7202ad7b8d	Fix copy-pasto in description of pg_serial, and silence compiler warning about uninitialized field you get on some compilers.	2011-02-08 09:05:13 +02:00
Heikki Linnakangas	dafaa3efb7	Implement genuine serializable isolation level. Until now, our Serializable mode has in fact been what's called Snapshot Isolation, which allows some anomalies that could not occur in any serialized ordering of the transactions. This patch fixes that using a method called Serializable Snapshot Isolation, based on research papers by Michael J. Cahill (see README-SSI for full references). In Serializable Snapshot Isolation, transactions run like they do in Snapshot Isolation, but a predicate lock manager observes the reads and writes performed and aborts transactions if it detects that an anomaly might occur. This method produces some false positives, ie. it sometimes aborts transactions even though there is no anomaly. To track reads we implement predicate locking, see storage/lmgr/predicate.c. Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared memory is finite, so when a transaction takes many tuple-level locks on a page, the locks are promoted to a single page-level lock, and further to a single relation level lock if necessary. To lock key values with no matching tuple, a sequential scan always takes a relation-level lock, and an index scan acquires a page-level lock that covers the search key, whether or not there are any matching keys at the moment. A predicate lock doesn't conflict with any regular locks or with another predicate locks in the normal sense. They're only used by the predicate lock manager to detect the danger of anomalies. Only serializable transactions participate in predicate locking, so there should be no extra overhead for for other transactions. Predicate locks can't be released at commit, but must be remembered until all the transactions that overlapped with it have completed. That means that we need to remember an unbounded amount of predicate locks, so we apply a lossy but conservative method of tracking locks for committed transactions. If we run short of shared memory, we overflow to a new "pg_serial" SLRU pool. We don't currently allow Serializable transactions in Hot Standby mode. That would be hard, because even read-only transactions can cause anomalies that wouldn't otherwise occur. Serializable isolation mode now means the new fully serializable level. Repeatable Read gives you the old Snapshot Isolation level that we have always had. Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and Anssi Kääriäinen	2011-02-08 00:09:08 +02:00
Simon Riggs	8585ad3625	Fix error code for canceling statement due to conflict with recovery. All retryable conflict errors now have an error code that indicates that a retry is possible, correcting my incomplete fix of 2010/05/12 Tatsuo Ishii and Simon Riggs, input from Robert Haas and Florian Pflug	2011-01-31 19:20:23 +00:00
Robert Haas	7f242d880b	Try to avoid running with a full fsync request queue. When we need to insert a new entry and the queue is full, compact the entire queue in the hopes of making room for the new entry. Doing this on every insertion might worsen contention on BgWriterCommLock, but when the queue it's full, it's far better than allowing the backend to perform its own fsync, per testing by Greg Smith as reported in http://archives.postgresql.org/pgsql-hackers/2011-01/msg02665.php Original idea from Greg Smith. Patch by me. Review by Chris Browne and Greg Smith	2011-01-29 08:08:41 -05:00
Tom Lane	7ab6f2da23	Change inv_truncate() to not repeat its systable_getnext_ordered() scan. In the case where the initial call of systable_getnext_ordered() returned NULL, this function would nonetheless call it again. That's undefined behavior that only by chance failed to not give visibly incorrect results. Put an if-test around the final loop to prevent that, and in passing improve some comments. No back-patch since there's no actual failure. Per report from YAMAMOTO Takashi.	2011-01-26 19:33:50 -05:00
Heikki Linnakangas	b1dc45c11d	Fix thinko in comment. Spotted by Jim Nasby.	2011-01-18 10:46:13 +02:00
Heikki Linnakangas	8f5d65e916	Treat a WAL sender process that hasn't started streaming yet as a regular backend, as far as the postmaster shutdown logic is concerned. That means, fast shutdown will wait for WAL sender processes to exit before signaling bgwriter to finish. This avoids race conditions between a base backup stopping or starting, and bgwriter writing the shutdown checkpoint WAL record. We don't want e.g the end-of-backup WAL record to be written after the shutdown checkpoint.	2011-01-15 16:38:21 +02:00
Bruce Momjian	5d950e3b0c	Stamp copyrights for year 2011.	2011-01-01 13:18:15 -05:00
Robert Haas	53dbc27c62	Support unlogged tables. The contents of an unlogged table are WAL-logged; thus, they are not available on standby servers and are truncated whenever the database system enters recovery. Indexes on unlogged tables are also unlogged. Unlogged GiST indexes are not currently supported.	2010-12-29 06:48:53 -05:00
Robert Haas	24ecde7742	Work around unfortunate getppid() behavior on BSD-ish systems. On MacOS X, and apparently also on other BSD-derived systems, attaching a debugger causes getppid() to return the pid of the debugging process rather than the actual parent PID. As a result, debugging the autovacuum launcher, startup process, or WAL sender on such systems causes it to exit, because the previous coding of PostmasterIsAlive() detects postmaster death by testing whether getppid() == PostmasterPid. Work around that behavior by checking the return value of getppid() more carefully. If it's PostmasterPid, the postmaster must be alive; if it's 1, assume the postmaster is dead. If it's any other value, assume we've been debugged and fall through to the less-reliable kill() test. Review by Tom Lane.	2010-12-21 06:30:32 -05:00
Robert Haas	8bd4b89e24	Try to save a kernel call in ResolveRecoveryConflictWithVirtualXIDs. If there's no work to be done, just exit quickly, before initialization.	2010-12-17 11:32:02 -05:00
Robert Haas	611fed3712	Reset 'ps' display just once when resolving VXID conflicts. This prevents the word "waiting" from briefly disappearing from the ps status line when ResolveRecoveryConflictWithVirtualXIDs begins a new iteration of the outer loop. Along the way, remove some useless pgstat_report_waiting() calls; the startup process doesn't appear in pg_stat_activity. Fujii Masao	2010-12-17 08:30:57 -05:00
Robert Haas	34c70c7ac4	Instrument checkpoint sync calls. Greg Smith, reviewed by Jeff Janes	2010-12-14 09:26:19 -05:00
Robert Haas	5f7b58fad8	Generalize concept of temporary relations to "relation persistence". This commit replaces pg_class.relistemp with pg_class.relpersistence; and also modifies the RangeVar node type to carry relpersistence rather than istemp. It also removes removes rd_istemp from RelationData and instead performs the correct computation based on relpersistence. For clarity, we add three new macros: RelationNeedsWAL(), RelationUsesLocalBuffers(), and RelationUsesTempNamespace(), so that we can clarify the purpose of each check that previous depended on rd_istemp. This is intended as infrastructure for the upcoming unlogged tables patch, as well as for future possible work on global temporary tables.	2010-12-13 12:34:26 -05:00
Tom Lane	04f4e10cfc	Use symbolic names not octal constants for file permission flags. Purely cosmetic patch to make our coding standards more consistent --- we were doing symbolic some places and octal other places. This patch fixes all C-coded uses of mkdir, chmod, and umask. There might be some other calls I missed. Inconsistency noted while researching tablespace directory permissions issue.	2010-12-10 17:35:33 -05:00
Tom Lane	576477e73c	Force default wal_sync_method to be fdatasync on Linux. Recent versions of the Linux system header files cause xlogdefs.h to believe that open_datasync should be the default sync method, whereas formerly fdatasync was the default on Linux. open_datasync is a bad choice, first because it doesn't actually outperform fdatasync (in fact the reverse), and second because we try to use O_DIRECT with it, causing failures on certain filesystems (e.g., ext4 with data=journal option). This part of the patch is largely per a proposal from Marti Raudsepp. More extensive changes are likely to follow in HEAD, but this is as much change as we want to back-patch. Also clean up confusing code and incorrect documentation surrounding the fsync_writethrough option. Those changes shouldn't result in any actual behavioral change, but I chose to back-patch them anyway to keep the branches looking similar in this area. In 9.0 and HEAD, also do some copy-editing on the WAL Reliability documentation section. Back-patch to all supported branches, since any of them might get used on modern Linux versions.	2010-12-08 20:01:09 -05:00
Simon Riggs	e620ee35b2	Optimize commit_siblings in two ways to improve group commit. First, avoid scanning the whole ProcArray once we know there are at least commit_siblings active; second, skip the check altogether if commit_siblings = 0. Greg Smith	2010-12-08 18:48:03 +00:00
Heikki Linnakangas	5a031a5556	Fix bugs in the hot standby known-assigned-xids tracking logic. If there's an old transaction running in the master, and a lot of transactions have started and finished since, and a WAL-record is written in the gap between the creating the running-xacts snapshot and WAL-logging it, recovery will fail with "too many KnownAssignedXids" error. This bug was reported by Joachim Wieland on Nov 19th. In the same scenario, when fewer transactions have started so that all the xids fit in KnownAssignedXids despite the first bug, a more serious bug arises. We incorrectly initialize the clog code with the oldest still running transaction, and when we see the WAL record belonging to a transaction with an XID larger than one that committed already before the checkpoint we're recovering from, we zero the clog page containing the already committed transaction, leading to data loss. In hindsight, trying to track xids in the known-assigned-xids array before seeing the running-xacts record was too complicated. To fix that, hold XidGenLock while the running-xacts snapshot is taken and WAL-logged. That ensures that no transaction can begin or end in that gap, so that in recvoery we know that the snapshot contains all transactions running at that point in WAL.	2010-12-07 09:23:30 +01:00
Simon Riggs	ed78384acd	Move call to GetTopTransactionId() earlier in LockAcquire(), removing an infrequently occurring race condition in Hot Standby. An xid must be assigned before a lock appears in shared memory, rather than immediately after, else GetRunningTransactionLocks() may see InvalidTransactionId, causing assertion failures during lock processing on standby. Bug report and diagnosis by Fujii Masao, fix by me.	2010-11-29 01:08:02 +00:00
Robert Haas	cc1ed40d57	Object access hook framework, with post-creation hook. After a SQL object is created, we provide an opportunity for security or logging plugins to get control; for example, a security label provider could use this to assign an initial security label to newly created objects. The basic infrastructure is (hopefully) reusable for other types of events that might require similar treatment. KaiGai Kohei, with minor adjustments.	2010-11-25 11:50:13 -05:00
Robert Haas	c2281ac87c	Remove belt-and-suspenders guards against buffer pin leaks. Forcibly releasing all leftover buffer pins should be unnecessary now that we have a robust ResourceOwner mechanism, and it significantly increases the cost of process shutdown. Instead, in an assert-enabled build, assert that no pins are held; in a non-assert-enabled build, do nothing.	2010-11-25 00:06:46 -05:00
Peter Eisentraut	fc946c39ae	Remove useless whitespace at end of lines	2010-11-23 22:34:55 +02:00
Robert Haas	3134d8863e	Add new buffers_backend_fsync field to pg_stat_bgwriter. This new field counts the number of times that a backend which writes a buffer out to the OS must also fsync() it. This happens when the bgwriter fsync request queue is full, and is generally detrimental to performance, so it's good to know when it's happening. Along the way, log a new message at level DEBUG1 whenever we fail to hand off an fsync, so that the problem can also be seen in examination of log files (if the logging level is cranked up high enough). Greg Smith, with minor tweaks by me.	2010-11-15 12:42:59 -05:00
Simon Riggs	52010027ef	Avoid spurious Hot Standby conflicts from btree delete records. Similar conflicts were already avoided for related record types. Massive over-caution resulted in a usability bug. Clear theoretical basis for doing this is now confirmed by me. Request to remove from Heikki (twice), over-caution by me.	2010-11-15 09:30:13 +00:00
Robert Haas	11e482c350	Move copydir() prototype into its own header file. Having this in src/include/port.h makes no sense, now that copydir.c lives in src/backend/strorage rather than src/port. Along the way, remove an obsolete comment from contrib/pg_upgrade that makes reference to the old location.	2010-11-12 16:39:53 -05:00
Tom Lane	54428dbe90	Fix error handling in temp-file deletion with log_temp_files active. The original coding in FileClose() reset the file-is-temp flag before unlinking the file, so that if control came back through due to an error, it wouldn't try to unlink the file twice. This was correct when written, but when the log_temp_files feature was added, the logging action was put in between those two steps. An error occurring during the logging action --- such as a query cancel --- would result in the unlink not getting done at all, as in recent report from Michael Glaesemann. To fix this, make sure that we do both the stat and the unlink before doing anything that could conceivably CHECK_FOR_INTERRUPTS. There is a judgment call here, which is which log message to emit first: if you can see only one, which should it be? I chose to log unlink failure at the risk of losing the log_temp_files log message --- after all, if the unlink does fail, the temp file is still there for you to see. Back-patch to all versions that have log_temp_files. The code was OK before that.	2010-11-08 22:14:48 -05:00
Tom Lane	5ac144d5c2	Improve messages for too many private files/dirs. Per Alexey Parshin.	2010-09-28 18:08:02 -04:00
Magnus Hagander	9f2e211386	Remove cvs keywords from all files.	2010-09-20 22:08:53 +02:00
Heikki Linnakangas	236b6bc29e	Simplify Windows implementation of latches. There's no need to keep a dynamic pool of event handles, we can permanently assign one for each shared latch. Thanks to that, we no longer need a separate shared memory block for latches, and we don't need to know in advance how many shared latches there is, so you no longer need to remember to update NumSharedLatches when you introduce a new latch to the system.	2010-09-15 10:06:21 +00:00
Heikki Linnakangas	2746e5f21d	Introduce latches. A latch is a boolean variable, with the capability to wait until it is set. Latches can be used to reliably wait until a signal arrives, which is hard otherwise because signals don't interrupt select() on some platforms, and even when they do, there's race conditions. On Unix, latches use the so called self-pipe trick under the covers to implement the sleep until the latch is set, without race conditions. On Windows, Windows events are used. Use the new latch abstraction to sleep in walsender, so that as soon as a transaction finishes, walsender is woken up to immediately send the WAL to the standby. This reduces the latency between master and standby, which is good. Preliminary work by Fujii Masao. The latch implementation is by me, with helpful comments from many people.	2010-09-11 15:48:04 +00:00
Tom Lane	174a51332f	Cosmetic fixes for KnownAssignedXidsGetOldestXmin, per Fujii Masao.	2010-08-30 17:30:44 +00:00
Simon Riggs	e24d1dc069	Teach GetOldestXmin() about KnownAssignedXids during recovery. Very minor issue, though this is required for a later patch. Reported by Heikki Linnakangas.	2010-08-30 14:16:48 +00:00
Heikki Linnakangas	e1cc96dbf0	Fix typo in comment.	2010-08-30 06:33:22 +00:00
Tom Lane	b9defe0405	Marginal code cleanup for streaming replication. There is no reason that proc.c should have to get involved in this dirty hack for letting the postmaster know which children are walsenders. Revert that file to the way it was, and confine the kluge to pmsignal.c and postmaster.c.	2010-08-23 17:20:01 +00:00
Robert Haas	a481ff71af	Remove the isLocalBuf argument from ReadBuffer_common. Since an SMgrRelation now knows whether or not the underlying relation is temporary, there's no point in also passing that information via an additional argument.	2010-08-20 01:07:50 +00:00
Tom Lane	79dc97a401	Bring some sanity to the trace_recovery_messages code and docs. Per gripe from Fujii Masao, though this is not exactly his proposed patch. Categorize as DEVELOPER_OPTIONS and set context PGC_SIGHUP, as per Fujii, but set the default to LOG because higher values aren't really sensible (see the code for trace_recovery()). Fix the documentation to agree with the code and to try to explain what the variable actually does. Get rid of no-op calls trace_recovery(LOG), which accomplish nothing except to demonstrate that this option confuses even its author.	2010-08-19 22:55:01 +00:00
Tom Lane	bc7cb8f42c	Allocate local buffers in a context of their own, rather than dumping them into TopMemoryContext. This makes no functional difference, but makes it easier to see what the space is being used for in MemoryContextStats dumps. Per a recent example in which I was surprised by the size of TopMemoryContext.	2010-08-19 16:16:20 +00:00
Peter Eisentraut	3f11971916	Remove extra newlines at end and beginning of files, add missing newlines at end of files.	2010-08-19 05:57:36 +00:00
Robert Haas	d37781fa82	Tidy up a few calls to smrgextend(). In the new API introduced by my patch to include the backend ID in temprel filenames, the last argument to smrgextend() became skipFsync rather than isTemp, but these calls didn't get the memo. It's not really a problem to pass rel->rd_istemp rather than just plain false, because smgrextend() now automatically skips the fsync for temprels anyway, but this seems cleaner and saves some minute number of cycles.	2010-08-19 02:58:37 +00:00
Robert Haas	66b14030e8	Make LockDatabaseObject() AcceptInvalidationMessages(). This is appropriate for the same reasons we already do it in LockSharedObject(): things might have changed while we were waiting for the lock. There doesn't seem to be a live bug here at the moment, but that's mostly because it isn't currently used for very much.	2010-08-16 02:02:28 +00:00
Robert Haas	27f145a40e	Further dtrace adjustments for the backend-IDs-in-relpath patch. Update the documentation, and back out a few ill-considered changes whose folly I failed to realize for failure to read the documentation.	2010-08-14 02:22:10 +00:00
Robert Haas	105d4c5ffe	Fix assorted dtrace breakage caused by patch to include backend IDs in temp relpaths. Per buildfarm.	2010-08-13 22:54:17 +00:00
Robert Haas	debcec7dc3	Include the backend ID in the relpath of temporary relations. This allows us to reliably remove all leftover temporary relation files on cluster startup without reference to system catalogs or WAL; therefore, we no longer include temporary relations in XLOG_XACT_COMMIT and XLOG_XACT_ABORT WAL records. Since these changes require including a backend ID in each SharedInvalSmgrMsg, the size of the SharedInvalidationMessage.id field has been reduced from two bytes to one, and the maximum number of connections has been reduced from INT_MAX / 4 to 2^23-1. It would be possible to remove these restrictions by increasing the size of SharedInvalidationMessage by 4 bytes, but right now that doesn't seem like a good trade-off. Review by Jaime Casanova and Tom Lane.	2010-08-13 20:10:54 +00:00
Robert Haas	30c22eb8fc	Correct sundry errors in Hot Standby-related comments. Fujii Masao	2010-08-12 23:24:54 +00:00
Robert Haas	20be0d480a	Make log_temp_files based on kB, and revert docs & comments to match. Per extensive discussion on pgsql-hackers. We are deliberately not back-patching this even though the behavior of 8.3 and 8.4 is unquestionably broken, for fear of breaking existing users of this parameter. This incompatibility should be release-noted.	2010-07-06 22:55:26 +00:00
Bruce Momjian	239d769e7e	pgindent run for 9.0, second run	2010-07-06 19:19:02 +00:00
Tom Lane	aceedd88f6	Make vacuum_defer_cleanup_age be PGC_SIGHUP level, since it's not sensible to have different values in different processes of the primary server. Also put it into the "Streaming Replication" GUC category; it doesn't belong in "Standby Servers" because you use it on the master not the standby. In passing also correct guc.c's idea of wal_keep_segments' category.	2010-07-03 21:23:58 +00:00
Tom Lane	e76c1a0f4d	Replace max_standby_delay with two parameters, max_standby_archive_delay and max_standby_streaming_delay, and revise the implementation to avoid assuming that timestamps found in WAL records can meaningfully be compared to clock time on the standby server. Instead, the delay limits are compared to the elapsed time since we last obtained a new WAL segment from archive or since we were last "caught up" to WAL data arriving via streaming replication. This avoids problems with clock skew between primary and standby, as well as other corner cases that the original coding would misbehave in, such as the primary server having significant idle time between transactions. Per my complaint some time ago and considerable ensuing discussion. Do some desultory editing on the hot standby documentation, too.	2010-07-03 20:43:58 +00:00
Robert Haas	bb0fe9feb9	Move copydir.c from src/port to src/backend/storage/file The previous commit to make copydir() interruptible prevented postgres.exe from linking on MinGW and Cygwin, because on those platforms libpgport_srv.a can't freely reference symbols defined by the backend. Since that code is already backend-specific anyway, just move the whole file into the backend rather than adding further kludges to deal with the symbols needed by CHECK_FOR_INTERRUPTS(). This probably needs some further cleanup, but this commit just moves the file as-is, which should hopefully be enough to turn the buildfarm green again.	2010-07-02 17:03:30 +00:00
Itagaki Takahiro	9e3cd37576	Remove max_standby_delay message from ps display of recovery process in waiting status. The parameter is not so interesting in ps display because it is referable in postgresql.conf.	2010-06-14 00:49:24 +00:00
Simon Riggs	f9dbac9476	HS Defer buffer pin deadlock check until deadlock_timeout has expired. During Hot Standby we need to check for buffer pin deadlocks when the Startup process begins to wait, in case it never wakes up again. We previously made the deadlock check immediately on the basis it was cheap, though clearer thinking and prima facie evidence shows that was too simple. Refactor existing code to make it easy to add in deferral of deadlock check until deadlock_timeout allowing a good reduction in deadlock checks since far few buffer pins are held for that duration. It's worth doing anyway, though major goal is to prevent further reports of context switching with high numbers of users on occasional tests.	2010-05-26 19:52:52 +00:00
Simon Riggs	fd34374b17	Add many new Asserts in code and fix simple bug that slipped through without them, related to previous commit. Report by Bruce Momjian.	2010-05-14 07:11:49 +00:00
Simon Riggs	8431e296ea	Cleanup initialization of Hot Standby. Clarify working with reanalysis of requirements and documentation on LogStandbySnapshot(). Fixes two minor bugs reported by Tom Lane that would lead to an incorrect snapshot after transaction wraparound. Also fix two other problems discovered that would give incorrect snapshots in certain cases. ProcArrayApplyRecoveryInfo() substantially rewritten. Some minor refactoring of xact_redo_apply() and ExpireTreeKnownAssignedTransactionIds().	2010-05-13 11:15:38 +00:00
Tom Lane	f9ed327f76	Clean up some awkward, inaccurate, and inefficient processing around MaxStandbyDelay. Use the GUC units mechanism for the value, and choose more appropriate timestamp functions for performing tests with it. Make the ps_activity manipulation in ResolveRecoveryConflictWithVirtualXIDs have behavior similar to ps_activity code elsewhere, notably not updating the display when update_process_title is off and not truncating the display contents at an arbitrarily-chosen length. Improve the docs to be explicit about what MaxStandbyDelay actually measures, viz the difference between primary and standby servers' clocks, and the possible hazards if their clocks aren't in sync.	2010-05-02 02:10:33 +00:00
Tom Lane	f0488bd57c	Rename the parameter recovery_connections to hot_standby, to reduce possible confusion with streaming-replication settings. Also, change its default value to "off", because of concern about executing new and poorly-tested code during ordinary non-replicating operation. Per discussion. In passing do some minor editing of related documentation.	2010-04-29 21:36:19 +00:00
Tom Lane	77acab75df	Modify ShmemInitStruct and ShmemInitHash to throw errors internally, rather than returning NULL for some-but-not-all failures as they used to. Remove now-redundant tests for NULL from call sites. We had to do something about this because many call sites were failing to check for NULL; and changing it like this seems a lot more useful and mistake-proof than adding checks to the call sites without them.	2010-04-28 16:54:16 +00:00
Heikki Linnakangas	9b8a73326e	Introduce wal_level GUC to explicitly control if information needed for archival or hot standby should be WAL-logged, instead of deducing that from other options like archive_mode. This replaces recovery_connections GUC in the primary, where it now has no effect, but it's still used in the standby to enable/disable hot standby. Remove the WAL-logging of "unlogged operations", like creating an index without WAL-logging and fsyncing it at the end. Instead, we keep a copy of the wal_mode setting and the settings that affect how much shared memory a hot standby server needs to track master transactions (max_connections, max_prepared_xacts, max_locks_per_xact) in pg_control. Whenever the settings change, at server restart, write a WAL record noting the new settings and update pg_control. This allows us to notice the change in those settings in the standby at the right moment, they used to be included in checkpoint records, but that meant that a changed value was not reflected in the standby until the first checkpoint after the change. Bump PG_CONTROL_VERSION and XLOG_PAGE_MAGIC. Whack XLOG_PAGE_MAGIC back to the sequence it used to follow, before hot standby and subsequent patches changed it to 0x9003.	2010-04-28 16:10:43 +00:00
Tom Lane	2871b4618a	Replace the KnownAssignedXids hash table with a sorted-array data structure, and be more tense about the locking requirements for it, to improve performance in Hot Standby mode. In passing fix a few bugs and improve a number of comments in the existing HS code. Simon Riggs, with some editorialization by Tom	2010-04-28 00:09:05 +00:00
Robert Haas	33980a0640	Fix various instances of "the the". Two of these were pointed out by Erik Rijkers; the rest I found.	2010-04-23 23:21:44 +00:00
Simon Riggs	a2555571fb	Optimise btree delete processing when no active backends. Clarify comments, downgrade a message to DEBUG and remove some debug counters. Direct from ideas by Heikki Linnakangas.	2010-04-22 08:04:25 +00:00
Simon Riggs	0192abc4d7	Relax locking during GetCurrentVirtualXIDs(). Earlier improvements to handling of btree delete records mean that all snapshot conflicts on standby now have a valid, useful latestRemovedXid. Our earlier approach using LW_EXCLUSIVE was useful when we didnt always have a valid value, though is no longer useful or necessary. Asserts added to code path to prove and ensure this is the case. This will reduce contention and improve performance of larger Hot Standby servers.	2010-04-21 19:08:14 +00:00
Simon Riggs	7bc76d51fb	Check RecoveryInProgress() while holding ProcArrayLock during snapshots. This prevents a rare, yet possible race condition at the exact moment of transition from recovery to normal running.	2010-04-19 18:03:38 +00:00
Simon Riggs	21d6a6a128	Tune GetSnapshotData() during Hot Standby by avoiding loop through normal backends. Makes code clearer also, since we avoid various Assert()s. Performance of snapshots taken during recovery no longer depends upon number of read-only backends.	2010-04-18 18:06:07 +00:00
Simon Riggs	19c7a59b56	Change some debug ereports to elogs, as requested by translation team.	2010-04-06 10:50:57 +00:00
Peter Eisentraut	c248d17120	Message tuning	2010-03-21 00:17:59 +00:00
Tom Lane	f784f05e95	Clear error_context_stack and debug_query_string at the beginning of proc_exit, so that we won't try to attach any context printouts to messages that get emitted while exiting. Per report from Dennis Koegel, the context functions won't necessarily work after we've started shutting down the backend, and it seems possible that debug_query_string could be pointing at freed storage as well. The context information doesn't seem particularly relevant to such messages anyway, so there's little lost by suppressing it. Back-patch to all supported branches. I can only demonstrate a crash with log_disconnections messages back to 8.1, but the risk seems real in 8.0 and before anyway.	2010-03-20 00:58:09 +00:00
Heikki Linnakangas	e0f9e2b648	Fix bug in KnownAssignedXidsMany(). I saw this when looking at the assertion failure reported by Erik Rijkers, but this alone doesn't explain the failure.	2010-03-11 09:26:59 +00:00
Heikki Linnakangas	daaeac88aa	Fix comment which was apparently copy-pasted from another function.	2010-03-11 09:10:25 +00:00
Bruce Momjian	65e806cba1	pgindent run for 9.0	2010-02-26 02:01:40 +00:00
Tom Lane	e9a383303c	Adjust pg_fsync_writethrough so that it will set errno when failing on a platform that doesn't support this operation. The former coding would allow an unrelated errno to be reported, which would be quite misleading. Not sure if this has anything to do with the current buildfarm failures, but it's certainly bogus as-is.	2010-02-22 15:26:14 +00:00
Tom Lane	d1e027221d	Replace the pg_listener-based LISTEN/NOTIFY mechanism with an in-memory queue. In addition, add support for a "payload" string to be passed along with each notify event. This implementation should be significantly more efficient than the old one, and is also more compatible with Hot Standby usage. There is not yet any facility for HS slaves to receive notifications generated on the master, although such a thing is possible in future. Joachim Wieland, reviewed by Jeff Davis; also hacked on by me.	2010-02-16 22:34:57 +00:00
Greg Stark	f8c183a1ac	Speed up CREATE DATABASE by deferring the fsyncs until after copying all the data and using posix_fadvise to nudge the OS into flushing it earlier. This also hopefully makes CREATE DATABASE avoid spamming the cache. Tests show a big speedup on Linux at least on some filesystems. Idea and patch from Andres Freund.	2010-02-15 00:50:57 +00:00
Simon Riggs	8eccf7614b	Improvements to ps message of startup process during Hot Standby. Message is reset earlier and potential bug avoided. Andres Freund	2010-02-13 16:29:38 +00:00
Simon Riggs	b95a720a48	Re-enable max_standby_delay = -1 using deadlock detection on startup process. If startup waits on a buffer pin we send a request to all backends to cancel themselves if they are holding the buffer pin required and they are also waiting on a lock. If not, startup waits until max_standby_delay before cancelling any backend waiting for the requested buffer pin.	2010-02-13 01:32:20 +00:00
Simon Riggs	5cbf6dceea	Fix typo bug in Hot Standby from recent refactoring. Bug introduced into code recently patched by Andres Freund, so quickly fixed by him when bug report from Tatsuo Ishii arrived.	2010-02-11 19:35:22 +00:00
Tom Lane	cbe9d6beb4	Fix up rickety handling of relation-truncation interlocks. Move rd_targblock, rd_fsm_nblocks, and rd_vm_nblocks from relcache to the smgr relation entries, so that they will get reset to InvalidBlockNumber whenever an smgr-level flush happens. Because we now send smgr invalidation messages immediately (not at end of transaction) when a relation truncation occurs, this ensures that other backends will reset their values before they next access the relation. We no longer need the unreliable assumption that a VACUUM that's doing a truncation will hold its AccessExclusive lock until commit --- in fact, we can intentionally release that lock as soon as we've completed the truncation. This patch therefore reverts (most of) Alvaro's patch of 2009-11-10, as well as my marginal hacking on it yesterday. We can also get rid of assorted no-longer-needed relcache flushes, which are far more expensive than an smgr flush because they kill a lot more state. In passing this patch fixes smgr_redo's failure to perform visibility-map truncation, and cleans up some rather dubious assumptions in freespace.c and visibilitymap.c about when rd_fsm_nblocks and rd_vm_nblocks can be out of date.	2010-02-09 21:43:30 +00:00
Tom Lane	16e5859cd2	Allow free space map vacuuming to be interrupted.	2010-02-09 00:28:57 +00:00
Tom Lane	0a469c8769	Remove old-style VACUUM FULL (which was known for a little while as VACUUM FULL INPLACE), along with a boatload of subsidiary code and complexity. Per discussion, the use case for this method of vacuuming is no longer large enough to justify maintaining it; not to mention that we don't wish to invest the work that would be needed to make it play nicely with Hot Standby. Aside from the code directly related to old-style VACUUM FULL, this commit removes support for certain WAL record types that could only be generated within VACUUM FULL, redirect-pointer removal in heap_page_prune, and nontransactional generation of cache invalidation sinval messages (the last being the sticking point for Hot Standby). We still have to retain all code that copes with finding HEAP_MOVED_OFF and HEAP_MOVED_IN flag bits on existing tuples. This can't be removed as long as we want to support in-place update from pre-9.0 databases.	2010-02-08 04:33:55 +00:00
Tom Lane	70a2b05a59	Assorted cleanups in preparation for using a map file to support altering the relfilenode of currently-not-relocatable system catalogs. 1. Get rid of inval.c's dependency on relfilenode, by not having it emit smgr invalidations as a result of relcache flushes. Instead, smgr sinval messages are sent directly from smgr.c when an actual relation delete or truncate is done. This makes considerably more structural sense and allows elimination of a large number of useless smgr inval messages that were formerly sent even in cases where nothing was changing at the physical-relation level. Note that this reintroduces the concept of nontransactional inval messages, but that's okay --- because the messages are sent by smgr.c, they will be sent in Hot Standby slaves, just from a lower logical level than before. 2. Move setNewRelfilenode out of catalog/index.c, where it never logically belonged, into relcache.c; which is a somewhat debatable choice as well but better than before. (I considered catalog/storage.c, but that seemed too low level.) Rename to RelationSetNewRelfilenode. 3. Cosmetic cleanups of some other relfilenode manipulations.	2010-02-03 01:14:17 +00:00
Tom Lane	ab7c49c988	Fix assorted poorly-thought-out message strings: use %u not %d for printing OIDs, avoid random line breaks in strings somebody might grep for.	2010-02-02 22:01:53 +00:00
Simon Riggs	c85c941470	Detect early deadlock in Hot Standby when Startup is already waiting. First stage of required deadlock detection to allow re-enabling max_standby_delay setting of -1, which is now essential in the absence of improved relation- specific conflict resoluton. Requested by Greg Stark et al.	2010-01-31 19:01:11 +00:00
Simon Riggs	29eedd3122	Adjust GetLockConflicts() so that it uses TopMemoryContext when executed InHotStandby. Cleaner solution than using malloc or palloc depending upon situation, as proposed by Tom.	2010-01-29 19:45:12 +00:00
Simon Riggs	76be0c81cc	Filter recovery conflicts based upon dboid from relfilenode of WAL records for heap and btree. Minor change, mostly API changes to pass through the required values. This is a simple change though also provides the refactoring required for further enhancements to conflict processing using the relOid. Changes only have effect during Hot Standby.	2010-01-29 17:10:05 +00:00
Simon Riggs	bcd8528f00	Use malloc() in GetLockConflicts() when called InHotStandby to avoid repeated palloc calls. Current code assumed this was already true, so this is a bug fix.	2010-01-28 10:05:37 +00:00
Simon Riggs	959ac58c04	In HS, Startup process sets SIGALRM when waiting for buffer pin. If woken by alarm we send SIGUSR1 to all backends requesting that they check to see if they are blocking Startup process. If so, they throw ERROR/FATAL as for other conflict resolutions. Deadlock stop gap removed. max_standby_delay = -1 option removed to prevent deadlock.	2010-01-23 16:37:12 +00:00
Simon Riggs	58565d78db	Better internal documentation of locking for Hot Standby conflict resolution. Discuss the reasons for the lock type we hold on ProcArrayLock while deriving the conflict list. Cover the idea of false positive conflicts and seemingly strange effects on snapshot derivation.	2010-01-21 00:53:58 +00:00
Tom Lane	e319e6799a	Fix bogus initialization of KnownAssignedXids shared memory state --- didn't work in EXEC_BACKEND case.	2010-01-16 17:17:26 +00:00
Simon Riggs	2edc31c439	Message mentions msec when it should be seconds, so use s instead of ms. Noticed by Andres Freund	2010-01-16 10:13:04 +00:00
Simon Riggs	a8ce974cdd	Teach standby conflict resolution to use SIGUSR1 Conflict reason is passed through directly to the backend, so we can take decisions about the effect of the conflict based upon the local state. No specific changes, as yet, though this prepares for later work. CancelVirtualTransaction() sends signals while holding ProcArrayLock. Introduce errdetail_abort() to give message detail explaining that the abort was caused by conflict processing. Remove CONFLICT_MODE states in favour of using PROCSIG_RECOVERY_CONFLICT states directly, for clarity.	2010-01-16 10:05:59 +00:00
Heikki Linnakangas	40f908bdcd	Introduce Streaming Replication. This includes two new kinds of postmaster processes, walsenders and walreceiver. Walreceiver is responsible for connecting to the primary server and streaming WAL to disk, while walsender runs in the primary server and streams WAL from disk to the client. Documentation still needs work, but the basics are there. We will probably pull the replication section to a new chapter later on, as well as the sections describing file-based replication. But let's do that as a separate patch, so that it's easier to see what has been added/changed. This patch also adds a new section to the chapter about FE/BE protocol, documenting the protocol used by walsender/walreceivxer. Bump catalog version because of two new functions, pg_last_xlog_receive_location() and pg_last_xlog_replay_location(), for monitoring the progress of replication. Fujii Masao, with additional hacking by me	2010-01-15 09:19:10 +00:00
Simon Riggs	e99767bc28	First part of refactoring of code for ResolveRecoveryConflict. Purposes of this are to centralise the conflict code to allow further change, as well as to allow passing through the full reason for the conflict through to the conflicting backends. Backend state alters how we can handle different types of conflict so this is now required. As originally suggested by Heikki, no longer optional.	2010-01-14 11:08:02 +00:00
Bruce Momjian	228170410d	Please tablespace directories in their own subdirectory so pg_migrator can upgrade clusters without renaming the tablespace directories. New directory structure format is, e.g.: $PGDATA/pg_tblspc/20981/PG_8.5_201001061/719849/83292814	2010-01-12 02:42:52 +00:00
Simon Riggs	3bfcccc295	During Hot Standby, fix drop database when sessions idle. Previously we only cancelled sessions that were in-transaction. Simple fix is to just cancel all sessions without waiting. Doing it this way avoids complicating common code paths, which would not be worth the trouble to cover this rare case. Problem report and fix by Andres Freund, edited somewhat by me	2010-01-10 15:44:28 +00:00
Bruce Momjian	0239800893	Update copyright for the year 2010.	2010-01-02 16:58:17 +00:00
Tom Lane	bd8a35655b	Suppress compiler warning (pid_t isn't int everywhere)	2009-12-31 22:07:36 +00:00
Tom Lane	b4594a66ba	Add missing 'static' tag.	2009-12-31 21:47:12 +00:00
Tom Lane	85d02a6586	Redefine Datum as uintptr_t, instead of unsigned long. This is more in keeping with modern practice, and is a first step towards porting to Win64 (which has sizeof(pointer) > sizeof(long)). Tsutomu Yamada, Magnus Hagander, Tom Lane	2009-12-31 19:41:37 +00:00
Simon Riggs	efc16ea520	Allow read only connections during recovery, known as Hot Standby. Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record. New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far. This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required. Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit. Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.	2009-12-19 01:32:45 +00:00
Robert Haas	cddca5ec13	Add an EXPLAIN (BUFFERS) option to show buffer-usage statistics. This patch also removes buffer-usage statistics from the track_counts output, since this (or the global server statistics) is deemed to be a better interface to this information. Itagaki Takahiro, reviewed by Euler Taveira de Oliveira.	2009-12-15 04:57:48 +00:00
Itagaki Takahiro	f1325ce213	Add large object access control. A new system catalog pg_largeobject_metadata manages ownership and access privileges of large objects. KaiGai Kohei, reviewed by Jaime Casanova.	2009-12-11 03:34:57 +00:00
Heikki Linnakangas	ab3148b712	Fix bug in temporary file management with subtransactions. A cursor opened in a subtransaction stays open even if the subtransaction is aborted, so any temporary files related to it must stay alive as well. With the patch, we use ResourceOwners to track open temporary files and don't automatically close them at subtransaction end (though in the normal case temporary files are registered with the subtransaction resource owner and will therefore be closed). At end of top transaction, we still check that there's no temporary files marked as close-at-end-of-transaction open, but that's now just a debugging cross-check as the resource owner cleanup should've closed them already.	2009-12-03 11:03:29 +00:00
Tom Lane	00e6a16d01	Change the autovacuum launcher to read pg_database directly, rather than via the "flat files" facility. This requires making it enough like a backend to be able to run transactions; it's no longer an "auxiliary process" but more like the autovacuum worker processes. Also, its signal handling has to be brought into line with backends/workers. In particular, since it now has to handle procsignal.c processing, the special autovac-launcher-only signal conditions are moved to SIGUSR2. Alvaro, with some cleanup from Tom	2009-08-31 19:41:00 +00:00
Tom Lane	04011cc970	Allow backends to start up without use of the flat-file copy of pg_database. To make this work in the base case, pg_database now has a nailed-in-cache relation descriptor that is initialized using hardwired knowledge in relcache.c. This means pg_database is added to the set of relations that need to have a Schema_pg_xxx macro maintained in pg_attribute.h. When this path is taken, we'll have to do a seqscan of pg_database to find the row we need. In the normal case, we are able to do an indexscan to find the database's row by name. This is made possible by storing a global relcache init file that describes only the shared catalogs and their indexes (and therefore is usable by all backends in any database). A new backend loads this cache file, finds its database OID after an indexscan on pg_database, and then loads the local relcache init file for that database. This change should effectively eliminate number of databases as a factor in backend startup time, even with large numbers of databases. However, the real reason for doing it is as a first step towards getting rid of the flat files altogether. There are still several other sub-projects to be tackled before that can happen.	2009-08-12 20:53:31 +00:00
Heikki Linnakangas	23dc89d2c3	Improve error messages in md.c. When a filesystem operation like open() or fsync() fails, say "file" rather than "relation" when printing the filename. This makes messages that display block numbers a bit confusing. For example, in message 'could not read block 150000 of file "base/1234/5678.1"', 150000 is the block number from the beginning of the relation, ie. segment 0, not 150000th block within that segment. Per discussion, users aren't usually interested in the exact location within the file, so we can live with that. To ease constructing error messages, add FilePathName(File) function to return the pathname of a virtual fd.	2009-08-05 18:01:54 +00:00
Tom Lane	2487d872e0	Create a multiplexing structure for signals to Postgres child processes. This patch gets us out from under the Unix limitation of two user-defined signal types. We already had done something similar for signals directed to the postmaster process; this adds multiplexing for signals directed to backends and auxiliary processes (so long as they're connected to shared memory). As proof of concept, replace the former usage of SIGUSR1 and SIGUSR2 for backends with use of the multiplexing mechanism. There are still some hard-wired definitions of SIGUSR1 and SIGUSR2 for other process types, but getting rid of those doesn't seem interesting at the moment. Fujii Masao	2009-07-31 20:26:23 +00:00
Tom Lane	8504905793	Fix a thinko introduced into CountActiveBackends by a recent patch: we should ignore NULL array entries, not non-NULL ones. This had the effect of disabling commit_delay, and could have caused a crash in the rare race condition the patch was intended to fix. Bug report and diagnosis by Jeff Janes, in bug #4952.	2009-07-29 15:57:11 +00:00
Tom Lane	2de48a83e6	Cleanup and code review for the patch that made bgwriter active during archive recovery. Invent a separate state variable and inquiry function for XLogInsertAllowed() to clarify some tests and make the management of writing the end-of-recovery checkpoint less klugy. Fix several places that were incorrectly testing InRecovery when they should be looking at RecoveryInProgress or XLogInsertAllowed (because they will now be executed in the bgwriter not startup process). Clarify handling of bad LSNs passed to XLogFlush during recovery. Use a spinlock for setting/testing SharedRecoveryInProgress. Improve quite a lot of comments. Heikki and Tom	2009-06-26 20:29:04 +00:00
Heikki Linnakangas	7e48b77b1c	Fix some serious bugs in archive recovery, now that bgwriter is active during it: When bgwriter is active, the startup process can't perform mdsync() correctly because it won't see the fsync requests accumulated in bgwriter's private pendingOpsTable. Therefore make bgwriter responsible for the end-of-recovery checkpoint as well, when it's active. When bgwriter is active (= archive recovery), the startup process must not accumulate fsync requests to its own pendingOpsTable, since bgwriter won't see them there when it performs restartpoints. Make startup process drop its pendingOpsTable when bgwriter is launched to avoid that. Update minimum recovery point one last time when leaving archive recovery. It won't be updated by the end-of-recovery checkpoint because XLogFlush() sees us as out of recovery already. This fixes bug #4879 reported by Fujii Masao.	2009-06-25 21:36:00 +00:00
Tom Lane	6382448cf9	For bulk write operations (eg COPY IN), use a ring buffer of 16MB instead of the 256KB limit originally enforced by a patch committed 2008-11-06. Per recent test results, the smaller size resulted in an undesirable decrease in bulk data loading speed, due to COPY processing frequently getting blocked for WAL flushing. This area might need more tweaking later, but this setting seems to be good enough for 8.4.	2009-06-22 20:04:28 +00:00
Bruce Momjian	d747140279	8.4 pgindent run, with new combined Linux/FreeBSD/MinGW typedef list provided by Andrew.	2009-06-11 14:49:15 +00:00
Tom Lane	4616d57dad	Fix all the server-side SIGQUIT handlers (grumble ... why so many identical copies?) to ensure they really don't run proc_exit/shmem_exit callbacks, as was intended. I broke this behavior recently by installing atexit callbacks without thinking about the one case where we truly don't want to run those callback functions. Noted in an example from Dave Page.	2009-05-15 15:56:39 +00:00
Tom Lane	249a899f73	Install an atexit(2) callback that ensures that proc_exit's cleanup processing will still be performed if something in a backend process calls exit() directly, instead of going through proc_exit() as we prefer. This is a second response to the issue that we might load third-party code that doesn't know it should not call exit(). Such a call will now cause a reasonably graceful backend shutdown, if possible. (Of course, if the reason for the exit() call is out-of-memory or some such, we might not be able to recover, but at least we will try.)	2009-05-05 20:06:07 +00:00
Tom Lane	969d7cd431	Install a "dead man switch" to allow the postmaster to detect cases where a backend has done exit(0) or exit(1) without having disengaged itself from shared memory. We are at risk for this whenever third-party code is loaded into a backend, since such code might not know it's supposed to go through proc_exit() instead. Also, it is reported that under Windows there are ways to externally kill a process that cause the status code returned to the postmaster to be indistinguishable from a voluntary exit (thank you, Microsoft). If this does happen then the system is probably hosed --- for instance, the dead session might still be holding locks. So the best recovery method is to treat this like a backend crash. The dead man switch is armed for a particular child process when it acquires a regular PGPROC, and disarmed when the PGPROC is released; these should be the first and last touches of shared memory resources in a backend, or close enough anyway. This choice means there is no coverage for auxiliary processes, but I doubt we need that, since they shouldn't be executing any user-provided code anyway. This patch also improves the management of the EXEC_BACKEND ShmemBackendArray array a bit, by reducing search costs. Although this problem is of long standing, the lack of field complaints seems to mean it's not critical enough to risk back-patching; at least not till we get some more testing of this mechanism.	2009-05-05 19:59:00 +00:00
Tom Lane	c973051ae6	A session that does not have any live snapshots does not have to be waited for when we are waiting for old snapshots to go away during a concurrent index build. In particular, this rule lets us avoid waiting for idle-in-transaction sessions. This logic could be improved further if we had some way to wake up when the session we are currently waiting for goes idle-in-transaction. However that would be a significantly more complex/invasive patch, so it'll have to wait for some other day. Simon Riggs, with some improvements by Tom.	2009-04-04 17:40:36 +00:00
Tom Lane	1b2bb33a54	Add a comment documenting the question of whether PrefetchBuffer should try to protect an already-existing buffer from being evicted. This was left as an open issue when the posix_fadvise patch was committed. I'm not sure there's any evidence to justify more work in this area, but we should have some record about it in the source code.	2009-04-03 18:17:43 +00:00
Tom Lane	948d6ec90f	Modify the relcache to record the temp status of both local and nonlocal temp relations; this is no more expensive than before, now that we have pg_class.relistemp. Insert tests into bufmgr.c to prevent attempting to fetch pages from nonlocal temp relations. This provides a low-level defense against bugs-of-omission allowing temp pages to be loaded into shared buffers, as in the contrib/pgstattuple problem reported by Stuart Bishop. While at it, tweak a bunch of places to use new relcache tests (instead of expensive probes into pg_namespace) to detect local or nonlocal temp tables.	2009-03-31 22:12:48 +00:00
Heikki Linnakangas	eeeb782e60	Fix a rare race condition when commit_siblings > 0 and a transaction commits at the same instant as a new backend is spawned. Since CountActiveBackends() doesn't hold ProcArrayLock, it needs to be prepared for the case that a pointer at the end of the proc array is still NULL even though numProcs says it should be valid, since it doesn't hold ProcArrayLock. Backpatch to 8.1. 8.0 and earlier had this right, but it was broken in the split of PGPROC and sinval shared memory arrays. Per report and proposal by Marko Kreen.	2009-03-31 05:18:33 +00:00
Tom Lane	471913a6a5	More fixes for 8.4 DTrace probes. Remove useless BUFFER_HIT/BUFFER_MISS probes --- the BUFFER_READ_DONE probe provides the same information and more besides. Expand the LOCK_WAIT_START/DONE probe arguments so that there's actually some chance of telling what is being waited for. Update and clean up the documentation.	2009-03-23 01:52:38 +00:00
Tom Lane	44023dc5f5	Add isExtend to the parameters of the buffer_read_start and buffer_read_done DTrace probes, so that ordinary reads can be distinguished from relation extension operations. Move buffer_read_start probe to before the smgrnblocks() call that's needed in the isExtend case, since really that step should be charged as part of the time needed for the extension operation. (This makes it slightly harder to match the read_start with the associated read_done, since now you can't match them on blockNumber, but it should still be possible since isExtend operations on the same relation can never be interleaved.) Per recent discussion. In passing, add the page identity (forkNum/blockNum) to the parameters of the buffer_flush_start/buffer_flush_done probes, which were unaccountably lacking the info.	2009-03-22 22:39:05 +00:00
Tom Lane	d287c9eff0	Restore previous ordering of BUFFER_FLUSH_START probe. I had wanted to make it include the time for the possible smgropen() call, but that results in a null pointer dereference :-(. An alternative solution would be to fetch the buffer tag instead of looking at *reln, but I'll just put it back as it was for the moment. BTW, this indicates that DTrace probes evaluate their arguments even when nominally inactive. What was that about "zero cost", again?	2009-03-13 17:46:21 +00:00
Tom Lane	e04810e8c4	Code review for dtrace probes added (so far) to 8.4. Adjust placement of some bufmgr probes, take out redundant and memory-leak-inducing path arguments to smgr__md__read__done and smgr__md__write__done, fix bogus attempt to recalculate space used in sort__done, clean up formatting in places where I'm not sure pgindent will do a nice job by itself.	2009-03-11 23:19:25 +00:00
Peter Eisentraut	9add9f95c3	Don't actively violate the system limit of maximum open files (RLIMIT_NOFILE). This avoids irritating kernel logs (if system overstep violations are enabled) and also the grsecurity alert when starting PostgreSQL. original patch by Jacek Drobiecki References: http://archives.postgresql.org/pgsql-bugs/2004-05/msg00103.php http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=248967	2009-03-04 09:12:49 +00:00
Heikki Linnakangas	cdd46c7654	Start background writer during archive recovery. Background writer now performs its usual buffer cleaning duties during archive recovery, and it's responsible for performing restartpoints. This requires some changes in postmaster. When the startup process has done all the initialization and is ready to start WAL redo, it signals the postmaster to launch the background writer. The postmaster is signaled again when the point in recovery is reached where we know that the database is in consistent state. Postmaster isn't interested in that at the moment, but that's the point where we could let other backends in to perform read-only queries. The postmaster is signaled third time when the recovery has ended, so that postmaster knows that it's safe to start accepting connections. The startup process now traps SIGTERM, and performs a "clean" shutdown. If you do a fast shutdown during recovery, a shutdown restartpoint is performed, like a shutdown checkpoint, and postmaster kills the processes cleanly. You still have to continue the recovery at next startup, though. Currently, the background writer is only launched during archive recovery. We could launch it during crash recovery as well, but it seems better to keep that codepath as simple as possible, for the sake of robustness. And it couldn't do any restartpoints during crash recovery anyway, so it wouldn't be that useful. log_restartpoints is gone. Use log_checkpoints instead. This is yet to be documented. This whole operation is a pre-requisite for Hot Standby, but has some value of its own whether the hot standby patch makes 8.4 or not. Simon Riggs, with lots of modifications by me.	2009-02-18 15:58:41 +00:00
Heikki Linnakangas	b2a667b9ee	Add a new option to RestoreBkpBlocks() to indicate if a cleanup lock should be used instead of the normal exclusive lock, and make WAL redo functions responsible for calling RestoreBkpBlocks(). They know better what kind of a lock they need. At the moment, this just moves things around with no functional change, but makes the hot standby patch that's under review cleaner.	2009-01-20 18:59:37 +00:00
Tom Lane	b7b8f0b609	Implement prefetching via posix_fadvise() for bitmap index scans. A new GUC variable effective_io_concurrency controls how many concurrent block prefetch requests will be issued. (The best way to handle this for plain index scans is still under debate, so that part is not applied yet --- tgl) Greg Stark	2009-01-12 05:10:45 +00:00
Tom Lane	dad75a62bf	Create a "shmem_startup_hook" to be called at the end of shared memory initialization, to give loadable modules a reasonable place to perform creation of any shared memory areas they need. This is the logical conclusion of our previous creation of RequestAddinShmemSpace() and RequestAddinLWLocks(). We don't need an explicit shmem_shutdown_hook, because the existing on_shmem_exit and on_proc_exit mechanisms serve that need. Also, adjust SubPostmasterMain so that libraries that got loaded into the postmaster will be loaded into all child processes, not only regular backends. This improves consistency with the non-EXEC_BACKEND behavior, and might be necessary for functionality for some types of add-ons.	2009-01-03 17:08:39 +00:00
Bruce Momjian	511db38ace	Update copyright for 2009.	2009-01-01 17:24:05 +00:00
Bruce Momjian	5a90bc1fbe	The attached patch contains a couple of fixes in the existing probes and includes a few new ones. - Fixed compilation errors on OS X for probes that use typedefs - Fixed a number of probes to pass ForkNumber per the relation forks patch - The new probes are those that were taken out from the previous submitted patch and required simple fixes. Will submit the other probes that may require more discussion in a separate patch. Robert Lor	2008-12-17 01:39:04 +00:00
Tom Lane	55368223cd	Tweak the tree descent loop in fsm_search_avail to not look at the right child if it doesn't need to. This saves some miniscule number of cycles, but the ulterior motive is to avoid an optimization bug known to exist in SCO's C compiler (and perhaps others?)	2008-12-10 17:11:18 +00:00
Heikki Linnakangas	dea81a6cf6	Revert SIGUSR1 multiplexing patch, per Tom's objection.	2008-12-09 15:59:39 +00:00
Heikki Linnakangas	7b05b3fa39	Provide support for multiplexing SIGUSR1 signal. The upcoming synchronous replication patch needs a signal, but we've already used SIGUSR1 and SIGUSR2 in normal backends. This patch allows reusing SIGUSR1 for that, and for other purposes too if the need arises.	2008-12-09 14:28:20 +00:00
Alvaro Herrera	7b640b0345	Fix a couple of snapshot management bugs in the new ResourceOwner world: non-writable large objects need to have their snapshots registered on the transaction resowner, not the current portal's, because it must persist until the large object is closed (which the portal does not). Also, ensure that the serializable snapshot is recorded by the transaction resource owner too, even when a subtransaction has changed the current resource owner before serializable is taken. Per bug reports from Pavan Deolasee.	2008-12-04 14:51:02 +00:00
Heikki Linnakangas	011fa3662e	Small comment fixes.	2008-12-03 12:22:53 +00:00
Heikki Linnakangas	4d6ee26171	Don't force creation of the FSM on searches. It will still be created as soon as the first page fills up, and is marked as (almost) full, though.	2008-11-27 13:32:26 +00:00
Heikki Linnakangas	58bece7a60	Fix #ifdeffed debugging code to work with relation forks.	2008-11-27 07:38:01 +00:00
Heikki Linnakangas	9858a8c81c	Rely on relcache invalidation to update the cached size of the FSM.	2008-11-26 17:08:58 +00:00
Heikki Linnakangas	3396000684	Rethink the way FSM truncation works. Instead of WAL-logging FSM truncations in FSM code, call FreeSpaceMapTruncateRel from smgr_redo. To make that cleaner from modularity point of view, move the WAL-logging one level up to RelationTruncate, and move RelationTruncate and all the related WAL-logging to new src/backend/catalog/storage.c file. Introduce new RelationCreateStorage and RelationDropStorage functions that are used instead of calling smgrcreate/smgrscheduleunlink directly. Move the pending rel deletion stuff from smgrcreate/smgrscheduleunlink to the new functions. This leaves smgr.c as a thin wrapper around md.c; all the transactional stuff is now in storage.c. This will make it easier to add new forks with similar truncation logic, like the visibility map.	2008-11-19 10:34:52 +00:00
Heikki Linnakangas	f06b7604ca	Fix oversight in previous error-reporting patch; mustn't pfree path string before passing it to elog.	2008-11-14 11:09:50 +00:00
Tom Lane	cad3a26a95	Fix sloppy omission of now-required #include's.	2008-11-11 14:17:02 +00:00
Heikki Linnakangas	7e8b0b9ab1	Change error messages to print the physical path, like "base/11517/3767_fsm", instead of symbolic names like "1663/11517/3767/1", per Alvaro's suggestion. I didn't change the messages in the higher-level index, heap and FSM routines, though, where the fork is implicit.	2008-11-11 13:19:16 +00:00
Tom Lane	6517f377d6	Implement ALTER DATABASE SET TABLESPACE to move a whole database (or at least as much of it as lives in its default tablespace) to a new tablespace. Guillaume Lelarge, with some help from Bernd Helmle and Tom Lane	2008-11-07 18:25:07 +00:00
Tom Lane	85e2cedf98	Improve bulk-insert performance by keeping the current target buffer pinned (but not locked, as that would risk deadlocks). Also, make it work in a small ring of buffers to avoid having bulk inserts trash the whole buffer arena. Robert Haas, after an idea of Simon Riggs'.	2008-11-06 20:51:15 +00:00
Tom Lane	b4eae023bb	Clean up the messy semantics (not to mention inefficiency) of PageGetTempPage by splitting it into three functions with better-defined behaviors. Zdenek Kotala	2008-11-03 20:47:49 +00:00
Tom Lane	d7112cfa88	Remove the last vestiges of the MAKE_PTR/MAKE_OFFSET mechanism. We haven't allowed different processes to have different addresses for the shmem segment in quite a long time, but there were still a few places left that used the old coding convention. Clean them up to reduce confusion and improve the compiler's ability to detect pointer type mismatches. Kris Jurka	2008-11-02 21:24:52 +00:00
Tom Lane	902d1cb35f	Remove all uses of the deprecated functions heap_formtuple, heap_modifytuple, and heap_deformtuple in favor of the newer functions heap_form_tuple et al (which do the same things but use bool control flags instead of arbitrary char values). Eliminate the former duplicate coding of these functions, reducing the deprecated functions to mere wrappers around the newer ones. We can't get rid of them entirely because add-on modules probably still contain many instances of the old coding style. Kris Jurka	2008-11-02 01:45:28 +00:00
Heikki Linnakangas	e9816533e3	Update FSM on WAL replay. This is a bit limited; the FSM is only updated on non-full-page-image WAL records, and quite arbitrarily, only if there's less than 20% free space on the page after the insert/update (not on HOT updates, though). The 20% cutoff should avoid most of the overhead, when replaying a bulk insertion, for example, while ensuring that pages that are full are marked as full in the FSM. This is mostly to avoid the nasty worst case scenario, where you replay from a PITR archive, and the FSM information in the base backup is really out of date. If there was a lot of pages that the outdated FSM claims to have free space, but don't actually have any, the first unlucky inserter after the recovery would traverse through all those pages, just to find out that they're full. We didn't have this problem with the old FSM implementation, because we simply threw the FSM information away on a non-clean shutdown.	2008-10-31 19:40:27 +00:00
Heikki Linnakangas	19c8dc839b	Unite ReadBufferWithFork, ReadBufferWithStrategy, and ZeroOrReadBuffer functions into one ReadBufferExtended function, that takes the strategy and mode as argument. There's three modes, RBM_NORMAL which is the default used by plain ReadBuffer(), RBM_ZERO, which replaces ZeroOrReadBuffer, and a new mode RBM_ZERO_ON_ERROR, which allows callers to read corrupt pages without throwing an error. The FSM needs the new mode to recover from corrupt pages, which could happend if we crash after extending an FSM file, and the new page is "torn". Add fork number to some error messages in bufmgr.c, that still lacked it.	2008-10-31 15:05:00 +00:00
Alvaro Herrera	089ae3bc9a	Properly access a buffer's LSN using existing access macros instead of abusing knowledge of page layout. Stolen from Jonah Harris' CRC patch	2008-10-20 21:11:15 +00:00
Tom Lane	dd4c165bc3	Improve some of the comments in fsmpage.c.	2008-10-07 21:10:11 +00:00
Heikki Linnakangas	89f373bf5b	Index FSMs needs to be vacuumed as well. Report by Jeff Davis.	2008-10-06 08:04:11 +00:00
Tom Lane	68827a7ada	Suppress an uninitialized-variable warning (not all versions of gcc complain here, but some do)	2008-10-01 14:59:23 +00:00
Heikki Linnakangas	f06ef2bede	Fix WAL redo of FSM truncation. We can't call smgrtruncate() during WAL replay, because it tries to XLogInsert().	2008-10-01 08:12:14 +00:00
Tom Lane	6ca1b1cd95	Fix compiler warning (unportable sprintf usage)	2008-09-30 14:15:58 +00:00
Heikki Linnakangas	15c121b3ed	Rewrite the FSM. Instead of relying on a fixed-size shared memory segment, the free space information is stored in a dedicated FSM relation fork, with each relation (except for hash indexes; they don't use FSM). This eliminates the max_fsm_relations and max_fsm_pages GUC options; remove any trace of them from the backend, initdb, and documentation. Rewrite contrib/pg_freespacemap to match the new FSM implementation. Also introduce a new variant of the get_raw_page(regclass, int4, int4) function in contrib/pageinspect that let's you to return pages from any relation fork, and a new fsm_page_contents() function to inspect the new FSM pages.	2008-09-30 10:52:14 +00:00
Alvaro Herrera	5817d861e9	Optimize CleanupTempFiles by having a boolean flag that keeps track of whether there are FD_XACT_TEMPORARY files to clean up at transaction end. Per performance profiling results on AWeber's huge systems. Patch by me after an idea suggested by Simon Riggs.	2008-09-19 04:57:10 +00:00
Tom Lane	35c2a3c3cf	Allow ShowBufferUsage() to report the number of reads/writes that have occurred to temporary files. This replaces the unused NDirectFileRead/NDirectFileWrite counters. Itagaki Takahiro	2008-09-17 13:15:55 +00:00
Heikki Linnakangas	3f0e808c4a	Introduce the concept of relation forks. An smgr relation can now consist of multiple forks, and each fork can be created and grown separately. The bulk of this patch is about changing the smgr API to include an extra ForkNumber argument in every smgr function. Also, smgrscheduleunlink and smgrdounlink no longer implicitly call smgrclose, because other forks might still exist after unlinking one. The callers of those functions have been modified to call smgrclose instead. This patch in itself doesn't have any user-visible effect, but provides the infrastructure needed for upcoming patches. The additional forks envisioned are a rewritten FSM implementation that doesn't rely on a fixed-size shared memory block, and a visibility map to allow skipping portions of a table in VACUUM that have no dead tuples.	2008-08-11 11:05:11 +00:00
Tom Lane	d8b04d5fac	In ReadOrZeroBuffer (and related entry points), don't bother to call PageHeaderIsValid when we zero the buffer instead of reading the page in. The actual performance improvement is probably marginal since this function isn't very heavily used, but a cycle saved is a cycle earned. Zdenek Kotala	2008-08-05 15:09:04 +00:00
Tom Lane	4abd7b49f1	Improve CREATE/DROP/RENAME DATABASE so that when failing because the source or target database is being accessed by other users, it tells you whether the "other users" are live sessions or uncommitted prepared transactions. (Indeed, it tells you exactly how many of each, but that's mostly just because it was easy to do so.) This should help forestall the gotcha of not realizing that a prepared transaction is what's blocking the command. Per discussion.	2008-08-04 18:03:46 +00:00
Alvaro Herrera	e36e6b1cab	Add a few more DTrace probes to the backend. Robert Lor	2008-08-01 13:16:09 +00:00
Tom Lane	dc02a4814a	Fix a race condition that I introduced into sinvaladt.c during the recent rewrite. When called from SIInsertDataEntries, SICleanupQueue releases the write lock if it has to issue a kill() to signal some laggard backend. That still seems like a good idea --- but it's possible that by the time we get the lock back, there are no longer enough free message slots to satisfy SIInsertDataEntries' requirement. Must recheck, and repeat the whole SICleanupQueue process if not. Noted while reading code.	2008-07-18 14:45:48 +00:00
Tom Lane	6816577a78	Change the PageGetContents() macro to guarantee its result is maxalign'd, thereby forestalling any problems with alignment of the data structure placed there. Since SizeOfPageHeaderData is maxalign'd anyway in 8.3 and HEAD, this does not actually change anything right now, but it is foreseeable that the header size will change again someday. I had to fix a couple of places that were assuming that the content offset is just SizeOfPageHeaderData rather than MAXALIGN(SizeOfPageHeaderData). Per discussion of Zdenek's page-macros patch.	2008-07-13 21:50:04 +00:00
Tom Lane	9d035f4254	Clean up the use of some page-header-access macros: principally, use SizeOfPageHeaderData instead of sizeof(PageHeaderData) in places where that makes the code clearer, and avoid casting between Page and PageHeader where possible. Zdenek Kotala, with some additional cleanup by Heikki Linnakangas. I did not apply the parts of the proposed patch that would have resulted in slightly changing the on-disk format of hash indexes; it seems to me that's not a win as long as there's any chance of having in-place upgrade for 8.4.	2008-07-13 20:45:47 +00:00
Alvaro Herrera	110147653a	Make sure we only try to free snapshots that have been passed through CopySnapshot, per Neil Conway. Also add a comment about the assumption in GetSnapshotData that the argument is statically allocated. Also, fix some more typos in comments in snapmgr.c.	2008-07-11 02:10:14 +00:00
Tom Lane	5b965bf08b	Teach autovacuum how to determine whether a temp table belongs to a crashed backend. If so, send a LOG message to the postmaster log, and if the table is beyond the vacuum-for-wraparound horizon, forcibly drop it. Per recent discussions. Perhaps we ought to back-patch this, but it probably needs to age a bit in HEAD first.	2008-07-01 02:09:34 +00:00
Tom Lane	dab421d2f0	Seems I was too optimistic in supposing that sinval's maxMsgNum could be read and written without a lock. The value itself is atomic, sure, but on processors with weak memory ordering it's possible for a reader to see the value change before it sees the associated message written into the buffer array. Fix by introducing a spinlock that's used just to read and write maxMsgNum. (We could do this with less overhead if we recognized a concept of "memory access barrier"; is it worth introducing such a thing? At the moment probably not --- I can't measure any clear slowdown from adding the spinlock, so this solution is probably fine.) Per buildfarm results.	2008-06-20 00:24:53 +00:00
Tom Lane	fad153ec45	Rewrite the sinval messaging mechanism to reduce contention and avoid unnecessary cache resets. The major changes are: * When the queue overflows, we only issue a cache reset to the specific backend or backends that still haven't read the oldest message, rather than resetting everyone as in the original coding. * When we observe backend(s) falling well behind, we signal SIGUSR1 to only one backend, the one that is furthest behind and doesn't already have a signal outstanding for it. When it finishes catching up, it will in turn signal SIGUSR1 to the next-furthest-back guy, if there is one that is far enough behind to justify a signal. The PMSIGNAL_WAKEN_CHILDREN mechanism is removed. * We don't attempt to clean out dead messages after every message-receipt operation; rather, we do it on the insertion side, and only when the queue fullness passes certain thresholds. * Split SInvalLock into SInvalReadLock and SInvalWriteLock so that readers don't block writers nor vice versa (except during the infrequent queue cleanout operations). * Transfer multiple sinval messages for each acquisition of a read or write lock.	2008-06-19 21:32:56 +00:00
Alvaro Herrera	a3540b0f65	Improve our #include situation by moving pointer types away from the corresponding struct definitions. This allows other headers to avoid including certain highly-loaded headers such as rel.h and relscan.h, instead using just relcache.h, heapam.h or genam.h, which are more lightweight and thus cause less unnecessary dependencies.	2008-06-19 00:46:06 +00:00
Tom Lane	86fdb32bd0	Remove freeBackends counter from the sinval shared memory area. We used to use it to help enforce superuser_reserved_backends, but since 8.1 it's just been dead weight.	2008-06-17 20:07:08 +00:00
Heikki Linnakangas	a213f1ee6c	Refactor XLogOpenRelation() and XLogReadBuffer() in preparation for relation forks. XLogOpenRelation() and the associated light-weight relation cache in xlogutils.c is gone, and XLogReadBuffer() now takes a RelFileNode as argument, instead of Relation. For functions that still need a Relation struct during WAL replay, there's a new function called CreateFakeRelcacheEntry() that returns a fake entry like XLogOpenRelation() used to.	2008-06-12 09:12:31 +00:00
Neil Conway	8374246054	Further tweak for comment in CheckDeadLock(), per Tom.	2008-06-09 18:23:05 +00:00
Neil Conway	da80a4b97e	Fix typo in comment.	2008-06-09 06:55:34 +00:00
Alvaro Herrera	cc87402d6e	Move BufferGetPageSize and BufferGetPage from bufpage.h to bufmgr.h. It is more logical that way, and also it reduces the amount of unnecessary includes in bufpage.h, which is widely used. Zdenek Kotala. My previous patch to bufpage.h should also have credited him as author, but I forgot (sorry about that).	2008-06-08 22:00:48 +00:00
Bruce Momjian	d82a1d582c	This is the patch replace offnum++ by OffsetNumberNext, to be consistent. OffsetNumberNext() has some casting that makes it useful. Fujii Masao	2008-05-13 15:44:08 +00:00
Alvaro Herrera	5da9da71c4	Improve snapshot manager by keeping explicit track of snapshots. There are two ways to track a snapshot: there's the "registered" list, which is used for arbitrary long-lived snapshots; and there's the "active stack", which is used for the snapshot that is considered "active" at any time. This also allows users of snapshots to stop worrying about snapshot memory allocation and freeing, and about using PG_TRY blocks around ActiveSnapshot assignment. This is all done automatically now. As a consequence, this allows us to reset MyProc->xmin when there are no more snapshots registered in the current backend, reducing the impact that long-running transactions have on VACUUM.	2008-05-12 20:02:02 +00:00
Alvaro Herrera	9084399782	Put back bufmgr.h in bufpage.h -- it is needed by some macros. Remove #include bufmgr.h from (most?) source files which already include bufpage.h.	2008-05-12 16:06:10 +00:00
Alvaro Herrera	f8c4d7db60	Restructure some header files a bit, in particular heapam.h, by removing some unnecessary #include lines in it. Also, move some tuple routine prototypes and macros to htup.h, which allows removal of heapam.h inclusion from some .c files. For this to work, a new header file access/sysattr.h needed to be created, initially containing attribute numbers of system columns, for pg_dump usage. While at it, make contrib ltree, intarray and hstore header files more consistent with our header style.	2008-05-12 00:00:54 +00:00
Tom Lane	3c6248a828	Remove the recently added USE_SEGMENTED_FILES option, and indeed remove all support for a nonsegmented mode from md.c. Per recent discussions, there doesn't seem to be much value in a "never segment" option as opposed to segmenting with a suitably large segment size. So instead provide a configure-time switch to set the desired segment size in units of gigabytes. While at it, expose a configure switch for BLCKSZ as well. Zdenek Kotala	2008-05-02 01:08:27 +00:00
Heikki Linnakangas	9cb91f90c9	Fix two race conditions between the pending unlink mechanism that was put in place to prevent reusing relation OIDs before next checkpoint, and DROP DATABASE. First, if a database was dropped, bgwriter would still try to unlink the files that the rmtree() call by the DROP DATABASE command has already deleted, or is just about to delete. Second, if a database is dropped, and another database is created with the same OID, bgwriter would in the worst case delete a relation in the new database that happened to get the same OID as a dropped relation in the old database. To fix these race conditions: - make rmtree() ignore ENOENT errors. This fixes the 1st race condition. - make ForgetDatabaseFsyncRequests forget unlink requests as well. - force checkpoint on in dropdb on all platforms Since ForgetDatabaseFsyncRequests() is asynchronous, the 2nd change isn't enough on its own to fix the problem of dropping and creating a database with same OID, but forcing a checkpoint on DROP DATABASE makes it sufficient. Per Tom Lane's bug report and proposal. Backpatch to 8.3.	2008-04-18 06:48:38 +00:00
Tom Lane	d1cbd26ded	Repair two places where SIGTERM exit could leave shared memory state corrupted. (Neither is very important if SIGTERM is used to shut down the whole database cluster together, but there's a problem if someone tries to SIGTERM individual backends.) To do this, introduce new infrastructure macros PG_ENSURE_ERROR_CLEANUP/PG_END_ENSURE_ERROR_CLEANUP that take care of transiently pushing an on_shmem_exit cleanup hook. Also use this method for createdb cleanup --- that wasn't a shared-memory-corruption problem, but SIGTERM abort of createdb could leave orphaned files lying around. Backpatch as far as 8.2. The shmem corruption cases don't exist in 8.1, and the createdb usage doesn't seem important enough to risk backpatching further.	2008-04-16 23:59:40 +00:00
Tom Lane	ec498cdcbb	Create new routines systable_beginscan_ordered, systable_getnext_ordered, systable_endscan_ordered that have API similar to systable_beginscan etc (in particular, the passed-in scankeys have heap not index attnums), but guarantee ordered output, unlike the existing functions. For the moment these are just very thin wrappers around index_beginscan/index_getnext/etc. Someday they might need to get smarter; but for now this is just a code refactoring exercise to reduce the number of direct callers of index_getnext, in preparation for changing that function's API. In passing, remove index_getnext_indexitem, which has been dead code for quite some time, and will have even less use than that in the presence of run-time-lossy indexes.	2008-04-12 23:14:21 +00:00
Alvaro Herrera	73b0300b2a	Move the HTSU_Result enum definition into snapshot.h, to avoid including tqual.h into heapam.h. This makes all inclusion of tqual.h explicit. I also sorted alphabetically the includes on some source files.	2008-03-26 21:10:39 +00:00
Alvaro Herrera	78f02ca1f5	Rename snapmgmt.c/h to snapmgr.c/h, for consistency with other files. Per complaint from Tom Lane.	2008-03-26 18:48:59 +00:00
Alvaro Herrera	d43b085d57	Separate snapshot management code from tuple visibility code, create a snapmgmt.c file for the former. The header files have also been reorganized in three parts: the most basic snapshot definitions are now in a new file snapshot.h, and the also new snapmgmt.h keeps the definitions for snapmgmt.c. tqual.h has been reduced to the bare minimum. This patch is just a first step towards managing live snapshots within a transaction; there is no functionality change. Per my proposal to pgsql-patches on 20080318191940.GB27458@alvh.no-ip.org and subsequent discussion.	2008-03-26 16:20:48 +00:00
Tom Lane	9b8e1eb375	Adjust the recent patch for reporting of deadlocked queries so that we report query texts only to the server log. This eliminates the issue of possible leaking of security-sensitive data in other sessions' queries. Since the log is presumed secure, we can now log the queries of all sessions involved in the deadlock, whether or not they belong to the same user as the one reporting the failure.	2008-03-24 18:22:36 +00:00
Tom Lane	4b7ae4afae	Report the current queries of all backends involved in a deadlock (if they'd be visible to the current user in pg_stat_activity). This might look like it's subject to race conditions, but it's actually pretty safe because at the time DeadLockReport() is constructing the report, we haven't yet aborted our transaction and so we can expect that everyone else involved in the deadlock is still blocked on some lock. (There are corner cases where that might not be true, such as a statement timeout triggering in another backend before we finish reporting; but at worst we'd report a misleading activity string, so it seems acceptable considering the usefulness of reporting the queries.) Original patch by Itagaki Takahiro, heavily modified by me.	2008-03-21 21:08:31 +00:00
Bruce Momjian	fca9fff41b	More README src cleanups.	2008-03-21 13:23:29 +00:00
Bruce Momjian	4e228447aa	Make source code READMEs more consistent. Add CVS tags to all README files.	2008-03-20 17:55:15 +00:00
Alvaro Herrera	d54bb24cdd	Move elog(DEBUG4) call outside the locked area, per suggestion from Tom Lane.	2008-03-18 12:36:43 +00:00
Peter Eisentraut	a7b7b07af3	Enable probes to work with Mac OS X Leopard and other OSes that will support DTrace in the future. Switch from using DTRACE_PROBEn macros to the dynamically generated macros. Use "dtrace -h" to create a header file that contains the dynamically generated macros to be used in the source code instead of the DTRACE_PROBEn macros. A dummy header file is generated for builds without DTrace support. Author: Robert Lor <Robert.Lor@sun.com>	2008-03-17 19:44:41 +00:00
Alvaro Herrera	23057f51f5	Move ProcState definition into sinvaladt.c from sinvaladt.h, since it's not needed anywhere after my previous patch. Noticed by Tom Lane. Also, remove #include <signal.h> from sinval.c.	2008-03-17 11:50:27 +00:00
Alvaro Herrera	ec6550c6c0	Modify interactions between sinval.c and sinvaladt.c. The code that actually deals with the queue, including locking etc, is all in sinvaladt.c. This means that the struct definition of the queue, and the queue pointer, are now internal "implementation details" inside sinvaladt.c. Per my proposal dated 25-Jun-2007 and followup discussion.	2008-03-16 19:47:34 +00:00
Tom Lane	611b4393f2	Make TransactionIdIsInProgress check transam.c's single-item XID status cache before it goes groveling through the ProcArray. In situations where the same recently-committed transaction ID is checked repeatedly by tqual.c, this saves a lot of shared-memory searches. And it's cheap enough that it shouldn't hurt noticeably when it doesn't help. Concept and patch by Simon, some minor tweaking and comment-cleanup by Tom.	2008-03-11 20:20:35 +00:00
Tom Lane	f0828b2fc3	Provide a build-time option to store large relations as single files, rather than dividing them into 1GB segments as has been our longtime practice. This requires working support for large files in the operating system; at least for the time being, it won't be the default. Zdenek Kotala	2008-03-10 20:06:27 +00:00
Tom Lane	3fcc7e8e18	Reduce memory consumption during VACUUM of large relations, by using FSMPageData (6 bytes) instead of PageFreeSpaceInfo (8 or 16 bytes) for the temporary array of page-free-space information. Itagaki Takahiro	2008-03-10 02:04:10 +00:00
Tom Lane	7d6e6e2e97	Fix PREPARE TRANSACTION to reject the case where the transaction has dropped a temporary table; we can't support that because there's no way to clean up the source backend's internal state if the eventual COMMIT PREPARED is done by another backend. This was checked correctly in 8.1 but I broke it in 8.2 :-(. Patch by Heikki Linnakangas, original trouble report by John Smith.	2008-03-04 19:54:06 +00:00
Tom Lane	d50e256b67	Fix another place that was assuming that a local variable declared as "struct varlena" would be at least word-aligned. Per buildfarm results from gypsy_moth. I did a little bit of trawling for other instances of this coding pattern, and didn't find any; but if we turn up any more of them I think we'd better revert the "char [4]" patch and find another way of making tuptoaster.c alignment-safe.	2008-03-01 19:26:22 +00:00
Peter Eisentraut	0474dcb608	Refactor backend makefiles to remove lots of duplicate code	2008-02-19 10:30:09 +00:00
Tom Lane	082aca9ec2	Fix PageGetExactFreeSpace() so that it actually behaves sensibly if pd_lower > pd_upper, rather than merely claiming to. This would only matter if the page header were corrupt, which shouldn't occur, but ...	2008-02-10 20:39:08 +00:00
Tom Lane	6f906905b1	Fix WaitOnLock() to ensure that the process's "waiting" flag is reset after erroring out of a wait. We can use a PG_TRY block for this, but add a comment explaining why it'd be a bad idea to use it for any other state cleanup. Back-patch to 8.2. Prior releases had the same issue, but only with respect to the process title, which is likely to get reset almost immediately anyway after the transaction aborts, so it seems not worth changing them. In 8.2 and HEAD, the pg_stat_activity "waiting" flag could remain set incorrectly for a long time. Per report from Gurjeet Singh.	2008-02-02 22:26:17 +00:00
Tom Lane	6322e84430	Change StatementCancelHandler() to check the DoingCommandRead flag to decide whether to execute an immediate interrupt, rather than testing whether LockWaitCancel() cancelled a lock wait. The old way misclassified the case where we were blocked in ProcWaitForSignal(), and arguably would misclassify any other future additions of new ImmediateInterruptOK states too. This allows reverting the old kluge that gave LockWaitCancel() a return value, since no callers care anymore. Improve comments in the various implementations of PGSemaphoreLock() to explain that on some platforms, the assumption that semop() exits after a signal is wrong, and so we must ensure that the signal handler itself throws elog if we want cancel or die interrupts to be effective. Per testing related to bug #3883, though this patch doesn't solve those problems fully. Perhaps this change should be back-patched, but since pre-8.3 branches aren't really relying on autovacuum to respond to SIGINT, it doesn't seem critical for them.	2008-01-26 19:55:08 +00:00
Tom Lane	ceb9360067	Fix CREATE INDEX CONCURRENTLY to not deadlock against an automatic or manual VACUUM that is blocked waiting to get lock on the table being indexed. Per report and fix suggestion from Greg Stark.	2008-01-09 21:52:36 +00:00
Tom Lane	da3df47c84	lmgr.c:DescribeLockTag was never taught about virtual xids, per Greg Stark. Also a couple of minor tweaks to try to future-proof the code a bit better against future locktag additions.	2008-01-08 23:18:51 +00:00
Bruce Momjian	9098ab9e32	Update copyrights in source tree to 2008.	2008-01-01 19:46:01 +00:00
Peter Eisentraut	5ca3d50db7	Clarify log messages	2007-12-13 11:55:44 +00:00
Tom Lane	895a94de6d	Avoid incrementing the CommandCounter when CommandCounterIncrement is called but no database changes have been made since the last CommandCounterIncrement. This should result in a significant improvement in the number of "commands" that can typically be performed within a transaction before hitting the 2^32 CommandId size limit. In particular this buys back (and more) the possible adverse consequences of my previous patch to fix plan caching behavior. The implementation requires tracking whether the current CommandCounter value has been "used" to mark any tuples. CommandCounter values stored into snapshots are presumed not to be used for this purpose. This requires some small executor changes, since the executor used to conflate the curcid of the snapshot it was using with the command ID to mark output tuples with. Separating these concepts allows some small simplifications in executor APIs. Something for the TODO list: look into having CommandCounterIncrement not do AcceptInvalidationMessages. It seems fairly bogus to be doing it there, but exactly where to do it instead isn't clear, and I'm disinclined to mess with asynchronous behavior during late beta.	2007-11-30 21:22:54 +00:00
Tom Lane	eae7e00f1f	Fix stupid typo in recently-added code :-(	2007-11-16 00:57:55 +00:00
Bruce Momjian	f6e8730d11	Re-run pgindent with updated list of typedefs. (Updated README should avoid this problem in the future.)	2007-11-15 22:25:18 +00:00
Tom Lane	591b9b091c	Use ftruncate() not truncate() in mdunlink. Seems Windows doesn't support the latter.	2007-11-15 21:49:47 +00:00
Bruce Momjian	fdf5a5efb7	pgindent run for 8.3.	2007-11-15 21:14:46 +00:00
Tom Lane	6cc4451b5c	Prevent re-use of a deleted relation's relfilenode until after the next checkpoint. This guards against an unlikely data-loss scenario in which we re-use the relfilenode, then crash, then replay the deletion and recreation of the file. Even then we'd be OK if all insertions into the new relation had been WAL-logged ... but that's not guaranteed given all the no-WAL-logging optimizations that have recently been added. Patch by Heikki Linnakangas, per a discussion last month.	2007-11-15 20:36:40 +00:00
Tom Lane	69500b05d6	Prevent continuing disk-space bloat when profiling (with PROFILE_PID_DIR enabled) and autovacuum is on. Since there will be a steady stream of autovac worker processes exiting and dropping gmon.out files, allowing them to make separate subdirectories results in serious bloat; and it seems unlikely that anyone will care about those profiles anyway. Limit the damage by forcing all autovac workers to dump in one subdirectory, PGDATA/gprof/avworker/. Per report from Jrg Beyer and subsequent discussion.	2007-11-04 17:55:15 +00:00
Alvaro Herrera	acac68b2bc	Allow an autovacuum worker to be interrupted automatically when it is found to be locking another process (except when it's working to prevent Xid wraparound problems).	2007-10-26 20:45:10 +00:00
Alvaro Herrera	745c1b2c2a	Rearrange vacuum-related bits in PGPROC as a bitmask, to better support having several of them. Add two more flags: whether the process is executing an ANALYZE, and whether a vacuum is for Xid wraparound (which is obviously only set by autovacuum). Sneakily move the worker's recently-acquired PostAuthDelay to a more useful place.	2007-10-24 20:55:36 +00:00
Tom Lane	7a315a09dc	Dept. of second thoughts: fix loop in BgBufferSync so that the exit when bgwriter_lru_maxpages is exceeded leaves the loop variables in the expected state. In the original coding, we'd fail to advance next_to_clean, causing that buffer to be probably-uselessly rechecked next time, and also have an off-by-one idea of the number of buffers scanned.	2007-09-25 22:11:48 +00:00
Tom Lane	6f5c38dcd0	Just-in-time background writing strategy. This code avoids re-scanning buffers that cannot possibly need to be cleaned, and estimates how many buffers it should try to clean based on moving averages of recent allocation requests and density of reusable buffers. The patch also adds a couple more columns to pg_stat_bgwriter to help measure the effectiveness of the bgwriter. Greg Smith, building on his own work and ideas from several other people, in particular a much older patch from Itagaki Takahiro.	2007-09-25 20:03:38 +00:00
Tom Lane	1b3d400cac	TransactionIdIsInProgress can skip scanning the ProcArray if the target XID is later than latestCompletedXid, per Florian Pflug. Also some minor improvements in the XIDCACHE_DEBUG code --- make sure each call of TransactionIdIsInProgress is counted one way or another.	2007-09-23 18:50:38 +00:00
Tom Lane	cc59049daf	Improve handling of prune/no-prune decisions by storing a page's oldest unpruned XMAX in its header. At the cost of 4 bytes per page, this keeps us from performing heap_page_prune when there's no chance of pruning anything. Seems to be necessary per Heikki's preliminary performance testing.	2007-09-21 21:25:42 +00:00
Tom Lane	da072ab2ab	Make some simple performance improvements in TransactionIdIsInProgress(). For XIDs of our own transaction and subtransactions, it's cheaper to ask TransactionIdIsCurrentTransactionId() than to look in shared memory. Also, the xids[] work array is always the same size within any given process, so malloc it just once instead of doing a palloc/pfree on every call; aside from being faster this lets us get rid of some goto's, since we no longer have any end-of-function pfree to do. Both ideas by Heikki.	2007-09-21 17:36:53 +00:00
Tom Lane	282d2a03dd	HOT updates. When we update a tuple without changing any of its indexed columns, and the new version can be stored on the same heap page, we no longer generate extra index entries for the new version. Instead, index searches follow the HOT-chain links to ensure they find the correct tuple version. In addition, this patch introduces the ability to "prune" dead tuples on a per-page basis, without having to do a complete VACUUM pass to recover space. VACUUM is still needed to clean up dead index entries, however. Pavan Deolasee, with help from a bunch of other people.	2007-09-20 17:56:33 +00:00
Tom Lane	6889303531	Redefine the lp_flags field of item pointers as having four states, rather than two independent bits (one of which was never used in heap pages anyway, or at least hadn't been in a very long time). This gives us flexibility to add the HOT notions of redirected and dead item pointers without requiring anything so klugy as magic values of lp_off and lp_len. The state values are chosen so that for the states currently in use (pre-HOT) there is no change in the physical representation.	2007-09-12 22:10:26 +00:00
Tom Lane	6bd4f401b0	Replace the former method of determining snapshot xmax --- to wit, calling ReadNewTransactionId from GetSnapshotData --- with a "latestCompletedXid" variable that is updated during transaction commit or abort. Since latestCompletedXid is written only in places that had to lock ProcArrayLock exclusively anyway, and is read only in places that had to lock ProcArrayLock shared anyway, it adds no new locking requirements to the system despite being cluster-wide. Moreover, removing ReadNewTransactionId from snapshot acquisition eliminates the need to take both XidGenLock and ProcArrayLock at the same time. Since XidGenLock is sometimes held across I/O this can be a significant win. Some preliminary benchmarking suggested that this patch has no effect on average throughput but can significantly improve the worst-case transaction times seen in pgbench. Concept by Florian Pflug, implementation by Tom Lane.	2007-09-08 20:31:15 +00:00
Tom Lane	0a51e7073c	Don't take ProcArrayLock while exiting a transaction that has no XID; there is no need for serialization against snapshot-taking because the xact doesn't affect anyone else's snapshot anyway. Per discussion. Also, move various info about the interlocking of transactions and snapshots out of code comments and into a hopefully-more-cohesive discussion in access/transam/README. Also, remove a couple of now-obsolete comments about having to force some WAL to be written to persuade RecordTransactionCommit to do its thing.	2007-09-07 20:59:26 +00:00
Tom Lane	cd1aae5864	Allow CREATE INDEX CONCURRENTLY to disregard transactions in other databases, per gripe from hubert depesz lubaczewski. Patch from Simon Riggs.	2007-09-07 00:58:57 +00:00
Tom Lane	0ecb4ea773	Volatile-qualify the ProcArray PGPROC pointer in a bunch of routines that examine fields that could change under them. This is just to make really sure that when we are fetching a value 'only once', that's what actually happens. Possibly this is a bug that should be back-patched, but in the absence of solid evidence that it's needed, I won't bother.	2007-09-05 21:11:19 +00:00
Tom Lane	295e63983d	Implement lazy XID allocation: transactions that do not modify any database rows will normally never obtain an XID at all. We already did things this way for subtransactions, but this patch extends the concept to top-level transactions. In applications where there are lots of short read-only transactions, this should improve performance noticeably; not so much from removal of the actual XID-assignments, as from reduction of overhead that's driven by the rate of XID consumption. We add a concept of a "virtual transaction ID" so that active transactions can be uniquely identified even if they don't have a regular XID. This is a much lighter-weight concept: uniqueness of VXIDs is only guaranteed over the short term, and no on-disk record is made about them. Florian Pflug, with some editorialization by Tom.	2007-09-05 18:10:48 +00:00
Tom Lane	24d4517b3b	Improve behavior of log_lock_waits patch. Ensure that something gets logged even if the "deadlock detected" ERROR message is suppressed by an exception catcher. Be clearer about the event sequence when a soft deadlock is fixed: the fixing process might or might not still have to wait, so log that separately. Fix race condition when someone releases us from the lock partway through printing all this junk --- we'd not get confused about our state, but the log message sequence could have been misleading, ie, a "still waiting" message with no subsequent "acquired" message. Greg Stark and Tom Lane.	2007-08-28 03:23:44 +00:00
Tom Lane	e4f4a7f5a4	Remove FileUnlink(), which wasn't being used anywhere and interacted poorly with the recent patch to log temp file sizes at removal time. Doesn't seem worth fixing since it's unused. In passing, make a few elog messages conform to the message style guide.	2007-07-26 15:15:18 +00:00
Tom Lane	82eed4dba2	Arrange to put TOAST tables belonging to temporary tables into special schemas named pg_toast_temp_nnn, alongside the pg_temp_nnn schemas used for the temp tables themselves. This allows low-level code such as the relcache to recognize that these tables are indeed temporary, which enables various optimizations such as not WAL-logging changes and using local rather than shared buffers for access. Aside from obvious performance benefits, this provides a solution to bug #3483, in which other backends unexpectedly held open file references to temporary tables. The scheme preserves the property that TOAST tables are not in any schema that's normally in the search path, so they don't conflict with user table names. initdb forced because of changes in system view definitions.	2007-07-25 22:16:18 +00:00
Tom Lane	fdb5b69e9c	Suppress warning when compiling with -DPROFILE_PID_DIR: sys/stat.h is supposed to be included when using mkdir().	2007-07-25 19:58:56 +00:00
Tom Lane	04fbe29a83	Fix WAL replay of truncate operations to cope with the possibility that the truncated relation was deleted later in the WAL sequence. Since replay normally auto-creates a relation upon its first reference by a WAL log entry, failure is seen only if the truncate entry happens to be the first reference after the checkpoint we're restarting from; which is a pretty unusual case but of course not impossible. Fix by making truncate entries auto-create like the other ones do. Per report and test case from Dharmendra Goyal.	2007-07-20 16:29:53 +00:00
Tom Lane	82b3684672	Add comments spelling out why it's a good idea to release multiple partition locks in reverse order.	2007-07-16 21:09:50 +00:00
Tom Lane	b09cb0cf12	Remove the pgstat_drop_relation() call from smgr_internal_unlink(), because we don't know at that point which relation OID to tell pgstat to forget. The code was passing the relfilenode, which is incorrect, and could possibly cause some other relation's stats to be zeroed out. While we could try to clean this up, it seems much simpler and more reliable to let the next invocation of pgstat_vacuum_tabstat() fix things; which indeed is how it worked before I introduced the buggy code into 8.1.3 and later :-(. Problem noticed by Itagaki Takahiro, fix is per subsequent discussion.	2007-07-08 22:23:16 +00:00
Tom Lane	83aaebba63	Fix incorrect comment about the timing of AbsorbFsyncRequests() during checkpoint. The comment claimed that we could do this anytime after setting the checkpoint REDO point, but actually BufferSync is relying on the assumption that buffers dumped by other backends will be fsync'd too. So we really could not do it any sooner than we are doing it.	2007-07-03 14:51:24 +00:00
Tom Lane	beba73763b	Fix comments not updated in recent patch.	2007-07-01 02:22:23 +00:00
Tom Lane	9fc25c0511	Improve logging of checkpoints. Patch by Greg Smith, worked over by Heikki and a little bit by me.	2007-06-30 19:12:02 +00:00
Alvaro Herrera	10af02b912	Arrange for SIGINT in autovacuum workers to cancel the current table and continue with the schedule. Change current uses of SIGINT to abort a worker into SIGTERM, which keeps the old behaviour of terminating the process. Patch from ITAGAKI Takahiro, with some editorializing of my own.	2007-06-29 17:07:39 +00:00
Tom Lane	867e2c91a0	Implement "distributed" checkpoints in which the checkpoint I/O is spread over a fairly long period of time, rather than being spat out in a burst. This happens only for background checkpoints carried out by the bgwriter; other cases, such as a shutdown checkpoint, are still done at full speed. Remove the "all buffers" scan in the bgwriter, and associated stats infrastructure, since this seems no longer very useful when the checkpoint itself is properly throttled. Original patch by Itagaki Takahiro, reworked by Heikki Linnakangas, and some minor API editorialization by me.	2007-06-28 00:02:40 +00:00
Tom Lane	9cce91dba0	Only log 'process acquired lock' if we actually did get the lock. This test seems inessential right now since the only control path for not getting the lock is via CHECK_FOR_INTERRUPTS which won't return control to ProcSleep, but it would be important if we ever allow the deadlock code to kill someone else's transaction instead of our own.	2007-06-19 22:01:15 +00:00
Tom Lane	6e07228728	Code review for log_lock_waits patch. Don't try to issue log messages from within a signal handler (this might be safe given the relatively narrow code range in which the interrupt is enabled, but it seems awfully risky); do issue more informative log messages that tell what is being waited for and the exact length of the wait; minor other code cleanup. Greg Stark and Tom Lane	2007-06-19 20:13:22 +00:00
Tom Lane	de6a6383a7	Update obsolete comment: it's no longer the case that mdread() will allow reads beyond EOF, except by special coercion.	2007-06-18 00:47:20 +00:00
Tom Lane	e976fd43c6	Add some simple defenses against null fields in pg_largeobject, and add comments noting that there's an alignment assumption now that the data field could be in 1-byte-header format. Per discussion with Greg Stark.	2007-06-12 19:46:24 +00:00
Tom Lane	a04a423599	Arrange for large sequential scans to synchronize with each other, so that when multiple backends are scanning the same relation concurrently, each page is (ideally) read only once. Jeff Davis, with review by Heikki and Tom.	2007-06-08 18:23:53 +00:00
Tom Lane	6d6d14b6d5	Redefine IsTransactionState() to only return true for TRANS_INPROGRESS state, which is the only state in which it's safe to initiate database queries. It turns out that all but two of the callers thought that's what it meant; and the other two were using it as a proxy for "will GetTopTransactionId() return a nonzero XID"? Since it was in fact an unreliable guide to that, make those two just invoke GetTopTransactionId() always, then deal with a zero result if they get one.	2007-06-07 21:45:59 +00:00
Tom Lane	24ee8af573	Rework temp_tablespaces patch so that temp tablespaces are assigned separately for each temp file, rather than once per sort or hashjoin; this allows spreading the data of a large sort or join across multiple tablespaces. (I remain dubious that this will make any difference in practice, but certain people insisted.) Arrange to cache the results of parsing the GUC variable instead of recomputing from scratch on every demand, and push usage of the cache down to the bottommost fd.c level.	2007-06-07 19:19:57 +00:00
Tom Lane	acfce502ba	Create a GUC parameter temp_tablespaces that allows selection of the tablespace(s) in which to store temp tables and temporary files. This is a list to allow spreading the load across multiple tablespaces (a random list element is chosen each time a temp object is to be created). Temp files are not stored in per-database pgsql_tmp/ directories anymore, but per-tablespace directories. Jaime Casanova and Albert Cervera, with review by Bernd Helmle and Tom Lane.	2007-06-03 17:08:34 +00:00
Tom Lane	964ec46cfe	Fix aboriginal bug in BufFileDumpBuffer that would cause it to write the wrong data when dumping a bufferload that crosses a component-file boundary. This probably has not been seen in the wild because (a) component files are normally 1GB apiece and (b) non-block-aligned buffer usage is relatively rare. But it's fairly easy to reproduce a problem if one reduces RELSEG_SIZE in a test build. Kudos to Kurt Harriman for spotting the bug.	2007-06-01 23:43:11 +00:00
Tom Lane	bd0a260928	Make CREATE/DROP/RENAME DATABASE wait a little bit to see if other backends will exit before failing because of conflicting DB usage. Per discussion, this seems a good idea to help mask the fact that backend exit takes nonzero time. Remove a couple of thereby-obsoleted sleeps in contrib and PL regression test sequences.	2007-06-01 19:38:07 +00:00
Tom Lane	d526575f89	Make large sequential scans and VACUUMs work in a limited-size "ring" of buffers, rather than blowing out the whole shared-buffer arena. Aside from avoiding cache spoliation, this fixes the problem that VACUUM formerly tended to cause a WAL flush for every page it modified, because we had it hacked to use only a single buffer. Those flushes will now occur only once per ring-ful. The exact ring size, and the threshold for seqscans to switch into the ring usage pattern, remain under debate; but the infrastructure seems done. The key bit of infrastructure is a new optional BufferAccessStrategy object that can be passed to ReadBuffer operations; this replaces the former StrategyHintVacuum API. This patch also changes the buffer usage-count methodology a bit: we now advance usage_count when first pinning a buffer, rather than when last unpinning it. To preserve the behavior that a buffer's lifetime starts to decrease when it's released, the clock sweep code is modified to not decrement usage_count of pinned buffers. Work not done in this commit: teach GiST and GIN indexes to use the vacuum BufferAccessStrategy for vacuum-driven fetches. Original patch by Simon, reworked by Heikki and again by Tom.	2007-05-30 20:12:03 +00:00
Tom Lane	77947c51c0	Fix up pgstats counting of live and dead tuples to recognize that committed and aborted transactions have different effects; also teach it not to assume that prepared transactions are always committed. Along the way, simplify the pgstats API by tying counting directly to Relations; I cannot detect any redeeming social value in having stats pointers in HeapScanDesc and IndexScanDesc structures. And fix a few corner cases in which counts might be missed because the relation's pgstat_info pointer hadn't been set.	2007-05-27 03:50:39 +00:00
Tom Lane	63735ca815	Dept. of second thoughts: add comments cautioning against using ReadOrZeroBuffer to fetch pages from beyond physical EOF. This would usually work, but would cause problems for md.c if writes occurred beyond a segment boundary when the previous segment file hadn't been fully extended.	2007-05-02 23:34:48 +00:00
Tom Lane	8c3cc86e7b	During WAL recovery, when reading a page that we intend to overwrite completely from the WAL data, don't bother to physically read it; just have bufmgr.c return a zeroed-out buffer instead. This speeds recovery significantly, and also avoids unnecessary failures when a page-to-be-overwritten has corrupt page headers on disk. This replaces a former kluge that accomplished the latter by pretending zero_damaged_pages was always ON during WAL recovery; which was OK when the kluge was put in, but is unsafe when restoring a WAL log that was written with full_page_writes off. Heikki Linnakangas	2007-05-02 23:18:03 +00:00
Bruce Momjian	1c8302cab3	Add comment on why deadlock detection error messages only prints numbers.	2007-04-20 20:15:52 +00:00
Alvaro Herrera	e2a186b03c	Add a multi-worker capability to autovacuum. This allows multiple worker processes to be running simultaneously. Also, now autovacuum processes do not count towards the max_connections limit; they are counted separately from regular processes, and are limited by the new GUC variable autovacuum_max_workers. The launcher now has intelligence to launch workers on each database every autovacuum_naptime seconds, limited only on the max amount of worker slots available. Also, the global worker I/O utilization is limited by the vacuum cost-based delay feature. Workers are "balanced" so that the total I/O consumption does not exceed the established limit. This part of the patch was contributed by ITAGAKI Takahiro. Per discussion.	2007-04-16 18:30:04 +00:00
Tom Lane	995ba280c1	Rearrange mdsync() looping logic to avoid the problem that a sufficiently fast flow of new fsync requests can prevent mdsync() from ever completing. This was an unforeseen consequence of a patch added in Mar 2006 to prevent the fsync request queue from overflowing. Problem identified by Heikki Linnakangas and independently by ITAGAKI Takahiro; fix based on ideas from Takahiro-san, Heikki, and Tom. Back-patch as far as 8.1 because a previous back-patch introduced the problem into 8.1 ...	2007-04-12 17:10:55 +00:00
Tom Lane	3e23b68dac	Support varlena fields with single-byte headers and unaligned storage. This commit breaks any code that assumes that the mere act of forming a tuple (without writing it to disk) does not "toast" any fields. While all available regression tests pass, I'm not totally sure that we've fixed every nook and cranny, especially in contrib. Greg Stark with some help from Tom Lane	2007-04-06 04:21:44 +00:00
Tom Lane	9c9b619473	Remove the CheckpointStartLock in favor of having backends show whether they are in their commit critical sections via flags in the ProcArray. Checkpoint can watch the ProcArray to determine when it's safe to proceed. This is a considerably better solution to the original problem of race conditions between checkpoint and transaction commit: it speeds up commit, since there's one less lock to fool with, and it prevents the problem of checkpoint being delayed indefinitely when there's a constant flow of commits. Heikki, with some kibitzing from Tom.	2007-04-03 16:34:36 +00:00
Magnus Hagander	335feca441	Add some instrumentation to the bgwriter, through the stats collector. New view pg_stat_bgwriter, and the functions required to build it.	2007-03-30 18:34:56 +00:00
Tom Lane	e85a01df67	Clean up the representation of special snapshots by including a "method pointer" in every Snapshot struct. This allows removal of the case-by-case tests in HeapTupleSatisfiesVisibility, which should make it a bit faster (I didn't try any performance tests though). More importantly, we are no longer violating portable C practices by assuming that small integers are distinct from all pointer values, and HeapTupleSatisfiesDirty no longer has a non-reentrant API involving side-effects on a global variable. There were a couple of places calling HeapTupleSatisfiesXXX routines directly rather than through the HeapTupleSatisfiesVisibility macro. Since these places had to be changed anyway, I chose to make them go through the macro for uniformity. Along the way I renamed HeapTupleSatisfiesSnapshot to HeapTupleSatisfiesMVCC to emphasize that it's only used with MVCC-type snapshots. I was sorely tempted to rename HeapTupleSatisfiesVisibility to HeapTupleSatisfiesSnapshot, but forebore for the moment to avoid confusion and reduce the likelihood that this patch breaks some of the pending patches. Might want to reconsider doing that later.	2007-03-25 19:45:14 +00:00
Bruce Momjian	1e2bfb5811	Cleanup for procarray.c.	2007-03-23 03:16:39 +00:00
Alvaro Herrera	626eb02198	Cleanup the bootstrap code a little, and rename "dummy procs" in the code comments and variables to "auxiliary proc", per Heikki's request.	2007-03-07 13:35:03 +00:00
Bruce Momjian	a535cdf130	Revert temp_tablespaces because of coding problems, per Tom.	2007-03-06 02:06:15 +00:00
Bruce Momjian	0763a56501	Add lo_truncate() to backend and libpq for large object truncation. Kris Jurka	2007-03-03 19:52:47 +00:00
Neil Conway	90d76525c5	Add resetStringInfo(), which clears the content of a StringInfo, and fixup various places in the tree that were clearing a StringInfo by hand. Making this function a part of the API simplifies client code slightly, and avoids needlessly peeking inside the StringInfo interface.	2007-03-03 19:32:55 +00:00
Bruce Momjian	e52c4a6e26	Add GUC log_lock_waits to log long wait times. Simon Riggs	2007-03-03 18:46:40 +00:00
Tom Lane	fb276438b6	Suppress useless searches for unused line pointers in PageAddItem. To do this, add a 16-bit "flags" field to page headers by stealing some bits from pd_tli. We use one flag bit as a hint to indicate whether there are any unused line pointers; the remaining 15 are available for future use. This is a cut-down form of an idea proposed by Hiroki Kataoka in July 2005. At the time it was rejected because the original patch increased the size of page headers and it wasn't clear that the benefit outweighed the distributed cost. The flag-bit approach gets most of the benefit without requiring an increase in the page header size. Heikki Linnakangas and Tom Lane	2007-03-02 00:48:44 +00:00
Magnus Hagander	2c6feff5e7	Remove temporary Windows-specific debugging code.	2007-02-28 15:59:30 +00:00
Tom Lane	234a02b2a8	Replace direct assignments to VARATT_SIZEP(x) with SET_VARSIZE(x, len). Get rid of VARATT_SIZE and VARATT_DATA, which were simply redundant with VARSIZE and VARDATA, and as a consequence almost no code was using the longer names. Rename the length fields of struct varlena and various derived structures to catch anyplace that was accessing them directly; and clean up various places so caught. In itself this patch doesn't change any behavior at all, but it is necessary infrastructure if we hope to play any games with the representation of varlena headers. Greg Stark and Tom Lane	2007-02-27 23:48:10 +00:00
Bruce Momjian	6f519ad01c	btree source code cleanups: I refactored findsplitloc and checksplitloc so that the division of labor is more clear IMO. I pushed all the space calculation inside the loop to checksplitloc. I also fixed the off by 4 in free space calculation caused by PageGetFreeSpace subtracting sizeof(ItemIdData), even though it was harmless, because it was distracting and I felt it might come back to bite us in the future if we change the page layout or alignments. There's now a new function PageGetExactFreeSpace that doesn't do the subtraction. findsplitloc now tries the "just the new item to right page" split as well. If people don't like the refactoring, I can write a patch to just add that. Heikki Linnakangas	2007-02-21 20:02:17 +00:00
Bruce Momjian	6765df9174	Add configure --enable-profiling to enable GCC profiling. Patches from Korry Douglas and Nikhil S	2007-02-21 15:12:39 +00:00
Alvaro Herrera	1820650934	Restructure autovacuum in two processes: a dummy process, which runs continuously, and requests vacuum runs of "autovacuum workers" to postmaster. The workers do the actual vacuum work. This allows for future improvements, like allowing multiple autovacuum jobs running in parallel. For now, the code keeps the original behavior of having a single autovac process at any time by sleeping until the previous worker has finished.	2007-02-15 23:23:23 +00:00
Peter Eisentraut	c138b966d4	Replace useless uses of := by = in makefiles.	2007-02-09 15:56:00 +00:00
Bruce Momjian	8b4ff8b6a1	Wording cleanup for error messages. Also change can't -> cannot. Standard English uses "may", "can", and "might" in different ways: may - permission, "You may borrow my rake." can - ability, "I can lift that log." might - possibility, "It might rain today." Unfortunately, in conversational English, their use is often mixed, as in, "You may use this variable to do X", when in fact, "can" is a better choice. Similarly, "It may crash" is better stated, "It might crash".	2007-02-01 19:10:30 +00:00
Bruce Momjian	148ea5cbea	Add GUC temp_tablespaces to provide a default location for temporary objects. Jaime Casanova	2007-01-25 04:35:11 +00:00
Peter Eisentraut	2cc01004c6	Remove remains of old depend target.	2007-01-20 17:16:17 +00:00
Tom Lane	eddbf39756	Extend yesterday's patch so that the bgwriter is also told to forget pending fsyncs during DROP DATABASE. Obviously necessary in hindsight :-(	2007-01-17 16:25:01 +00:00
Tom Lane	6d660587f6	Revise bgwriter fsync-request mechanism to improve robustness when a table is deleted. A backend about to unlink a file now sends a "revoke fsync" request to the bgwriter to make it clean out pending fsync requests. There is still a race condition where the bgwriter may try to fsync after the unlink has happened, but we can resolve that by rechecking the fsync request queue to see if a revoke request arrived meanwhile. This eliminates the former kluge of "just assuming" that an ENOENT failure is okay, and lets us handle the fact that on Windows it might be EACCES too without introducing any questionable assumptions. After an idea of mine improved by Magnus. The HEAD patch doesn't apply cleanly to 8.2, but I'll see about a back-port later. In the meantime this could do with some testing on Windows; I've been able to force it through the code path via ENOENT, but that doesn't prove that it actually fixes the Windows problem ...	2007-01-17 00:17:21 +00:00
Alvaro Herrera	eb63cc3da8	Arrange for autovacuum to be killed when another operation wants to be alone accessing it, like DROP DATABASE. This allows the regression tests to pass with autovacuum enabled, which open the gates for finally enabling autovacuum by default.	2007-01-16 13:28:57 +00:00
Bruce Momjian	d64995aa89	Remove trace macro call from new log_temp_files, until it gets more research.	2007-01-09 22:03:51 +00:00
Bruce Momjian	be8a431881	Add GUC log_temp_files to log the use of temporary files. Bill Moran	2007-01-09 21:31:17 +00:00
Bruce Momjian	29dccf5fe0	Update CVS HEAD for 2007 copyright. Back branches are typically not back-stamped for this.	2007-01-05 22:20:05 +00:00
Tom Lane	ef07221997	Clean up smgr.c/md.c APIs as per discussion a couple months ago. Instead of having md.c return a success/failure boolean to smgr.c, which was just going to elog anyway, let md.c issue the elog messages itself. This allows better error reporting, particularly in cases such as "short read" or "short write" which Peter was complaining of. Also, remove the kluge of allowing mdread() to return zeroes from a read-beyond-EOF: this is now an error condition except when InRecovery or zero_damaged_pages = true. (Hash indexes used to require that behavior, but no more.) Also, enforce that mdwrite() is to be used for rewriting existing blocks while mdextend() is to be used for extending the relation EOF. This restriction lets us get rid of the old ad-hoc defense against creating huge files by an accidental reference to a bogus block number: we'll only create new segments in mdextend() not mdwrite() or mdread(). (Again, when InRecovery we allow it anyway, since we need to allow updates of blocks that were later truncated away.) Also, clean up the original makeshift patch for bug #2737: move the responsibility for padding relation segments to full length into md.c.	2007-01-03 18:11:01 +00:00
Tom Lane	72619f8191	Modify local buffer management to request memory for local buffers in blocks of increasing size, instead of one at a time. This reduces the memory management overhead when num_temp_buffers is large: in the previous coding we would actually waste 50% of the space used for temp buffers, because aset.c would round the individual requests up to 16K. Problem noted while studying a performance issue reported by Steven Flatt. Back-patch as far as 8.1 --- older versions used few enough local buffers that the issue isn't significant for them.	2006-12-27 22:31:54 +00:00
Peter Eisentraut	409600942b	KB -> kB	2006-11-24 09:20:12 +00:00
Tom Lane	3ad0728c81	On systems that have setsid(2) (which should be just about everything except Windows), arrange for each postmaster child process to be its own process group leader, and deliver signals SIGINT, SIGTERM, SIGQUIT to the whole process group not only the direct child process. This provides saner behavior for archive and recovery scripts; in particular, it's possible to shut down a warm-standby recovery server using "pg_ctl stop -m immediate", since delivery of SIGQUIT to the startup subprocess will result in killing the waiting recovery_command. Also, this makes Query Cancel and statement_timeout apply to scripts being run from backends via system(). (There is no support in the core backend for that, but it's widely done using untrusted PLs.) Per gripe from Stephen Harris and subsequent discussion.	2006-11-21 20:59:53 +00:00
Tom Lane	1a5c450f30	When truncating a relation in-place (eg during VACUUM), do not try to unlink any no-longer-needed segments; just truncate them to zero bytes and leave the files in place for possible future re-use. This avoids problems when the segments are re-used due to relation growth shortly after truncation. Before, the bgwriter, and possibly other backends, could still be holding open file references to the old segment files, and would write dirty blocks into those files where they'd disappear from the view of other processes. Back-patch as far as 8.0. I believe the 7.x branches are not vulnerable, because they had no bgwriter, and "blind" writes by other backends would always be done via freshly-opened file references.	2006-11-20 01:07:56 +00:00
Tom Lane	36e012e727	Remove temporary Windows-specific debugging code; it seems the problem with fopen() not using FILE_SHARE_DELETE was indeed the bug we were after, given lack of recent reports.	2006-11-06 17:10:22 +00:00
Tom Lane	48188e1621	Fix recently-understood problems with handling of XID freezing, particularly in PITR scenarios. We now WAL-log the replacement of old XIDs with FrozenTransactionId, so that such replacement is guaranteed to propagate to PITR slave databases. Also, rather than relying on hint-bit updates to be preserved, pg_clog is not truncated until all instances of an XID are known to have been replaced by FrozenTransactionId. Add new GUC variables and pg_autovacuum columns to allow management of the freezing policy, so that users can trade off the size of pg_clog against the amount of freezing work done. Revise the already-existing code that forces autovacuum of tables approaching the wraparound point to make it more bulletproof; also, revise the autovacuum logic so that anti-wraparound vacuuming is done per-table rather than per-database. initdb forced because of changes in pg_class, pg_database, and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.	2006-11-05 22:42:10 +00:00
Tom Lane	954c1813ac	Remove an unnecessary HOLD_INTERRUPTS/RESUME_INTERRUPTS pair. This was required back when RESUME_INTERRUPTS could actually execute ProcessInterrupts, but that hasn't been true since 2001...	2006-10-22 20:34:54 +00:00
Tom Lane	e0dece127d	Redesign the patch for allocation of shmem space and LWLocks for add-on modules; the first try was not usable in EXEC_BACKEND builds (e.g., Windows). Instead, just provide some entry points to increase the allocation requests during postmaster start, and provide a dedicated LWLock that can be used to synchronize allocation operations performed by backends. Per discussion with Marc Munro.	2006-10-15 22:04:08 +00:00
Bruce Momjian	f99a569a2e	pgindent run for 8.2.	2006-10-04 00:30:14 +00:00
Tom Lane	c92f7e258e	Replace strncpy with strlcpy in selected places that seem possibly relevant to performance. (A wholesale effort to get rid of strncpy should be undertaken sometime, but not during beta.) This commit also fixes dynahash.c to correctly truncate overlength string keys for hashtables, so that its callers don't have to anymore.	2006-09-27 18:40:10 +00:00
Tom Lane	ffae5cc5a6	Add a check to prevent overwriting valid data if smgrnblocks() gives a wrong answer, as has been seen to occur with a buggy Linux kernel. Not really our bug, but it's a simple test in a seldom-used control path, so might as well have a defense.	2006-09-25 22:01:10 +00:00
Tom Lane	d40d34863e	Fix pg_locks view to call advisory locks advisory locks, while preserving backward compatibility for anyone using the old userlock code that's now on pgfoundry --- locks from that code still show as 'userlock'.	2006-09-22 23:20:14 +00:00
Tom Lane	9e936693a9	Fix free space map to correctly track the total amount of FSM space needed even when a single relation requires more than max_fsm_pages pages. Also, make VACUUM emit a warning in this case, since it likely means that VACUUM FULL or other drastic corrective measure is needed. Per reports from Jeff Frost and others of unexpected changes in the claimed max_fsm_pages need.	2006-09-21 20:31:22 +00:00
Tom Lane	9b4cda0df6	Add built-in userlock manipulation functions to replace the former contrib functionality. Along the way, remove the USER_LOCKS configuration symbol, since it no longer makes any sense to try to compile that out. No user documentation yet ... mmoncure has promised to write some. Thanks to Abhijit Menon-Sen for creating a first draft to work from.	2006-09-18 22:40:40 +00:00
Tom Lane	2e5e856f6b	Marginal cleanup in arrangements for ensuring StrategyHintVacuum is cleared after an error during VACUUM. We have a PG_TRY block anyway around the only call sites, so just reset it in the CATCH clause instead of having AtEOXact_Buffers blindly do it during xact end. I think the old code was actively wrong for the case of a failure during ANALYZE inside a subtransaction --- the flag wouldn't get cleared until main transaction end. Probably not worth back-patching though.	2006-09-17 22:16:22 +00:00
Bruce Momjian	a0e87ad7a5	Specify lo_write() to take a _const_ buffer, to match documentation.	2006-09-07 15:37:25 +00:00
Tom Lane	8fad2e3ff4	Arrange for GetSnapshotData to copy live-subtransaction XIDs from the PGPROC array into snapshots, and use this information to avoid visits to pg_subtrans in HeapTupleSatisfiesSnapshot. This appears to solve the pg_subtrans-related context swap storm problem that's been reported by several people for 8.1. While at it, modify GetSnapshotData to not take an exclusive lock on ProcArrayLock, as closer analysis shows that shared lock is always sufficient. Itagaki Takahiro and Tom Lane	2006-09-03 15:59:39 +00:00
Tom Lane	e06fda0a8b	Add a function GetLockConflicts() to lock.c to report xacts holding locks that would conflict with a specified lock request, without actually trying to get that lock. Use this instead of the former ad hoc method of doing the first wait step in CREATE INDEX CONCURRENTLY. Fixes problem with undetected deadlock and in many cases will allow the index creation to proceed sooner than it otherwise could've. Per discussion with Greg Stark.	2006-08-27 19:14:34 +00:00
Tom Lane	e093dcdd28	Add the ability to create indexes 'concurrently', that is, without blocking concurrent writes to the table. Greg Stark, with a little help from Tom Lane.	2006-08-25 04:06:58 +00:00
Tom Lane	f836c2e37e	Add some debug logging code to AllocateFile's failure path to log the specific Windows error code (GetLastError). This is a hopefully temporary hack to try to diagnose rare failures. Magnus Hagander	2006-08-24 03:15:43 +00:00
Tom Lane	9bf760f7de	Add a 'waiting' column to pg_stat_activity to carry the same information that ps_status provides by appending 'waiting' to the PS display. This completes the project of making it feasible to turn off process title updates and instead rely on pg_stat_activity. Per my suggestion a few weeks ago.	2006-08-19 01:36:34 +00:00
Tom Lane	7aa772f03e	Now that we've rearranged relation open to get a lock before touching the rel, it's easy to get rid of the narrow race-condition window that used to exist in VACUUM and CLUSTER. Did some minor code-beautification work in the same area, too.	2006-08-18 16:09:13 +00:00
Tom Lane	2dd7ab0627	Put back another improperly-removed #include.	2006-08-07 21:56:25 +00:00
Tom Lane	3467758809	Fix missing 'static' keywords --- some compilers gripe about this.	2006-08-04 16:42:56 +00:00
Bruce Momjian	2c6d96cef6	Add support for loadable modules to allocated shared memory and lightweight locks. Marc Munro	2006-08-01 19:03:11 +00:00
Tom Lane	09d3670df3	Change the relation_open protocol so that we obtain lock on a relation (table or index) before trying to open its relcache entry. This fixes race conditions in which someone else commits a change to the relation's catalog entries while we are in process of doing relcache load. Problems of that ilk have been reported sporadically for years, but it was not really practical to fix until recently --- for instance, the recent addition of WAL-log support for in-place updates helped. Along the way, remove pg_am.amconcurrent: all AMs are now expected to support concurrent update.	2006-07-31 20:09:10 +00:00
Tom Lane	8822263635	Fix a couple of comments.	2006-07-30 20:17:11 +00:00
Alvaro Herrera	92c2ecc130	Modify snapshot definition so that lazy vacuums are ignored by other vacuums. This allows a OLTP-like system with big tables to continue regular vacuuming on small-but-frequently-updated tables while the big tables are being vacuumed. Original patch from Hannu Krossing, rewritten by Tom Lane and updated by me.	2006-07-30 02:07:18 +00:00
Peter Eisentraut	e9b4969062	DTrace support, with a small initial set of probes by Robert Lor	2006-07-24 16:32:45 +00:00
Tom Lane	a794fb0681	Convert the lock manager to use the new dynahash.c support for partitioned hash tables, instead of the previous kluge involving multiple hash tables. This partially undoes my patch of last December.	2006-07-23 23:08:46 +00:00
Tom Lane	b25dc481c8	Fix oversight in sizing of shared buffer lookup hashtable. Because BufferAlloc tries to insert a new mapping entry before deleting the old one for a buffer, we have a transient need for more than NBuffers entries --- one more in 8.1, and as many as NUM_BUFFER_PARTITIONS more in CVS HEAD. In theory this could lead to an "out of shared memory" failure if shmem had already been completely claimed by the time the extra entries were needed.	2006-07-23 18:34:45 +00:00
Tom Lane	10b9ca3d05	Split the buffer mapping table into multiple separately lockable partitions, as per discussion. Passes functionality checks, but I don't have any performance data yet.	2006-07-23 03:07:58 +00:00
Tom Lane	51ee9fa157	Add support to dynahash.c for partitioning shared hashtables according to the low-order bits of the entry hash value. Also make some incidental cleanups in the dynahash API, such as not exporting the hash header structs to the world.	2006-07-22 23:04:39 +00:00
Tom Lane	c0e9b3139f	Hmm, seems --disable-spinlocks has been broken for awhile and nobody noticed. Fix SpinlockSemas() to report the correct count considering that PG 8.1 adds a spinlock to each shared-buffer header.	2006-07-22 21:04:40 +00:00
Tom Lane	3ff58b48c9	Put back another not-so-unnecessary #include, per report from Hiroshi Saito.	2006-07-16 01:05:23 +00:00
Tom Lane	daecd97617	Put back some more not-so-unused-as-all-that #includes. This un-breaks the EXEC_BACKEND code on my machines, so hopefully it will fix the Windows buildfarm members.	2006-07-15 15:47:17 +00:00
Tom Lane	cd24163f6d	Fix another passel of include-file breakage. Kris Jurka, Tom Lane	2006-07-14 16:59:19 +00:00
Bruce Momjian	e0522505bd	Remove 576 references of include files that were not needed.	2006-07-14 14:52:27 +00:00
Tom Lane	ae643747b1	Fix a passel of recently-committed violations of the rule 'thou shalt have no other gods before c.h'. Also remove some demonstrably redundant #include lines, mostly of <errno.h> which was added to c.h years ago.	2006-07-14 05:28:29 +00:00
Bruce Momjian	a22d76d96a	Allow include files to compile own their own. Strip unused include files out unused include files, and add needed includes to C files. The next step is to remove unused include files in C files.	2006-07-13 16:49:20 +00:00
Bruce Momjian	370a709c75	Add GUC update_process_title to control whether 'ps' display is updated for every command, default to on.	2006-06-27 22:16:44 +00:00
Tom Lane	27c3e3de09	Remove redundant gettimeofday() calls to the extent practical without changing semantics too much. statement_timestamp is now set immediately upon receipt of a client command message, and the various places that used to do their own gettimeofday() calls to mark command startup are referenced to that instead. I have also made stats_command_string use that same value for pg_stat_activity.query_start for both the command itself and its eventual replacement by <IDLE> or <idle in transaction>. There was some debate about that, but no argument that seemed convincing enough to justify an extra gettimeofday() call.	2006-06-20 22:52:00 +00:00
Tom Lane	b13c9686d0	Take the statistics collector out of the loop for monitoring backends' current commands; instead, store current-status information in shared memory. This substantially reduces the overhead of stats_command_string and also ensures that pg_stat_activity is fully up to date at all times. Per my recent proposal.	2006-06-19 01:51:22 +00:00
Tom Lane	8ff80c1bd3	Remove obsolete comment about VACUUM FULL: it takes buffer content locks now, and must do so to ensure bgwriter doesn't write a page that is in process of being compacted.	2006-06-08 14:58:33 +00:00
Bruce Momjian	26cfefabad	Fix printf mask for SizeVfdCache Qingqing Zhou	2006-05-30 13:04:59 +00:00
Tom Lane	2246e31775	Upon closer inspection, the sparc code in s_lock.c is dead code, and always has been, because it's not got any .globl declaration! We've been relying on the solaris_sparc.s code instead. Rip it out. (Not back-patched, since this is just cosmetic cleanup.)	2006-05-12 16:50:52 +00:00
Tom Lane	ab1ad7a653	Remove unnecessary .seg/.section directives, per Alan Stange.	2006-05-11 21:58:22 +00:00
Tom Lane	5749f6ef0c	Rewrite btree vacuuming to fold the former bulkdelete and cleanup operations into a single mostly-physical-order scan of the index. This requires some ticklish interlocking considerations, but should create no material performance impact on normal index operations (at least given the already-committed changes to make scans work a page at a time). VACUUM itself should get significantly faster in any index that's degenerated to a very nonlinear page order. Also, we save one pass over the index entirely, except in the case where there were no deletions to do and so only one pass happened anyway. Original patch by Heikki Linnakangas, rework by Tom Lane.	2006-05-08 00:00:17 +00:00
Tom Lane	52667d56a3	Rethink the locking mechanisms used for CREATE/DROP/RENAME DATABASE. The former approach used ExclusiveLock on pg_database, which being a cluster-wide lock meant only one of these operations could proceed at a time; worse, it also blocked all incoming connections in ReverifyMyDatabase. Now that we have LockSharedObject(), we can use locks of different types applied to databases considered as objects. This allows much more flexible management of the interlocking: two CREATE DATABASEs need not block each other, and need not block connections except to the template database being used. Similarly DROP DATABASE doesn't block unrelated operations. The locking used in flatfiles.c is also much narrower in scope than before. Per recent proposal.	2006-05-04 16:07:29 +00:00
Bruce Momjian	a1ee621589	Fix s_lock_test to use tas.o file, if needed.	2006-04-28 22:54:31 +00:00
Tom Lane	486f994be7	Revise large-object access routines to avoid running with CurrentMemoryContext set to the large object context ("fscxt"), as this is inevitably a source of transaction-duration memory leaks. Not sure why we'd not noticed it before; maybe people weren't touching a whole lot of LOs in the same transaction before the 8.1 pg_dump changes. Per report from Wayne Conrad. Backpatched as far as 8.1, but the problem doubtless goes all the way back. I'm disinclined to spend the time to try to verify that the older branches would still work if patched, seeing that this code was significantly modified for 8.0 and again for 8.1, and that we don't have any trouble reports before 8.1. (Maybe the leaks were smaller before?)	2006-04-26 00:34:57 +00:00
Tom Lane	b5498a26de	Add some optional code (conditionally compiled under #ifdef LWLOCK_STATS) to track the number of LWLock acquisitions and the number of times we block waiting for an LWLock, on a per-process basis. After having needed this twice in the past few months, seems like it should go into CVS.	2006-04-21 16:45:12 +00:00
Tom Lane	defe93463c	Make the world safe for full_page_writes. Allow XLOG records that try to update no-longer-existing pages to fall through as no-ops, but make a note of each page number referenced by such records. If we don't see a later XLOG entry dropping the table or truncating away the page, complain at the end of XLOG replay. Since this fixes the known failure mode for full_page_writes = off, revert my previous band-aid patch that disabled that GUC variable.	2006-04-14 20:27:24 +00:00
Tom Lane	0fcc3c2f1d	Repair a low-probability race condition identified by Qingqing Zhou. If a process abandons a wait in LockBufferForCleanup (in practice, only happens if someone cancels a VACUUM) just before someone else sends it a signal indicating the buffer is available, it was possible for the wakeup to remain in the process' semaphore, causing misbehavior next time the process waited for an lmgr lock. Rather than try to prevent the race condition directly, it seems best to make the lock manager robust against leftover wakeups, by having it repeat waiting on the semaphore if the lock has not actually been granted or denied yet.	2006-04-14 03:38:56 +00:00
Tom Lane	a8b8f4db23	Clean up WAL/buffer interactions as per my recent proposal. Get rid of the misleadingly-named WriteBuffer routine, and instead require routines that change buffer pages to call MarkBufferDirty (which does exactly what it says). We also require that they do so before calling XLogInsert; this takes care of the synchronization requirement documented in SyncOneBuffer. Note that because bufmgr takes the buffer content lock (in shared mode) while writing out any buffer, it doesn't matter whether MarkBufferDirty is executed before the buffer content change is complete, so long as the content change is completed before releasing exclusive lock on the buffer. So it's OK to set the dirtybit before we fill in the LSN. This eliminates the former kluge of needing to set the dirtybit in LockBuffer. Aside from making the code more transparent, we can also add some new debugging assertions, in particular that the caller of MarkBufferDirty must hold the buffer content lock, not merely a pin.	2006-03-31 23:32:07 +00:00
Tom Lane	4243f2387a	Suppress attempts to report dropped tables to the stats collector from a startup or recovery process. Since such a process isn't a real backend, pgstat.c gets confused. This accounts for recent reports of strange "invalid server process ID -1" log messages during crash recovery. There isn't any point in attempting to make the report, since we'll discard stats in such scenarios anyhow.	2006-03-30 22:11:55 +00:00
Tom Lane	6d61cdec07	Clean up and document the API for XLogOpenRelation and XLogReadBuffer. This commit doesn't make much functional change, but it does eliminate some duplicated code --- for instance, PageIsNew tests are now done inside XLogReadBuffer rather than by each caller. The GIST xlog code still needs a lot of love, but I'll worry about that separately.	2006-03-29 21:17:39 +00:00
Tom Lane	0a20207060	Arrange to emit a description of the current XLOG record as error context when an error occurs during xlog replay. Also, replace the former risky 'write into a fixed-size buffer with no overflow detection' API for XLOG record description routines; use an expansible StringInfo instead. (The latter accounts for most of the patch bulk.) Qingqing Zhou	2006-03-24 04:32:13 +00:00
Bruce Momjian	f2f5b05655	Update copyright for 2006. Update scripts.	2006-03-05 15:59:11 +00:00
Tom Lane	60d3c9fdf4	Declare the arguments of AllocateFile() as const char , not char . This is consistent with the standard definition of fopen().	2006-03-04 21:32:47 +00:00
Tom Lane	9a506a6257	Arrange to call AbsorbFsyncRequests every so often while performing a checkpoint in the bgwriter. This forestalls overflow of the fsync request queue, which is not fatal but causes considerable performance degradation when it occurs (because backends then have to do their own fsyncs). Per patch from Itagaki Takahiro, modified a little bit by me.	2006-03-03 00:02:02 +00:00
Bruce Momjian	d5dd3d451e	Add contrib/pg_freespacemap to display free space map information. Mark Kirkwood	2006-02-12 03:55:53 +00:00
Bruce Momjian	59bb147353	Update random() usage so ranges are inclusive/exclusive as required.	2006-02-03 12:45:47 +00:00
Tom Lane	d5db3abfb6	Modify pgstats code to reduce performance penalties from oversized stats data files: avoid creating stats hashtable entries for tables that aren't being touched except by vacuum/analyze, ensure that entries for dropped tables are removed promptly, and tweak the data layout to avoid storing useless struct padding. Also improve the performance of pgstat_vacuum_tabstat(), and make sure that autovacuum invokes it exactly once per autovac cycle rather than multiple times or not at all. This should cure recent complaints about 8.1 showing much higher stats I/O volume than was seen in 8.0. It'd still be a good idea to revisit the design with an eye to not re-writing the entire stats dataset every half second ... but that would be too much to backpatch, I fear.	2006-01-18 20:35:06 +00:00
Tom Lane	558bc2584d	Fix fsync code to test whether F_FULLFSYNC is available, instead of assuming it always is on Darwin. Per report from Neil Brandt.	2006-01-17 23:52:31 +00:00
Tom Lane	39fc1fb07a	Remove logic in XactLockTableWait() that attempted to mark a crashed transaction as aborted. Since we only call XactLockTableWait on XIDs that we believe to be currently running, the odds of this code ever actually firing are minimal. It's certainly unnecessary, since a transaction that's not either running or committed will be presumed aborted anyway. What's more, it's not hard to imagine scenarios where this could result in corrupting pg_clog: for instance, if a bogus XID somehow got passed to XactLockTableWait. I think the code probably dates from the ancient era when we didn't have TransactionIdIsInProgress; back then it may have been necessary, but now I think it's a waste of cycles and potentially dangerous. Per discussion with Qingqing Zhou and Karsten Hilbert.	2006-01-13 21:32:12 +00:00
Tom Lane	304160c3e2	Fix ReadBuffer() to correctly handle the case where it's trying to extend the relation but it finds a pre-existing valid buffer. The buffer does not correspond to any page known to the kernel, so we must do smgrextend to ensure that the space becomes allocated. The 7.x branches all do this correctly, but the corner case got lost somewhere during 8.0 bufmgr rewrites. (My fault no doubt :-( ... I think I assumed that such a buffer must be not-BM_VALID, which is not so.)	2006-01-06 00:04:20 +00:00
Bruce Momjian	44f9021223	Remove BEOS port.	2006-01-05 03:01:38 +00:00
Tom Lane	349f40b2c2	Rearrange backend startup sequence so that ShmemIndexLock can become an LWLock instead of a spinlock. This hardly matters on Unix machines but should improve startup performance on Windows (or any port using EXEC_BACKEND). Per previous discussion.	2006-01-04 21:06:32 +00:00
Tom Lane	195f164228	Get rid of the SpinLockAcquire/SpinLockAcquire_NoHoldoff distinction in favor of having just one set of macros that don't do HOLD/RESUME_INTERRUPTS (hence, these correspond to the old SpinLockAcquire_NoHoldoff case). Given our coding rules for spinlock use, there is no reason to allow CHECK_FOR_INTERRUPTS to be done while holding a spinlock, and also there is no situation where ImmediateInterruptOK will be true while holding a spinlock. Therefore doing HOLD/RESUME_INTERRUPTS while taking/releasing a spinlock is just a waste of cycles. Qingqing Zhou and Tom Lane.	2005-12-29 18:08:05 +00:00
Tom Lane	fb3dbdf986	Rethink prior patch to filter out dead backend entries from the pgstats file. The original code probed the PGPROC array separately for each PID, which was not good for large numbers of backends: not only is the runtime O(N^2) but most of it is spent holding ProcArrayLock. Instead, take the lock just once and copy the active PIDs into an array, then use qsort and bsearch so that the lookup time is more like O(N log N).	2005-12-16 04:03:40 +00:00
Tom Lane	ec0baf949e	Divide the lock manager's shared state into 'partitions', so as to reduce contention for the former single LockMgrLock. Per my recent proposal. I set it up for 16 partitions, but on a pgbench test this gives only a marginal further improvement over 4 partitions --- we need to test more scenarios to choose the number of partitions.	2005-12-11 21:02:18 +00:00
Tom Lane	c599a247bb	Simplify lock manager data structures by making a clear separation between the data defining the semantics of a lock method (ie, conflict resolution table and ancillary data, which is all constant) and the hash tables storing the current state. The only thing we give up by this is the ability to use separate hashtables for different lock methods, but there is no need for that anyway. Put some extra fields into the LockMethod definition structs to clean up some other uglinesses, like hard-wired tests for DEFAULT_LOCKMETHOD and USER_LOCKMETHOD. This commit doesn't do anything about the performance issues we were discussing, but it clears away some of the underbrush that's in the way of fixing that.	2005-12-09 01:22:04 +00:00
Tom Lane	f38c3e778a	Fix thinko in comment.	2005-12-08 15:38:29 +00:00
Tom Lane	887a7c61f6	Get rid of slru.c's hardwired insistence on a fixed number of slots per SLRU area. The number of slots is still a compile-time constant (someday we might want to change that), but at least it's a different constant for each SLRU area. Increase number of subtrans buffers to 32 based on experimentation with a heavily subtrans-bashing test case, and increase number of multixact member buffers to 16, since it's obviously silly for it not to be at least twice the number of multixact offset buffers.	2005-12-06 23:08:34 +00:00
Tom Lane	a98871b7ac	Tweak indexscan machinery to avoid taking an AccessShareLock on an index if we already have a stronger lock due to the index's table being the update target table of the query. Same optimization I applied earlier at the table level. There doesn't seem to be much interest in the more radical idea of not locking indexes at all, so do what we can ...	2005-12-03 05:51:03 +00:00
Tom Lane	ace17c1d82	Retry in FileRead and FileWrite if Windows returns ERROR_NO_SYSTEM_RESOURCES. Also add a retry for Unixen returning EINTR, which hasn't been reported as an issue but at least theoretically could be. Patch by Qingqing Zhou, some minor adjustments by me.	2005-12-01 20:24:18 +00:00
Bruce Momjian	436a2956d8	Re-run pgindent, fixing a problem where comment lines after a blank comment line where output as too long, and update typedefs for /lib directory. Also fix case where identifiers were used as variable names in the backend, but as typedefs in ecpg (favor the backend for indenting). Backpatch to 8.1.X.	2005-11-22 18:17:34 +00:00
Tom Lane	c859308aba	DropRelFileNodeBuffers failed to fix the state of the lookup hash table that was added to localbuf.c in 8.1; therefore, applying it to a temp table left corrupt lookup state in memory. The only case where this had a significant chance of causing problems was an ON COMMIT DELETE ROWS temp table; the other possible paths left bogus state that was unlikely to be used again. Per report from Csaba Nagy.	2005-11-17 17:42:02 +00:00
Tom Lane	48052de722	Repair an error introduced by log_line_prefix patch: it is not acceptable to assume that the string pointer passed to set_ps_display is good forever. There's no need to anyway since ps_status.c itself saves the string, and we already had an API (get_ps_display) to return it. I believe this explains Jim Nasby's report of intermittent crashes in elog.c when %i format code is in use in log_line_prefix. While at it, repair a previously unnoticed problem: on some platforms such as Darwin, the string returned by get_ps_display was blank-padded to the maximum length, meaning that lock.c's attempt to append " waiting" to it never worked.	2005-11-05 03:04:53 +00:00
Peter Eisentraut	07bb9f086b	Message corrections	2005-10-29 00:31:52 +00:00
Tom Lane	fbbe00242d	Tweak buffer manager so that 'internal' accesses to a buffer do not advance its usage_count. This includes writes of dirty buffers triggered by bgwriter, checkpoint, or FlushRelationBuffers, as well as various corner cases that really ought not count as accesses to the page. Should make for some marginal improvement in the quality of our decisions about when to recycle buffers. Per suggestion from ITAGAKI Takahiro.	2005-10-27 17:07:58 +00:00
Bruce Momjian	1dc3498251	Standard pgindent run for 8.1.	2005-10-15 02:49:52 +00:00
Neil Conway	c10dba2fe3	Remove an antiquated comment.	2005-10-13 06:24:05 +00:00
Tom Lane	fa72121594	Fix another recently-changed place that was messing with spinlock- protected data structures and not using a volatile pointer for same.	2005-10-12 16:55:59 +00:00
Tom Lane	07eeb9d109	Do all accesses to shared buffer headers through volatile-qualified pointers, to ensure that compilers won't rearrange accesses to occur while we're not holding the buffer header spinlock. It's probably not necessary to mark volatile in every single place in bufmgr.c, but better safe than sorry. Per trouble report from Kevin Grittner.	2005-10-12 16:45:14 +00:00
Tom Lane	a72ee09090	Add infrastructure for making spins_per_delay variable depending on whether we seem to be running in a uniprocessor or multiprocessor. The adjustment rules could probably still use further tweaking, but I'm convinced this should be a win overall.	2005-10-11 20:41:32 +00:00
Tom Lane	82e861fbe1	Fix LWLockAssign() so that it can safely be executed after postmaster initialization. Add spinlocking, fix EXEC_BACKEND unsafeness.	2005-10-07 21:42:38 +00:00
Tom Lane	bb55e583f6	Allocate a few extra LWLocks for possible use by add-on modules. Per request from Marc Munro.	2005-10-07 20:11:03 +00:00
Bruce Momjian	4f915cd377	This patch cleans up the access to members of ItemIdData. It uses existing macros instead of touching directly. ITAGAKI Takahiro	2005-09-22 16:46:00 +00:00
Bruce Momjian	658657177e	Print proper cause of statement cancel, user interaction or timeout.	2005-09-19 17:21:49 +00:00
Tom Lane	dc06734a72	Force the size and alignment of LWLock array entries to be either 16 or 32 bytes. This shouldn't make any difference on x86 machines, where the size happened to be 16 bytes anyway, but on 64-bit machines and machines with slock_t int or wider, it will speed array indexing and hopefully reduce SMP cache contention effects. Per recent experimentation.	2005-09-16 00:30:05 +00:00
Tom Lane	396526d8c3	Adjust m68k spinlock code to avoid duplicate in-line and not-in-line definitions on recent Linux systems, per Martin Pitt.	2005-08-26 14:47:35 +00:00
Tom Lane	1a33436224	Replace out-of-line tas() assembly code for MIPS with a properly constrained GCC inline version. Thiemo Seufer, by way of Martin Pitt.	2005-08-25 17:17:10 +00:00
Tom Lane	0007490e09	Convert the arithmetic for shared memory size calculation from 'int' to 'Size' (that is, size_t), and install overflow detection checks in it. This allows us to remove the former arbitrary restrictions on NBuffers etc. It won't make any difference in a 32-bit machine, but in a 64-bit machine you could theoretically have terabytes of shared buffers. (How efficiently we could manage 'em remains to be seen.) Similarly, num_temp_buffers, work_mem, and maintenance_work_mem can be set above 2Gb on a 64-bit machine. Original patch from Koichi Suzuki, additional work by moi.	2005-08-20 23:26:37 +00:00
Tatsuo Ishii	bc3991c185	Add BackendXidGetPid().	2005-08-20 01:26:36 +00:00
Bruce Momjian	28d0515d18	Fix FSM warning to mention increasing max_fsm_pages. Was incorrectly max_fsm_relations.	2005-08-17 03:50:59 +00:00
Bruce Momjian	27639809d2	Reverse out Assert addition.	2005-08-12 23:13:54 +00:00
Bruce Momjian	fab177e64f	Improve documention on loading large data sets into plperl. David Fetter	2005-08-12 21:42:53 +00:00
Tom Lane	3ae7e4a33b	Remove BufferBlockPointers array in favor of a base + (bufnum) * BLCKSZ computation. On modern machines this is as fast if not faster, and we don't have to clog the CPU's L2 cache with a tens-of-KB pointer array. If we ever decide to adopt a more dynamic allocation method for shared buffers, we'll probably have to revert this patch, but in the meantime we might as well save a few bytes and nanoseconds. Per Qingqing Zhou.	2005-08-12 05:05:51 +00:00
Tom Lane	721e53785d	Solve the problem of OID collisions by probing for duplicate OIDs whenever we generate a new OID. This prevents occasional duplicate-OID errors that can otherwise occur once the OID counter has wrapped around. Duplicate relfilenode values are also checked for when creating new physical files. Per my recent proposal.	2005-08-12 01:36:05 +00:00
Tom Lane	15269b5955	Avoid useless loop overhead in AtEOXact routines when the backend is compiled with USE_ASSERT_CHECKING but is running with assert_enabled false.	2005-08-08 19:44:22 +00:00
Tom Lane	7117cd3a77	Cause ShutdownPostgres to do a normal transaction abort during backend exit, instead of trying to take shortcuts. Introduce some additional shutdown callback routines to eliminate kluges like having ProcKill be responsible for shutting down the buffer manager. Ensure that the order of operations during shutdown is predictable and what you would expect given the module layering.	2005-08-08 03:12:16 +00:00
Tom Lane	5337ad464e	Fix count_usable_fds() to stop trying to open files once it reaches max_files_per_process. Going further than that is just a waste of cycles, and it seems that current Cygwin does not cope gracefully with deliberately running the system out of FDs. Per Andrew Dunstan.	2005-08-07 18:47:19 +00:00
Tom Lane	6eac4e69cf	Tweak BgBufferSync() so that a persistent write error on a dirty buffer doesn't block the bgwriter from making progress writing out other buffers. This was a hard problem in the context of the ARC/2Q design, but it's trivial in the context of clock sweep ... just advance the sweep counter before we try to write not after.	2005-08-02 20:52:08 +00:00
Tom Lane	2a4fad1a0e	Add NOWAIT option to SELECT FOR UPDATE/SHARE. Original patch by Hans-Juergen Schoenig, revisions by Karel Zak and Tom Lane.	2005-08-01 20:31:16 +00:00
Tom Lane	d42cf5a42a	Add per-user and per-database connection limit options. This patch also includes preliminary update of pg_dumpall for roles. Petr Jelinek, with review by Bruce Momjian and Tom Lane.	2005-07-31 17:19:22 +00:00
Bruce Momjian	1521aef1db	SUNOS4_CC -> SUNOS_CC.	2005-07-30 03:07:42 +00:00
Tom Lane	eb5949d190	Arrange for the postmaster (and standalone backends, initdb, etc) to chdir into PGDATA and subsequently use relative paths instead of absolute paths to access all files under PGDATA. This seems to give a small performance improvement, and it should make the system more robust against naive DBAs doing things like moving a database directory that has a live postmaster in it. Per recent discussion.	2005-07-04 04:51:52 +00:00
Tom Lane	b95ae32b41	Avoid WAL-logging individual tuple insertions during CREATE TABLE AS (a/k/a SELECT INTO). Instead, flush and fsync the whole relation before committing. We do still need the WAL log when PITR is active, however. Simon Riggs and Tom Lane.	2005-06-20 18:37:02 +00:00
Tom Lane	3f749924f8	Simplify uses of readdir() by creating a function ReadDir() that includes error checking and an appropriate ereport(ERROR) message. This gets rid of rather tedious and error-prone manipulation of errno, as well as a Windows-specific bug workaround, at more than a dozen call sites. After an idea in a recent patch by Heikki Linnakangas.	2005-06-19 21:34:03 +00:00
Tom Lane	d0a89683a3	Two-phase commit. Original patch by Heikki Linnakangas, with additional hacking by Alvaro Herrera and Tom Lane.	2005-06-17 22:32:51 +00:00
Tom Lane	8563ccae2c	Simplify shared-memory lock data structures as per recent discussion: it is sufficient to track whether a backend holds a lock or not, and store information about transaction vs. session locks only in the inside-the-backend LocalLockTable. Since there can now be but one PROCLOCK per lock per backend, LockCountMyLocks() is no longer needed, thus eliminating some O(N^2) behavior when a backend holds many locks. Also simplify the LockAcquire/LockRelease API by passing just a 'sessionLock' boolean instead of a transaction ID. The previous API was designed with the idea that per-transaction lock holding would be important for subtransactions, but now that we have subtransactions we know that this is unwanted. While at it, add an 'isTempObject' parameter to LockAcquire to indicate whether the lock is being taken on a temp table. This is not used just yet, but will be needed shortly for two-phase commit.	2005-06-14 22:15:33 +00:00
Tom Lane	a2fb7b8a1f	Adjust lo_open() so that specifying INV_READ without INV_WRITE creates a descriptor that uses the current transaction snapshot, rather than SnapshotNow as it did before (and still does if INV_WRITE is set). This means pg_dump will now dump a consistent snapshot of large object contents, as it never could do before. Also, add a lo_create() function that is similar to lo_creat() but allows the desired OID of the large object to be specified. This will simplify pg_restore considerably (but I'll fix that in a separate commit).	2005-06-13 02:26:53 +00:00
Tom Lane	ee7ac7b11e	Modify XLogInsert API to make callers specify whether pages to be backed up have the standard layout with unused space between pd_lower and pd_upper. When this is set, XLogInsert will omit the unused space without bothering to scan it to see if it's zero. That saves time in XLogInsert, and also allows reversion of my earlier patch to make PageRepairFragmentation et al explicitly re-zero freed space. Per suggestion by Heikki Linnakangas.	2005-06-06 20:22:58 +00:00
Tom Lane	4c8495a1f2	Remove the mostly-stubbed-out-anyway support routines for WAL UNDO. That code is never going to be used in the foreseeable future, and where it's more than a stub it's making the redo routines harder to read.	2005-06-06 17:01:25 +00:00
Tom Lane	21fda22ec4	Change CRCs in WAL records from 64bit to 32bit for performance reasons. Instead of a separate CRC on each backup block, include backup blocks in their parent WAL record's CRC; this is important to ensure that the backup block really goes with the WAL record, ie there was not a page tear right at the start of the backup block. Implement a simple form of compression of backup blocks: drop any run of zeroes starting at pd_lower, so as not to store the unused 'hole' that commonly exists in PG heap and index pages. Tweak PageRepairFragmentation and related routines to ensure they keep the unused space zeroed, so that the above compression method remains effective. All per recent discussions.	2005-06-02 05:55:29 +00:00
Tom Lane	140b078d2a	Improve LockAcquire API per my recent proposal. All error conditions are now reported via elog, eliminating the need to test the result code at most call sites. Make it possible for the caller to distinguish a freshly acquired lock from one already held in the current transaction. Use that capability to avoid redundant AcceptInvalidationMessages() calls in LockRelation().	2005-05-29 22:45:02 +00:00
Tom Lane	e92a88272e	Modify hash_search() API to prevent future occurrences of the error spotted by Qingqing Zhou. The HASH_ENTER action now automatically fails with elog(ERROR) on out-of-memory --- which incidentally lets us eliminate duplicate error checks in quite a bunch of places. If you really need the old return-NULL-on-out-of-memory behavior, you can ask for HASH_ENTER_NULL. But there is now an Assert in that path checking that you aren't hoping to get that behavior in a palloc-based hash table. Along the way, remove the old HASH_FIND_SAVE/HASH_REMOVE_SAVED actions, which were not being used anywhere anymore, and were surely too ugly and unsafe to want to see revived again.	2005-05-29 04:23:07 +00:00
Bruce Momjian	6dc7760ac3	Add support for wal_fsync_writethrough for Darwin, and restructure the code to better handle writethrough. Chris Campbell	2005-05-20 14:53:26 +00:00
Tom Lane	f519d04a43	Update comment that I missed the first time around.	2005-05-19 23:57:11 +00:00
Tom Lane	191b13aaca	Factor out lock cleanup code that is needed in several places in lock.c. Also, remove the rather useless return value of LockReleaseAll. Change response to detection of corruption in the shared lock tables to PANIC, since that is the only way of cleaning up fully. Originally an idea of Heikki Linnakangas, variously hacked on by Alvaro Herrera and Tom Lane.	2005-05-19 23:30:18 +00:00
Tom Lane	ee3b71f6bc	Split the shared-memory array of PGPROC pointers out of the sinval communication structure, and make it its own module with its own lock. This should reduce contention at least a little, and it definitely makes the code seem cleaner. Per my recent proposal.	2005-05-19 21:35:48 +00:00
Neil Conway	f38e413b20	Code cleanup: in C89, there is no point casting the first argument to memset() or MemSet() to a char . For one, memset()'s first argument is a void , and further void * can be implicitly coerced to/from any other pointer type.	2005-05-11 01:26:02 +00:00
Tom Lane	93b2477278	Use the standard lock manager to establish priority order when there is contention for a tuple-level lock. This solves the problem of a would-be exclusive locker being starved out by an indefinite succession of share-lockers. Per recent discussion with Alvaro.	2005-04-30 19:03:33 +00:00
Tom Lane	3a694bb0a1	Restructure LOCKTAG as per discussions of a couple months ago. Essentially, we shoehorn in a lockable-object-type field by taking a byte away from the lockmethodid, which can surely fit in one byte instead of two. This allows less artificial definitions of all the other fields of LOCKTAG; we can get rid of the special pg_xactlock pseudo-relation, and also support locks on individual tuples and general database objects (including shared objects). None of those possibilities are actually exploited just yet, however. I removed pg_xactlock from pg_class, but did not force initdb for that change. At this point, relkind 's' (SPECIAL) is unused and could be removed entirely.	2005-04-29 22:28:24 +00:00
Tom Lane	bedb78d386	Implement sharable row-level locks, and use them for foreign key references to eliminate unnecessary deadlocks. This commit adds SELECT ... FOR SHARE paralleling SELECT ... FOR UPDATE. The implementation uses a new SLRU data structure (managed much like pg_subtrans) to represent multiple- transaction-ID sets. When more than one transaction is holding a shared lock on a particular row, we create a MultiXactId representing that set of transactions and store its ID in the row's XMAX. This scheme allows an effectively unlimited number of row locks, just as we did before, while not costing any extra overhead except when a shared lock actually has to be shared. Still TODO: use the regular lock manager to control the grant order when multiple backends are waiting for a row lock. Alvaro Herrera and Tom Lane.	2005-04-28 21:47:18 +00:00
Bruce Momjian	3b0a5e50d7	Update VACUUM VERBOSE FSM message, per Tom.	2005-04-24 03:51:49 +00:00
Bruce Momjian	714d5a4c37	Update VACUUM VERBOSE update, per Alvaro.	2005-04-23 21:16:34 +00:00
Bruce Momjian	9ba6587f8b	Update working of VACUUM VERBOSE.	2005-04-23 21:10:20 +00:00
Bruce Momjian	52e08c35f7	Make VACUUM VERBOSE FSM output all output in a single INFO output statement.	2005-04-23 20:56:01 +00:00
Bruce Momjian	e947e1153a	Modify output of VACUUM VERBOSE to be clearer.	2005-04-23 15:20:39 +00:00
Neil Conway	ea208aca00	Remove an unused variable "waitingForSignal". From Qingqing Zhou.	2005-04-15 04:18:10 +00:00
Tom Lane	162bd08b3f	Completion of project to use fixed OIDs for all system catalogs and indexes. Replace all heap_openr and index_openr calls by heap_open and index_open. Remove runtime lookups of catalog OID numbers in various places. Remove relcache's support for looking up system catalogs by name. Bulky but mostly very boring patch ...	2005-04-14 20:03:27 +00:00
Tom Lane	2193a856a2	Simplify initdb-time assignment of OIDs as I proposed yesterday, and avoid encroaching on the 'user' range of OIDs by allowing automatic OID assignment to use values below 16k until we reach normal operation. initdb not forced since this doesn't make any incompatible change; however a lot of stuff will have different OIDs after your next initdb.	2005-04-13 18:54:57 +00:00
Tom Lane	badb83f9ec	If we're going to have a non-panic check for held_lwlocks[] overrun, it must occur before we get into the critical state of holding a lock we have no place to record. Per discussion with Qingqing Zhou.	2005-04-08 14:18:35 +00:00
Tom Lane	e794dfa511	Use an always-there test, not an Assert, to check for overrun of the held_lwlocks[] array. Per Qingqing Zhou.	2005-04-08 03:43:54 +00:00
Neil Conway	5b1c607abe	Remove an unused variable `ShmemBootstrap', and remove an obsolete comment. Patch from Alvaro.	2005-04-04 04:34:41 +00:00
Tom Lane	94e03330cb	Create a routine PageIndexMultiDelete() that replaces a loop around PageIndexTupleDelete() with a single pass of compactification --- logic mostly lifted from PageRepairFragmentation. I noticed while profiling that a VACUUM that's cleaning up a whole lot of deleted tuples would spend as much as a third of its CPU time in PageIndexTupleDelete; not too surprising considering the loop method was roughly O(N^2) in the number of tuples involved.	2005-03-22 06:17:03 +00:00
Tom Lane	354049c709	Remove unnecessary calls of FlushRelationBuffers: there is no need to write out data that we are about to tell the filesystem to drop. smgr_internal_unlink already had a DropRelFileNodeBuffers call to get rid of dead buffers without a write after it's no longer possible to roll back the deleting transaction. Adding a similar call in smgrtruncate simplifies callers and makes the overall division of labor clearer. This patch removes the former behavior that VACUUM would write all dirty buffers of a relation unconditionally.	2005-03-20 22:00:54 +00:00
Tom Lane	91728fa26c	Add temp_buffers GUC variable to allow users to determine the size of the local buffer arena for temporary table access.	2005-03-19 23:27:11 +00:00
Tom Lane	d65522aeb6	Upgrade localbuf.c to use a hash table instead of linear search to find already-allocated local buffers. This is the last obstacle in the way of setting NLocBuffer to something reasonably large.	2005-03-19 17:39:43 +00:00
Tom Lane	88164799ce	Need to reset local buffer pin counts, not only shared buffer pins, before we attempt any file deletions in ShutdownPostgres. Per Tatsuo.	2005-03-18 16:16:09 +00:00
Tom Lane	cef01c3355	Avoid infinite loop in InvalidateBuffer if we ourselves are holding a pin on the victim buffer.	2005-03-18 05:25:23 +00:00
Bruce Momjian	2c4dea126a	Issue free space notices to both the user and the server log file.	2005-03-14 20:15:09 +00:00
Bruce Momjian	45905425a0	Add warning about the need to increase "max_fsm_relations" and "max_fsm_relations" for vacuums. Also improve VACUUM VERBOSE final message text. Ron Mayer	2005-03-12 05:21:52 +00:00
Neil Conway	c129c16492	Slight refactoring and optimization of some code in WaitOnLock().	2005-03-11 03:52:06 +00:00
Tom Lane	5d5087363d	Replace the BufMgrLock with separate locks on the lookup hashtable and the freelist, plus per-buffer spinlocks that protect access to individual shared buffer headers. This requires abandoning a global freelist (since the freelist is a global contention point), which shoots down ARC and 2Q as well as plain LRU management. Adopt a clock sweep algorithm instead. Preliminary results show substantial improvement in multi-backend situations.	2005-03-04 20:21:07 +00:00
Tom Lane	a2ad04f4b0	Release proclock immediately in RemoveFromWaitQueue() if it represents no held locks. This maintains the invariant that proclocks are present only for procs that are holding or awaiting a lock; when this is not true, LockRelease will fail. Per report from Stephen Clouse.	2005-03-01 21:14:59 +00:00
Bruce Momjian	0542b1e2fe	Use _() macro consistently rather than gettext(). Add translation macros around strings that were missing them.	2005-02-22 04:43:23 +00:00
Neil Conway	11635c3f6f	Refactor some duplicated code in lock.c: create UnGrantLock(), move code from LockRelease() and LockReleaseAll() into it. From Heikki Linnakangas.	2005-02-04 02:04:53 +00:00
Tom Lane	cc4f58f4cd	Ensure that all details of the ARC algorithm are hidden within freelist.c. This refactoring does not change any algorithms or data structures, just remove visibility of the ARC datastructures from other source files.	2005-02-03 23:29:19 +00:00
Neil Conway	a885ecd6ef	Change heap_modifytuple() to require a TupleDesc rather than a Relation. Patch from Alvaro Herrera, minor editorializing by Neil Conway.	2005-01-27 23:24:11 +00:00
Tom Lane	0ce4d56924	Phase 1 of fix for 'SMgrRelation hashtable corrupted' problem. This is the minimum required fix. I want to look next at taking advantage of it by simplifying the message semantics in the shared inval message queue, but that part can be held over for 8.1 if it turns out too ugly.	2005-01-10 20:02:24 +00:00
Tom Lane	c9d8edc906	Repair bufmgr deadlock problem reported by Michael Wildpaner. Must take share lock on a buffer being written out before releasing BufMgrLock in the BufferAlloc code path; if we do it later we might block on someone who's re-pinned the buffer. I believe this is only an issue for BufferAlloc and not the other places that call FlushBuffer. BufferSync must continue to do it the old way since it may well be trying to write buffers that other backends have pinned; but it should not be holding any conflicting locks. FlushRelationBuffers is okay since it's got exclusive lock at the relation level.	2005-01-03 18:49:41 +00:00
PostgreSQL Daemon	2ff501590b	Tag appropriate files for rc3 Also performed an initial run through of upgrading our Copyright date to extend to 2005 ... first run here was very simple ... change everything where: grep 1996-2004 && the word 'Copyright' ... scanned through the generated list with 'less' first, and after, to make sure that I only picked up the right entries ...	2004-12-31 22:04:05 +00:00
Tom Lane	96ecf9d5aa	Support Sun's compiler on SunOS4 (a/k/a Solaris 9). Per ayan@ayan.net	2004-12-29 23:47:40 +00:00
Tom Lane	eee5abce46	Refactor EXEC_BACKEND code so that postmaster child processes reattach to shared memory as soon as possible, ie, right after read_backend_variables. The effective difference from the original code is that this happens before instead of after read_nondefault_variables(), which loads GUC information and is apparently capable of expanding the backend's memory allocation more than you'd think it should. This should fix the failure-to-attach-to-shared-memory reports we've been seeing on Windows. Also clean up a few bits of unnecessarily grotty EXEC_BACKEND code.	2004-12-29 21:36:09 +00:00
Bruce Momjian	08690d0688	Allow NetBSD, m64k to compile the ASM spinlock code. R?mi Zara	2004-12-18 22:12:52 +00:00
Neil Conway	4acc97d7e4	Assert that BufferIsPinned() in IncrBufferRefCount(), rather than using a home-brewed combination of assertions that boiled down to the same thing.	2004-11-24 02:56:17 +00:00
Tom Lane	8ecbc46bdb	Reduce the default size of the local lock hash table. There's usually no need for it to be nearly as big as the global hash table, and since it's not in shared memory it can grow if it does need to be bigger. By reducing the size, we speed up hash_seq_search(), which saves a significant fraction of subtransaction entry/exit overhead.	2004-11-20 20:16:54 +00:00
Peter Eisentraut	0ed3c7665e	Small message clarifications	2004-11-05 17:11:34 +00:00
Neil Conway	8ec05b28b7	Modify hash_create() to elog(ERROR) if an error occurs, rather than returning a NULL pointer (some callers remembered to check the return value, but some did not -- it is safer to just bail out). Also, cleanup pgstat.c to use elog(ERROR) rather than elog(LOG) followed by exit().	2004-10-25 00:46:43 +00:00
Tom Lane	4347cc2392	Allow background writing to be shut down by setting limit values to zero. This does not disable the bgwriter process: it still has to wake up often enough to collect fsync requests from backends in a timely fashion. But it responds to the recent gripe about not being able to prevent the disk from being spun up constantly.	2004-10-17 22:01:51 +00:00
Tom Lane	fdd13f1568	Give the ResourceOwner mechanism full responsibility for releasing buffer pins at end of transaction, and reduce AtEOXact_Buffers to an Assert cross-check that this was done correctly. When not USE_ASSERT_CHECKING, AtEOXact_Buffers is a complete no-op. This gets rid of an O(NBuffers) bottleneck during transaction commit/abort, which recent testing has shown becomes significant above a few tens of thousands of shared buffers.	2004-10-16 18:57:26 +00:00
Tom Lane	1c2de47746	Remove BufferLocks[] array in favor of a single pointer to the buffer (if any) currently waited for by LockBufferForCleanup(), which is all that we were using it for anymore. Saves some space and eliminates proportional-to-NBuffers slowdown in UnlockBuffers().	2004-10-16 18:05:07 +00:00
Tom Lane	9ffc8ed58b	Repair possible failure to update hint bits back to disk, per http://archives.postgresql.org/pgsql-hackers/2004-10/msg00464.php. This fix is intended to be permanent: it moves the responsibility for calling SetBufferCommitInfoNeedsSave() into the tqual.c routines, eliminating the requirement for callers to test whether t_infomask changed. Also, tighten validity checking on buffer IDs in bufmgr.c --- several routines were paranoid about out-of-range shared buffer numbers but not about out-of-range local ones, which seems a tad pointless.	2004-10-15 22:40:29 +00:00
Neil Conway	0683a47556	Allow the spinlock test to be compiled successfully in a vpath build.	2004-10-07 00:08:04 +00:00
Tom Lane	0fb3152ea9	Minor adjustments to improve the accuracy of our computation of required shared memory size.	2004-09-29 15:15:56 +00:00
Tom Lane	3a246cc285	Arrange to preallocate all required space for the buffer and FSM hash tables in shared memory. This ensures that overflow of the lock table creates no long-lasting problems. Per discussion with Merlin Moncure.	2004-09-28 20:46:37 +00:00
Tom Lane	86fff990b2	RecentXmin is too recent to use as the cutoff point for accessing pg_subtrans --- what we need is the oldest xmin of any snapshot in use in the current top transaction. Introduce a new variable TransactionXmin to play this role. Fixes intermittent regression failure reported by Neil Conway.	2004-09-16 18:35:23 +00:00
Tom Lane	8f9f198603	Restructure subtransaction handling to reduce resource consumption, as per recent discussions. Invent SubTransactionIds that are managed like CommandIds (ie, counter is reset at start of each top transaction), and use these instead of TransactionIds to keep track of subtransaction status in those modules that need it. This means that a subtransaction does not need an XID unless it actually inserts/modifies rows in the database. Accordingly, don't assign it an XID nor take a lock on the XID until it tries to do that. This saves a lot of overhead for subtransactions that are only used for error recovery (eg plpgsql exceptions). Also, arrange to release a subtransaction's XID lock as soon as the subtransaction exits, in both the commit and abort cases. This avoids holding many unique locks after a long series of subtransactions. The price is some additional overhead in XactLockTableWait, but that seems acceptable. Finally, restructure the state machine in xact.c to have a more orthogonal set of states for subtransactions.	2004-09-16 16:58:44 +00:00
Tom Lane	abc98dcc15	When LockAcquire fails at the stage of creating a proclock object, be sure to clean up the already-created lock object, if it has no other references. Avoids possibly-permanent leak of shared memory.	2004-09-12 18:30:50 +00:00
Tom Lane	083258e535	Fix a number of places where brittle data structures or overly strong Asserts would lead to a server core dump if an error occurred while trying to abort a failed subtransaction (thereby leading to re-execution of whatever parts of AbortSubTransaction had already run). This of course does not prevent such an error from creating an infinite loop, but at least we don't make the situation worse. Responds to an open item on the subtransactions to-do list.	2004-09-06 23:33:48 +00:00
Tom Lane	23645f0582	Fix incorrect ordering of smgr cleanup relative to buffer pin cleanup during transaction abort. Add a regression test case to catch related mistakes in future. Alvaro Herrera and Tom Lane.	2004-09-06 17:56:33 +00:00
Tom Lane	eb917c1a21	I can't see any good reason for DropRelFileNodeBuffers to be issuing FATAL when it detects a nonzero reference count. Reduce to ERROR.	2004-09-06 17:31:32 +00:00
Tom Lane	a421b4e850	FlushRelationBuffers was also being a bit cavalier about whether the relation is already opened by smgr.	2004-08-31 16:13:06 +00:00
Tom Lane	332ee2dc41	Improve spinlock selftest to make it able to detect misdeclaration of the slock_t datatype (ie, declared type smaller than what the hardware TAS instruction needs).	2004-08-30 23:47:20 +00:00
Tom Lane	303e46ea93	Tweak md.c logic to cope with the situation where WAL replay tries to write into a high-numbered segment of a relation that was later deleted. We need to temporarily recreate missing segment files, instead of failing.	2004-08-30 03:52:43 +00:00
Bruce Momjian	15d3f9f6b7	Another pgindent run with lib typedefs added.	2004-08-30 02:54:42 +00:00
Bruce Momjian	b6b71b85bc	Pgindent run for 8.0.	2004-08-29 05:07:03 +00:00
Bruce Momjian	da9a8649d8	Update copyright to 2004.	2004-08-29 04:13:13 +00:00
Tom Lane	1785acebf2	Introduce local hash table for lock state, as per recent proposal. PROCLOCK structs in shared memory now have only a bitmask for held locks, rather than counts (making them 40 bytes smaller, which is a good thing). Multiple locks within a transaction are counted in the local hash table instead, and we have provision for tracking which ResourceOwner each count belongs to. Solves recently reported problem with memory leakage within long transactions.	2004-08-27 17:07:42 +00:00
Tom Lane	337b513e07	Fix user locks. Broken some time ago for all platforms by Windows-related changes.	2004-08-26 17:23:30 +00:00
Tom Lane	4dbb880d3c	Rearrange pg_subtrans handling as per recent discussion. pg_subtrans updates are no longer WAL-logged nor even fsync'd; we do not need to, since after a crash no old pg_subtrans data is needed again. We truncate pg_subtrans to RecentGlobalXmin at each checkpoint. slru.c's API is refactored a little bit to separate out the necessary decisions.	2004-08-23 23:22:45 +00:00
Tom Lane	f009c316ba	Tweak code so that pg_subtrans is never consulted for XIDs older than RecentXmin (== MyProc->xmin). This ensures that it will be safe to truncate pg_subtrans at RecentGlobalXmin, which should largely eliminate any fear of bloat. Along the way, eliminate SubTransXidsHaveCommonAncestor, which isn't really needed and could not give a trustworthy result anyway under the lookback restriction. In an unrelated but nearby change, #ifdef out GetUndoRecPtr, which has been dead code since 2001 and seems unlikely to ever be resurrected.	2004-08-22 02:41:58 +00:00
Tom Lane	1a3de15a3a	Dept. of further reflection: I looked around to see if any other callers of XLogInsert had the same sort of checkpoint interlock problem as RecordTransactionCommit, and indeed I found some. Btree index build and ALTER TABLE SET TABLESPACE write data outside the friendly confines of the buffer manager, and therefore they have to take their own responsibility for checkpoint interlock. The easiest solution seems to be to force smgrimmedsync at the end of the index build or table copy, even when the operation is being WAL-logged. This is sufficient since the new index or table will be of interest to no one if we don't get as far as committing the current transaction.	2004-08-15 23:44:46 +00:00
Tom Lane	057ea3471f	Xmin calculations should consider only top transaction IDs, and therefore starting with GetCurrentTransactionId is wrong. Fixes miscomputation of RecentGlobalXmin leading to bizarre behavior reported by Gavin Sherry.	2004-08-15 17:03:36 +00:00
Tom Lane	efcaf1e868	Some mop-up work for savepoints (nested transactions). Store a small number of active subtransaction XIDs in each backend's PGPROC entry, and use this to avoid expensive probes into pg_subtrans during TransactionIdIsInProgress. Extend EOXactCallback API to allow add-on modules to get control at subxact start/end. (This is deliberately not compatible with the former API, since any uses of that API probably need manual review anyway.) Add basic reference documentation for SAVEPOINT and related commands. Minor other cleanups to check off some of the open issues for subtransactions. Alvaro Herrera and Tom Lane.	2004-08-01 17:32:22 +00:00
Tom Lane	a393fbf937	Restructure error handling as recently discussed. It is now really possible to trap an error inside a function rather than letting it propagate out to PostgresMain. You still have to use AbortCurrentTransaction to clean up, but at least the error handling itself will cooperate.	2004-07-31 00:45:57 +00:00
Tom Lane	1bf3d61504	Fix subtransaction behavior for large objects, temp namespace, files, password/group files. Also allow read-only subtransactions of a read-write parent, but not vice versa. These are the reasonably noncontroversial parts of Alvaro's recent mop-up patch, plus further work on large objects to minimize use of the TopTransactionResourceOwner.	2004-07-28 14:23:31 +00:00
Tom Lane	cc813fc2b8	Replace nested-BEGIN syntax for subtransactions with spec-compliant SAVEPOINT/RELEASE/ROLLBACK-TO syntax. (Alvaro) Cause COMMIT of a failed transaction to report ROLLBACK instead of COMMIT in its command tag. (Tom) Fix a few loose ends in the nested-transactions stuff.	2004-07-27 05:11:48 +00:00
Tom Lane	2042b3428d	Invent WAL timelines, as per recent discussion, to make point-in-time recovery more manageable. Also, undo recent change to add FILE_HEADER and WASTED_SPACE records to XLOG; instead make the XLOG page header variable-size with extra fields in the first page of an XLOG file. This should fix the boundary-case bugs observed by Mark Kirkwood. initdb forced due to change of XLOG representation.	2004-07-21 22:31:26 +00:00
Tom Lane	fe548629c5	Invent ResourceOwner mechanism as per my recent proposal, and use it to keep track of portal-related resources separately from transaction-related resources. This allows cursors to work in a somewhat sane fashion with nested transactions. For now, cursor behavior is non-subtransactional, that is a cursor's state does not roll back if you abort a subtransaction that fetched from the cursor. We might want to change that later.	2004-07-17 03:32:14 +00:00
Tom Lane	8801110b20	Move TablespaceCreateDbspace() call into smgrcreate(), which is where it probably should have been to begin with; this is to cover cases like needing to recreate the per-db directory during WAL replay. Also, fix heap_create to force pg_class.reltablespace to be zero instead of the database's default tablespace; this makes the world safe for CREATE DATABASE to handle all tables in the default tablespace alike, as per previous discussion. And force pg_class.reltablespace to zero when creating a relation without physical storage (eg, a view); this avoids possibly having dangling references in this column after a subsequent DROP TABLESPACE.	2004-07-11 19:52:52 +00:00
Tom Lane	77a436ba55	Fix seriously nasty memory leak in new TransactionIdIsInProgress code.	2004-07-01 03:13:05 +00:00
Tom Lane	573a71a5da	Nested transactions. There is still much left to do, especially on the performance front, but with feature freeze upon us I think it's time to drive a stake in the ground and say that this will be in 7.5. Alvaro Herrera, with some help from Tom Lane.	2004-07-01 00:52:04 +00:00
Tom Lane	c1d9dec3e3	Looks like s_lock_test needs <time.h> on some platforms.	2004-06-19 20:31:55 +00:00
Tom Lane	1232878159	s_lock_test requires libpgport to build now.	2004-06-19 19:43:11 +00:00
Tom Lane	2467394ee1	Tablespaces. Alternate database locations are dead, long live tablespaces. There are various things left to do: contrib dbsize and oid2name modules need work, and so does the documentation. Also someone should think about COMMENT ON TABLESPACE and maybe RENAME TABLESPACE. Also initlocation is dead, it just doesn't know it yet. Gavin Sherry and Tom Lane.	2004-06-18 06:14:31 +00:00
Tom Lane	bbf0ebadaf	StrategyDirtyBufferList wasn't being careful to honor max_buffers limit. Bug is only latent given that sole caller is passing NBuffers, but it could bite someone in the rear someday.	2004-06-11 17:20:39 +00:00
Tom Lane	e6cba71503	Add some code to Assert that when we release pin on a buffer, we are not holding the buffer's cntx_lock or io_in_progress_lock. A recent report from Litao Wu makes me wonder whether it is ever possible for us to drop a buffer and forget to release its cntx_lock. The Assert does not fire in the regression tests, but that proves little ...	2004-06-11 16:43:24 +00:00
Bruce Momjian	a1ccbb9019	Previous code cleanup was for bufpage.c, not bufmgr.c. This cleanup just cleans up a comment.	2004-06-09 13:11:34 +00:00
Bruce Momjian	ce04221a1e	Stylistic changes in bufmgr.c Basically replaces (*a).b with a->b as it is everywhere else in Postgres. Manfred Koizar	2004-06-08 14:00:35 +00:00
Tom Lane	c3a153afed	Tweak palloc/repalloc to allow zero bytes to be requested, as per recent proposal. Eliminate several dozen now-unnecessary hacks to avoid palloc(0). (It's likely there are more that I didn't find.)	2004-06-05 19:48:09 +00:00
Tom Lane	921d749bd4	Adjust our timezone library to use pg_time_t (typedef'd as int64) in place of time_t, as per prior discussion. The behavior does not change on machines without a 64-bit-int type, but on machines with one, which is most, we are rid of the bizarre boundary behavior at the edges of the 32-bit-time_t range (1901 and 2038). The system will now treat times over the full supported timestamp range as being in your local time zone. It may seem a little bizarre to consider that times in 4000 BC are PST or EST, but this is surely at least as reasonable as propagating Gregorian calendar rules back that far. I did not modify the format of the zic timezone database files, which means that for the moment the system will not know about daylight-savings periods outside the range 1901-2038. Given the way the files are set up, it's not a simple decision like 'widen to 64 bits'; we have to actually think about the range of years that need to be supported. We should probably inquire what the plans of the upstream zic people are before making any decisions of our own.	2004-06-03 02:08:07 +00:00
Bruce Momjian	e8d9d68ca4	Per previous discussions, here are two functions to send INT and TERM (cancel and terminate) signals to other backends. They permit only INT and TERM, and permits sending only to postgresql backends. Magnus Hagander	2004-06-02 21:29:29 +00:00
Tom Lane	2095206de1	Adjust btree index build to not use shared buffers, thereby avoiding the locking conflict against concurrent CHECKPOINT that was discussed a few weeks ago. Also, if not using WAL archiving (which is always true ATM but won't be if PITR makes it into this release), there's no need to WAL-log the index build process; it's sufficient to force-fsync the completed index before commit. This seems to gain about a factor of 2 in my tests, which is consistent with writing half as much data. I did not try it with WAL on a separate drive though --- probably the gain would be a lot less in that scenario.	2004-06-02 17:28:18 +00:00
Tom Lane	91d20ff7aa	Additional mop-up for sync-to-fsync changes: avoid issuing fsyncs for temp tables, and avoid WAL-logging truncations of temp tables. Do issue fsync on truncated files (not sure this is necessary but it seems like a good idea).	2004-05-31 20:31:33 +00:00
Tom Lane	e674707968	Minor code rationalization: FlushRelationBuffers just returns void, rather than an error code, and does elog(ERROR) not elog(WARNING) when it detects a problem. All callers were simply elog(ERROR)'ing on failure return anyway, and I find it hard to envision a caller that would not, so we may as well simplify the callers and produce the more useful error message directly.	2004-05-31 19:24:05 +00:00
Tom Lane	9b178555fc	Per previous discussions, get rid of use of sync(2) in favor of explicitly fsync'ing every (non-temp) file we have written since the last checkpoint. In the vast majority of cases, the burden of the fsyncs should fall on the bgwriter process not on backends. (To this end, we assume that an fsync issued by the bgwriter will force out blocks written to the same file by other processes using other file descriptors. Anyone have a problem with that?) This makes the world safe for WIN32, which ain't even got sync(2), and really makes the world safe for Unixen as well, because sync(2) never had the semantics we need: it offers no way to wait for the requested I/O to finish. Along the way, fix a bug I recently introduced in xlog recovery: file truncation replay failed to clear bufmgr buffers for the dropped blocks, which could result in 'PANIC: heap_delete_redo: no block' later on in xlog replay.	2004-05-31 03:48:10 +00:00
Tom Lane	c6719a2784	Implement new PostmasterIsAlive() check for WIN32, per Claudio Natoli. In passing, align a few error messages with the style guide.	2004-05-30 03:50:15 +00:00
Tom Lane	076a055acf	Separate out bgwriter code into a logically separate module, rather than being random pieces of other files. Give bgwriter responsibility for all checkpoint activity (other than a post-recovery checkpoint); so this child process absorbs the functionality of the former transient checkpoint and shutdown subprocesses. While at it, create an actual include file for postmaster.c, which for some reason never had its own file before.	2004-05-29 22:48:23 +00:00
Tom Lane	1a321f26d8	Code review for EXEC_BACKEND changes. Reduce the number of #ifdefs by about a third, make it work on non-Windows platforms again. (But perhaps I broke the WIN32 code, since I have no way to test that.) Fold all the paths that fork postmaster child processes to go through the single routine SubPostmasterMain, which takes care of resurrecting the state that would normally be inherited from the postmaster (including GUC variables). Clean up some places where there's no particularly good reason for the EXEC and non-EXEC cases to work differently. Take care of one or two FIXMEs that remained in the code.	2004-05-28 05:13:32 +00:00
Tom Lane	ebfc56d3fb	Handle impending sinval queue overflow by means of a separate signal (SIGUSR1, which we have not been using recently) instead of piggybacking on SIGUSR2-driven NOTIFY processing. This has several good results: the processing needed to drain the sinval queue is a lot less than the processing needed to answer a NOTIFY; there's less contention since we don't have a bunch of backends all trying to acquire exclusive lock on pg_listener; backends that are sitting inside a transaction block can still drain the queue, whereas NOTIFY processing can't run if there's an open transaction block. (This last is a fairly serious issue that I don't think we ever recognized before --- with clients like JDBC that tend to sit with open transaction blocks, the sinval queue draining mechanism never really worked as intended, probably resulting in a lot of useless cache-reset overhead.) This is the last of several proposed changes in response to Philip Warner's recent report of sinval-induced performance problems.	2004-05-23 03:50:45 +00:00
Tom Lane	4af3421161	Get rid of rd_nblocks field in relcache entries. Turns out this was costing us lots more to maintain than it was worth. On shared tables it was of exactly zero benefit because we couldn't trust it to be up to date. On temp tables it sometimes saved an lseek, but not often enough to be worth getting excited about. And the real problem was that we forced an lseek on every relcache flush in order to update the field. So all in all it seems best to lose the complexity.	2004-05-08 19:09:25 +00:00
Neil Conway	0370951347	Tiny assorted fixes: correct a typo in a comment in vacuumlazy.c, remove some unused #include directives from bufmgr.c, and clarify comments in bufmgr.h and buf.h	2004-04-25 23:50:58 +00:00
Neil Conway	139abc2896	Make LocalRefCount and PrivateRefCount arrays of int32, rather than long. This saves a small amount of per-backend memory for LP64 machines.	2004-04-22 07:21:55 +00:00
Tom Lane	95a03e9cdf	Another round of code cleanup on bufmgr. Use BM_VALID flag to keep track of whether we have successfully read data into a buffer; this makes the error behavior a bit more transparent (IMHO anyway), and also makes it work correctly for local buffers which don't use Start/TerminateBufferIO. Collapse three separate functions for writing a shared buffer into one. This overlaps a bit with cleanups that Neil proposed awhile back, but seems not to have committed yet.	2004-04-21 18:06:30 +00:00
Tom Lane	011c3e62e7	Code review for ARC patch. Eliminate static variables, improve handling of VACUUM cases so that VACUUM requests don't affect the ARC state at all, avoid corner case where BufferSync would uselessly rewrite a buffer that no longer contains the page that was to be flushed. Make some minor other cleanups in and around the bufmgr as well, such as moving PinBuffer and UnpinBuffer into bufmgr.c where they really belong.	2004-04-19 23:27:17 +00:00
Bruce Momjian	31338352bd	* Most changes are to fix warnings issued when compiling win32 * removed a few redundant defines * get_user_name safe under win32 * rationalized pipe read EOF for win32 (UPDATED PATCH USED) * changed all backend instances of sleep() to pg_usleep - except for the SLEEP_ON_ASSERT in assert.c, as it would exceed a 32-bit long [Note to patcher: If a SLEEP_ON_ASSERT of 2000 seconds is acceptable, please replace with pg_usleep(2000000000L)] I added a comment to that part of the code: /* * It would be nice to use pg_usleep() here, but only does 2000 sec * or 33 minutes, which seems too short. */ sleep(1000000); Claudio Natoli	2004-04-19 17:42:59 +00:00
Bruce Momjian	48b2802eee	When changing select() calls for delays into pg_usleep(), two comments in s_lock.c were not updated, and still refers to select. Made my grep hit the wrong files, so I figured a simple patch was in order.. (other refs in the same comment block was changed..) Magnus Hagander	2004-03-23 21:39:46 +00:00
Bruce Momjian	3947f653f9	* postmaster.c: cleanup pmdaemonize under win32; missed failure message in CreateOptsFile * s_lock.c: minor comment fix * findbe.c: variables not used under win32 moved within #ifndef WIN32 case Claudio Natoli	2004-03-15 16:18:43 +00:00
Bruce Momjian	c672aa823b	For application to HEAD, following community review. * Changes incorrect CYGWIN defines to __CYGWIN__ * Some localtime returns NULL checks (when unchecked cause SEGVs under Win32 regression tests) * Rationalized CreateSharedMemoryAndSemaphores and AttachSharedMemoryAndSemaphores (Bruce, I finally remembered to do it); requires attention. Claudio Natoli	2004-02-25 19:41:23 +00:00
Tom Lane	7a57a67278	Replace opendir/closedir calls throughout the backend with AllocateDir and FreeDir routines modeled on the existing AllocateFile/FreeFile. Like the latter, these routines will avoid failing on EMFILE/ENFILE conditions whenever possible, and will prevent leakage of directory descriptors if an elog() occurs while one is open. Also, reduce PANIC to ERROR in MoveOfflineLogs() --- this is not critical code and there is no reason to force a DB restart on failure. All per recent trouble report from Olivier Hubaut.	2004-02-23 23:03:10 +00:00
Tom Lane	f83356c7f5	Do a direct probe during postmaster startup to determine the maximum number of openable files and the number already opened. This eliminates depending on sysconf(_SC_OPEN_MAX), and allows much saner behavior on platforms where open-file slots are used up by semaphores.	2004-02-23 20:45:59 +00:00
Bruce Momjian	af3b182a57	Here is a patch that implements setitimer() on win32. With this patch applied, deadlock detection and statement_timeout now works. The file timer.c goes into src/backend/port/win32/. The patch also removes two lines of "printf debugging" accidentally left in pqsignal.h, in the console control handler. Magnus Hagander	2004-02-18 16:25:12 +00:00
Tom Lane	da99cce7cd	Avoid delaying postmaster shutdown by up to 10 seconds on platforms where signals do not terminate sleep() delays.	2004-02-12 20:07:26 +00:00
Jan Wieck	fc65a3e1fd	Fixed bug where FlushRelationBuffers() did call StrategyInvalidateBuffer() for already empty buffers because their buffer tag was not cleard out when the buffers have been invalidated before. Also removed the misnamed BM_FREE bufhdr flag and replaced the checks, which effectively ask if the buffer is unpinned, with checks against the refcount field. Jan	2004-02-12 15:06:56 +00:00
Tom Lane	c3c09be34b	Commit the reasonably uncontroversial parts of J.R. Nield's PITR patch, to wit: Add a header record to each WAL segment file so that it can be reliably identified. Avoid splitting WAL records across segment files (this is not strictly necessary, but makes it simpler to incorporate the header records). Make WAL entries for file creation, deletion, and truncation (as foreseen but never implemented by Vadim). Also, add support for making XLOG_SEG_SIZE configurable at compile time, similarly to BLCKSZ. Fix a couple bugs I introduced in WAL replay during recent smgr API changes. initdb is forced due to changes in pg_control contents.	2004-02-11 22:55:26 +00:00
Tom Lane	58f337a343	Centralize implementation of delay code by creating a pg_usleep() subroutine in src/port/pgsleep.c. Remove platform dependencies from miscadmin.h and put them in port.h where they belong. Extend recent vacuum cost-based-delay patch to apply to VACUUM FULL, ANALYZE, and non-btree index vacuuming. By the way, where is the documentation for the cost-based-delay patch?	2004-02-10 03:42:45 +00:00
Tom Lane	87bd956385	Restructure smgr API as per recent proposal. smgr no longer depends on the relcache, and so the notion of 'blind write' is gone. This should improve efficiency in bgwriter and background checkpoint processes. Internal restructuring in md.c to remove the not-very-useful array of MdfdVec objects --- might as well just use pointers. Also remove the long-dead 'persistent main memory' storage manager (mm.c), since it seems quite unlikely to ever get resurrected.	2004-02-10 01:55:27 +00:00
Neil Conway	f06e79525a	Win32 signals cleanup. Patch by Magnus Hagander, with input from Claudio Natoli and Bruce Momjian (and some cosmetic fixes from Neil Conway). Changes: - remove duplicate signal definitions from pqsignal.h - replace pqkill() with kill() and redefine kill() in Win32 - use ereport() in place of fprintf() in some error handling in pqsignal.c - export pg_queue_signal() and make use of it where necessary - add a console control handler for Ctrl-C and similar handling on Win32 - do WaitForSingleObjectEx() in CHECK_FOR_INTERRUPTS() on Win32; query cancelling should now work on Win32 - various other fixes and cleanups	2004-02-08 22:28:57 +00:00
Jan Wieck	f425b605f4	Cost based vacuum delay feature. Jan	2004-02-06 19:36:18 +00:00
Jan Wieck	8d09e25693	Backing out the background writer sync() option. Jan	2004-02-04 01:24:53 +00:00
Bruce Momjian	5ee2ae2049	Remove sleep() and use single PG_SLEEP call for Win32 signal handling and consistency. Change PG_USLEEP to use SleepEx() for signal interuptability.	2004-01-30 15:57:04 +00:00
Bruce Momjian	50491963cb	Here's the latest win32 signals code, this time in the form of a patch against the latest shapshot. It also includes the replacement of kill() with pqkill() and sigsetmask() with pqsigsetmask(). Passes all tests fine on my linux machine once applied. Still doesn't link completely on Win32 - there are a few things still required. But much closer than before. At Bruce's request, I'm goint to write up a README file about the method of signals delivery chosen and why the others were rejected (basically a summary of the mailinglist discussions). I'll finish that up once/if the patch is accepted. Magnus Hagander	2004-01-27 00:45:26 +00:00
Bruce Momjian	eec08b95e7	[all] Removed call to getppid in SendPostmasterSignal, replacing with a PostmasterPid variable, which gets set (early) in PostmasterMain getppid would not be the postmaster? [fork/exec] Implements processCancelRequest by keeping an array of pid/cancel_key structs in shared mem [fork/exec] Moves AttachSharedMemoryAndSemaphores call for backends into SubPostmasterMain [win32] Implements reaper/waitpid by keeping an arrays of children pids,handles in postmaster local mem - this item is largely untested, for reasons which should be obvious, but appears sound [win32/all] Added extern for pgpipe in Win32 case, and changed the second pipe call (which seems to have been missed earlier) to pgpipe [win32] #define'd ftruncate to chsize in the Win32 case [win32] PG_USLEEP for Win32 has a misplaced paren. Fixed. [win32] DLLIMPORT handling for MingW case Claudio Natoli	2004-01-26 22:59:54 +00:00

... 11 12 13 14 15 ...

1950 Commits