Commit Graph

1950 Commits

Author SHA1 Message Date
Heikki Linnakangas
5762a4d909 Inherit max_safe_fds to child processes in EXEC_BACKEND mode.
Postmaster sets max_safe_fds by testing how many open file descriptors it
can open, and that is normally inherited by all child processes at fork().
Not so on EXEC_BACKEND, ie. Windows, however. Because of that, we
effectively ignored max_files_per_process on Windows, and always assumed
a conservative default of 32 simultaneous open files. That could have an
impact on performance, if you need to access a lot of different files
in a query. After this patch, the value is passed to child processes by
save/restore_backend_variables() among many other global variables.

It has been like this forever, but given the lack of complaints about it,
I'm not backpatching this.
2012-03-29 08:19:11 +03:00
Robert Haas
40b9b95769 New GUC, track_iotiming, to track I/O timings.
Currently, the only way to see the numbers this gathers is via
EXPLAIN (ANALYZE, BUFFERS), but the plan is to add visibility through
the stats collector and pg_stat_statements in subsequent patches.

Ants Aasma, reviewed by Greg Smith, with some further changes by me.
2012-03-27 14:55:02 -04:00
Peter Eisentraut
0e85abd658 Clean up compiler warnings from unused variables with asserts disabled
For those variables only used when asserts are enabled, use a new
macro PG_USED_FOR_ASSERTS_ONLY, which expands to
__attribute__((unused)) when asserts are not enabled.
2012-03-21 23:33:10 +02:00
Robert Haas
07d1edb954 Extend object access hook framework to support arguments, and DROP.
This allows loadable modules to get control at drop time, perhaps for the
purpose of performing additional security checks or to log the event.
The initial purpose of this code is to support sepgsql, but other
applications should be possible as well.

KaiGai Kohei, reviewed by me.
2012-03-09 14:34:56 -05:00
Heikki Linnakangas
d6a7271958 Correctly detect SSI conflicts of prepared transactions after crash.
A prepared transaction can get new conflicts in and out after preparing, so
we cannot rely on the in- and out-flags stored in the statefile at prepare-
time. As a quick fix, make the conservative assumption that after a restart,
all prepared transactions are considered to have both in- and out-conflicts.
That can lead to unnecessary rollbacks after a crash, but that shouldn't be
a big problem in practice; you don't want prepared transactions to hang
around for a long time anyway.

Dan Ports
2012-02-29 15:42:36 +02:00
Robert Haas
2254367435 Make EXPLAIN (BUFFERS) track blocks dirtied, as well as those written.
Also expose the new counters through pg_stat_statements.

Patch by me.  Review by Fujii Masao and Greg Smith.
2012-02-22 20:33:05 -05:00
Heikki Linnakangas
1a01560cbb Rename LWLockWaitUntilFree to LWLockAcquireOrWait.
LWLockAcquireOrWait makes it more clear that the lock is acquired if it's
free.
2012-02-08 09:17:13 +02:00
Heikki Linnakangas
15ad6f1510 When building with LWLOCK_STATS, initialize the stats in LWLockWaitUntilFree.
If LWLockWaitUntilFree was called before the first LWLockAcquire call, you
would either crash because of access to uninitialized array or account the
acquisition incorrectly. LWLockConditionalAcquire doesn't have this problem
because it doesn't update the lwlock stats.

In practice, this never happens because there is no codepath where you would
call LWLockWaitUntilfree before LWLockAcquire after a new process is
launched. But that's just accidental, there's no guarantee that that's
always going to be true in the future.

Spotted by Jeff Janes.
2012-02-07 10:11:54 +02:00
Tom Lane
c6d76d7c82 Add locking around WAL-replay modification of shared-memory variables.
Originally, most of this code assumed that no Postgres backends could be
running concurrently with it, and so no locking could be needed.  That
assumption fails in Hot Standby.  While it's still true that Hot Standby
backends should never change values like nextXid, they can examine them,
and consistency is important in some cases such as when computing a
snapshot.  Therefore, prudence requires that WAL replay code obtain the
relevant locks when modifying such variables, even though it can examine
them without taking a lock.  We were following that coding rule in some
places but not all.  This commit applies the coding rule uniformly to all
updates of ShmemVariableCache and MultiXactState fields; a search of the
replay routines did not find any other cases that seemed to be at risk.

In addition, this commit fixes a longstanding thinko in replay of NEXTOID
and checkpoint records: we tried to advance nextOid only if it was behind
the value in the WAL record, but the comparison would draw the wrong
conclusion if OID wraparound had occurred since the previous value.
Better to just unconditionally assign the new value, since OID assignment
shouldn't be happening during replay anyway.

The additional locking seems to be more in the nature of future-proofing
than fixing any live bug, so I am not going to back-patch it.  The NEXTOID
fix will be back-patched separately.
2012-02-06 12:34:10 -05:00
Tom Lane
2af72cefea Add missing Assert and fix inaccurate elog message in standby_redo().
All other WAL redo routines either call RestoreBkpBlocks() or Assert that
they haven't been passed any backup blocks.  Make this one do likewise.
Also, fix incorrect routine name in its failure message.
2012-02-04 22:32:35 -05:00
Heikki Linnakangas
82d4b262d9 Fix bug in the new wait-until-lwlock-is-free mechanism.
If there was a wait-until-free process in the head of the wait queue,
followed by an exclusive locker, the exclusive locker was not be woken up
as it should.
2012-01-31 00:09:30 +02:00
Heikki Linnakangas
9b38d46d9f Make group commit more effective.
When a backend needs to flush the WAL, and someone else is already flushing
the WAL, wait until it releases the WALInsertLock and check if we still need
to do the flush or if the other backend already did the work for us, before
acquiring WALInsertLock. This helps group commit, because when the WAL flush
finishes, all the backends that were waiting for it can be woken up in one
go, and the can all concurrently observe that they're done, rather than
waking them up one by one in a cascading fashion.

This is based on a new LWLock function, LWLockWaitUntilFree(), which has
peculiar semantics. If the lock is immediately free, it grabs the lock and
returns true. If it's not free, it waits until it is released, but then
returns false without grabbing the lock. This is used in XLogFlush(), so
that when the lock is acquired, the backend flushes the WAL, but if it's
not, the backend first checks the current flush location before retrying.

Original patch and benchmarking by Peter Geoghegan and Simon Riggs, although
this patch as committed ended up being very different from that.
2012-01-30 16:53:48 +02:00
Tom Lane
ad10853b30 Assorted comment fixes, mostly just typos, but some obsolete statements.
YAMAMOTO Takashi
2012-01-29 19:23:56 -05:00
Magnus Hagander
672614cf21 Prevent logging "failed to stat file: success" for temp files
This was broken in commit bc3347484a, the
addition of statistics counters for temp files.

Reported by Thom Brown
2012-01-28 10:03:26 +01:00
Heikki Linnakangas
cf3fff6326 Initialize the new bgwriterLatch field properly.
Peter Geoghegan
2012-01-27 18:25:32 +02:00
Heikki Linnakangas
6d90eaaa89 Make bgwriter sleep longer when it has no work to do, to save electricity.
To make it wake up promptly when activity starts again, backends nudge it
by setting a latch in MarkBufferDirty(). The latch is kept set while
bgwriter is active, so there is very little overhead from that when the
system is busy. It is only armed before going into longer sleep.

Peter Geoghegan, with some changes by me.
2012-01-26 18:39:13 +02:00
Robert Haas
467ff207f5 Add missing #include, to suppress compiler warning. 2012-01-26 10:16:26 -05:00
Magnus Hagander
61cb8c5abb Add deadlock counter to pg_stat_database
Adds a counter that tracks number of deadlocks that occurred in
each database to pg_stat_database.

Magnus Hagander, reviewed by Jaime Casanova
2012-01-26 15:58:19 +01:00
Robert Haas
0e549697d1 Classify DROP operations by whether or not they are user-initiated.
This doesn't do anything useful just yet, but is intended as supporting
infrastructure for allowing sepgsql to sensibly check DROP permissions.

KaiGai Kohei and Robert Haas
2012-01-26 09:30:27 -05:00
Magnus Hagander
bc3347484a Track temporary file count and size in pg_stat_database
Add counters for number and size of temporary files used
for spill-to-disk queries for each database to the
pg_stat_database view.

Tomas Vondra, review by Magnus Hagander
2012-01-26 14:41:19 +01:00
Simon Riggs
c172b7b02e Resolve timing issue with logging locks for Hot Standby.
We log AccessExclusiveLocks for replay onto standby nodes,
but because of timing issues on ProcArray it is possible to
log a lock that is still held by a just committed transaction
that is very soon to be removed. To avoid any timing issue we
avoid applying locks made by transactions with InvalidXid.

Simon Riggs, bug report Tom Lane, diagnosis Pavan Deolasee
2012-01-23 23:37:32 +00:00
Heikki Linnakangas
326b922e8b Fix corner case in cleanup of transactions using SSI.
When the only remaining active transactions are READ ONLY, we do a "partial
cleanup" of committed transactions because certain types of conflicts
aren't possible anymore. For committed r/w transactions, we release the
SIREAD locks but keep the SERIALIZABLEXACT. However, for committed r/o
transactions, we can go further and release the SERIALIZABLEXACT too. The
problem was with the latter case: we were returning the SERIALIZABLEXACT to
the free list without removing it from the finished list.

The only real change in the patch is the SHMQueueDelete line, but I also
reworked some of the surrounding code to make it obvious that r/o and r/w
transactions are handled differently -- the existing code felt a bit too
clever.

Dan Ports
2012-01-18 17:57:33 +02:00
Robert Haas
33aaa139e6 Make the number of CLOG buffers adaptive, based on shared_buffers.
Previously, this was hardcoded: we always had 8.  Performance testing
shows that isn't enough, especially on big SMP systems, so we allow it
to scale up as high as 32 when there's adequate memory.  On the flip
side, when shared_buffers is very small, drop the number of CLOG buffers
down to as little as 4, so that we can start the postmaster even
when very little shared memory is available.

Per extensive discussion with Simon Riggs, Tom Lane, and others on
pgsql-hackers.
2012-01-06 14:32:18 -05:00
Robert Haas
7e4911b2ae Fix variable confusion in BufferSync().
As noted by Heikki Linnakangas, the previous coding confused the "flags"
variable with the "mask" variable.  The affect of this appears to be that
unlogged buffers would get written out at every checkpoint rather than
only at shutdown time.  Although that's arguably an acceptable failure
mode, I'm back-patching this change, since it seems like a poor idea to
rely on this happening to work.
2012-01-06 08:35:48 -05:00
Bruce Momjian
e126958c2e Update copyright notices for year 2012. 2012-01-01 18:01:58 -05:00
Peter Eisentraut
d383c23f6f Remove support for on_exit()
All supported platforms support the C89 standard function atexit()
(SunOS 4 probably being the last one not to), and supporting both
makes the code clumsy.
2011-12-27 20:57:59 +02:00
Tom Lane
d0024cd188 Avoid crashing when we have problems unlinking files post-commit.
smgrdounlink takes care to not throw an ERROR if it fails to unlink
something, but that caution was rendered useless by commit
3396000684, which put an smgrexists call in
front of it; smgrexists *does* throw error if anything looks funny, such
as getting a permissions error from trying to open the file.  If that
happens post-commit, you get a PANIC, and what's worse the same logic
appears in the WAL replay code, so the database even fails to restart.

Restore the intended behavior by removing the smgrexists call --- it isn't
accomplishing anything that we can't do better by adjusting mdunlink's
ideas of whether it ought to warn about ENOENT or not.

Per report from Joseph Shraibman of unrecoverable crash after trying to
drop a table whose FSM fork had somehow gotten chmod'd to 000 permissions.
Backpatch to 8.4, where the bogus coding was introduced.
2011-12-20 15:00:36 -05:00
Robert Haas
0d76b60db4 Various micro-optimizations for GetSnapshopData().
Heikki Linnakangas had the idea of rearranging GetSnapshotData to
avoid checking for sub-XIDs when no top-level XID is present.  This
patch does that plus further a bit of further, related rearrangement.
Benchmarking show a significant improvement on unlogged tables at
higher concurrency levels, and mostly indifferent result on permanent
tables (which are presumably bottlenecked elsewhere).  Most of the
benefit seems to come from using the new NormalTransactionIdPrecedes()
macro rather than the function call TransactionIdPrecedes().
2011-12-16 21:48:47 -05:00
Alvaro Herrera
9d3b502443 Improve logging of autovacuum I/O activity
This adds some I/O stats to the logging of autovacuum (when the
operation takes long enough that log_autovacuum_min_duration causes it
to be logged), so that it is easier to tune.  Notably, it adds buffer
I/O counts (hits, misses, dirtied) and read and write rate.

Authors: Greg Smith and Noah Misch
2011-11-25 16:34:32 -03:00
Robert Haas
ed0b409d22 Move "hot" members of PGPROC into a separate PGXACT array.
This speeds up snapshot-taking and reduces ProcArrayLock contention.
Also, the PGPROC (and PGXACT) structures used by two-phase commit are
now allocated as part of the main array, rather than in a separate
array, and we keep ProcArray sorted in pointer order.  These changes
are intended to minimize the number of cache lines that must be pulled
in to take a snapshot, and testing shows a substantial increase in
performance on both read and write workloads at high concurrencies.

Pavan Deolasee, Heikki Linnakangas, Robert Haas
2011-11-25 08:02:10 -05:00
Tom Lane
40d35036bb Avoid floating-point underflow while tracking buffer allocation rate.
When the system is idle for awhile after activity, the "smoothed_alloc"
state variable in BgBufferSync converges slowly to zero.  With standard
IEEE float arithmetic this results in several iterations with denormalized
values, which causes kernel traps and annoying log messages on some
poorly-designed platforms.  There's no real need to track such small values
of smoothed_alloc, so we can prevent the kernel traps by forcing it to zero
as soon as it's too small to be interesting for our purposes.  This issue
is purely cosmetic, since the iterations don't happen fast enough for the
kernel traps to pose any meaningful performance problem, but still it seems
worth shutting up the log messages.

The kernel log messages were previously reported by a number of people,
but kudos to Greg Matthews for tracking down exactly where they were coming
from.
2011-11-19 00:35:29 -05:00
Robert Haas
71b2b657c0 Revert removal of trace_userlocks, because userlocks aren't gone.
This reverts commit 0180bd6180.
contrib/userlock is gone, but user-level locking still exists,
and is exposed via the pg_advisory* family of functions.
2011-11-10 17:54:27 -05:00
Simon Riggs
86e3364899 Derive oldestActiveXid at correct time for Hot Standby.
There was a timing window between when oldestActiveXid was derived
and when it should have been derived that only shows itself under
heavy load. Move code around to ensure correct timing of derivation.
No change to StartupSUBTRANS() code, which is where this failed.

Bug report by Chris Redekop
2011-11-02 08:54:56 +00:00
Simon Riggs
10b7c686e5 Start Hot Standby faster when initial snapshot is incomplete.
If the initial snapshot had overflowed then we can start whenever
the latest snapshot is empty, not overflowed or as we did already,
start when the xmin on primary was higher than xmax of our starting
snapshot, which proves we have full snapshot data.

Bug report by Chris Redekop
2011-11-02 08:47:43 +00:00
Robert Haas
c2891b46a4 Initialize myProcLocks queues just once, at postmaster startup.
In assert-enabled builds, we assert during the shutdown sequence that
the queues have been properly emptied, and during process startup that
we are inheriting empty queues.  In non-assert enabled builds, we just
save a few cycles.
2011-11-01 22:44:54 -04:00
Simon Riggs
806a2aee37 Split work of bgwriter between 2 processes: bgwriter and checkpointer.
bgwriter is now a much less important process, responsible for page
cleaning duties only. checkpointer is now responsible for checkpoints
and so has a key role in shutdown. Later patches will correct doc
references to the now old idea that bgwriter performs checkpoints.
Has beneficial effect on performance at high write rates, but mainly
refactoring to more easily allow changes for power reduction by
simplifying previously tortuous code around required to allow page
cleaning and checkpointing to time slice in the same process.

Patch by me, Review by Dickson Guedes
2011-11-01 17:14:47 +00:00
Robert Haas
53f1ca59b5 Allow hint bits to be set sooner for temporary and unlogged tables.
We need not wait until the commit record is durably on disk, because
in the event of a crash the page we're updating with hint bits will
be gone anyway.  Per off-list report from Heikki Linnakangas, this
can significantly degrade the performance of unlogged tables; I was
able to show a 2x speedup from this patch on a pgbench run with scale
factor 15.  In practice, this will mostly help small, heavily updated
tables, because on larger tables you're unlikely to run into the same
row again before the commit record makes it out to disk.
2011-10-28 17:08:09 -04:00
Heikki Linnakangas
cbf65509bb Fix the number of lwlocks needed by the "fast path" lock patch. It needs
one lock per backend or auxiliary process - the need for a lock for each
aux processes was not accounted for in NumLWLocks(). No-one noticed,
because the three locks needed for the three aux processes fit into the
few extra lwlocks we allocate for 3rd party modules that don't call
RequestAddinLWLocks() (NUM_USER_DEFINED_LWLOCKS, 4 by default).
2011-10-27 22:39:58 +03:00
Tom Lane
bb446b689b Support synchronization of snapshots through an export/import procedure.
A transaction can export a snapshot with pg_export_snapshot(), and then
others can import it with SET TRANSACTION SNAPSHOT.  The data does not
leave the server so there are not security issues.  A snapshot can only
be imported while the exporting transaction is still running, and there
are some other restrictions.

I'm not totally convinced that we've covered all the bases for SSI (true
serializable) mode, but it works fine for lesser isolation modes.

Joachim Wieland, reviewed by Marko Tiikkaja, and rather heavily modified
by Tom Lane
2011-10-22 18:23:30 -04:00
Tom Lane
b4a0223d00 Simplify and improve ProcessStandbyHSFeedbackMessage logic.
There's no need to clamp the standby's xmin to be greater than
GetOldestXmin's result; if there were any such need this logic would be
hopelessly inadequate anyway, because it fails to account for
within-database versus cluster-wide values of GetOldestXmin.  So get rid of
that, and just rely on sanity-checking that the xmin is not wrapped around
relative to the nextXid counter.  Also, don't reset the walsender's xmin if
the current feedback xmin is indeed out of range; that just creates more
problems than we already had.  Lastly, don't bother to take the
ProcArrayLock; there's no need to do that to set xmin.

Also improve the comments about this in GetOldestXmin itself.
2011-10-20 19:43:31 -04:00
Bruce Momjian
0180bd6180 Remove all "traces" of trace_userlocks, because userlocks were removed
in PG 8.2.
2011-10-13 19:59:57 -04:00
Robert Haas
e76bcaba9c Repair breakage in VirtualXactLock.
I broke this in commit 84e3712677.  Report and
fix by Fujii Masao.
2011-10-11 07:39:09 -04:00
Tom Lane
57eb009092 Allow snapshot references to still work during transaction abort.
In REPEATABLE READ (nee SERIALIZABLE) mode, an attempt to do
GetTransactionSnapshot() between AbortTransaction and CleanupTransaction
failed, because GetTransactionSnapshot would recompute the transaction
snapshot (which is already wrong, given the isolation mode) and then
re-register it in the TopTransactionResourceOwner, leading to an Assert
because the TopTransactionResourceOwner should be empty of resources after
AbortTransaction.  This is the root cause of bug #6218 from Yamamoto
Takashi.  While changing plancache.c to avoid requesting a snapshot when
handling a ROLLBACK masks the problem, I think this is really a snapmgr.c
bug: it's lower-level than the resource manager mechanism and should not be
shutting itself down before we unwind resource manager resources.  However,
just postponing the release of the transaction snapshot until cleanup time
didn't work because of the circular dependency with
TopTransactionResourceOwner.  Fix by managing the internal reference to
that snapshot manually instead of depending on TopTransactionResourceOwner.
This saves a few cycles as well as making the module layering more
straightforward.  predicate.c's dependencies on TopTransactionResourceOwner
go away too.

I think this is a longstanding bug, but there's no evidence that it's more
than a latent bug, so it doesn't seem worth any risk of back-patching.
2011-09-26 22:25:28 -04:00
Robert Haas
0c8eda6258 Memory barrier support for PostgreSQL.
This is not actually used anywhere yet, but it gets the basic
infrastructure in place.  It is fairly likely that there are bugs, and
support for some important platforms may be missing, so we'll need to
refine this as we go along.
2011-09-23 17:52:43 -04:00
Peter Eisentraut
1b81c2fe6e Remove many -Wcast-qual warnings
This addresses only those cases that are easy to fix by adding or
moving a const qualifier or removing an unnecessary cast.  There are
many more complicated cases remaining.
2011-09-11 21:54:32 +03:00
Tom Lane
a7801b62f2 Move Timestamp/Interval typedefs and basic macros into datatype/timestamp.h.
As per my recent proposal, this refactors things so that these typedefs and
macros are available in a header that can be included in frontend-ish code.
I also changed various headers that were undesirably including
utils/timestamp.h to include datatype/timestamp.h instead.  Unsurprisingly,
this showed that half the system was getting utils/timestamp.h by way of
xlog.h.

No actual code changes here, just header refactoring.
2011-09-09 13:23:41 -04:00
Tom Lane
1609797c25 Clean up the #include mess a little.
walsender.h should depend on xlog.h, not vice versa.  (Actually, the
inclusion was circular until a couple hours ago, which was even sillier;
but Bruce broke it in the expedient rather than logically correct
direction.)  Because of that poor decision, plus blind application of
pgrminclude, we had a situation where half the system was depending on
xlog.h to include such unrelated stuff as array.h and guc.h.  Clean up
the header inclusion, and manually revert a lot of what pgrminclude had
done so things build again.

This episode reinforces my feeling that pgrminclude should not be run
without adult supervision.  Inclusion changes in header files in particular
need to be reviewed with great care.  More generally, it'd be good if we
had a clearer notion of module layering to dictate which headers can sanely
include which others ... but that's a big task for another day.
2011-09-04 01:13:16 -04:00
Bruce Momjian
6416a82a62 Remove unnecessary #include references, per pgrminclude script. 2011-09-01 10:04:27 -04:00
Robert Haas
c01c25fbe5 Improve spinlock performance for HP-UX, ia64, non-gcc.
At least on this architecture, it's very important to spin on a
non-atomic instruction and only retry the atomic once it appears
that it will succeed.  To fix this, split TAS() into two macros:
TAS(), for trying to grab the lock the first time, and TAS_SPIN(),
for spinning until we get it.  TAS_SPIN() defaults to same as TAS(),
but we can override it when we know there's a better way.

It's likely that some of the other cases in s_lock.h require
similar treatment, but this is the only one we've got conclusive
evidence for at present.
2011-08-29 10:05:48 -04:00
Bruce Momjian
f261deb4b4 Add missing includes after pgrminclude run. 2011-08-26 18:15:14 -04:00
Robert Haas
7488936478 Typo fix. 2011-08-22 12:16:27 -04:00
Robert Haas
24bf1552f6 Remove obsolete README file.
Perhaps we ought to add some other kind of documentation here instead,
but for now let's get rid of this woefully obsolete description of the
sinval machinery.
2011-08-18 09:49:41 -04:00
Peter Eisentraut
e5475a80d2 Add "Reason code" prefix to internal SSI error messages
This makes it clearer that the error message is perhaps not supposed
to be understood by users, and it also makes it somewhat clearer that
it was not accidentally omitted from translation.

Idea from Heikki Linnakangas, except that we don't mark "Reason code"
for translation at this point, because that would make the
implementation too cumbersome.
2011-08-15 15:20:16 +03:00
Tom Lane
4dab3d5ae1 Change the autovacuum launcher to use WaitLatch instead of a poll loop.
In pursuit of this (and with the expectation that WaitLatch will be needed
in more places), convert the latch field that was already added to PGPROC
for sync rep into a generic latch that is activated for all PGPROC-owning
processes, and change many of the standard backend signal handlers to set
that latch when a signal happens.  This will allow WaitLatch callers to be
wakened properly by these signals.

In passing, fix a whole bunch of signal handlers that had been hacked to do
things that might change errno, without adding the necessary save/restore
logic for errno.  Also make some minor fixes in unix_latch.c, and clean
up bizarre and unsafe scheme for disowning the process's latch.  Much of
this has to be back-patched into 9.1.

Peter Geoghegan, with additional work by Tom
2011-08-10 12:22:21 -04:00
Tom Lane
4e15a4db5e Documentation improvement and minor code cleanups for the latch facility.
Improve the documentation around weak-memory-ordering risks, and do a pass
of general editorialization on the comments in the latch code.  Make the
Windows latch code more like the Unix latch code where feasible; in
particular provide the same Assert checks in both implementations.
Fix poorly-placed WaitLatch call in syncrep.c.

This patch resolves, for the moment, concerns around weak-memory-ordering
bugs in latch-related code: we have documented the restrictions and checked
that existing calls meet them.  In 9.2 I hope that we will install suitable
memory barrier instructions in SetLatch/ResetLatch, so that their callers
don't need to be quite so careful.
2011-08-09 15:30:45 -04:00
Robert Haas
84e3712677 Create VXID locks "lazily" in the main lock table.
Instead of entering them on transaction startup, we materialize them
only when someone wants to wait, which will occur only during CREATE
INDEX CONCURRENTLY.  In Hot Standby mode, the startup process must also
be able to probe for conflicting VXID locks, but the lock need never be
fully materialized, because the startup process does not use the normal
lock wait mechanism.  Since most VXID locks never need to touch the
lock manager partition locks, this can significantly reduce blocking
contention on read-heavy workloads.

Patch by me.  Review by Jeff Davis.
2011-08-04 12:38:33 -04:00
Tom Lane
ac36e6f71f Move CheckRecoveryConflictDeadlock() call to a safer place.
This kluge was inserted in a spot apparently chosen at random: the lock
manager's state is not yet fully set up for the wait, and in particular
LockWaitCancel hasn't been armed by setting lockAwaited, so the ProcLock
will not get cleaned up if the ereport is thrown.  This seems to not cause
any observable problem in trivial test cases, because LockReleaseAll will
silently clean up the debris; but I was able to cause failures with tests
involving subtransactions.

Fixes breakage induced by commit c85c941470.
Back-patch to all affected branches.
2011-08-02 15:16:29 -04:00
Tom Lane
2e53bd5517 Fix incorrect initialization of ProcGlobal->startupBufferPinWaitBufId.
It was initialized in the wrong place and to the wrong value.  With bad
luck this could result in incorrect query-cancellation failures in hot
standby sessions, should a HS backend be holding pin on buffer number 1
while trying to acquire a lock.
2011-08-02 13:23:52 -04:00
Robert Haas
85b436f7b1 Minor stylistic corrections. 2011-08-01 08:24:45 -04:00
Robert Haas
b4fbe392f8 Reduce sinval synchronization overhead.
Testing shows that the overhead of acquiring and releasing
SInvalReadLock and msgNumLock on high-core count boxes can waste a lot
of CPU time and hurt performance.  This patch adds a per-backend flag
that allows us to skip all that locking in most cases.  Further
testing shows that this improves performance even when sinval traffic
is very high.

Patch by me.  Review and testing by Noah Misch.
2011-07-29 16:46:13 -04:00
Peter Eisentraut
0fe8150827 Minor message style adjustment 2011-07-27 23:54:46 +03:00
Robert Haas
8e5ac74c12 Some refinement for the "fast path" lock patch.
1. In GetLockStatusData, avoid initializing instance before we've ensured
that the array is large enough.  Otherwise, if repalloc moves the block
around, we're hosed.

2. Add the word "Relation" to the name of some identifiers, to avoid
assuming that the fast-path mechanism will only ever apply to relations
(though these particular parts certainly will).  Some of the macros
could possibly use similar treatment, but the names are getting awfully
long already.

3. Add a missing word to comment in AtPrepare_Locks().
2011-07-19 12:10:15 -04:00
Peter Eisentraut
30f854537d Change debug message from ereport to elog 2011-07-19 07:50:10 +03:00
Robert Haas
3cba8999b3 Create a "fast path" for acquiring weak relation locks.
When an AccessShareLock, RowShareLock, or RowExclusiveLock is requested
on an unshared database relation, and we can verify that no conflicting
locks can possibly be present, record the lock in a per-backend queue,
stored within the PGPROC, rather than in the primary lock table.  This
eliminates a great deal of contention on the lock manager LWLocks.

This patch also refactors the interface between GetLockStatusData() and
pg_lock_status() to be a bit more abstract, so that we don't rely so
heavily on the lock manager's internal representation details.  The new
fast path lock structures don't have a LOCK or PROCLOCK structure to
return, so we mustn't depend on that for purposes of listing outstanding
locks.

Review by Jeff Davis.
2011-07-18 00:49:28 -04:00
Tom Lane
9473bb96d0 Further thoughts about temp_file_limit patch.
Move FileClose's decrement of temporary_files_size up, so that it will be
executed even if elog() throws an error.  This is reasonable since if the
unlink() fails, the fact the file is still there is not our fault, and we
are going to forget about it anyhow.  So we won't count it against
temp_file_limit anymore.

Update fileSize and temporary_files_size correctly in FileTruncate.
We probably don't have any places that truncate temp files, but fd.c
surely should not assume that.
2011-07-17 15:05:44 -04:00
Tom Lane
23e5b16c71 Add temp_file_limit GUC parameter to constrain temporary file space usage.
The limit is enforced against the total amount of temp file space used by
each session.

Mark Kirkwood, reviewed by Cédric Villemain and Tatsuo Ishii
2011-07-17 14:19:31 -04:00
Tom Lane
1af37ec96d Replace errdetail("%s", ...) with errdetail_internal("%s", ...).
There may be some other places where we should use errdetail_internal,
but they'll have to be evaluated case-by-case.  This commit just hits
a bunch of places where invoking gettext is obviously a waste of cycles.
2011-07-16 14:22:18 -04:00
Tom Lane
3ee7c8710d Use errdetail_internal() for SSI transaction cancellation details.
Per discussion, these seem too technical to be worth translating.

Kevin Grittner
2011-07-16 14:22:16 -04:00
Robert Haas
4240e429d0 Try to acquire relation locks in RangeVarGetRelid.
In the previous coding, we would look up a relation in RangeVarGetRelid,
lock the resulting OID, and then AcceptInvalidationMessages().  While
this was sufficient to ensure that we noticed any changes to the
relation definition before building the relcache entry, it didn't
handle the possibility that the name we looked up no longer referenced
the same OID.  This was particularly problematic in the case where a
table had been dropped and recreated: we'd latch on to the entry for
the old relation and fail later on.  Now, we acquire the relation lock
inside RangeVarGetRelid, and retry the name lookup if we notice that
invalidation messages have been processed meanwhile.  Many operations
that would previously have failed with an error in the presence of
concurrent DDL will now succeed.

There is a good deal of work remaining to be done here: many callers
of RangeVarGetRelid still pass NoLock for one reason or another.  In
addition, nothing in this patch guards against the possibility that
the meaning of an unqualified name might change due to the creation
of a relation in a schema earlier in the user's search path than the
one where it was previously found.  Furthermore, there's nothing at
all here to guard against similar race conditions for non-relations.
For all that, it's a start.

Noah Misch and Robert Haas
2011-07-08 22:19:30 -04:00
Heikki Linnakangas
89fd72cbf2 Introduce a pipe between postmaster and each backend, which can be used to
detect postmaster death. Postmaster keeps the write-end of the pipe open,
so when it dies, children get EOF in the read-end. That can conveniently
be waited for in select(), which allows eliminating some of the polling
loops that check for postmaster death. This patch doesn't yet change all
the loops to use the new mechanism, expect a follow-on patch to do that.

This changes the interface to WaitLatch, so that it takes as argument a
bitmask of events that it waits for. Possible events are latch set, timeout,
postmaster death, and socket becoming readable or writeable.

The pipe method behaves slightly differently from the kill() method
previously used in PostmasterIsAlive() in the case that postmaster has died,
but its parent has not yet read its exit code with waitpid(). The pipe
returns EOF as soon as the process dies, but kill() continues to return
true until waitpid() has been called (IOW while the process is a zombie).
Because of that, change PostmasterIsAlive() to use the pipe too, otherwise
WaitLatch() would return immediately with WL_POSTMASTER_DEATH, while
PostmasterIsAlive() would claim it's still alive. That could easily lead to
busy-waiting while postmaster is in zombie state.

Peter Geoghegan with further changes by me, reviewed by Fujii Masao and
Florian Pflug.
2011-07-08 18:44:07 +03:00
Heikki Linnakangas
9598afa3b0 Fix one overflow and one signedness error, caused by the patch to calculate
OLDSERXID_MAX_PAGE based on BLCKSZ. MSVC compiler warned about these.
2011-07-08 17:29:53 +03:00
Heikki Linnakangas
bdaabb9b22 There's a small window wherein a transaction is committed but not yet
on the finished list, and we shouldn't flag it as a potential conflict
if so. We can also skip adding a doomed transaction to the list of
possible conflicts because we know it won't commit.

Dan Ports and Kevin Grittner.
2011-07-08 00:36:30 +03:00
Heikki Linnakangas
406d61835b SSI has a race condition, where the order of commit sequence numbers of
transactions might not match the order the work done in those transactions
become visible to others. The logic in SSI, however, assumed that it does.
Fix that by having two sequence numbers for each serializable transaction,
one taken before a transaction becomes visible to others, and one after it.
This is easier than trying to make the the transition totally atomic, which
would require holding ProcArrayLock and SerializableXactHashLock at the same
time. By using prepareSeqNo instead of commitSeqNo in a few places where
commit sequence numbers are compared, we can make those comparisons err on
the safe side when we don't know for sure which committed first.

Per analysis by Kevin Grittner and Dan Ports, but this approach to fix it
is different from the original patch.
2011-07-07 23:26:34 +03:00
Robert Haas
5b2b444f66 Adjust OLDSERXID_MAX_PAGE based on BLCKSZ.
The value when BLCKSZ = 8192 is unchanged, but with larger-than-normal
block sizes we might need to crank things back a bit, as we'll have
more entries per page than normal in that case.

Kevin Grittner
2011-07-07 15:05:21 -04:00
Heikki Linnakangas
928408d9e5 Fix a bug with SSI and prepared transactions:
If there's a dangerous structure T0 ---> T1 ---> T2, and T2 commits first,
we need to abort something. If T2 commits before both conflicts appear,
then it should be caught by OnConflict_CheckForSerializationFailure. If
both conflicts appear before T2 commits, it should be caught by
PreCommit_CheckForSerializationFailure. But that is actually run when
T2 *prepares*. Fix that in OnConflict_CheckForSerializationFailure, by
treating a prepared T2 as if it committed already.

This is mostly a problem for prepared transactions, which are in prepared
state for some time, but also for regular transactions because they also go
through the prepared state in the SSI code for a short moment when they're
committed.

Kevin Grittner and Dan Ports
2011-07-07 18:12:15 +03:00
Peter Eisentraut
27af66162b Message style tweaks 2011-07-05 00:01:35 +03:00
Peter Eisentraut
21f1e15aaf Unify spelling of "canceled", "canceling", "cancellation"
We had previously (af26857a27)
established the U.S. spellings as standard.
2011-06-29 09:28:46 +03:00
Tom Lane
223be216af Undo overly enthusiastic de-const-ification.
s/const//g wasn't exactly what I was suggesting here ... parameter
declarations of the form "const structtype *param" are good and useful,
so put those occurrences back.  Likewise, avoid casting away the const
in a "const void *" parameter.
2011-06-22 23:04:46 -04:00
Heikki Linnakangas
5da417f7c4 Remove pointless const qualifiers from function arguments in the SSI code.
As Tom Lane pointed out, "const Relation foo" doesn't guarantee that you
can't modify the data the "foo" pointer points to. It just means that you
can't change the pointer to point to something else within the function,
which is not very useful.
2011-06-22 12:18:39 +03:00
Tom Lane
a3290f655e Minor editing for README-SSI.
Fix some grammatical issues, try to clarify a couple of proofs, make the
terminology more consistent.
2011-06-21 18:01:22 -04:00
Heikki Linnakangas
1eea8e8a06 Fix bug in PreCommit_CheckForSerializationFailure. A transaction that has
already been marked as PREPARED cannot be killed. Kill the current
transaction instead.

One of the prepared_xacts regression tests actually hits this bug. I
removed the anomaly from the duplicate-gids test so that it fails in the
intended way, and added a new test to check serialization failures with
a prepared transaction.

Dan Ports
2011-06-21 14:49:50 +03:00
Heikki Linnakangas
7cb2ff9621 Fix bug introduced by recent SSI patch to merge ROLLED_BACK and
MARKED_FOR_DEATH flags into one. We still need the ROLLED_BACK flag to
mark transactions that are in the process of being rolled back. To be
precise, ROLLED_BACK now means that a transaction has already been
discounted from the count of transactions with the oldest xmin, but not
yet removed from the list of active transactions.

Dan Ports
2011-06-21 14:49:50 +03:00
Peter Eisentraut
8a8fbe7e79 Capitalization fixes 2011-06-19 00:37:30 +03:00
Robert Haas
c573486ce9 Fix minor thinko in ProcGlobalShmemSize().
There's no need to add space for startupBufferPinWaitBufId, because
it's part of the PROC_HDR object for which this function already
allocates space.

This has been wrong for a while, but the only consequence is that our
shared memory allocation is increased by 4 bytes, so no back-patch.
2011-06-17 09:12:19 -04:00
Heikki Linnakangas
78475b0eca Update README-SSI. Add a section to describe the "dangerous structure" that
SSI is based on, as well as the optimizations about relative commit times
and read-only transactions. Plus a bunch of other misc fixes and
improvements.

Dan Ports
2011-06-16 21:20:39 +03:00
Heikki Linnakangas
cb94db91b2 pgindent run of recent SSI changes. Also, remove an unnecessary #include.
Kevin Grittner
2011-06-16 16:17:22 +03:00
Heikki Linnakangas
264a6b127a The rolled-back flag on serializable xacts was pointless and redundant with
the marked-for-death flag. It was only set for a fleeting moment while a
transaction was being cleaned up at rollback. All the places that checked
for the rolled-back flag should also check the marked-for-death flag, as
both flags mean that the transaction will roll back. I also renamed the
marked-for-death into "doomed", which is a lot shorter name.
2011-06-15 13:35:28 +03:00
Heikki Linnakangas
0a0e2b52a5 Make non-MVCC snapshots exempt from predicate locking. Scans with non-MVCC
snapshots, like in REINDEX, are basically non-transactional operations. The
DDL operation itself might participate in SSI, but there's separate
functions for that.

Kevin Grittner and Dan Ports, with some changes by me.
2011-06-15 12:11:18 +03:00
Heikki Linnakangas
13000b44d6 Remove now-unnecessary casts.
Kevin Grittner
2011-06-12 22:49:33 +03:00
Robert Haas
47ebcecc3e Code cleanup for InitProcGlobal.
The old code creates three separate arrays when only one is needed,
using two different shmem allocation functions for no obvious reason.
It also strangely splits up the initialization of AuxilaryProcs
between the top and bottom of the function to no evident purpose.

Review by Tom Lane.
2011-06-12 00:07:04 -04:00
Heikki Linnakangas
cb2d158c58 Fix locking while setting flags in MySerializableXact.
Even if a flag is modified only by the backend owning the transaction, it's
not safe to modify it without a lock. Another backend might be setting or
clearing a different flag in the flags field concurrently, and that
operation might be lost because setting or clearing a bit in a word is not
atomic.

Make did-write flag a simple backend-private boolean variable, because it
was only set or tested in the owning backend (except when committing a
prepared transaction, but it's not worthwhile to optimize for the case of a
read-only prepared transaction). This also eliminates the need to add
locking where that flag is set.

Also, set the did-write flag when doing DDL operations like DROP TABLE or
TRUNCATE -- that was missed earlier.
2011-06-10 23:41:10 +03:00
Alvaro Herrera
fba105b109 Use "transient" files for blind writes, take 2
"Blind writes" are a mechanism to push buffers down to disk when
evicting them; since they may belong to different databases than the one
a backend is connected to, the backend does not necessarily have a
relation to link them to, and thus no way to blow them away.  We were
keeping those files open indefinitely, which would cause a problem if
the underlying table was deleted, because the operating system would not
be able to reclaim the disk space used by those files.

To fix, have bufmgr mark such files as transient to smgr; the lower
layer is allowed to close the file descriptor when the current
transaction ends.  We must be careful to have any other access of the
file to remove the transient markings, to prevent unnecessary expensive
system calls when evicting buffers belonging to our own database (which
files we're likely to require again soon.)

This commit fixes a bug in the previous one, which neglected to cleanly
handle the LRU ring that fd.c uses to manage open files, and caused an
unacceptable failure just before beta2 and was thus reverted.
2011-06-10 13:43:02 -04:00
Alvaro Herrera
3d114b63b2 Use a constant sprintf format to silence compiler warning 2011-06-10 13:38:50 -04:00
Heikki Linnakangas
c79c570bd8 Small comment fixes and enhancements. 2011-06-10 17:22:46 +03:00
Alvaro Herrera
9261557eb1 Revert "Use "transient" files for blind writes"
This reverts commit 54d9e8c6c1, which
caused a failure on the buildfarm.  Not a good thing to have just before
a beta release.
2011-06-09 16:41:44 -04:00
Alvaro Herrera
54d9e8c6c1 Use "transient" files for blind writes
"Blind writes" are a mechanism to push buffers down to disk when
evicting them; since they may belong to different databases than the one
a backend is connected to, the backend does not necessarily have a
relation to link them to, and thus no way to blow them away.  We were
keeping those files open indefinitely, which would cause a problem if
the underlying table was deleted, because the operating system would not
be able to reclaim the disk space used by those files.

To fix, have bufmgr mark such files as transient to smgr; the lower
layer is allowed to close the file descriptor when the current
transaction ends.  We must be careful to have any other access of the
file to remove the transient markings, to prevent unnecessary expensive
system calls when evicting buffers belonging to our own database (which
files we're likely to require again soon.)
2011-06-09 16:25:49 -04:00
Heikki Linnakangas
e1c26ab853 Fix the truncation logic of the OldSerXid SLRU mechanism. We can't pass
SimpleLruTruncate() a page number that's "in the future", because it will
issue a warning and refuse to truncate anything. Instead, we leave behind
the latest segment. If the slru is not needed before XID wrap-around, the
segment will appear as new again, and not be cleaned up until it gets old
enough again. That's a bit unpleasant, but better than not cleaning up
anything.

Also, fix broken calculation to check and warn if the span of the OldSerXid
SLRU is getting too large to fit in the 64k SLRU pages that we have
available. It was not XID wraparound aware.

Kevin Grittner and me.
2011-06-09 21:39:39 +03:00
Heikki Linnakangas
5234161ac1 Mark the SLRU page as dirty when setting an entry in pg_serial. In the
passing, fix an incorrect comment.
2011-06-09 12:10:14 +03:00
Heikki Linnakangas
8f9622bbb3 Make DDL operations play nicely with Serializable Snapshot Isolation.
Truncating or dropping a table is treated like deletion of all tuples, and
check for conflicts accordingly. If a table is clustered or rewritten by
ALTER TABLE, all predicate locks on the heap are promoted to relation-level
locks, because the tuple or page ids of any existing tuples will change and
won't be valid after rewriting the table. Arguably ALTER TABLE should be
treated like a mass-UPDATE of every row, but if you e.g change the datatype
of a column, you could also argue that it's just a change to the physical
layout, not a logical change. Reindexing promotes all locks on the index to
relation-level lock on the heap.

Kevin Grittner, with a lot of cosmetic changes by me.
2011-06-08 14:02:43 +03:00
Heikki Linnakangas
a31ff707a2 Make ascii-art in comments pgindent-safe, and some other formatting changes.
Kevin Grittner
2011-06-07 09:54:24 +03:00
Heikki Linnakangas
c8630919e0 SSI comment fixes and enhancements. Notably, document that the conflict-out
flag actually means that the transaction has a conflict out to a transaction
that committed before the flagged transaction.

Kevin Grittner
2011-06-03 12:45:42 +03:00
Peter Eisentraut
ba4cacf075 Recode non-ASCII characters in source to UTF-8
For consistency, have all non-ASCII characters from contributors'
names in the source be in UTF-8.  But remove some other more
gratuitous uses of non-ASCII characters.
2011-05-31 23:11:46 +03:00
Heikki Linnakangas
3103f9a77d The row-version chaining in Serializable Snapshot Isolation was still wrong.
On further analysis, it turns out that it is not needed to duplicate predicate
locks to the new row version at update, the lock on the version that the
transaction saw as visible is enough. However, there was a different bug in
the code that checks for dangerous structures when a new rw-conflict happens.
Fix that bug, and remove all the row-version chaining related code.

Kevin Grittner & Dan Ports, with some comment editorialization by me.
2011-05-30 20:47:17 +03:00
Robert Haas
74aaa2136d Fix race condition in CheckTargetForConflictsIn.
Dan Ports
2011-05-19 12:12:04 -04:00
Robert Haas
71932ecc2b Add comment about memory reordering to PredicateLockTupleRowVersionLink.
Dan Ports, per head-scratching from Simon Riggs and myself.
2011-05-06 21:55:10 -04:00
Robert Haas
02e6a115cc Add fast paths for cases when no serializable transactions are running.
Dan Ports
2011-04-25 09:52:01 -04:00
Heikki Linnakangas
4c37c1e3b2 Reduce the initial size of local lock hash to 16 entries.
The hash table is seq scanned at transaction end, to release all locks,
and making the hash table larger than necessary makes that slower. With
very simple queries, that overhead can amount to a few percent of the total
CPU time used.

At the moment, backend startup needs 6 locks, and a simple query with one
table and index needs 3 locks. 16 is enough for even quite complicated
transactions, and it will grow automatically if it fills up.
2011-04-15 15:07:36 +03:00
Peter Eisentraut
5caa3479c2 Clean up most -Wunused-but-set-variable warnings from gcc 4.6
This warning is new in gcc 4.6 and part of -Wall.  This patch cleans
up most of the noise, but there are some still warnings that are
trickier to remove.
2011-04-11 22:28:45 +03:00
Heikki Linnakangas
dad1f46382 TransferPredicateLocksToNewTarget should initialize a new lock
entry's commitSeqNo to that of the old one being transferred, or take
the minimum commitSeqNo if it is merging two lock entries.

Also, CreatePredicateLock should initialize commitSeqNo for to
InvalidSerCommitSeqNo instead of to 0. (I don't think using 0 would
actually affect anything, but we should be consistent.)

I also added a couple of assertions I used to track this down: a
lock's commitSeqNo should never be zero, and it should be
InvalidSerCommitSeqNo if and only if the lock is not held by
OldCommittedSxact.

Dan Ports, to fix leak of predicate locks reported by YAMAMOTO Takashi.
2011-04-11 13:46:37 +03:00
Heikki Linnakangas
7c797e7194 Fix the size of predicate lock manager's shared memory hash tables at creation.
This way they don't compete with the regular lock manager for the slack shared
memory, making the behavior more predictable.
2011-04-11 13:43:31 +03:00
Bruce Momjian
bf50caf105 pgindent run before PG 9.1 beta 1. 2011-04-10 11:42:00 -04:00
Robert Haas
cdcdfca401 Truncate the predicate lock SLRU to empty, instead of almost empty.
Otherwise, the SLRU machinery can get confused and think that the SLRU
has wrapped around.  Along the way, regardless of whether we're
truncating all of the SLRU or just some of it, flush pages after
truncating, rather than before.

Kevin Grittner
2011-04-08 16:52:19 -04:00
Robert Haas
fbc0d07796 Partially roll back overenthusiastic SSI optimization.
When a regular lock is held, SSI can use that in lieu of a predicate lock
to detect rw conflicts; but if the regular lock is being taken by a
subtransaction, we can't assume that it'll commit, so releasing the
parent transaction's lock in that case is a no-no.

Kevin Grittner
2011-04-08 15:29:02 -04:00
Robert Haas
56c7140ca8 Tweaks for SSI out-of-shared memory behavior.
If we call hash_search() with HASH_ENTER, it will bail out rather than
return NULL, so it's redundant to check for NULL again in the caller.
Thus, in cases where we believe it's impossible for the hash table to run
out of slots anyway, we can simplify the code slightly.

On the flip side, in cases where it's theoretically possible to run out of
space, we don't want to rely on dynahash.c to throw an error; instead,
we pass HASH_ENTER_NULL and throw the error ourselves if a NULL comes
back, so that we can provide a more descriptive error message.

Kevin Grittner
2011-04-07 16:43:39 -04:00
Robert Haas
632f0faa7c Repair some flakiness in CheckTargetForConflictsIn.
When we release and reacquire SerializableXactHashLock, we must recheck
whether an R/W conflict still needs to be flagged, because it could have
changed under us in the meantime.  And when we release the partition
lock, we must re-walk the list of predicate locks from the beginning,
because our pointer could get invalidated under us.

Bug report #5952 by Yamamoto Takashi.  Patch by Kevin Grittner.
2011-04-05 15:17:25 -04:00
Heikki Linnakangas
60b142b9a6 Fix a tiny race condition in predicate locking. Need to hold the lock while
examining the head of predicate locks list. Also, fix the comment of
RemoveTargetIfNoLongerUsed, it was neglected when we changed the way update
chains are handled.

Kevin Grittner
2011-03-31 18:43:23 +03:00
Simon Riggs
b98ac467f5 Prevent intermittent hang in recovery from bgwriter interaction.
Startup process waited for cleanup lock but when hot_standby = off
the pid was not registered, so that the bgwriter would not wake
the waiting process as intended.
2011-03-23 13:30:05 +00:00
Robert Haas
6436098795 Minor sync rep corrections.
Fujii Masao, with a bit of additional wordsmithing by me.
2011-03-10 14:57:02 -05:00
Heikki Linnakangas
46c333a963 Fix overly strict assertion in SummarizeOldestCommittedSxact(). There's a
race condition where SummarizeOldestCommittedSxact() is called even though
another backend already cleared out all finished sxact entries. That's OK,
RegisterSerializableTransactionInt() can just retry getting a news xact
slot from the available-list when that happens.

Reported by YAMAMOTO Takashi, bug #5918.
2011-03-08 21:06:26 +02:00
Heikki Linnakangas
93d888232e Don't throw a warning if vacuum sees PD_ALL_VISIBLE flag set on a page that
contains newly-inserted tuples that according to our OldestXmin are not
yet visible to everyone. The value returned by GetOldestXmin() is conservative,
and it can move backwards on repeated calls, so if we see that contradiction
between the PD_ALL_VISIBLE flag and status of tuples on the page, we have to
assume it's because an earlier vacuum calculated a higher OldestXmin value,
and all the tuples really are visible to everyone.

We have received several reports of this bug, with the "PD_ALL_VISIBLE flag
was incorrectly set in relation ..." warning appearing in logs. We were
finally able to hunt it down with David Gould's help to run extra diagnostics
in an environment where this happened frequently.

Also reword the warning, per Robert Haas' suggestion, to not imply that the
PD_ALL_VISIBLE flag is necessarily at fault, as it might also be a symptom
of corruption on a tuple header.

Backpatch to 8.4, where the PD_ALL_VISIBLE flag was introduced.
2011-03-08 20:30:53 +02:00
Heikki Linnakangas
4cd3fb6e12 Truncate predicate lock manager's SLRU lazily at checkpoint. That's safer
than doing it aggressively whenever the tail-XID pointer is advanced, because
this way we don't need to do it while holding SerializableXactHashLock.

This also fixes bug #5915 spotted by YAMAMOTO Takashi, and removes an
obsolete comment spotted by Kevin Grittner.
2011-03-08 12:12:54 +02:00
Simon Riggs
a8a8a3e096 Efficient transaction-controlled synchronous replication.
If a standby is broadcasting reply messages and we have named
one or more standbys in synchronous_standby_names then allow
users who set synchronous_replication to wait for commit, which
then provides strict data integrity guarantees. Design avoids
sending and receiving transaction state information so minimises
bookkeeping overheads. We synchronize with the highest priority
standby that is connected and ready to synchronize. Other standbys
can be defined to takeover in case of standby failure.

This version has very strict behaviour; more relaxed options
may be added at a later date.

Simon Riggs and Fujii Masao, with reviews by Yeb Havinga, Jaime
Casanova, Heikki Linnakangas and Robert Haas, plus the assistance
of many other design reviewers.
2011-03-06 22:49:16 +00:00
Heikki Linnakangas
ee3838b1d3 You must hold a lock on the heap page when you call
CheckForSerializableConflictOut(), because it can set hint bits.

YAMAMOTO Takashi
2011-03-04 15:43:11 +02:00
Heikki Linnakangas
47ad79122b Fix bugs in Serializable Snapshot Isolation.
Change the way UPDATEs are handled. Instead of maintaining a chain of
tuple-level locks in shared memory, copy any existing locks on the old
tuple to the new tuple at UPDATE. Any existing page-level lock needs to
be duplicated too, as a lock on the new tuple. That was neglected
previously.

Store xmin on tuple-level predicate locks, to distinguish a lock on an old
already-recycled tuple from a new tuple at the same physical location.
Failure to distinguish them caused loops in the tuple-lock chains, as
reported by YAMAMOTO Takashi. Although we don't use the chain representation
of UPDATEs anymore, it seems like a good idea to store the xmin to avoid
some false positives if no other reason.

CheckSingleTargetForConflictsIn now correctly handles the case where a lock
that's being held is not reflected in the local lock table. That happens
if another backend acquires a lock on our behalf due to an UPDATE or a page
split.

PredicateLockPageCombine now retains locks for the page that is being
removed, rather than removing them. This prevents a potentially dangerous
false-positive inconsistency where the local lock table believes that a lock
is held, but it is actually not.

Dan Ports and Kevin Grittner
2011-03-01 19:05:16 +02:00
Magnus Hagander
45a6d79b17 Properly initialize variables
Kevin Grittner
2011-02-18 11:59:57 +01:00
Itagaki Takahiro
62c7bd31c8 Add transaction-level advisory locks.
They share the same locking namespace with the existing session-level
advisory locks, but they are automatically released at the end of the
current transaction and cannot be released explicitly via unlock
functions.

Marko Tiikkaja, reviewed by me.
2011-02-18 14:05:12 +09:00
Simon Riggs
bca8b7f16a Hot Standby feedback for avoidance of cleanup conflicts on standby.
Standby optionally sends back information about oldestXmin of queries
which is then checked and applied to the WALSender's proc->xmin.
GetOldestXmin() is modified slightly to agree with GetSnapshotData(),
so that all backends on primary include WALSender within their snapshots.
Note this does nothing to change the snapshot xmin on either master or
standby. Feedback piggybacks on the standby reply message.
vacuum_defer_cleanup_age is no longer used on standby, though parameter
still exists on primary, since some use cases still exist.

Simon Riggs, review comments from Fujii Masao, Heikki Linnakangas, Robert Haas
2011-02-16 19:29:37 +00:00
Robert Haas
6a77e9385e Rename max_predicate_locks_per_transaction.
The new name, max_pred_locks_per_transaction, is shorter.

Kevin Grittner, per discussion.
2011-02-15 08:04:55 -05:00
Heikki Linnakangas
cecb5901b8 Allocate all entries in the serializable xid hash up-front, so that you don't
run out of shared memory when you try to assign an xid to a transaction.

Kevin Grittner
2011-02-10 12:03:21 +02:00
Heikki Linnakangas
036bb15872 Fix allocation of RW-conflict pool in the new predicate lock manager, and
also take the RW-conflict pool into account in the PredicateLockShmemSize()
estimate.
2011-02-09 12:23:07 +02:00
Heikki Linnakangas
7202ad7b8d Fix copy-pasto in description of pg_serial, and silence compiler warning
about uninitialized field you get on some compilers.
2011-02-08 09:05:13 +02:00
Heikki Linnakangas
dafaa3efb7 Implement genuine serializable isolation level.
Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.

To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.

A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.

Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.

We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.

Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.

Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
2011-02-08 00:09:08 +02:00
Simon Riggs
8585ad3625 Fix error code for canceling statement due to conflict with recovery.
All retryable conflict errors now have an error code that indicates that
a retry is possible, correcting my incomplete fix of 2010/05/12

Tatsuo Ishii and Simon Riggs, input from Robert Haas and Florian Pflug
2011-01-31 19:20:23 +00:00
Robert Haas
7f242d880b Try to avoid running with a full fsync request queue.
When we need to insert a new entry and the queue is full, compact the
entire queue in the hopes of making room for the new entry.  Doing this
on every insertion might worsen contention on BgWriterCommLock, but
when the queue it's full, it's far better than allowing the backend to
perform its own fsync, per testing by Greg Smith as reported in
http://archives.postgresql.org/pgsql-hackers/2011-01/msg02665.php

Original idea from Greg Smith.  Patch by me.  Review by Chris Browne
and Greg Smith
2011-01-29 08:08:41 -05:00
Tom Lane
7ab6f2da23 Change inv_truncate() to not repeat its systable_getnext_ordered() scan.
In the case where the initial call of systable_getnext_ordered() returned
NULL, this function would nonetheless call it again.  That's undefined
behavior that only by chance failed to not give visibly incorrect results.
Put an if-test around the final loop to prevent that, and in passing
improve some comments.  No back-patch since there's no actual failure.

Per report from YAMAMOTO Takashi.
2011-01-26 19:33:50 -05:00
Heikki Linnakangas
b1dc45c11d Fix thinko in comment. Spotted by Jim Nasby. 2011-01-18 10:46:13 +02:00
Heikki Linnakangas
8f5d65e916 Treat a WAL sender process that hasn't started streaming yet as a regular
backend, as far as the postmaster shutdown logic is concerned. That means,
fast shutdown will wait for WAL sender processes to exit before signaling
bgwriter to finish. This avoids race conditions between a base backup stopping
or starting, and bgwriter writing the shutdown checkpoint WAL record. We don't
want e.g the end-of-backup WAL record to be written after the shutdown
checkpoint.
2011-01-15 16:38:21 +02:00
Bruce Momjian
5d950e3b0c Stamp copyrights for year 2011. 2011-01-01 13:18:15 -05:00
Robert Haas
53dbc27c62 Support unlogged tables.
The contents of an unlogged table are WAL-logged; thus, they are not
available on standby servers and are truncated whenever the database
system enters recovery.  Indexes on unlogged tables are also unlogged.
Unlogged GiST indexes are not currently supported.
2010-12-29 06:48:53 -05:00
Robert Haas
24ecde7742 Work around unfortunate getppid() behavior on BSD-ish systems.
On MacOS X, and apparently also on other BSD-derived systems, attaching
a debugger causes getppid() to return the pid of the debugging process
rather than the actual parent PID.  As a result, debugging the
autovacuum launcher, startup process, or WAL sender on such systems
causes it to exit, because the previous coding of PostmasterIsAlive()
detects postmaster death by testing whether getppid() == PostmasterPid.

Work around that behavior by checking the return value of getppid()
more carefully.  If it's PostmasterPid, the postmaster must be alive;
if it's 1, assume the postmaster is dead.  If it's any other value,
assume we've been debugged and fall through to the less-reliable
kill() test.

Review by Tom Lane.
2010-12-21 06:30:32 -05:00
Robert Haas
8bd4b89e24 Try to save a kernel call in ResolveRecoveryConflictWithVirtualXIDs.
If there's no work to be done, just exit quickly, before initialization.
2010-12-17 11:32:02 -05:00
Robert Haas
611fed3712 Reset 'ps' display just once when resolving VXID conflicts.
This prevents the word "waiting" from briefly disappearing from the ps
status line when ResolveRecoveryConflictWithVirtualXIDs begins a new
iteration of the outer loop.

Along the way, remove some useless pgstat_report_waiting() calls;
the startup process doesn't appear in pg_stat_activity.

Fujii Masao
2010-12-17 08:30:57 -05:00
Robert Haas
34c70c7ac4 Instrument checkpoint sync calls.
Greg Smith, reviewed by Jeff Janes
2010-12-14 09:26:19 -05:00
Robert Haas
5f7b58fad8 Generalize concept of temporary relations to "relation persistence".
This commit replaces pg_class.relistemp with pg_class.relpersistence;
and also modifies the RangeVar node type to carry relpersistence rather
than istemp.  It also removes removes rd_istemp from RelationData and
instead performs the correct computation based on relpersistence.

For clarity, we add three new macros: RelationNeedsWAL(),
RelationUsesLocalBuffers(), and RelationUsesTempNamespace(), so that we
can clarify the purpose of each check that previous depended on
rd_istemp.

This is intended as infrastructure for the upcoming unlogged tables
patch, as well as for future possible work on global temporary tables.
2010-12-13 12:34:26 -05:00
Tom Lane
04f4e10cfc Use symbolic names not octal constants for file permission flags.
Purely cosmetic patch to make our coding standards more consistent ---
we were doing symbolic some places and octal other places.  This patch
fixes all C-coded uses of mkdir, chmod, and umask.  There might be some
other calls I missed.  Inconsistency noted while researching tablespace
directory permissions issue.
2010-12-10 17:35:33 -05:00
Tom Lane
576477e73c Force default wal_sync_method to be fdatasync on Linux.
Recent versions of the Linux system header files cause xlogdefs.h to
believe that open_datasync should be the default sync method, whereas
formerly fdatasync was the default on Linux.  open_datasync is a bad
choice, first because it doesn't actually outperform fdatasync (in fact
the reverse), and second because we try to use O_DIRECT with it, causing
failures on certain filesystems (e.g., ext4 with data=journal option).
This part of the patch is largely per a proposal from Marti Raudsepp.
More extensive changes are likely to follow in HEAD, but this is as much
change as we want to back-patch.

Also clean up confusing code and incorrect documentation surrounding the
fsync_writethrough option.  Those changes shouldn't result in any actual
behavioral change, but I chose to back-patch them anyway to keep the
branches looking similar in this area.

In 9.0 and HEAD, also do some copy-editing on the WAL Reliability
documentation section.

Back-patch to all supported branches, since any of them might get used
on modern Linux versions.
2010-12-08 20:01:09 -05:00
Simon Riggs
e620ee35b2 Optimize commit_siblings in two ways to improve group commit.
First, avoid scanning the whole ProcArray once we know there
are at least commit_siblings active; second, skip the check
altogether if commit_siblings = 0.

Greg Smith
2010-12-08 18:48:03 +00:00
Heikki Linnakangas
5a031a5556 Fix bugs in the hot standby known-assigned-xids tracking logic. If there's
an old transaction running in the master, and a lot of transactions have
started and finished since, and a WAL-record is written in the gap between
the creating the running-xacts snapshot and WAL-logging it, recovery will fail
with "too many KnownAssignedXids" error. This bug was reported by
Joachim Wieland on Nov 19th.

In the same scenario, when fewer transactions have started so that all the
xids fit in KnownAssignedXids despite the first bug, a more serious bug
arises. We incorrectly initialize the clog code with the oldest still running
transaction, and when we see the WAL record belonging to a transaction with
an XID larger than one that committed already before the checkpoint we're
recovering from, we zero the clog page containing the already committed
transaction, leading to data loss.

In hindsight, trying to track xids in the known-assigned-xids array before
seeing the running-xacts record was too complicated. To fix that, hold
XidGenLock while the running-xacts snapshot is taken and WAL-logged. That
ensures that no transaction can begin or end in that gap, so that in recvoery
we know that the snapshot contains all transactions running at that point in
WAL.
2010-12-07 09:23:30 +01:00
Simon Riggs
ed78384acd Move call to GetTopTransactionId() earlier in LockAcquire(),
removing an infrequently occurring race condition in Hot Standby.
An xid must be assigned before a lock appears in shared memory,
rather than immediately after, else GetRunningTransactionLocks()
may see InvalidTransactionId, causing assertion failures during
lock processing on standby.

Bug report and diagnosis by Fujii Masao, fix by me.
2010-11-29 01:08:02 +00:00
Robert Haas
cc1ed40d57 Object access hook framework, with post-creation hook.
After a SQL object is created, we provide an opportunity for security
or logging plugins to get control; for example, a security label provider
could use this to assign an initial security label to newly created
objects.  The basic infrastructure is (hopefully) reusable for other types
of events that might require similar treatment.

KaiGai Kohei, with minor adjustments.
2010-11-25 11:50:13 -05:00
Robert Haas
c2281ac87c Remove belt-and-suspenders guards against buffer pin leaks.
Forcibly releasing all leftover buffer pins should be unnecessary now
that we have a robust ResourceOwner mechanism, and it significantly
increases the cost of process shutdown.  Instead, in an assert-enabled
build, assert that no pins are held; in a non-assert-enabled build, do
nothing.
2010-11-25 00:06:46 -05:00
Peter Eisentraut
fc946c39ae Remove useless whitespace at end of lines 2010-11-23 22:34:55 +02:00
Robert Haas
3134d8863e Add new buffers_backend_fsync field to pg_stat_bgwriter.
This new field counts the number of times that a backend which writes a
buffer out to the OS must also fsync() it.  This happens when the
bgwriter fsync request queue is full, and is generally detrimental to
performance, so it's good to know when it's happening.  Along the way,
log a new message at level DEBUG1 whenever we fail to hand off an fsync,
so that the problem can also be seen in examination of log files
(if the logging level is cranked up high enough).

Greg Smith, with minor tweaks by me.
2010-11-15 12:42:59 -05:00
Simon Riggs
52010027ef Avoid spurious Hot Standby conflicts from btree delete records.
Similar conflicts were already avoided for related record types.
Massive over-caution resulted in a usability bug. Clear theoretical
basis for doing this is now confirmed by me.
Request to remove from Heikki (twice), over-caution by me.
2010-11-15 09:30:13 +00:00
Robert Haas
11e482c350 Move copydir() prototype into its own header file.
Having this in src/include/port.h makes no sense, now that copydir.c lives
in src/backend/strorage rather than src/port.  Along the way, remove an
obsolete comment from contrib/pg_upgrade that makes reference to the old
location.
2010-11-12 16:39:53 -05:00
Tom Lane
54428dbe90 Fix error handling in temp-file deletion with log_temp_files active.
The original coding in FileClose() reset the file-is-temp flag before
unlinking the file, so that if control came back through due to an error,
it wouldn't try to unlink the file twice.  This was correct when written,
but when the log_temp_files feature was added, the logging action was put
in between those two steps.  An error occurring during the logging action
--- such as a query cancel --- would result in the unlink not getting done
at all, as in recent report from Michael Glaesemann.

To fix this, make sure that we do both the stat and the unlink before doing
anything that could conceivably CHECK_FOR_INTERRUPTS.  There is a judgment
call here, which is which log message to emit first: if you can see only
one, which should it be?  I chose to log unlink failure at the risk of
losing the log_temp_files log message --- after all, if the unlink does
fail, the temp file is still there for you to see.

Back-patch to all versions that have log_temp_files.  The code was OK
before that.
2010-11-08 22:14:48 -05:00
Tom Lane
5ac144d5c2 Improve messages for too many private files/dirs. Per Alexey Parshin. 2010-09-28 18:08:02 -04:00
Magnus Hagander
9f2e211386 Remove cvs keywords from all files. 2010-09-20 22:08:53 +02:00
Heikki Linnakangas
236b6bc29e Simplify Windows implementation of latches. There's no need to keep a
dynamic pool of event handles, we can permanently assign one for each
shared latch. Thanks to that, we no longer need a separate shared memory
block for latches, and we don't need to know in advance how many shared
latches there is, so you no longer need to remember to update
NumSharedLatches when you introduce a new latch to the system.
2010-09-15 10:06:21 +00:00
Heikki Linnakangas
2746e5f21d Introduce latches. A latch is a boolean variable, with the capability to
wait until it is set. Latches can be used to reliably wait until a signal
arrives, which is hard otherwise because signals don't interrupt select()
on some platforms, and even when they do, there's race conditions.

On Unix, latches use the so called self-pipe trick under the covers to
implement the sleep until the latch is set, without race conditions. On
Windows, Windows events are used.

Use the new latch abstraction to sleep in walsender, so that as soon as
a transaction finishes, walsender is woken up to immediately send the WAL
to the standby. This reduces the latency between master and standby, which
is good.

Preliminary work by Fujii Masao. The latch implementation is by me, with
helpful comments from many people.
2010-09-11 15:48:04 +00:00
Tom Lane
174a51332f Cosmetic fixes for KnownAssignedXidsGetOldestXmin, per Fujii Masao. 2010-08-30 17:30:44 +00:00
Simon Riggs
e24d1dc069 Teach GetOldestXmin() about KnownAssignedXids during recovery.
Very minor issue, though this is required for a later patch.
Reported by Heikki Linnakangas.
2010-08-30 14:16:48 +00:00
Heikki Linnakangas
e1cc96dbf0 Fix typo in comment. 2010-08-30 06:33:22 +00:00
Tom Lane
b9defe0405 Marginal code cleanup for streaming replication.
There is no reason that proc.c should have to get involved in this dirty hack
for letting the postmaster know which children are walsenders.  Revert that
file to the way it was, and confine the kluge to pmsignal.c and postmaster.c.
2010-08-23 17:20:01 +00:00
Robert Haas
a481ff71af Remove the isLocalBuf argument from ReadBuffer_common.
Since an SMgrRelation now knows whether or not the underlying relation is
temporary, there's no point in also passing that information via an
additional argument.
2010-08-20 01:07:50 +00:00
Tom Lane
79dc97a401 Bring some sanity to the trace_recovery_messages code and docs.
Per gripe from Fujii Masao, though this is not exactly his proposed patch.
Categorize as DEVELOPER_OPTIONS and set context PGC_SIGHUP, as per Fujii,
but set the default to LOG because higher values aren't really sensible
(see the code for trace_recovery()).  Fix the documentation to agree with
the code and to try to explain what the variable actually does.  Get rid
of no-op calls trace_recovery(LOG), which accomplish nothing except to
demonstrate that this option confuses even its author.
2010-08-19 22:55:01 +00:00
Tom Lane
bc7cb8f42c Allocate local buffers in a context of their own, rather than dumping them
into TopMemoryContext.  This makes no functional difference, but makes it
easier to see what the space is being used for in MemoryContextStats dumps.
Per a recent example in which I was surprised by the size of TopMemoryContext.
2010-08-19 16:16:20 +00:00
Peter Eisentraut
3f11971916 Remove extra newlines at end and beginning of files, add missing newlines
at end of files.
2010-08-19 05:57:36 +00:00
Robert Haas
d37781fa82 Tidy up a few calls to smrgextend().
In the new API introduced by my patch to include the backend ID in
temprel filenames, the last argument to smrgextend() became skipFsync
rather than isTemp, but these calls didn't get the memo.  It's not
really a problem to pass rel->rd_istemp rather than just plain false,
because smgrextend() now automatically skips the fsync for temprels
anyway, but this seems cleaner and saves some minute number of cycles.
2010-08-19 02:58:37 +00:00
Robert Haas
66b14030e8 Make LockDatabaseObject() AcceptInvalidationMessages().
This is appropriate for the same reasons we already do it in
LockSharedObject(): things might have changed while we were waiting
for the lock.  There doesn't seem to be a live bug here at the moment,
but that's mostly because it isn't currently used for very much.
2010-08-16 02:02:28 +00:00
Robert Haas
27f145a40e Further dtrace adjustments for the backend-IDs-in-relpath patch.
Update the documentation, and back out a few ill-considered changes
whose folly I failed to realize for failure to read the documentation.
2010-08-14 02:22:10 +00:00
Robert Haas
105d4c5ffe Fix assorted dtrace breakage caused by patch to include backend IDs
in temp relpaths.

Per buildfarm.
2010-08-13 22:54:17 +00:00
Robert Haas
debcec7dc3 Include the backend ID in the relpath of temporary relations.
This allows us to reliably remove all leftover temporary relation
files on cluster startup without reference to system catalogs or WAL;
therefore, we no longer include temporary relations in XLOG_XACT_COMMIT
and XLOG_XACT_ABORT WAL records.

Since these changes require including a backend ID in each
SharedInvalSmgrMsg, the size of the SharedInvalidationMessage.id
field has been reduced from two bytes to one, and the maximum number
of connections has been reduced from INT_MAX / 4 to 2^23-1.  It would
be possible to remove these restrictions by increasing the size of
SharedInvalidationMessage by 4 bytes, but right now that doesn't seem
like a good trade-off.

Review by Jaime Casanova and Tom Lane.
2010-08-13 20:10:54 +00:00
Robert Haas
30c22eb8fc Correct sundry errors in Hot Standby-related comments.
Fujii Masao
2010-08-12 23:24:54 +00:00
Robert Haas
20be0d480a Make log_temp_files based on kB, and revert docs & comments to match.
Per extensive discussion on pgsql-hackers.  We are deliberately not
back-patching this even though the behavior of 8.3 and 8.4 is
unquestionably broken, for fear of breaking existing users of this
parameter.  This incompatibility should be release-noted.
2010-07-06 22:55:26 +00:00
Bruce Momjian
239d769e7e pgindent run for 9.0, second run 2010-07-06 19:19:02 +00:00
Tom Lane
aceedd88f6 Make vacuum_defer_cleanup_age be PGC_SIGHUP level, since it's not sensible
to have different values in different processes of the primary server.
Also put it into the "Streaming Replication" GUC category; it doesn't belong
in "Standby Servers" because you use it on the master not the standby.
In passing also correct guc.c's idea of wal_keep_segments' category.
2010-07-03 21:23:58 +00:00
Tom Lane
e76c1a0f4d Replace max_standby_delay with two parameters, max_standby_archive_delay and
max_standby_streaming_delay, and revise the implementation to avoid assuming
that timestamps found in WAL records can meaningfully be compared to clock
time on the standby server.  Instead, the delay limits are compared to the
elapsed time since we last obtained a new WAL segment from archive or since
we were last "caught up" to WAL data arriving via streaming replication.
This avoids problems with clock skew between primary and standby, as well
as other corner cases that the original coding would misbehave in, such
as the primary server having significant idle time between transactions.
Per my complaint some time ago and considerable ensuing discussion.

Do some desultory editing on the hot standby documentation, too.
2010-07-03 20:43:58 +00:00
Robert Haas
bb0fe9feb9 Move copydir.c from src/port to src/backend/storage/file
The previous commit to make copydir() interruptible prevented
postgres.exe from linking on MinGW and Cygwin, because on those
platforms libpgport_srv.a can't freely reference symbols defined
by the backend.  Since that code is already backend-specific anyway,
just move the whole file into the backend rather than adding further
kludges to deal with the symbols needed by CHECK_FOR_INTERRUPTS().

This probably needs some further cleanup, but this commit just moves
the file as-is, which should hopefully be enough to turn the
buildfarm green again.
2010-07-02 17:03:30 +00:00
Itagaki Takahiro
9e3cd37576 Remove max_standby_delay message from ps display of recovery process
in waiting status. The parameter is not so interesting in ps display
because it is referable in postgresql.conf.
2010-06-14 00:49:24 +00:00
Simon Riggs
f9dbac9476 HS Defer buffer pin deadlock check until deadlock_timeout has expired.
During Hot Standby we need to check for buffer pin deadlocks when the
Startup process begins to wait, in case it never wakes up again. We
previously made the deadlock check immediately on the basis it was
cheap, though clearer thinking and prima facie evidence shows that
was too simple. Refactor existing code to make it easy to add in
deferral of deadlock check until deadlock_timeout allowing a good
reduction in deadlock checks since far few buffer pins are held for
that duration. It's worth doing anyway, though major goal is to
prevent further reports of context switching with high numbers of
users on occasional tests.
2010-05-26 19:52:52 +00:00
Simon Riggs
fd34374b17 Add many new Asserts in code and fix simple bug that slipped through
without them, related to previous commit. Report by Bruce Momjian.
2010-05-14 07:11:49 +00:00
Simon Riggs
8431e296ea Cleanup initialization of Hot Standby. Clarify working with reanalysis
of requirements and documentation on LogStandbySnapshot(). Fixes
two minor bugs reported by Tom Lane that would lead to an incorrect
snapshot after transaction wraparound. Also fix two other problems
discovered that would give incorrect snapshots in certain cases.
ProcArrayApplyRecoveryInfo() substantially rewritten. Some minor
refactoring of xact_redo_apply() and ExpireTreeKnownAssignedTransactionIds().
2010-05-13 11:15:38 +00:00
Tom Lane
f9ed327f76 Clean up some awkward, inaccurate, and inefficient processing around
MaxStandbyDelay.  Use the GUC units mechanism for the value, and choose more
appropriate timestamp functions for performing tests with it.  Make the
ps_activity manipulation in ResolveRecoveryConflictWithVirtualXIDs have
behavior similar to ps_activity code elsewhere, notably not updating the
display when update_process_title is off and not truncating the display
contents at an arbitrarily-chosen length.  Improve the docs to be explicit
about what MaxStandbyDelay actually measures, viz the difference between
primary and standby servers' clocks, and the possible hazards if their clocks
aren't in sync.
2010-05-02 02:10:33 +00:00
Tom Lane
f0488bd57c Rename the parameter recovery_connections to hot_standby, to reduce possible
confusion with streaming-replication settings.  Also, change its default
value to "off", because of concern about executing new and poorly-tested
code during ordinary non-replicating operation.  Per discussion.

In passing do some minor editing of related documentation.
2010-04-29 21:36:19 +00:00
Tom Lane
77acab75df Modify ShmemInitStruct and ShmemInitHash to throw errors internally,
rather than returning NULL for some-but-not-all failures as they used to.
Remove now-redundant tests for NULL from call sites.

We had to do something about this because many call sites were failing to
check for NULL; and changing it like this seems a lot more useful and
mistake-proof than adding checks to the call sites without them.
2010-04-28 16:54:16 +00:00
Heikki Linnakangas
9b8a73326e Introduce wal_level GUC to explicitly control if information needed for
archival or hot standby should be WAL-logged, instead of deducing that from
other options like archive_mode. This replaces recovery_connections GUC in
the primary, where it now has no effect, but it's still used in the standby
to enable/disable hot standby.

Remove the WAL-logging of "unlogged operations", like creating an index
without WAL-logging and fsyncing it at the end. Instead, we keep a copy of
the wal_mode setting and the settings that affect how much shared memory a
hot standby server needs to track master transactions (max_connections,
max_prepared_xacts, max_locks_per_xact) in pg_control. Whenever the settings
change, at server restart, write a WAL record noting the new settings and
update pg_control. This allows us to notice the change in those settings in
the standby at the right moment, they used to be included in checkpoint
records, but that meant that a changed value was not reflected in the
standby until the first checkpoint after the change.

Bump PG_CONTROL_VERSION and XLOG_PAGE_MAGIC. Whack XLOG_PAGE_MAGIC back to
the sequence it used to follow, before hot standby and subsequent patches
changed it to 0x9003.
2010-04-28 16:10:43 +00:00
Tom Lane
2871b4618a Replace the KnownAssignedXids hash table with a sorted-array data structure,
and be more tense about the locking requirements for it, to improve performance
in Hot Standby mode.  In passing fix a few bugs and improve a number of
comments in the existing HS code.

Simon Riggs, with some editorialization by Tom
2010-04-28 00:09:05 +00:00
Robert Haas
33980a0640 Fix various instances of "the the".
Two of these were pointed out by Erik Rijkers; the rest I found.
2010-04-23 23:21:44 +00:00
Simon Riggs
a2555571fb Optimise btree delete processing when no active backends.
Clarify comments, downgrade a message to DEBUG and remove some
debug counters. Direct from ideas by Heikki Linnakangas.
2010-04-22 08:04:25 +00:00
Simon Riggs
0192abc4d7 Relax locking during GetCurrentVirtualXIDs(). Earlier improvements
to handling of btree delete records mean that all snapshot
conflicts on standby now have a valid, useful latestRemovedXid.
Our earlier approach using LW_EXCLUSIVE was useful when we didnt
always have a valid value, though is no longer useful or necessary.
Asserts added to code path to prove and ensure this is the case.
This will reduce contention and improve performance of larger Hot
Standby servers.
2010-04-21 19:08:14 +00:00
Simon Riggs
7bc76d51fb Check RecoveryInProgress() while holding ProcArrayLock during snapshots.
This prevents a rare, yet possible race condition at the exact moment
of transition from recovery to normal running.
2010-04-19 18:03:38 +00:00
Simon Riggs
21d6a6a128 Tune GetSnapshotData() during Hot Standby by avoiding loop
through normal backends. Makes code clearer also, since we
avoid various Assert()s. Performance of snapshots taken
during recovery no longer depends upon number of read-only
backends.
2010-04-18 18:06:07 +00:00
Simon Riggs
19c7a59b56 Change some debug ereports to elogs, as requested by translation team. 2010-04-06 10:50:57 +00:00
Peter Eisentraut
c248d17120 Message tuning 2010-03-21 00:17:59 +00:00
Tom Lane
f784f05e95 Clear error_context_stack and debug_query_string at the beginning of proc_exit,
so that we won't try to attach any context printouts to messages that get
emitted while exiting.  Per report from Dennis Koegel, the context functions
won't necessarily work after we've started shutting down the backend, and it
seems possible that debug_query_string could be pointing at freed storage
as well.  The context information doesn't seem particularly relevant to
such messages anyway, so there's little lost by suppressing it.

Back-patch to all supported branches.  I can only demonstrate a crash with
log_disconnections messages back to 8.1, but the risk seems real in 8.0 and
before anyway.
2010-03-20 00:58:09 +00:00
Heikki Linnakangas
e0f9e2b648 Fix bug in KnownAssignedXidsMany(). I saw this when looking at the
assertion failure reported by Erik Rijkers, but this alone doesn't explain
the failure.
2010-03-11 09:26:59 +00:00
Heikki Linnakangas
daaeac88aa Fix comment which was apparently copy-pasted from another function. 2010-03-11 09:10:25 +00:00
Bruce Momjian
65e806cba1 pgindent run for 9.0 2010-02-26 02:01:40 +00:00
Tom Lane
e9a383303c Adjust pg_fsync_writethrough so that it will set errno when failing
on a platform that doesn't support this operation.  The former coding
would allow an unrelated errno to be reported, which would be quite
misleading.  Not sure if this has anything to do with the current
buildfarm failures, but it's certainly bogus as-is.
2010-02-22 15:26:14 +00:00
Tom Lane
d1e027221d Replace the pg_listener-based LISTEN/NOTIFY mechanism with an in-memory queue.
In addition, add support for a "payload" string to be passed along with
each notify event.

This implementation should be significantly more efficient than the old one,
and is also more compatible with Hot Standby usage.  There is not yet any
facility for HS slaves to receive notifications generated on the master,
although such a thing is possible in future.

Joachim Wieland, reviewed by Jeff Davis; also hacked on by me.
2010-02-16 22:34:57 +00:00
Greg Stark
f8c183a1ac Speed up CREATE DATABASE by deferring the fsyncs until after copying
all the data and using posix_fadvise to nudge the OS into flushing it
earlier. This also hopefully makes CREATE DATABASE avoid spamming the
cache.

Tests show a big speedup on Linux at least on some filesystems.

Idea and patch from Andres Freund.
2010-02-15 00:50:57 +00:00
Simon Riggs
8eccf7614b Improvements to ps message of startup process during Hot Standby.
Message is reset earlier and potential bug avoided.

Andres Freund
2010-02-13 16:29:38 +00:00
Simon Riggs
b95a720a48 Re-enable max_standby_delay = -1 using deadlock detection on startup
process. If startup waits on a buffer pin we send a request to all
backends to cancel themselves if they are holding the buffer pin
required and they are also waiting on a lock. If not, startup waits
until max_standby_delay before cancelling any backend waiting for
the requested buffer pin.
2010-02-13 01:32:20 +00:00
Simon Riggs
5cbf6dceea Fix typo bug in Hot Standby from recent refactoring. Bug introduced
into code recently patched by Andres Freund, so quickly fixed by him
when bug report from Tatsuo Ishii arrived.
2010-02-11 19:35:22 +00:00
Tom Lane
cbe9d6beb4 Fix up rickety handling of relation-truncation interlocks.
Move rd_targblock, rd_fsm_nblocks, and rd_vm_nblocks from relcache to the smgr
relation entries, so that they will get reset to InvalidBlockNumber whenever
an smgr-level flush happens.  Because we now send smgr invalidation messages
immediately (not at end of transaction) when a relation truncation occurs,
this ensures that other backends will reset their values before they next
access the relation.  We no longer need the unreliable assumption that a
VACUUM that's doing a truncation will hold its AccessExclusive lock until
commit --- in fact, we can intentionally release that lock as soon as we've
completed the truncation.  This patch therefore reverts (most of) Alvaro's
patch of 2009-11-10, as well as my marginal hacking on it yesterday.  We can
also get rid of assorted no-longer-needed relcache flushes, which are far more
expensive than an smgr flush because they kill a lot more state.

In passing this patch fixes smgr_redo's failure to perform visibility-map
truncation, and cleans up some rather dubious assumptions in freespace.c and
visibilitymap.c about when rd_fsm_nblocks and rd_vm_nblocks can be out of
date.
2010-02-09 21:43:30 +00:00
Tom Lane
16e5859cd2 Allow free space map vacuuming to be interrupted. 2010-02-09 00:28:57 +00:00
Tom Lane
0a469c8769 Remove old-style VACUUM FULL (which was known for a little while as
VACUUM FULL INPLACE), along with a boatload of subsidiary code and complexity.
Per discussion, the use case for this method of vacuuming is no longer large
enough to justify maintaining it; not to mention that we don't wish to invest
the work that would be needed to make it play nicely with Hot Standby.

Aside from the code directly related to old-style VACUUM FULL, this commit
removes support for certain WAL record types that could only be generated
within VACUUM FULL, redirect-pointer removal in heap_page_prune, and
nontransactional generation of cache invalidation sinval messages (the last
being the sticking point for Hot Standby).

We still have to retain all code that copes with finding HEAP_MOVED_OFF and
HEAP_MOVED_IN flag bits on existing tuples.  This can't be removed as long
as we want to support in-place update from pre-9.0 databases.
2010-02-08 04:33:55 +00:00
Tom Lane
70a2b05a59 Assorted cleanups in preparation for using a map file to support altering
the relfilenode of currently-not-relocatable system catalogs.

1. Get rid of inval.c's dependency on relfilenode, by not having it emit
smgr invalidations as a result of relcache flushes.  Instead, smgr sinval
messages are sent directly from smgr.c when an actual relation delete or
truncate is done.  This makes considerably more structural sense and allows
elimination of a large number of useless smgr inval messages that were
formerly sent even in cases where nothing was changing at the
physical-relation level.  Note that this reintroduces the concept of
nontransactional inval messages, but that's okay --- because the messages
are sent by smgr.c, they will be sent in Hot Standby slaves, just from a
lower logical level than before.

2. Move setNewRelfilenode out of catalog/index.c, where it never logically
belonged, into relcache.c; which is a somewhat debatable choice as well but
better than before.  (I considered catalog/storage.c, but that seemed too
low level.)  Rename to RelationSetNewRelfilenode.

3. Cosmetic cleanups of some other relfilenode manipulations.
2010-02-03 01:14:17 +00:00
Tom Lane
ab7c49c988 Fix assorted poorly-thought-out message strings: use %u not %d for printing
OIDs, avoid random line breaks in strings somebody might grep for.
2010-02-02 22:01:53 +00:00
Simon Riggs
c85c941470 Detect early deadlock in Hot Standby when Startup is already waiting. First
stage of required deadlock detection to allow re-enabling max_standby_delay
setting of -1, which is now essential in the absence of improved relation-
specific conflict resoluton. Requested by Greg Stark et al.
2010-01-31 19:01:11 +00:00
Simon Riggs
29eedd3122 Adjust GetLockConflicts() so that it uses TopMemoryContext when
executed InHotStandby. Cleaner solution than using malloc or palloc
depending upon situation, as proposed by Tom.
2010-01-29 19:45:12 +00:00
Simon Riggs
76be0c81cc Filter recovery conflicts based upon dboid from relfilenode of WAL
records for heap and btree. Minor change, mostly API changes to
pass through the required values. This is a simple change though
also provides the refactoring required for further enhancements
to conflict processing using the relOid. Changes only have effect
during Hot Standby.
2010-01-29 17:10:05 +00:00
Simon Riggs
bcd8528f00 Use malloc() in GetLockConflicts() when called InHotStandby to avoid repeated
palloc calls. Current code assumed this was already true, so this is a bug fix.
2010-01-28 10:05:37 +00:00
Simon Riggs
959ac58c04 In HS, Startup process sets SIGALRM when waiting for buffer pin. If
woken by alarm we send SIGUSR1 to all backends requesting that they
check to see if they are blocking Startup process. If so, they throw
ERROR/FATAL as for other conflict resolutions. Deadlock stop gap
removed. max_standby_delay = -1 option removed to prevent deadlock.
2010-01-23 16:37:12 +00:00
Simon Riggs
58565d78db Better internal documentation of locking for Hot Standby conflict resolution.
Discuss the reasons for the lock type we hold on ProcArrayLock while deriving
the conflict list. Cover the idea of false positive conflicts and seemingly
strange effects on snapshot derivation.
2010-01-21 00:53:58 +00:00
Tom Lane
e319e6799a Fix bogus initialization of KnownAssignedXids shared memory state ---
didn't work in EXEC_BACKEND case.
2010-01-16 17:17:26 +00:00
Simon Riggs
2edc31c439 Message mentions msec when it should be seconds, so use s instead of ms.
Noticed by Andres Freund
2010-01-16 10:13:04 +00:00
Simon Riggs
a8ce974cdd Teach standby conflict resolution to use SIGUSR1
Conflict reason is passed through directly to the backend, so we can
take decisions about the effect of the conflict based upon the local
state. No specific changes, as yet, though this prepares for later work.
CancelVirtualTransaction() sends signals while holding ProcArrayLock.
Introduce errdetail_abort() to give message detail explaining that the
abort was caused by conflict processing. Remove CONFLICT_MODE states
in favour of using PROCSIG_RECOVERY_CONFLICT states directly, for clarity.
2010-01-16 10:05:59 +00:00
Heikki Linnakangas
40f908bdcd Introduce Streaming Replication.
This includes two new kinds of postmaster processes, walsenders and
walreceiver. Walreceiver is responsible for connecting to the primary server
and streaming WAL to disk, while walsender runs in the primary server and
streams WAL from disk to the client.

Documentation still needs work, but the basics are there. We will probably
pull the replication section to a new chapter later on, as well as the
sections describing file-based replication. But let's do that as a separate
patch, so that it's easier to see what has been added/changed. This patch
also adds a new section to the chapter about FE/BE protocol, documenting the
protocol used by walsender/walreceivxer.

Bump catalog version because of two new functions,
pg_last_xlog_receive_location() and pg_last_xlog_replay_location(), for
monitoring the progress of replication.

Fujii Masao, with additional hacking by me
2010-01-15 09:19:10 +00:00
Simon Riggs
e99767bc28 First part of refactoring of code for ResolveRecoveryConflict. Purposes
of this are to centralise the conflict code to allow further change,
as well as to allow passing through the full reason for the conflict
through to the conflicting backends. Backend state alters how we
can handle different types of conflict so this is now required.
As originally suggested by Heikki, no longer optional.
2010-01-14 11:08:02 +00:00
Bruce Momjian
228170410d Please tablespace directories in their own subdirectory so pg_migrator
can upgrade clusters without renaming the tablespace directories.  New
directory structure format is, e.g.:

	$PGDATA/pg_tblspc/20981/PG_8.5_201001061/719849/83292814
2010-01-12 02:42:52 +00:00
Simon Riggs
3bfcccc295 During Hot Standby, fix drop database when sessions idle.
Previously we only cancelled sessions that were in-transaction.

Simple fix is to just cancel all sessions without waiting. Doing
it this way avoids complicating common code paths, which would
not be worth the trouble to cover this rare case.

Problem report and fix by Andres Freund, edited somewhat by me
2010-01-10 15:44:28 +00:00
Bruce Momjian
0239800893 Update copyright for the year 2010. 2010-01-02 16:58:17 +00:00
Tom Lane
bd8a35655b Suppress compiler warning (pid_t isn't int everywhere) 2009-12-31 22:07:36 +00:00
Tom Lane
b4594a66ba Add missing 'static' tag. 2009-12-31 21:47:12 +00:00
Tom Lane
85d02a6586 Redefine Datum as uintptr_t, instead of unsigned long.
This is more in keeping with modern practice, and is a first step towards
porting to Win64 (which has sizeof(pointer) > sizeof(long)).

Tsutomu Yamada, Magnus Hagander, Tom Lane
2009-12-31 19:41:37 +00:00
Simon Riggs
efc16ea520 Allow read only connections during recovery, known as Hot Standby.
Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.

New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.

This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.

Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.

Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
2009-12-19 01:32:45 +00:00
Robert Haas
cddca5ec13 Add an EXPLAIN (BUFFERS) option to show buffer-usage statistics.
This patch also removes buffer-usage statistics from the track_counts
output, since this (or the global server statistics) is deemed to be a better
interface to this information.

Itagaki Takahiro, reviewed by Euler Taveira de Oliveira.
2009-12-15 04:57:48 +00:00
Itagaki Takahiro
f1325ce213 Add large object access control.
A new system catalog pg_largeobject_metadata manages
ownership and access privileges of large objects.

KaiGai Kohei, reviewed by Jaime Casanova.
2009-12-11 03:34:57 +00:00
Heikki Linnakangas
ab3148b712 Fix bug in temporary file management with subtransactions. A cursor opened
in a subtransaction stays open even if the subtransaction is aborted, so
any temporary files related to it must stay alive as well. With the patch,
we use ResourceOwners to track open temporary files and don't automatically
close them at subtransaction end (though in the normal case temporary files
are registered with the subtransaction resource owner and will therefore be
closed).

At end of top transaction, we still check that there's no temporary files
marked as close-at-end-of-transaction open, but that's now just a debugging
cross-check as the resource owner cleanup should've closed them already.
2009-12-03 11:03:29 +00:00
Tom Lane
00e6a16d01 Change the autovacuum launcher to read pg_database directly, rather than
via the "flat files" facility.  This requires making it enough like a backend
to be able to run transactions; it's no longer an "auxiliary process" but
more like the autovacuum worker processes.  Also, its signal handling has
to be brought into line with backends/workers.  In particular, since it
now has to handle procsignal.c processing, the special autovac-launcher-only
signal conditions are moved to SIGUSR2.

Alvaro, with some cleanup from Tom
2009-08-31 19:41:00 +00:00
Tom Lane
04011cc970 Allow backends to start up without use of the flat-file copy of pg_database.
To make this work in the base case, pg_database now has a nailed-in-cache
relation descriptor that is initialized using hardwired knowledge in
relcache.c.  This means pg_database is added to the set of relations that
need to have a Schema_pg_xxx macro maintained in pg_attribute.h.  When this
path is taken, we'll have to do a seqscan of pg_database to find the row
we need.

In the normal case, we are able to do an indexscan to find the database's row
by name.  This is made possible by storing a global relcache init file that
describes only the shared catalogs and their indexes (and therefore is usable
by all backends in any database).  A new backend loads this cache file,
finds its database OID after an indexscan on pg_database, and then loads
the local relcache init file for that database.

This change should effectively eliminate number of databases as a factor
in backend startup time, even with large numbers of databases.  However,
the real reason for doing it is as a first step towards getting rid of
the flat files altogether.  There are still several other sub-projects
to be tackled before that can happen.
2009-08-12 20:53:31 +00:00
Heikki Linnakangas
23dc89d2c3 Improve error messages in md.c. When a filesystem operation like open() or
fsync() fails, say "file" rather than "relation" when printing the filename.

This makes messages that display block numbers a bit confusing. For example,
in message 'could not read block 150000 of file "base/1234/5678.1"', 150000
is the block number from the beginning of the relation, ie. segment 0, not
150000th block within that segment. Per discussion, users aren't usually
interested in the exact location within the file, so we can live with that.

To ease constructing error messages, add FilePathName(File) function to
return the pathname of a virtual fd.
2009-08-05 18:01:54 +00:00
Tom Lane
2487d872e0 Create a multiplexing structure for signals to Postgres child processes.
This patch gets us out from under the Unix limitation of two user-defined
signal types.  We already had done something similar for signals directed to
the postmaster process; this adds multiplexing for signals directed to
backends and auxiliary processes (so long as they're connected to shared
memory).

As proof of concept, replace the former usage of SIGUSR1 and SIGUSR2
for backends with use of the multiplexing mechanism.  There are still some
hard-wired definitions of SIGUSR1 and SIGUSR2 for other process types,
but getting rid of those doesn't seem interesting at the moment.

Fujii Masao
2009-07-31 20:26:23 +00:00
Tom Lane
8504905793 Fix a thinko introduced into CountActiveBackends by a recent patch:
we should ignore NULL array entries, not non-NULL ones.  This had the
effect of disabling commit_delay, and could have caused a crash in the
rare race condition the patch was intended to fix.

Bug report and diagnosis by Jeff Janes, in bug #4952.
2009-07-29 15:57:11 +00:00
Tom Lane
2de48a83e6 Cleanup and code review for the patch that made bgwriter active during
archive recovery.  Invent a separate state variable and inquiry function
for XLogInsertAllowed() to clarify some tests and make the management of
writing the end-of-recovery checkpoint less klugy.  Fix several places
that were incorrectly testing InRecovery when they should be looking at
RecoveryInProgress or XLogInsertAllowed (because they will now be executed
in the bgwriter not startup process).  Clarify handling of bad LSNs passed
to XLogFlush during recovery.  Use a spinlock for setting/testing
SharedRecoveryInProgress.  Improve quite a lot of comments.

Heikki and Tom
2009-06-26 20:29:04 +00:00
Heikki Linnakangas
7e48b77b1c Fix some serious bugs in archive recovery, now that bgwriter is active
during it:

When bgwriter is active, the startup process can't perform mdsync() correctly
because it won't see the fsync requests accumulated in bgwriter's private
pendingOpsTable. Therefore make bgwriter responsible for the end-of-recovery
checkpoint as well, when it's active.

When bgwriter is active (= archive recovery), the startup process must not
accumulate fsync requests to its own pendingOpsTable, since bgwriter won't
see them there when it performs restartpoints. Make startup process drop its
pendingOpsTable when bgwriter is launched to avoid that.

Update minimum recovery point one last time when leaving archive recovery.
It won't be updated by the end-of-recovery checkpoint because XLogFlush()
sees us as out of recovery already.

This fixes bug #4879 reported by Fujii Masao.
2009-06-25 21:36:00 +00:00
Tom Lane
6382448cf9 For bulk write operations (eg COPY IN), use a ring buffer of 16MB instead
of the 256KB limit originally enforced by a patch committed 2008-11-06.
Per recent test results, the smaller size resulted in an undesirable decrease
in bulk data loading speed, due to COPY processing frequently getting blocked
for WAL flushing.  This area might need more tweaking later, but this setting
seems to be good enough for 8.4.
2009-06-22 20:04:28 +00:00
Bruce Momjian
d747140279 8.4 pgindent run, with new combined Linux/FreeBSD/MinGW typedef list
provided by Andrew.
2009-06-11 14:49:15 +00:00
Tom Lane
4616d57dad Fix all the server-side SIGQUIT handlers (grumble ... why so many identical
copies?) to ensure they really don't run proc_exit/shmem_exit callbacks,
as was intended.  I broke this behavior recently by installing atexit
callbacks without thinking about the one case where we truly don't want
to run those callback functions.  Noted in an example from Dave Page.
2009-05-15 15:56:39 +00:00
Tom Lane
249a899f73 Install an atexit(2) callback that ensures that proc_exit's cleanup processing
will still be performed if something in a backend process calls exit()
directly, instead of going through proc_exit() as we prefer.  This is a second
response to the issue that we might load third-party code that doesn't know it
should not call exit().  Such a call will now cause a reasonably graceful
backend shutdown, if possible.  (Of course, if the reason for the exit() call
is out-of-memory or some such, we might not be able to recover, but at least
we will try.)
2009-05-05 20:06:07 +00:00
Tom Lane
969d7cd431 Install a "dead man switch" to allow the postmaster to detect cases where
a backend has done exit(0) or exit(1) without having disengaged itself
from shared memory.  We are at risk for this whenever third-party code is
loaded into a backend, since such code might not know it's supposed to go
through proc_exit() instead.  Also, it is reported that under Windows
there are ways to externally kill a process that cause the status code
returned to the postmaster to be indistinguishable from a voluntary exit
(thank you, Microsoft).  If this does happen then the system is probably
hosed --- for instance, the dead session might still be holding locks.
So the best recovery method is to treat this like a backend crash.

The dead man switch is armed for a particular child process when it
acquires a regular PGPROC, and disarmed when the PGPROC is released;
these should be the first and last touches of shared memory resources
in a backend, or close enough anyway.  This choice means there is no
coverage for auxiliary processes, but I doubt we need that, since they
shouldn't be executing any user-provided code anyway.

This patch also improves the management of the EXEC_BACKEND
ShmemBackendArray array a bit, by reducing search costs.

Although this problem is of long standing, the lack of field complaints
seems to mean it's not critical enough to risk back-patching; at least
not till we get some more testing of this mechanism.
2009-05-05 19:59:00 +00:00
Tom Lane
c973051ae6 A session that does not have any live snapshots does not have to be waited for
when we are waiting for old snapshots to go away during a concurrent index
build.  In particular, this rule lets us avoid waiting for
idle-in-transaction sessions.

This logic could be improved further if we had some way to wake up when
the session we are currently waiting for goes idle-in-transaction.  However
that would be a significantly more complex/invasive patch, so it'll have to
wait for some other day.

Simon Riggs, with some improvements by Tom.
2009-04-04 17:40:36 +00:00
Tom Lane
1b2bb33a54 Add a comment documenting the question of whether PrefetchBuffer should
try to protect an already-existing buffer from being evicted.  This was
left as an open issue when the posix_fadvise patch was committed.  I'm
not sure there's any evidence to justify more work in this area, but we
should have some record about it in the source code.
2009-04-03 18:17:43 +00:00
Tom Lane
948d6ec90f Modify the relcache to record the temp status of both local and nonlocal
temp relations; this is no more expensive than before, now that we have
pg_class.relistemp.  Insert tests into bufmgr.c to prevent attempting
to fetch pages from nonlocal temp relations.  This provides a low-level
defense against bugs-of-omission allowing temp pages to be loaded into shared
buffers, as in the contrib/pgstattuple problem reported by Stuart Bishop.
While at it, tweak a bunch of places to use new relcache tests (instead of
expensive probes into pg_namespace) to detect local or nonlocal temp tables.
2009-03-31 22:12:48 +00:00
Heikki Linnakangas
eeeb782e60 Fix a rare race condition when commit_siblings > 0 and a transaction commits
at the same instant as a new backend is spawned. Since CountActiveBackends()
doesn't hold ProcArrayLock, it needs to be prepared for the case that a
pointer at the end of the proc array is still NULL even though numProcs says
it should be valid, since it doesn't hold ProcArrayLock. Backpatch to 8.1.
8.0 and earlier had this right, but it was broken in the split of PGPROC and
sinval shared memory arrays.

Per report and proposal by Marko Kreen.
2009-03-31 05:18:33 +00:00
Tom Lane
471913a6a5 More fixes for 8.4 DTrace probes. Remove useless BUFFER_HIT/BUFFER_MISS
probes --- the BUFFER_READ_DONE probe provides the same information and more
besides.  Expand the LOCK_WAIT_START/DONE probe arguments so that there's
actually some chance of telling what is being waited for.  Update and
clean up the documentation.
2009-03-23 01:52:38 +00:00
Tom Lane
44023dc5f5 Add isExtend to the parameters of the buffer_read_start and buffer_read_done
DTrace probes, so that ordinary reads can be distinguished from relation
extension operations.  Move buffer_read_start probe to before the
smgrnblocks() call that's needed in the isExtend case, since really that step
should be charged as part of the time needed for the extension operation.
(This makes it slightly harder to match the read_start with the associated
read_done, since now you can't match them on blockNumber, but it should still
be possible since isExtend operations on the same relation can never be
interleaved.)  Per recent discussion.

In passing, add the page identity (forkNum/blockNum) to the parameters of the
buffer_flush_start/buffer_flush_done probes, which were unaccountably lacking
the info.
2009-03-22 22:39:05 +00:00
Tom Lane
d287c9eff0 Restore previous ordering of BUFFER_FLUSH_START probe. I had wanted to
make it include the time for the possible smgropen() call, but that
results in a null pointer dereference :-(.

An alternative solution would be to fetch the buffer tag instead of
looking at *reln, but I'll just put it back as it was for the moment.

BTW, this indicates that DTrace probes evaluate their arguments even
when nominally inactive.  What was that about "zero cost", again?
2009-03-13 17:46:21 +00:00
Tom Lane
e04810e8c4 Code review for dtrace probes added (so far) to 8.4. Adjust placement of
some bufmgr probes, take out redundant and memory-leak-inducing path arguments
to smgr__md__read__done and smgr__md__write__done, fix bogus attempt to
recalculate space used in sort__done, clean up formatting in places where
I'm not sure pgindent will do a nice job by itself.
2009-03-11 23:19:25 +00:00
Peter Eisentraut
9add9f95c3 Don't actively violate the system limit of maximum open files (RLIMIT_NOFILE).
This avoids irritating kernel logs (if system overstep violations are enabled)
and also the grsecurity alert when starting PostgreSQL.

original patch by Jacek Drobiecki

References:
http://archives.postgresql.org/pgsql-bugs/2004-05/msg00103.php
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=248967
2009-03-04 09:12:49 +00:00
Heikki Linnakangas
cdd46c7654 Start background writer during archive recovery. Background writer now performs
its usual buffer cleaning duties during archive recovery, and it's responsible
for performing restartpoints.

This requires some changes in postmaster. When the startup process has done
all the initialization and is ready to start WAL redo, it signals the
postmaster to launch the background writer. The postmaster is signaled again
when the point in recovery is reached where we know that the database is in
consistent state. Postmaster isn't interested in that at the moment, but
that's the point where we could let other backends in to perform read-only
queries. The postmaster is signaled third time when the recovery has ended,
so that postmaster knows that it's safe to start accepting connections.

The startup process now traps SIGTERM, and performs a "clean" shutdown. If
you do a fast shutdown during recovery, a shutdown restartpoint is performed,
like a shutdown checkpoint, and postmaster kills the processes cleanly. You
still have to continue the recovery at next startup, though.

Currently, the background writer is only launched during archive recovery.
We could launch it during crash recovery as well, but it seems better to keep
that codepath as simple as possible, for the sake of robustness. And it
couldn't do any restartpoints during crash recovery anyway, so it wouldn't be
that useful.

log_restartpoints is gone. Use log_checkpoints instead. This is yet to be
documented.

This whole operation is a pre-requisite for Hot Standby, but has some value of
its own whether the hot standby patch makes 8.4 or not.

Simon Riggs, with lots of modifications by me.
2009-02-18 15:58:41 +00:00
Heikki Linnakangas
b2a667b9ee Add a new option to RestoreBkpBlocks() to indicate if a cleanup lock should
be used instead of the normal exclusive lock, and make WAL redo functions
responsible for calling RestoreBkpBlocks(). They know better what kind of a
lock they need.

At the moment, this just moves things around with no functional change, but
makes the hot standby patch that's under review cleaner.
2009-01-20 18:59:37 +00:00
Tom Lane
b7b8f0b609 Implement prefetching via posix_fadvise() for bitmap index scans. A new
GUC variable effective_io_concurrency controls how many concurrent block
prefetch requests will be issued.

(The best way to handle this for plain index scans is still under debate,
so that part is not applied yet --- tgl)

Greg Stark
2009-01-12 05:10:45 +00:00
Tom Lane
dad75a62bf Create a "shmem_startup_hook" to be called at the end of shared memory
initialization, to give loadable modules a reasonable place to perform
creation of any shared memory areas they need.  This is the logical conclusion
of our previous creation of RequestAddinShmemSpace() and RequestAddinLWLocks().
We don't need an explicit shmem_shutdown_hook, because the existing
on_shmem_exit and on_proc_exit mechanisms serve that need.

Also, adjust SubPostmasterMain so that libraries that got loaded into the
postmaster will be loaded into all child processes, not only regular backends.
This improves consistency with the non-EXEC_BACKEND behavior, and might be
necessary for functionality for some types of add-ons.
2009-01-03 17:08:39 +00:00
Bruce Momjian
511db38ace Update copyright for 2009. 2009-01-01 17:24:05 +00:00
Bruce Momjian
5a90bc1fbe The attached patch contains a couple of fixes in the existing probes and
includes a few new ones.

- Fixed compilation errors on OS X for probes that use typedefs
- Fixed a number of probes to pass ForkNumber per the relation forks
patch
- The new probes are those that were taken out from the previous
submitted patch and required simple fixes. Will submit the other probes
that may require more discussion in a separate patch.

Robert Lor
2008-12-17 01:39:04 +00:00
Tom Lane
55368223cd Tweak the tree descent loop in fsm_search_avail to not look at the
right child if it doesn't need to.  This saves some miniscule number
of cycles, but the ulterior motive is to avoid an optimization bug
known to exist in SCO's C compiler (and perhaps others?)
2008-12-10 17:11:18 +00:00
Heikki Linnakangas
dea81a6cf6 Revert SIGUSR1 multiplexing patch, per Tom's objection. 2008-12-09 15:59:39 +00:00
Heikki Linnakangas
7b05b3fa39 Provide support for multiplexing SIGUSR1 signal. The upcoming synchronous
replication patch needs a signal, but we've already used SIGUSR1 and
SIGUSR2 in normal backends. This patch allows reusing SIGUSR1 for that,
and for other purposes too if the need arises.
2008-12-09 14:28:20 +00:00
Alvaro Herrera
7b640b0345 Fix a couple of snapshot management bugs in the new ResourceOwner world:
non-writable large objects need to have their snapshots registered on the
transaction resowner, not the current portal's, because it must persist until
the large object is closed (which the portal does not).  Also, ensure that the
serializable snapshot is recorded by the transaction resource owner too, even
when a subtransaction has changed the current resource owner before
serializable is taken.

Per bug reports from Pavan Deolasee.
2008-12-04 14:51:02 +00:00
Heikki Linnakangas
011fa3662e Small comment fixes. 2008-12-03 12:22:53 +00:00
Heikki Linnakangas
4d6ee26171 Don't force creation of the FSM on searches. It will still be created
as soon as the first page fills up, and is marked as (almost) full,
though.
2008-11-27 13:32:26 +00:00
Heikki Linnakangas
58bece7a60 Fix #ifdeffed debugging code to work with relation forks. 2008-11-27 07:38:01 +00:00
Heikki Linnakangas
9858a8c81c Rely on relcache invalidation to update the cached size of the FSM. 2008-11-26 17:08:58 +00:00
Heikki Linnakangas
3396000684 Rethink the way FSM truncation works. Instead of WAL-logging FSM
truncations in FSM code, call FreeSpaceMapTruncateRel from smgr_redo. To
make that cleaner from modularity point of view, move the WAL-logging one
level up to RelationTruncate, and move RelationTruncate and all the
related WAL-logging to new src/backend/catalog/storage.c file. Introduce
new RelationCreateStorage and RelationDropStorage functions that are used
instead of calling smgrcreate/smgrscheduleunlink directly. Move the
pending rel deletion stuff from smgrcreate/smgrscheduleunlink to the new
functions. This leaves smgr.c as a thin wrapper around md.c; all the
transactional stuff is now in storage.c.

This will make it easier to add new forks with similar truncation logic,
like the visibility map.
2008-11-19 10:34:52 +00:00
Heikki Linnakangas
f06b7604ca Fix oversight in previous error-reporting patch; mustn't pfree path string
before passing it to elog.
2008-11-14 11:09:50 +00:00
Tom Lane
cad3a26a95 Fix sloppy omission of now-required #include's. 2008-11-11 14:17:02 +00:00
Heikki Linnakangas
7e8b0b9ab1 Change error messages to print the physical path, like
"base/11517/3767_fsm", instead of symbolic names like "1663/11517/3767/1",
per Alvaro's suggestion. I didn't change the messages in the higher-level
index, heap and FSM routines, though, where the fork is implicit.
2008-11-11 13:19:16 +00:00
Tom Lane
6517f377d6 Implement ALTER DATABASE SET TABLESPACE to move a whole database (or at least
as much of it as lives in its default tablespace) to a new tablespace.

Guillaume Lelarge, with some help from Bernd Helmle and Tom Lane
2008-11-07 18:25:07 +00:00
Tom Lane
85e2cedf98 Improve bulk-insert performance by keeping the current target buffer pinned
(but not locked, as that would risk deadlocks).  Also, make it work in a small
ring of buffers to avoid having bulk inserts trash the whole buffer arena.

Robert Haas, after an idea of Simon Riggs'.
2008-11-06 20:51:15 +00:00
Tom Lane
b4eae023bb Clean up the messy semantics (not to mention inefficiency) of PageGetTempPage
by splitting it into three functions with better-defined behaviors.

Zdenek Kotala
2008-11-03 20:47:49 +00:00
Tom Lane
d7112cfa88 Remove the last vestiges of the MAKE_PTR/MAKE_OFFSET mechanism. We haven't
allowed different processes to have different addresses for the shmem segment
in quite a long time, but there were still a few places left that used the
old coding convention.  Clean them up to reduce confusion and improve the
compiler's ability to detect pointer type mismatches.

Kris Jurka
2008-11-02 21:24:52 +00:00
Tom Lane
902d1cb35f Remove all uses of the deprecated functions heap_formtuple, heap_modifytuple,
and heap_deformtuple in favor of the newer functions heap_form_tuple et al
(which do the same things but use bool control flags instead of arbitrary
char values).  Eliminate the former duplicate coding of these functions,
reducing the deprecated functions to mere wrappers around the newer ones.
We can't get rid of them entirely because add-on modules probably still
contain many instances of the old coding style.

Kris Jurka
2008-11-02 01:45:28 +00:00
Heikki Linnakangas
e9816533e3 Update FSM on WAL replay. This is a bit limited; the FSM is only updated
on non-full-page-image WAL records, and quite arbitrarily, only if there's
less than 20% free space on the page after the insert/update (not on HOT
updates, though). The 20% cutoff should avoid most of the overhead, when
replaying a bulk insertion, for example, while ensuring that pages that
are full are marked as full in the FSM.

This is mostly to avoid the nasty worst case scenario, where you replay
from a PITR archive, and the FSM information in the base backup is really
out of date. If there was a lot of pages that the outdated FSM claims to
have free space, but don't actually have any, the first unlucky inserter
after the recovery would traverse through all those pages, just to find
out that they're full. We didn't have this problem with the old FSM
implementation, because we simply threw the FSM information away on a
non-clean shutdown.
2008-10-31 19:40:27 +00:00
Heikki Linnakangas
19c8dc839b Unite ReadBufferWithFork, ReadBufferWithStrategy, and ZeroOrReadBuffer
functions into one ReadBufferExtended function, that takes the strategy
and mode as argument. There's three modes, RBM_NORMAL which is the default
used by plain ReadBuffer(), RBM_ZERO, which replaces ZeroOrReadBuffer, and
a new mode RBM_ZERO_ON_ERROR, which allows callers to read corrupt pages
without throwing an error. The FSM needs the new mode to recover from
corrupt pages, which could happend if we crash after extending an FSM file,
and the new page is "torn".

Add fork number to some error messages in bufmgr.c, that still lacked it.
2008-10-31 15:05:00 +00:00
Alvaro Herrera
089ae3bc9a Properly access a buffer's LSN using existing access macros instead of abusing
knowledge of page layout.

Stolen from Jonah Harris' CRC patch
2008-10-20 21:11:15 +00:00
Tom Lane
dd4c165bc3 Improve some of the comments in fsmpage.c. 2008-10-07 21:10:11 +00:00
Heikki Linnakangas
89f373bf5b Index FSMs needs to be vacuumed as well. Report by Jeff Davis. 2008-10-06 08:04:11 +00:00
Tom Lane
68827a7ada Suppress an uninitialized-variable warning (not all versions of gcc
complain here, but some do)
2008-10-01 14:59:23 +00:00
Heikki Linnakangas
f06ef2bede Fix WAL redo of FSM truncation. We can't call smgrtruncate() during WAL
replay, because it tries to XLogInsert().
2008-10-01 08:12:14 +00:00
Tom Lane
6ca1b1cd95 Fix compiler warning (unportable sprintf usage) 2008-09-30 14:15:58 +00:00
Heikki Linnakangas
15c121b3ed Rewrite the FSM. Instead of relying on a fixed-size shared memory segment, the
free space information is stored in a dedicated FSM relation fork, with each
relation (except for hash indexes; they don't use FSM).

This eliminates the max_fsm_relations and max_fsm_pages GUC options; remove any
trace of them from the backend, initdb, and documentation.

Rewrite contrib/pg_freespacemap to match the new FSM implementation. Also
introduce a new variant of the get_raw_page(regclass, int4, int4) function in
contrib/pageinspect that let's you to return pages from any relation fork, and
a new fsm_page_contents() function to inspect the new FSM pages.
2008-09-30 10:52:14 +00:00
Alvaro Herrera
5817d861e9 Optimize CleanupTempFiles by having a boolean flag that keeps track of whether
there are FD_XACT_TEMPORARY files to clean up at transaction end.

Per performance profiling results on AWeber's huge systems.

Patch by me after an idea suggested by Simon Riggs.
2008-09-19 04:57:10 +00:00
Tom Lane
35c2a3c3cf Allow ShowBufferUsage() to report the number of reads/writes that have
occurred to temporary files.  This replaces the unused
NDirectFileRead/NDirectFileWrite counters.

Itagaki Takahiro
2008-09-17 13:15:55 +00:00
Heikki Linnakangas
3f0e808c4a Introduce the concept of relation forks. An smgr relation can now consist
of multiple forks, and each fork can be created and grown separately.

The bulk of this patch is about changing the smgr API to include an extra
ForkNumber argument in every smgr function. Also, smgrscheduleunlink and
smgrdounlink no longer implicitly call smgrclose, because other forks might
still exist after unlinking one. The callers of those functions have been
modified to call smgrclose instead.

This patch in itself doesn't have any user-visible effect, but provides the
infrastructure needed for upcoming patches. The additional forks envisioned
are a rewritten FSM implementation that doesn't rely on a fixed-size shared
memory block, and a visibility map to allow skipping portions of a table in
VACUUM that have no dead tuples.
2008-08-11 11:05:11 +00:00
Tom Lane
d8b04d5fac In ReadOrZeroBuffer (and related entry points), don't bother to call
PageHeaderIsValid when we zero the buffer instead of reading the page in.
The actual performance improvement is probably marginal since this function
isn't very heavily used, but a cycle saved is a cycle earned.

Zdenek Kotala
2008-08-05 15:09:04 +00:00
Tom Lane
4abd7b49f1 Improve CREATE/DROP/RENAME DATABASE so that when failing because the source
or target database is being accessed by other users, it tells you whether
the "other users" are live sessions or uncommitted prepared transactions.
(Indeed, it tells you exactly how many of each, but that's mostly just
because it was easy to do so.)  This should help forestall the gotcha of
not realizing that a prepared transaction is what's blocking the command.
Per discussion.
2008-08-04 18:03:46 +00:00
Alvaro Herrera
e36e6b1cab Add a few more DTrace probes to the backend.
Robert Lor
2008-08-01 13:16:09 +00:00
Tom Lane
dc02a4814a Fix a race condition that I introduced into sinvaladt.c during the recent
rewrite.  When called from SIInsertDataEntries, SICleanupQueue releases
the write lock if it has to issue a kill() to signal some laggard backend.
That still seems like a good idea --- but it's possible that by the time
we get the lock back, there are no longer enough free message slots to
satisfy SIInsertDataEntries' requirement.  Must recheck, and repeat the
whole SICleanupQueue process if not.  Noted while reading code.
2008-07-18 14:45:48 +00:00
Tom Lane
6816577a78 Change the PageGetContents() macro to guarantee its result is maxalign'd,
thereby forestalling any problems with alignment of the data structure placed
there.  Since SizeOfPageHeaderData is maxalign'd anyway in 8.3 and HEAD, this
does not actually change anything right now, but it is foreseeable that the
header size will change again someday.  I had to fix a couple of places that
were assuming that the content offset is just SizeOfPageHeaderData rather than
MAXALIGN(SizeOfPageHeaderData).  Per discussion of Zdenek's page-macros patch.
2008-07-13 21:50:04 +00:00
Tom Lane
9d035f4254 Clean up the use of some page-header-access macros: principally, use
SizeOfPageHeaderData instead of sizeof(PageHeaderData) in places where that
makes the code clearer, and avoid casting between Page and PageHeader where
possible.  Zdenek Kotala, with some additional cleanup by Heikki Linnakangas.

I did not apply the parts of the proposed patch that would have resulted in
slightly changing the on-disk format of hash indexes; it seems to me that's
not a win as long as there's any chance of having in-place upgrade for 8.4.
2008-07-13 20:45:47 +00:00
Alvaro Herrera
110147653a Make sure we only try to free snapshots that have been passed through
CopySnapshot, per Neil Conway.  Also add a comment about the assumption in
GetSnapshotData that the argument is statically allocated.

Also, fix some more typos in comments in snapmgr.c.
2008-07-11 02:10:14 +00:00
Tom Lane
5b965bf08b Teach autovacuum how to determine whether a temp table belongs to a crashed
backend.  If so, send a LOG message to the postmaster log, and if the table
is beyond the vacuum-for-wraparound horizon, forcibly drop it.  Per recent
discussions.  Perhaps we ought to back-patch this, but it probably needs
to age a bit in HEAD first.
2008-07-01 02:09:34 +00:00
Tom Lane
dab421d2f0 Seems I was too optimistic in supposing that sinval's maxMsgNum could be
read and written without a lock.  The value itself is atomic, sure, but on
processors with weak memory ordering it's possible for a reader to see the
value change before it sees the associated message written into the buffer
array.  Fix by introducing a spinlock that's used just to read and write
maxMsgNum.  (We could do this with less overhead if we recognized a concept
of "memory access barrier"; is it worth introducing such a thing?  At the
moment probably not --- I can't measure any clear slowdown from adding the
spinlock, so this solution is probably fine.)  Per buildfarm results.
2008-06-20 00:24:53 +00:00
Tom Lane
fad153ec45 Rewrite the sinval messaging mechanism to reduce contention and avoid
unnecessary cache resets.  The major changes are:

* When the queue overflows, we only issue a cache reset to the specific
backend or backends that still haven't read the oldest message, rather
than resetting everyone as in the original coding.

* When we observe backend(s) falling well behind, we signal SIGUSR1
to only one backend, the one that is furthest behind and doesn't already
have a signal outstanding for it.  When it finishes catching up, it will
in turn signal SIGUSR1 to the next-furthest-back guy, if there is one that
is far enough behind to justify a signal.  The PMSIGNAL_WAKEN_CHILDREN
mechanism is removed.

* We don't attempt to clean out dead messages after every message-receipt
operation; rather, we do it on the insertion side, and only when the queue
fullness passes certain thresholds.

* Split SInvalLock into SInvalReadLock and SInvalWriteLock so that readers
don't block writers nor vice versa (except during the infrequent queue
cleanout operations).

* Transfer multiple sinval messages for each acquisition of a read or
write lock.
2008-06-19 21:32:56 +00:00
Alvaro Herrera
a3540b0f65 Improve our #include situation by moving pointer types away from the
corresponding struct definitions.  This allows other headers to avoid including
certain highly-loaded headers such as rel.h and relscan.h, instead using just
relcache.h, heapam.h or genam.h, which are more lightweight and thus cause less
unnecessary dependencies.
2008-06-19 00:46:06 +00:00
Tom Lane
86fdb32bd0 Remove freeBackends counter from the sinval shared memory area. We used to
use it to help enforce superuser_reserved_backends, but since 8.1 it's
just been dead weight.
2008-06-17 20:07:08 +00:00
Heikki Linnakangas
a213f1ee6c Refactor XLogOpenRelation() and XLogReadBuffer() in preparation for relation
forks. XLogOpenRelation() and the associated light-weight relation cache in
xlogutils.c is gone, and XLogReadBuffer() now takes a RelFileNode as argument,
instead of Relation.

For functions that still need a Relation struct during WAL replay, there's a
new function called CreateFakeRelcacheEntry() that returns a fake entry like
XLogOpenRelation() used to.
2008-06-12 09:12:31 +00:00
Neil Conway
8374246054 Further tweak for comment in CheckDeadLock(), per Tom. 2008-06-09 18:23:05 +00:00
Neil Conway
da80a4b97e Fix typo in comment. 2008-06-09 06:55:34 +00:00
Alvaro Herrera
cc87402d6e Move BufferGetPageSize and BufferGetPage from bufpage.h to bufmgr.h. It is
more logical that way, and also it reduces the amount of unnecessary includes
in bufpage.h, which is widely used.

Zdenek Kotala.

My previous patch to bufpage.h should also have credited him as author, but I
forgot (sorry about that).
2008-06-08 22:00:48 +00:00
Bruce Momjian
d82a1d582c This is the patch replace offnum++ by OffsetNumberNext, to be
consistent.  OffsetNumberNext() has some casting that makes it useful.

Fujii Masao
2008-05-13 15:44:08 +00:00
Alvaro Herrera
5da9da71c4 Improve snapshot manager by keeping explicit track of snapshots.
There are two ways to track a snapshot: there's the "registered" list, which
is used for arbitrary long-lived snapshots; and there's the "active stack",
which is used for the snapshot that is considered "active" at any time.
This also allows users of snapshots to stop worrying about snapshot memory
allocation and freeing, and about using PG_TRY blocks around ActiveSnapshot
assignment.  This is all done automatically now.

As a consequence, this allows us to reset MyProc->xmin when there are no
more snapshots registered in the current backend, reducing the impact that
long-running transactions have on VACUUM.
2008-05-12 20:02:02 +00:00
Alvaro Herrera
9084399782 Put back bufmgr.h in bufpage.h -- it is needed by some macros.
Remove #include bufmgr.h from (most?) source files which already include
bufpage.h.
2008-05-12 16:06:10 +00:00
Alvaro Herrera
f8c4d7db60 Restructure some header files a bit, in particular heapam.h, by removing some
unnecessary #include lines in it.  Also, move some tuple routine prototypes and
macros to htup.h, which allows removal of heapam.h inclusion from some .c
files.

For this to work, a new header file access/sysattr.h needed to be created,
initially containing attribute numbers of system columns, for pg_dump usage.

While at it, make contrib ltree, intarray and hstore header files more
consistent with our header style.
2008-05-12 00:00:54 +00:00
Tom Lane
3c6248a828 Remove the recently added USE_SEGMENTED_FILES option, and indeed remove all
support for a nonsegmented mode from md.c.  Per recent discussions, there
doesn't seem to be much value in a "never segment" option as opposed to
segmenting with a suitably large segment size.  So instead provide a
configure-time switch to set the desired segment size in units of gigabytes.
While at it, expose a configure switch for BLCKSZ as well.

Zdenek Kotala
2008-05-02 01:08:27 +00:00
Heikki Linnakangas
9cb91f90c9 Fix two race conditions between the pending unlink mechanism that was put in
place to prevent reusing relation OIDs before next checkpoint, and DROP
DATABASE. First, if a database was dropped, bgwriter would still try to unlink
the files that the rmtree() call by the DROP DATABASE command has already
deleted, or is just about to delete. Second, if a database is dropped, and
another database is created with the same OID, bgwriter would in the worst
case delete a relation in the new database that happened to get the same OID
as a dropped relation in the old database.

To fix these race conditions:
- make rmtree() ignore ENOENT errors. This fixes the 1st race condition.
- make ForgetDatabaseFsyncRequests forget unlink requests as well.
- force checkpoint on in dropdb on all platforms

Since ForgetDatabaseFsyncRequests() is asynchronous, the 2nd change isn't
enough on its own to fix the problem of dropping and creating a database with
same OID, but forcing a checkpoint on DROP DATABASE makes it sufficient.

Per Tom Lane's bug report and proposal. Backpatch to 8.3.
2008-04-18 06:48:38 +00:00
Tom Lane
d1cbd26ded Repair two places where SIGTERM exit could leave shared memory state
corrupted.  (Neither is very important if SIGTERM is used to shut down the
whole database cluster together, but there's a problem if someone tries to
SIGTERM individual backends.)  To do this, introduce new infrastructure
macros PG_ENSURE_ERROR_CLEANUP/PG_END_ENSURE_ERROR_CLEANUP that take care
of transiently pushing an on_shmem_exit cleanup hook.  Also use this method
for createdb cleanup --- that wasn't a shared-memory-corruption problem,
but SIGTERM abort of createdb could leave orphaned files lying around.

Backpatch as far as 8.2.  The shmem corruption cases don't exist in 8.1,
and the createdb usage doesn't seem important enough to risk backpatching
further.
2008-04-16 23:59:40 +00:00
Tom Lane
ec498cdcbb Create new routines systable_beginscan_ordered, systable_getnext_ordered,
systable_endscan_ordered that have API similar to systable_beginscan etc
(in particular, the passed-in scankeys have heap not index attnums),
but guarantee ordered output, unlike the existing functions.  For the moment
these are just very thin wrappers around index_beginscan/index_getnext/etc.
Someday they might need to get smarter; but for now this is just a code
refactoring exercise to reduce the number of direct callers of index_getnext,
in preparation for changing that function's API.

In passing, remove index_getnext_indexitem, which has been dead code for
quite some time, and will have even less use than that in the presence
of run-time-lossy indexes.
2008-04-12 23:14:21 +00:00
Alvaro Herrera
73b0300b2a Move the HTSU_Result enum definition into snapshot.h, to avoid including
tqual.h into heapam.h.  This makes all inclusion of tqual.h explicit.

I also sorted alphabetically the includes on some source files.
2008-03-26 21:10:39 +00:00
Alvaro Herrera
78f02ca1f5 Rename snapmgmt.c/h to snapmgr.c/h, for consistency with other files.
Per complaint from Tom Lane.
2008-03-26 18:48:59 +00:00
Alvaro Herrera
d43b085d57 Separate snapshot management code from tuple visibility code, create a
snapmgmt.c file for the former.  The header files have also been reorganized
in three parts: the most basic snapshot definitions are now in a new file
snapshot.h, and the also new snapmgmt.h keeps the definitions for snapmgmt.c.
tqual.h has been reduced to the bare minimum.

This patch is just a first step towards managing live snapshots within a
transaction; there is no functionality change.

Per my proposal to pgsql-patches on 20080318191940.GB27458@alvh.no-ip.org and
subsequent discussion.
2008-03-26 16:20:48 +00:00
Tom Lane
9b8e1eb375 Adjust the recent patch for reporting of deadlocked queries so that we report
query texts only to the server log.  This eliminates the issue of possible
leaking of security-sensitive data in other sessions' queries.  Since the
log is presumed secure, we can now log the queries of all sessions involved
in the deadlock, whether or not they belong to the same user as the one
reporting the failure.
2008-03-24 18:22:36 +00:00
Tom Lane
4b7ae4afae Report the current queries of all backends involved in a deadlock
(if they'd be visible to the current user in pg_stat_activity).

This might look like it's subject to race conditions, but it's actually
pretty safe because at the time DeadLockReport() is constructing the
report, we haven't yet aborted our transaction and so we can expect that
everyone else involved in the deadlock is still blocked on some lock.
(There are corner cases where that might not be true, such as a statement
timeout triggering in another backend before we finish reporting; but at
worst we'd report a misleading activity string, so it seems acceptable
considering the usefulness of reporting the queries.)

Original patch by Itagaki Takahiro, heavily modified by me.
2008-03-21 21:08:31 +00:00
Bruce Momjian
fca9fff41b More README src cleanups. 2008-03-21 13:23:29 +00:00
Bruce Momjian
4e228447aa Make source code READMEs more consistent. Add CVS tags to all README files. 2008-03-20 17:55:15 +00:00
Alvaro Herrera
d54bb24cdd Move elog(DEBUG4) call outside the locked area, per suggestion from Tom Lane. 2008-03-18 12:36:43 +00:00
Peter Eisentraut
a7b7b07af3 Enable probes to work with Mac OS X Leopard and other OSes that will
support DTrace in the future.

Switch from using DTRACE_PROBEn macros to the dynamically generated macros.
Use "dtrace -h" to create a header file that contains the dynamically
generated macros to be used in the source code instead of the DTRACE_PROBEn
macros.  A dummy header file is generated for builds without DTrace support.

Author: Robert Lor <Robert.Lor@sun.com>
2008-03-17 19:44:41 +00:00
Alvaro Herrera
23057f51f5 Move ProcState definition into sinvaladt.c from sinvaladt.h, since it's not
needed anywhere after my previous patch.  Noticed by Tom Lane.

Also, remove #include <signal.h> from sinval.c.
2008-03-17 11:50:27 +00:00
Alvaro Herrera
ec6550c6c0 Modify interactions between sinval.c and sinvaladt.c. The code that actually
deals with the queue, including locking etc, is all in sinvaladt.c.  This means
that the struct definition of the queue, and the queue pointer, are now
internal "implementation details" inside sinvaladt.c.

Per my proposal dated 25-Jun-2007 and followup discussion.
2008-03-16 19:47:34 +00:00
Tom Lane
611b4393f2 Make TransactionIdIsInProgress check transam.c's single-item XID status cache
before it goes groveling through the ProcArray.  In situations where the same
recently-committed transaction ID is checked repeatedly by tqual.c, this saves
a lot of shared-memory searches.  And it's cheap enough that it shouldn't
hurt noticeably when it doesn't help.
Concept and patch by Simon, some minor tweaking and comment-cleanup by Tom.
2008-03-11 20:20:35 +00:00
Tom Lane
f0828b2fc3 Provide a build-time option to store large relations as single files, rather
than dividing them into 1GB segments as has been our longtime practice.  This
requires working support for large files in the operating system; at least for
the time being, it won't be the default.

Zdenek Kotala
2008-03-10 20:06:27 +00:00
Tom Lane
3fcc7e8e18 Reduce memory consumption during VACUUM of large relations, by using
FSMPageData (6 bytes) instead of PageFreeSpaceInfo (8 or 16 bytes)
for the temporary array of page-free-space information.

Itagaki Takahiro
2008-03-10 02:04:10 +00:00
Tom Lane
7d6e6e2e97 Fix PREPARE TRANSACTION to reject the case where the transaction has dropped a
temporary table; we can't support that because there's no way to clean up the
source backend's internal state if the eventual COMMIT PREPARED is done by
another backend.  This was checked correctly in 8.1 but I broke it in 8.2 :-(.
Patch by Heikki Linnakangas, original trouble report by John Smith.
2008-03-04 19:54:06 +00:00
Tom Lane
d50e256b67 Fix another place that was assuming that a local variable declared as
"struct varlena" would be at least word-aligned.  Per buildfarm results
from gypsy_moth.  I did a little bit of trawling for other instances of
this coding pattern, and didn't find any; but if we turn up any more
of them I think we'd better revert the "char [4]" patch and find another
way of making tuptoaster.c alignment-safe.
2008-03-01 19:26:22 +00:00
Peter Eisentraut
0474dcb608 Refactor backend makefiles to remove lots of duplicate code 2008-02-19 10:30:09 +00:00
Tom Lane
082aca9ec2 Fix PageGetExactFreeSpace() so that it actually behaves sensibly
if pd_lower > pd_upper, rather than merely claiming to.  This would
only matter if the page header were corrupt, which shouldn't occur,
but ...
2008-02-10 20:39:08 +00:00
Tom Lane
6f906905b1 Fix WaitOnLock() to ensure that the process's "waiting" flag is reset after
erroring out of a wait.  We can use a PG_TRY block for this, but add a comment
explaining why it'd be a bad idea to use it for any other state cleanup.

Back-patch to 8.2.  Prior releases had the same issue, but only with respect
to the process title, which is likely to get reset almost immediately anyway
after the transaction aborts, so it seems not worth changing them.  In 8.2
and HEAD, the pg_stat_activity "waiting" flag could remain set incorrectly
for a long time.

Per report from Gurjeet Singh.
2008-02-02 22:26:17 +00:00
Tom Lane
6322e84430 Change StatementCancelHandler() to check the DoingCommandRead flag to decide
whether to execute an immediate interrupt, rather than testing whether
LockWaitCancel() cancelled a lock wait.  The old way misclassified the case
where we were blocked in ProcWaitForSignal(), and arguably would misclassify
any other future additions of new ImmediateInterruptOK states too.  This
allows reverting the old kluge that gave LockWaitCancel() a return value,
since no callers care anymore.  Improve comments in the various
implementations of PGSemaphoreLock() to explain that on some platforms, the
assumption that semop() exits after a signal is wrong, and so we must ensure
that the signal handler itself throws elog if we want cancel or die interrupts
to be effective.  Per testing related to bug #3883, though this patch doesn't
solve those problems fully.

Perhaps this change should be back-patched, but since pre-8.3 branches aren't
really relying on autovacuum to respond to SIGINT, it doesn't seem critical
for them.
2008-01-26 19:55:08 +00:00
Tom Lane
ceb9360067 Fix CREATE INDEX CONCURRENTLY to not deadlock against an automatic or manual
VACUUM that is blocked waiting to get lock on the table being indexed.
Per report and fix suggestion from Greg Stark.
2008-01-09 21:52:36 +00:00
Tom Lane
da3df47c84 lmgr.c:DescribeLockTag was never taught about virtual xids, per Greg Stark.
Also a couple of minor tweaks to try to future-proof the code a bit better
against future locktag additions.
2008-01-08 23:18:51 +00:00
Bruce Momjian
9098ab9e32 Update copyrights in source tree to 2008. 2008-01-01 19:46:01 +00:00
Peter Eisentraut
5ca3d50db7 Clarify log messages 2007-12-13 11:55:44 +00:00
Tom Lane
895a94de6d Avoid incrementing the CommandCounter when CommandCounterIncrement is called
but no database changes have been made since the last CommandCounterIncrement.
This should result in a significant improvement in the number of "commands"
that can typically be performed within a transaction before hitting the 2^32
CommandId size limit.  In particular this buys back (and more) the possible
adverse consequences of my previous patch to fix plan caching behavior.

The implementation requires tracking whether the current CommandCounter
value has been "used" to mark any tuples.  CommandCounter values stored into
snapshots are presumed not to be used for this purpose.  This requires some
small executor changes, since the executor used to conflate the curcid of
the snapshot it was using with the command ID to mark output tuples with.
Separating these concepts allows some small simplifications in executor APIs.

Something for the TODO list: look into having CommandCounterIncrement not do
AcceptInvalidationMessages.  It seems fairly bogus to be doing it there,
but exactly where to do it instead isn't clear, and I'm disinclined to mess
with asynchronous behavior during late beta.
2007-11-30 21:22:54 +00:00
Tom Lane
eae7e00f1f Fix stupid typo in recently-added code :-( 2007-11-16 00:57:55 +00:00
Bruce Momjian
f6e8730d11 Re-run pgindent with updated list of typedefs. (Updated README should
avoid this problem in the future.)
2007-11-15 22:25:18 +00:00
Tom Lane
591b9b091c Use ftruncate() not truncate() in mdunlink. Seems Windows doesn't
support the latter.
2007-11-15 21:49:47 +00:00
Bruce Momjian
fdf5a5efb7 pgindent run for 8.3. 2007-11-15 21:14:46 +00:00
Tom Lane
6cc4451b5c Prevent re-use of a deleted relation's relfilenode until after the next
checkpoint.  This guards against an unlikely data-loss scenario in which
we re-use the relfilenode, then crash, then replay the deletion and
recreation of the file.  Even then we'd be OK if all insertions into the
new relation had been WAL-logged ... but that's not guaranteed given all
the no-WAL-logging optimizations that have recently been added.

Patch by Heikki Linnakangas, per a discussion last month.
2007-11-15 20:36:40 +00:00
Tom Lane
69500b05d6 Prevent continuing disk-space bloat when profiling (with PROFILE_PID_DIR
enabled) and autovacuum is on.  Since there will be a steady stream of autovac
worker processes exiting and dropping gmon.out files, allowing them to make
separate subdirectories results in serious bloat; and it seems unlikely that
anyone will care about those profiles anyway.  Limit the damage by forcing all
autovac workers to dump in one subdirectory, PGDATA/gprof/avworker/.

Per report from Jšrg Beyer and subsequent discussion.
2007-11-04 17:55:15 +00:00
Alvaro Herrera
acac68b2bc Allow an autovacuum worker to be interrupted automatically when it is found
to be locking another process (except when it's working to prevent Xid
wraparound problems).
2007-10-26 20:45:10 +00:00
Alvaro Herrera
745c1b2c2a Rearrange vacuum-related bits in PGPROC as a bitmask, to better support
having several of them.  Add two more flags: whether the process is
executing an ANALYZE, and whether a vacuum is for Xid wraparound (which
is obviously only set by autovacuum).

Sneakily move the worker's recently-acquired PostAuthDelay to a more useful
place.
2007-10-24 20:55:36 +00:00
Tom Lane
7a315a09dc Dept. of second thoughts: fix loop in BgBufferSync so that the exit when
bgwriter_lru_maxpages is exceeded leaves the loop variables in the
expected state.  In the original coding, we'd fail to advance
next_to_clean, causing that buffer to be probably-uselessly rechecked next
time, and also have an off-by-one idea of the number of buffers scanned.
2007-09-25 22:11:48 +00:00
Tom Lane
6f5c38dcd0 Just-in-time background writing strategy. This code avoids re-scanning
buffers that cannot possibly need to be cleaned, and estimates how many
buffers it should try to clean based on moving averages of recent allocation
requests and density of reusable buffers.  The patch also adds a couple
more columns to pg_stat_bgwriter to help measure the effectiveness of the
bgwriter.

Greg Smith, building on his own work and ideas from several other people,
in particular a much older patch from Itagaki Takahiro.
2007-09-25 20:03:38 +00:00
Tom Lane
1b3d400cac TransactionIdIsInProgress can skip scanning the ProcArray if the target XID is
later than latestCompletedXid, per Florian Pflug.  Also some minor
improvements in the XIDCACHE_DEBUG code --- make sure each call of
TransactionIdIsInProgress is counted one way or another.
2007-09-23 18:50:38 +00:00
Tom Lane
cc59049daf Improve handling of prune/no-prune decisions by storing a page's oldest
unpruned XMAX in its header.  At the cost of 4 bytes per page, this keeps us
from performing heap_page_prune when there's no chance of pruning anything.
Seems to be necessary per Heikki's preliminary performance testing.
2007-09-21 21:25:42 +00:00
Tom Lane
da072ab2ab Make some simple performance improvements in TransactionIdIsInProgress().
For XIDs of our own transaction and subtransactions, it's cheaper to ask
TransactionIdIsCurrentTransactionId() than to look in shared memory.
Also, the xids[] work array is always the same size within any given
process, so malloc it just once instead of doing a palloc/pfree on every
call; aside from being faster this lets us get rid of some goto's, since
we no longer have any end-of-function pfree to do.  Both ideas by Heikki.
2007-09-21 17:36:53 +00:00
Tom Lane
282d2a03dd HOT updates. When we update a tuple without changing any of its indexed
columns, and the new version can be stored on the same heap page, we no longer
generate extra index entries for the new version.  Instead, index searches
follow the HOT-chain links to ensure they find the correct tuple version.

In addition, this patch introduces the ability to "prune" dead tuples on a
per-page basis, without having to do a complete VACUUM pass to recover space.
VACUUM is still needed to clean up dead index entries, however.

Pavan Deolasee, with help from a bunch of other people.
2007-09-20 17:56:33 +00:00
Tom Lane
6889303531 Redefine the lp_flags field of item pointers as having four states, rather
than two independent bits (one of which was never used in heap pages anyway,
or at least hadn't been in a very long time).  This gives us flexibility to
add the HOT notions of redirected and dead item pointers without requiring
anything so klugy as magic values of lp_off and lp_len.  The state values
are chosen so that for the states currently in use (pre-HOT) there is no
change in the physical representation.
2007-09-12 22:10:26 +00:00
Tom Lane
6bd4f401b0 Replace the former method of determining snapshot xmax --- to wit, calling
ReadNewTransactionId from GetSnapshotData --- with a "latestCompletedXid"
variable that is updated during transaction commit or abort.  Since
latestCompletedXid is written only in places that had to lock ProcArrayLock
exclusively anyway, and is read only in places that had to lock ProcArrayLock
shared anyway, it adds no new locking requirements to the system despite being
cluster-wide.  Moreover, removing ReadNewTransactionId from snapshot
acquisition eliminates the need to take both XidGenLock and ProcArrayLock at
the same time.  Since XidGenLock is sometimes held across I/O this can be a
significant win.  Some preliminary benchmarking suggested that this patch has
no effect on average throughput but can significantly improve the worst-case
transaction times seen in pgbench.  Concept by Florian Pflug, implementation
by Tom Lane.
2007-09-08 20:31:15 +00:00
Tom Lane
0a51e7073c Don't take ProcArrayLock while exiting a transaction that has no XID; there is
no need for serialization against snapshot-taking because the xact doesn't
affect anyone else's snapshot anyway.  Per discussion.  Also, move various
info about the interlocking of transactions and snapshots out of code comments
and into a hopefully-more-cohesive discussion in access/transam/README.

Also, remove a couple of now-obsolete comments about having to force some WAL
to be written to persuade RecordTransactionCommit to do its thing.
2007-09-07 20:59:26 +00:00
Tom Lane
cd1aae5864 Allow CREATE INDEX CONCURRENTLY to disregard transactions in other
databases, per gripe from hubert depesz lubaczewski.  Patch from
Simon Riggs.
2007-09-07 00:58:57 +00:00
Tom Lane
0ecb4ea773 Volatile-qualify the ProcArray PGPROC pointer in a bunch of routines
that examine fields that could change under them.  This is just to make
really sure that when we are fetching a value 'only once', that's what
actually happens.  Possibly this is a bug that should be back-patched,
but in the absence of solid evidence that it's needed, I won't bother.
2007-09-05 21:11:19 +00:00
Tom Lane
295e63983d Implement lazy XID allocation: transactions that do not modify any database
rows will normally never obtain an XID at all.  We already did things this way
for subtransactions, but this patch extends the concept to top-level
transactions.  In applications where there are lots of short read-only
transactions, this should improve performance noticeably; not so much from
removal of the actual XID-assignments, as from reduction of overhead that's
driven by the rate of XID consumption.  We add a concept of a "virtual
transaction ID" so that active transactions can be uniquely identified even
if they don't have a regular XID.  This is a much lighter-weight concept:
uniqueness of VXIDs is only guaranteed over the short term, and no on-disk
record is made about them.

Florian Pflug, with some editorialization by Tom.
2007-09-05 18:10:48 +00:00
Tom Lane
24d4517b3b Improve behavior of log_lock_waits patch. Ensure that something gets logged
even if the "deadlock detected" ERROR message is suppressed by an exception
catcher.  Be clearer about the event sequence when a soft deadlock is fixed:
the fixing process might or might not still have to wait, so log that
separately.  Fix race condition when someone releases us from the lock partway
through printing all this junk --- we'd not get confused about our state, but
the log message sequence could have been misleading, ie, a "still waiting"
message with no subsequent "acquired" message.  Greg Stark and Tom Lane.
2007-08-28 03:23:44 +00:00
Tom Lane
e4f4a7f5a4 Remove FileUnlink(), which wasn't being used anywhere and interacted poorly
with the recent patch to log temp file sizes at removal time.  Doesn't seem
worth fixing since it's unused.
In passing, make a few elog messages conform to the message style guide.
2007-07-26 15:15:18 +00:00
Tom Lane
82eed4dba2 Arrange to put TOAST tables belonging to temporary tables into special schemas
named pg_toast_temp_nnn, alongside the pg_temp_nnn schemas used for the temp
tables themselves.  This allows low-level code such as the relcache to
recognize that these tables are indeed temporary, which enables various
optimizations such as not WAL-logging changes and using local rather than
shared buffers for access.  Aside from obvious performance benefits, this
provides a solution to bug #3483, in which other backends unexpectedly held
open file references to temporary tables.  The scheme preserves the property
that TOAST tables are not in any schema that's normally in the search path,
so they don't conflict with user table names.

initdb forced because of changes in system view definitions.
2007-07-25 22:16:18 +00:00
Tom Lane
fdb5b69e9c Suppress warning when compiling with -DPROFILE_PID_DIR: sys/stat.h is
supposed to be included when using mkdir().
2007-07-25 19:58:56 +00:00
Tom Lane
04fbe29a83 Fix WAL replay of truncate operations to cope with the possibility that the
truncated relation was deleted later in the WAL sequence.  Since replay
normally auto-creates a relation upon its first reference by a WAL log entry,
failure is seen only if the truncate entry happens to be the first reference
after the checkpoint we're restarting from; which is a pretty unusual case but
of course not impossible.  Fix by making truncate entries auto-create like
the other ones do.  Per report and test case from Dharmendra Goyal.
2007-07-20 16:29:53 +00:00
Tom Lane
82b3684672 Add comments spelling out why it's a good idea to release multiple
partition locks in reverse order.
2007-07-16 21:09:50 +00:00
Tom Lane
b09cb0cf12 Remove the pgstat_drop_relation() call from smgr_internal_unlink(), because
we don't know at that point which relation OID to tell pgstat to forget.
The code was passing the relfilenode, which is incorrect, and could possibly
cause some other relation's stats to be zeroed out.  While we could try to
clean this up, it seems much simpler and more reliable to let the next
invocation of pgstat_vacuum_tabstat() fix things; which indeed is how it
worked before I introduced the buggy code into 8.1.3 and later :-(.
Problem noticed by Itagaki Takahiro, fix is per subsequent discussion.
2007-07-08 22:23:16 +00:00
Tom Lane
83aaebba63 Fix incorrect comment about the timing of AbsorbFsyncRequests() during
checkpoint.  The comment claimed that we could do this anytime after
setting the checkpoint REDO point, but actually BufferSync is relying
on the assumption that buffers dumped by other backends will be fsync'd
too.  So we really could not do it any sooner than we are doing it.
2007-07-03 14:51:24 +00:00
Tom Lane
beba73763b Fix comments not updated in recent patch. 2007-07-01 02:22:23 +00:00
Tom Lane
9fc25c0511 Improve logging of checkpoints. Patch by Greg Smith, worked over
by Heikki and a little bit by me.
2007-06-30 19:12:02 +00:00
Alvaro Herrera
10af02b912 Arrange for SIGINT in autovacuum workers to cancel the current table and
continue with the schedule.  Change current uses of SIGINT to abort a worker
into SIGTERM, which keeps the old behaviour of terminating the process.

Patch from ITAGAKI Takahiro, with some editorializing of my own.
2007-06-29 17:07:39 +00:00
Tom Lane
867e2c91a0 Implement "distributed" checkpoints in which the checkpoint I/O is spread
over a fairly long period of time, rather than being spat out in a burst.
This happens only for background checkpoints carried out by the bgwriter;
other cases, such as a shutdown checkpoint, are still done at full speed.

Remove the "all buffers" scan in the bgwriter, and associated stats
infrastructure, since this seems no longer very useful when the checkpoint
itself is properly throttled.

Original patch by Itagaki Takahiro, reworked by Heikki Linnakangas,
and some minor API editorialization by me.
2007-06-28 00:02:40 +00:00
Tom Lane
9cce91dba0 Only log 'process acquired lock' if we actually did get the lock. This
test seems inessential right now since the only control path for not
getting the lock is via CHECK_FOR_INTERRUPTS which won't return control
to ProcSleep, but it would be important if we ever allow the deadlock
code to kill someone else's transaction instead of our own.
2007-06-19 22:01:15 +00:00
Tom Lane
6e07228728 Code review for log_lock_waits patch. Don't try to issue log messages from
within a signal handler (this might be safe given the relatively narrow code
range in which the interrupt is enabled, but it seems awfully risky); do issue
more informative log messages that tell what is being waited for and the exact
length of the wait; minor other code cleanup.  Greg Stark and Tom Lane
2007-06-19 20:13:22 +00:00
Tom Lane
de6a6383a7 Update obsolete comment: it's no longer the case that mdread() will allow
reads beyond EOF, except by special coercion.
2007-06-18 00:47:20 +00:00
Tom Lane
e976fd43c6 Add some simple defenses against null fields in pg_largeobject, and add
comments noting that there's an alignment assumption now that the data
field could be in 1-byte-header format.  Per discussion with Greg Stark.
2007-06-12 19:46:24 +00:00
Tom Lane
a04a423599 Arrange for large sequential scans to synchronize with each other, so that
when multiple backends are scanning the same relation concurrently, each page
is (ideally) read only once.

Jeff Davis, with review by Heikki and Tom.
2007-06-08 18:23:53 +00:00
Tom Lane
6d6d14b6d5 Redefine IsTransactionState() to only return true for TRANS_INPROGRESS state,
which is the only state in which it's safe to initiate database queries.
It turns out that all but two of the callers thought that's what it meant;
and the other two were using it as a proxy for "will GetTopTransactionId()
return a nonzero XID"?  Since it was in fact an unreliable guide to that,
make those two just invoke GetTopTransactionId() always, then deal with a
zero result if they get one.
2007-06-07 21:45:59 +00:00
Tom Lane
24ee8af573 Rework temp_tablespaces patch so that temp tablespaces are assigned separately
for each temp file, rather than once per sort or hashjoin; this allows
spreading the data of a large sort or join across multiple tablespaces.
(I remain dubious that this will make any difference in practice, but certain
people insisted.)  Arrange to cache the results of parsing the GUC variable
instead of recomputing from scratch on every demand, and push usage of the
cache down to the bottommost fd.c level.
2007-06-07 19:19:57 +00:00
Tom Lane
acfce502ba Create a GUC parameter temp_tablespaces that allows selection of the
tablespace(s) in which to store temp tables and temporary files.  This is a
list to allow spreading the load across multiple tablespaces (a random list
element is chosen each time a temp object is to be created).  Temp files are
not stored in per-database pgsql_tmp/ directories anymore, but per-tablespace
directories.

Jaime Casanova and Albert Cervera, with review by Bernd Helmle and Tom Lane.
2007-06-03 17:08:34 +00:00
Tom Lane
964ec46cfe Fix aboriginal bug in BufFileDumpBuffer that would cause it to write the
wrong data when dumping a bufferload that crosses a component-file boundary.
This probably has not been seen in the wild because (a) component files are
normally 1GB apiece and (b) non-block-aligned buffer usage is relatively
rare.  But it's fairly easy to reproduce a problem if one reduces RELSEG_SIZE
in a test build.  Kudos to Kurt Harriman for spotting the bug.
2007-06-01 23:43:11 +00:00
Tom Lane
bd0a260928 Make CREATE/DROP/RENAME DATABASE wait a little bit to see if other backends
will exit before failing because of conflicting DB usage.  Per discussion,
this seems a good idea to help mask the fact that backend exit takes nonzero
time.  Remove a couple of thereby-obsoleted sleeps in contrib and PL
regression test sequences.
2007-06-01 19:38:07 +00:00
Tom Lane
d526575f89 Make large sequential scans and VACUUMs work in a limited-size "ring" of
buffers, rather than blowing out the whole shared-buffer arena.  Aside from
avoiding cache spoliation, this fixes the problem that VACUUM formerly tended
to cause a WAL flush for every page it modified, because we had it hacked to
use only a single buffer.  Those flushes will now occur only once per
ring-ful.  The exact ring size, and the threshold for seqscans to switch into
the ring usage pattern, remain under debate; but the infrastructure seems
done.  The key bit of infrastructure is a new optional BufferAccessStrategy
object that can be passed to ReadBuffer operations; this replaces the former
StrategyHintVacuum API.

This patch also changes the buffer usage-count methodology a bit: we now
advance usage_count when first pinning a buffer, rather than when last
unpinning it.  To preserve the behavior that a buffer's lifetime starts to
decrease when it's released, the clock sweep code is modified to not decrement
usage_count of pinned buffers.

Work not done in this commit: teach GiST and GIN indexes to use the vacuum
BufferAccessStrategy for vacuum-driven fetches.

Original patch by Simon, reworked by Heikki and again by Tom.
2007-05-30 20:12:03 +00:00
Tom Lane
77947c51c0 Fix up pgstats counting of live and dead tuples to recognize that committed
and aborted transactions have different effects; also teach it not to assume
that prepared transactions are always committed.

Along the way, simplify the pgstats API by tying counting directly to
Relations; I cannot detect any redeeming social value in having stats
pointers in HeapScanDesc and IndexScanDesc structures.  And fix a few
corner cases in which counts might be missed because the relation's
pgstat_info pointer hadn't been set.
2007-05-27 03:50:39 +00:00
Tom Lane
63735ca815 Dept. of second thoughts: add comments cautioning against using
ReadOrZeroBuffer to fetch pages from beyond physical EOF.  This would
usually work, but would cause problems for md.c if writes occurred
beyond a segment boundary when the previous segment file hadn't been
fully extended.
2007-05-02 23:34:48 +00:00
Tom Lane
8c3cc86e7b During WAL recovery, when reading a page that we intend to overwrite completely
from the WAL data, don't bother to physically read it; just have bufmgr.c
return a zeroed-out buffer instead.  This speeds recovery significantly,
and also avoids unnecessary failures when a page-to-be-overwritten has corrupt
page headers on disk.  This replaces a former kluge that accomplished the
latter by pretending zero_damaged_pages was always ON during WAL recovery;
which was OK when the kluge was put in, but is unsafe when restoring a WAL
log that was written with full_page_writes off.

Heikki Linnakangas
2007-05-02 23:18:03 +00:00
Bruce Momjian
1c8302cab3 Add comment on why deadlock detection error messages only prints numbers. 2007-04-20 20:15:52 +00:00
Alvaro Herrera
e2a186b03c Add a multi-worker capability to autovacuum. This allows multiple worker
processes to be running simultaneously.  Also, now autovacuum processes do not
count towards the max_connections limit; they are counted separately from
regular processes, and are limited by the new GUC variable
autovacuum_max_workers.

The launcher now has intelligence to launch workers on each database every
autovacuum_naptime seconds, limited only on the max amount of worker slots
available.

Also, the global worker I/O utilization is limited by the vacuum cost-based
delay feature.  Workers are "balanced" so that the total I/O consumption does
not exceed the established limit.  This part of the patch was contributed by
ITAGAKI Takahiro.

Per discussion.
2007-04-16 18:30:04 +00:00
Tom Lane
995ba280c1 Rearrange mdsync() looping logic to avoid the problem that a sufficiently
fast flow of new fsync requests can prevent mdsync() from ever completing.
This was an unforeseen consequence of a patch added in Mar 2006 to prevent
the fsync request queue from overflowing.  Problem identified by Heikki
Linnakangas and independently by ITAGAKI Takahiro; fix based on ideas from
Takahiro-san, Heikki, and Tom.

Back-patch as far as 8.1 because a previous back-patch introduced the problem
into 8.1 ...
2007-04-12 17:10:55 +00:00
Tom Lane
3e23b68dac Support varlena fields with single-byte headers and unaligned storage.
This commit breaks any code that assumes that the mere act of forming a tuple
(without writing it to disk) does not "toast" any fields.  While all available
regression tests pass, I'm not totally sure that we've fixed every nook and
cranny, especially in contrib.

Greg Stark with some help from Tom Lane
2007-04-06 04:21:44 +00:00
Tom Lane
9c9b619473 Remove the CheckpointStartLock in favor of having backends show whether they
are in their commit critical sections via flags in the ProcArray.  Checkpoint
can watch the ProcArray to determine when it's safe to proceed.  This is
a considerably better solution to the original problem of race conditions
between checkpoint and transaction commit: it speeds up commit, since there's
one less lock to fool with, and it prevents the problem of checkpoint being
delayed indefinitely when there's a constant flow of commits.  Heikki, with
some kibitzing from Tom.
2007-04-03 16:34:36 +00:00
Magnus Hagander
335feca441 Add some instrumentation to the bgwriter, through the stats collector.
New view pg_stat_bgwriter, and the functions required to build it.
2007-03-30 18:34:56 +00:00
Tom Lane
e85a01df67 Clean up the representation of special snapshots by including a "method
pointer" in every Snapshot struct.  This allows removal of the case-by-case
tests in HeapTupleSatisfiesVisibility, which should make it a bit faster
(I didn't try any performance tests though).  More importantly, we are no
longer violating portable C practices by assuming that small integers are
distinct from all pointer values, and HeapTupleSatisfiesDirty no longer
has a non-reentrant API involving side-effects on a global variable.

There were a couple of places calling HeapTupleSatisfiesXXX routines
directly rather than through the HeapTupleSatisfiesVisibility macro.
Since these places had to be changed anyway, I chose to make them go
through the macro for uniformity.

Along the way I renamed HeapTupleSatisfiesSnapshot to HeapTupleSatisfiesMVCC
to emphasize that it's only used with MVCC-type snapshots.  I was sorely
tempted to rename HeapTupleSatisfiesVisibility to HeapTupleSatisfiesSnapshot,
but forebore for the moment to avoid confusion and reduce the likelihood that
this patch breaks some of the pending patches.  Might want to reconsider
doing that later.
2007-03-25 19:45:14 +00:00
Bruce Momjian
1e2bfb5811 Cleanup for procarray.c. 2007-03-23 03:16:39 +00:00
Alvaro Herrera
626eb02198 Cleanup the bootstrap code a little, and rename "dummy procs" in the code
comments and variables to "auxiliary proc", per Heikki's request.
2007-03-07 13:35:03 +00:00
Bruce Momjian
a535cdf130 Revert temp_tablespaces because of coding problems, per Tom. 2007-03-06 02:06:15 +00:00
Bruce Momjian
0763a56501 Add lo_truncate() to backend and libpq for large object truncation.
Kris Jurka
2007-03-03 19:52:47 +00:00
Neil Conway
90d76525c5 Add resetStringInfo(), which clears the content of a StringInfo, and
fixup various places in the tree that were clearing a StringInfo by hand.
Making this function a part of the API simplifies client code slightly,
and avoids needlessly peeking inside the StringInfo interface.
2007-03-03 19:32:55 +00:00
Bruce Momjian
e52c4a6e26 Add GUC log_lock_waits to log long wait times.
Simon Riggs
2007-03-03 18:46:40 +00:00
Tom Lane
fb276438b6 Suppress useless searches for unused line pointers in PageAddItem. To do
this, add a 16-bit "flags" field to page headers by stealing some bits from
pd_tli.  We use one flag bit as a hint to indicate whether there are any
unused line pointers; the remaining 15 are available for future use.

This is a cut-down form of an idea proposed by Hiroki Kataoka in July 2005.
At the time it was rejected because the original patch increased the size of
page headers and it wasn't clear that the benefit outweighed the distributed
cost.  The flag-bit approach gets most of the benefit without requiring an
increase in the page header size.

Heikki Linnakangas and Tom Lane
2007-03-02 00:48:44 +00:00
Magnus Hagander
2c6feff5e7 Remove temporary Windows-specific debugging code. 2007-02-28 15:59:30 +00:00
Tom Lane
234a02b2a8 Replace direct assignments to VARATT_SIZEP(x) with SET_VARSIZE(x, len).
Get rid of VARATT_SIZE and VARATT_DATA, which were simply redundant with
VARSIZE and VARDATA, and as a consequence almost no code was using the
longer names.  Rename the length fields of struct varlena and various
derived structures to catch anyplace that was accessing them directly;
and clean up various places so caught.  In itself this patch doesn't
change any behavior at all, but it is necessary infrastructure if we hope
to play any games with the representation of varlena headers.
Greg Stark and Tom Lane
2007-02-27 23:48:10 +00:00
Bruce Momjian
6f519ad01c btree source code cleanups:
I refactored findsplitloc and checksplitloc so that the division of
labor is more clear IMO. I pushed all the space calculation inside the
loop to checksplitloc.

I also fixed the off by 4 in free space calculation caused by
PageGetFreeSpace subtracting sizeof(ItemIdData), even though it was
harmless, because it was distracting and I felt it might come back to
bite us in the future if we change the page layout or alignments.
There's now a new function PageGetExactFreeSpace that doesn't do the
subtraction.

findsplitloc now tries the "just the new item to right page" split as
well. If people don't like the refactoring, I can write a patch to just
add that.

Heikki Linnakangas
2007-02-21 20:02:17 +00:00
Bruce Momjian
6765df9174 Add configure --enable-profiling to enable GCC profiling. Patches from
Korry Douglas and Nikhil S
2007-02-21 15:12:39 +00:00
Alvaro Herrera
1820650934 Restructure autovacuum in two processes: a dummy process, which runs
continuously, and requests vacuum runs of "autovacuum workers" to postmaster.
The workers do the actual vacuum work.  This allows for future improvements,
like allowing multiple autovacuum jobs running in parallel.

For now, the code keeps the original behavior of having a single autovac
process at any time by sleeping until the previous worker has finished.
2007-02-15 23:23:23 +00:00
Peter Eisentraut
c138b966d4 Replace useless uses of := by = in makefiles. 2007-02-09 15:56:00 +00:00
Bruce Momjian
8b4ff8b6a1 Wording cleanup for error messages. Also change can't -> cannot.
Standard English uses "may", "can", and "might" in different ways:

        may - permission, "You may borrow my rake."

        can - ability, "I can lift that log."

        might - possibility, "It might rain today."

Unfortunately, in conversational English, their use is often mixed, as
in, "You may use this variable to do X", when in fact, "can" is a better
choice.  Similarly, "It may crash" is better stated, "It might crash".
2007-02-01 19:10:30 +00:00
Bruce Momjian
148ea5cbea Add GUC temp_tablespaces to provide a default location for temporary
objects.

Jaime Casanova
2007-01-25 04:35:11 +00:00
Peter Eisentraut
2cc01004c6 Remove remains of old depend target. 2007-01-20 17:16:17 +00:00
Tom Lane
eddbf39756 Extend yesterday's patch so that the bgwriter is also told to forget
pending fsyncs during DROP DATABASE.  Obviously necessary in hindsight :-(
2007-01-17 16:25:01 +00:00
Tom Lane
6d660587f6 Revise bgwriter fsync-request mechanism to improve robustness when a table
is deleted.  A backend about to unlink a file now sends a "revoke fsync"
request to the bgwriter to make it clean out pending fsync requests.  There
is still a race condition where the bgwriter may try to fsync after the unlink
has happened, but we can resolve that by rechecking the fsync request queue
to see if a revoke request arrived meanwhile.  This eliminates the former
kluge of "just assuming" that an ENOENT failure is okay, and lets us handle
the fact that on Windows it might be EACCES too without introducing any
questionable assumptions.  After an idea of mine improved by Magnus.

The HEAD patch doesn't apply cleanly to 8.2, but I'll see about a back-port
later.  In the meantime this could do with some testing on Windows; I've been
able to force it through the code path via ENOENT, but that doesn't prove that
it actually fixes the Windows problem ...
2007-01-17 00:17:21 +00:00
Alvaro Herrera
eb63cc3da8 Arrange for autovacuum to be killed when another operation wants to be alone
accessing it, like DROP DATABASE.  This allows the regression tests to pass
with autovacuum enabled, which open the gates for finally enabling autovacuum
by default.
2007-01-16 13:28:57 +00:00
Bruce Momjian
d64995aa89 Remove trace macro call from new log_temp_files, until it gets more
research.
2007-01-09 22:03:51 +00:00
Bruce Momjian
be8a431881 Add GUC log_temp_files to log the use of temporary files.
Bill Moran
2007-01-09 21:31:17 +00:00
Bruce Momjian
29dccf5fe0 Update CVS HEAD for 2007 copyright. Back branches are typically not
back-stamped for this.
2007-01-05 22:20:05 +00:00
Tom Lane
ef07221997 Clean up smgr.c/md.c APIs as per discussion a couple months ago. Instead of
having md.c return a success/failure boolean to smgr.c, which was just going
to elog anyway, let md.c issue the elog messages itself.  This allows better
error reporting, particularly in cases such as "short read" or "short write"
which Peter was complaining of.  Also, remove the kluge of allowing mdread()
to return zeroes from a read-beyond-EOF: this is now an error condition
except when InRecovery or zero_damaged_pages = true.  (Hash indexes used to
require that behavior, but no more.)  Also, enforce that mdwrite() is to be
used for rewriting existing blocks while mdextend() is to be used for
extending the relation EOF.  This restriction lets us get rid of the old
ad-hoc defense against creating huge files by an accidental reference to
a bogus block number: we'll only create new segments in mdextend() not
mdwrite() or mdread().  (Again, when InRecovery we allow it anyway, since
we need to allow updates of blocks that were later truncated away.)
Also, clean up the original makeshift patch for bug #2737: move the
responsibility for padding relation segments to full length into md.c.
2007-01-03 18:11:01 +00:00
Tom Lane
72619f8191 Modify local buffer management to request memory for local buffers in blocks
of increasing size, instead of one at a time.  This reduces the memory
management overhead when num_temp_buffers is large: in the previous coding
we would actually waste 50% of the space used for temp buffers, because aset.c
would round the individual requests up to 16K.  Problem noted while studying
a performance issue reported by Steven Flatt.

Back-patch as far as 8.1 --- older versions used few enough local buffers
that the issue isn't significant for them.
2006-12-27 22:31:54 +00:00
Peter Eisentraut
409600942b KB -> kB 2006-11-24 09:20:12 +00:00
Tom Lane
3ad0728c81 On systems that have setsid(2) (which should be just about everything except
Windows), arrange for each postmaster child process to be its own process
group leader, and deliver signals SIGINT, SIGTERM, SIGQUIT to the whole
process group not only the direct child process.  This provides saner behavior
for archive and recovery scripts; in particular, it's possible to shut down a
warm-standby recovery server using "pg_ctl stop -m immediate", since delivery
of SIGQUIT to the startup subprocess will result in killing the waiting
recovery_command.  Also, this makes Query Cancel and statement_timeout apply
to scripts being run from backends via system().  (There is no support in the
core backend for that, but it's widely done using untrusted PLs.)  Per gripe
from Stephen Harris and subsequent discussion.
2006-11-21 20:59:53 +00:00
Tom Lane
1a5c450f30 When truncating a relation in-place (eg during VACUUM), do not try to unlink
any no-longer-needed segments; just truncate them to zero bytes and leave
the files in place for possible future re-use.  This avoids problems when
the segments are re-used due to relation growth shortly after truncation.
Before, the bgwriter, and possibly other backends, could still be holding
open file references to the old segment files, and would write dirty blocks
into those files where they'd disappear from the view of other processes.

Back-patch as far as 8.0.  I believe the 7.x branches are not vulnerable,
because they had no bgwriter, and "blind" writes by other backends would
always be done via freshly-opened file references.
2006-11-20 01:07:56 +00:00
Tom Lane
36e012e727 Remove temporary Windows-specific debugging code; it seems the problem
with fopen() not using FILE_SHARE_DELETE was indeed the bug we were after,
given lack of recent reports.
2006-11-06 17:10:22 +00:00
Tom Lane
48188e1621 Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios.  We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases.  Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId.  Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done.  Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database.  initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs.  Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 22:42:10 +00:00
Tom Lane
954c1813ac Remove an unnecessary HOLD_INTERRUPTS/RESUME_INTERRUPTS pair.
This was required back when RESUME_INTERRUPTS could actually
execute ProcessInterrupts, but that hasn't been true since 2001...
2006-10-22 20:34:54 +00:00
Tom Lane
e0dece127d Redesign the patch for allocation of shmem space and LWLocks for add-on
modules; the first try was not usable in EXEC_BACKEND builds (e.g.,
Windows).  Instead, just provide some entry points to increase the
allocation requests during postmaster start, and provide a dedicated
LWLock that can be used to synchronize allocation operations performed
by backends.  Per discussion with Marc Munro.
2006-10-15 22:04:08 +00:00
Bruce Momjian
f99a569a2e pgindent run for 8.2. 2006-10-04 00:30:14 +00:00
Tom Lane
c92f7e258e Replace strncpy with strlcpy in selected places that seem possibly relevant
to performance.  (A wholesale effort to get rid of strncpy should be
undertaken sometime, but not during beta.)  This commit also fixes dynahash.c
to correctly truncate overlength string keys for hashtables, so that its
callers don't have to anymore.
2006-09-27 18:40:10 +00:00
Tom Lane
ffae5cc5a6 Add a check to prevent overwriting valid data if smgrnblocks() gives a
wrong answer, as has been seen to occur with a buggy Linux kernel.  Not
really our bug, but it's a simple test in a seldom-used control path,
so might as well have a defense.
2006-09-25 22:01:10 +00:00
Tom Lane
d40d34863e Fix pg_locks view to call advisory locks advisory locks, while preserving
backward compatibility for anyone using the old userlock code that's now
on pgfoundry --- locks from that code still show as 'userlock'.
2006-09-22 23:20:14 +00:00
Tom Lane
9e936693a9 Fix free space map to correctly track the total amount of FSM space needed
even when a single relation requires more than max_fsm_pages pages.  Also,
make VACUUM emit a warning in this case, since it likely means that VACUUM
FULL or other drastic corrective measure is needed.  Per reports from Jeff
Frost and others of unexpected changes in the claimed max_fsm_pages need.
2006-09-21 20:31:22 +00:00
Tom Lane
9b4cda0df6 Add built-in userlock manipulation functions to replace the former
contrib functionality.  Along the way, remove the USER_LOCKS configuration
symbol, since it no longer makes any sense to try to compile that out.
No user documentation yet ... mmoncure has promised to write some.
Thanks to Abhijit Menon-Sen for creating a first draft to work from.
2006-09-18 22:40:40 +00:00
Tom Lane
2e5e856f6b Marginal cleanup in arrangements for ensuring StrategyHintVacuum is cleared
after an error during VACUUM.  We have a PG_TRY block anyway around the only
call sites, so just reset it in the CATCH clause instead of having
AtEOXact_Buffers blindly do it during xact end.  I think the old code was
actively wrong for the case of a failure during ANALYZE inside a
subtransaction --- the flag wouldn't get cleared until main transaction end.
Probably not worth back-patching though.
2006-09-17 22:16:22 +00:00
Bruce Momjian
a0e87ad7a5 Specify lo_write() to take a _const_ buffer, to match documentation. 2006-09-07 15:37:25 +00:00
Tom Lane
8fad2e3ff4 Arrange for GetSnapshotData to copy live-subtransaction XIDs from the
PGPROC array into snapshots, and use this information to avoid visits
to pg_subtrans in HeapTupleSatisfiesSnapshot.  This appears to solve
the pg_subtrans-related context swap storm problem that's been reported
by several people for 8.1.  While at it, modify GetSnapshotData to not
take an exclusive lock on ProcArrayLock, as closer analysis shows that
shared lock is always sufficient.
Itagaki Takahiro and Tom Lane
2006-09-03 15:59:39 +00:00
Tom Lane
e06fda0a8b Add a function GetLockConflicts() to lock.c to report xacts holding
locks that would conflict with a specified lock request, without
actually trying to get that lock.  Use this instead of the former ad hoc
method of doing the first wait step in CREATE INDEX CONCURRENTLY.
Fixes problem with undetected deadlock and in many cases will allow the
index creation to proceed sooner than it otherwise could've.  Per
discussion with Greg Stark.
2006-08-27 19:14:34 +00:00
Tom Lane
e093dcdd28 Add the ability to create indexes 'concurrently', that is, without
blocking concurrent writes to the table.  Greg Stark, with a little help
from Tom Lane.
2006-08-25 04:06:58 +00:00
Tom Lane
f836c2e37e Add some debug logging code to AllocateFile's failure path to log the
specific Windows error code (GetLastError).  This is a hopefully temporary
hack to try to diagnose rare failures.  Magnus Hagander
2006-08-24 03:15:43 +00:00
Tom Lane
9bf760f7de Add a 'waiting' column to pg_stat_activity to carry the same information
that ps_status provides by appending 'waiting' to the PS display.  This
completes the project of making it feasible to turn off process title
updates and instead rely on pg_stat_activity.  Per my suggestion a few
weeks ago.
2006-08-19 01:36:34 +00:00
Tom Lane
7aa772f03e Now that we've rearranged relation open to get a lock before touching
the rel, it's easy to get rid of the narrow race-condition window that
used to exist in VACUUM and CLUSTER.  Did some minor code-beautification
work in the same area, too.
2006-08-18 16:09:13 +00:00
Tom Lane
2dd7ab0627 Put back another improperly-removed #include. 2006-08-07 21:56:25 +00:00
Tom Lane
3467758809 Fix missing 'static' keywords --- some compilers gripe about this. 2006-08-04 16:42:56 +00:00
Bruce Momjian
2c6d96cef6 Add support for loadable modules to allocated shared memory and
lightweight locks.

Marc Munro
2006-08-01 19:03:11 +00:00
Tom Lane
09d3670df3 Change the relation_open protocol so that we obtain lock on a relation
(table or index) before trying to open its relcache entry.  This fixes
race conditions in which someone else commits a change to the relation's
catalog entries while we are in process of doing relcache load.  Problems
of that ilk have been reported sporadically for years, but it was not
really practical to fix until recently --- for instance, the recent
addition of WAL-log support for in-place updates helped.

Along the way, remove pg_am.amconcurrent: all AMs are now expected to support
concurrent update.
2006-07-31 20:09:10 +00:00
Tom Lane
8822263635 Fix a couple of comments. 2006-07-30 20:17:11 +00:00
Alvaro Herrera
92c2ecc130 Modify snapshot definition so that lazy vacuums are ignored by other
vacuums.  This allows a OLTP-like system with big tables to continue
regular vacuuming on small-but-frequently-updated tables while the
big tables are being vacuumed.

Original patch from Hannu Krossing, rewritten by Tom Lane and updated
by me.
2006-07-30 02:07:18 +00:00
Peter Eisentraut
e9b4969062 DTrace support, with a small initial set of probes
by Robert Lor
2006-07-24 16:32:45 +00:00
Tom Lane
a794fb0681 Convert the lock manager to use the new dynahash.c support for partitioned
hash tables, instead of the previous kluge involving multiple hash tables.
This partially undoes my patch of last December.
2006-07-23 23:08:46 +00:00
Tom Lane
b25dc481c8 Fix oversight in sizing of shared buffer lookup hashtable. Because
BufferAlloc tries to insert a new mapping entry before deleting the old one
for a buffer, we have a transient need for more than NBuffers entries ---
one more in 8.1, and as many as NUM_BUFFER_PARTITIONS more in CVS HEAD.
In theory this could lead to an "out of shared memory" failure if shmem
had already been completely claimed by the time the extra entries were
needed.
2006-07-23 18:34:45 +00:00
Tom Lane
10b9ca3d05 Split the buffer mapping table into multiple separately lockable
partitions, as per discussion.  Passes functionality checks, but
I don't have any performance data yet.
2006-07-23 03:07:58 +00:00
Tom Lane
51ee9fa157 Add support to dynahash.c for partitioning shared hashtables according
to the low-order bits of the entry hash value.  Also make some incidental
cleanups in the dynahash API, such as not exporting the hash header
structs to the world.
2006-07-22 23:04:39 +00:00
Tom Lane
c0e9b3139f Hmm, seems --disable-spinlocks has been broken for awhile and nobody
noticed.  Fix SpinlockSemas() to report the correct count considering
that PG 8.1 adds a spinlock to each shared-buffer header.
2006-07-22 21:04:40 +00:00
Tom Lane
3ff58b48c9 Put back another not-so-unnecessary #include, per report from Hiroshi Saito. 2006-07-16 01:05:23 +00:00
Tom Lane
daecd97617 Put back some more not-so-unused-as-all-that #includes. This un-breaks
the EXEC_BACKEND code on my machines, so hopefully it will fix the
Windows buildfarm members.
2006-07-15 15:47:17 +00:00
Tom Lane
cd24163f6d Fix another passel of include-file breakage. Kris Jurka, Tom Lane 2006-07-14 16:59:19 +00:00
Bruce Momjian
e0522505bd Remove 576 references of include files that were not needed. 2006-07-14 14:52:27 +00:00
Tom Lane
ae643747b1 Fix a passel of recently-committed violations of the rule 'thou shalt
have no other gods before c.h'.  Also remove some demonstrably redundant
#include lines, mostly of <errno.h> which was added to c.h years ago.
2006-07-14 05:28:29 +00:00
Bruce Momjian
a22d76d96a Allow include files to compile own their own.
Strip unused include files out unused include files, and add needed
includes to C files.

The next step is to remove unused include files in C files.
2006-07-13 16:49:20 +00:00
Bruce Momjian
370a709c75 Add GUC update_process_title to control whether 'ps' display is updated
for every command, default to on.
2006-06-27 22:16:44 +00:00
Tom Lane
27c3e3de09 Remove redundant gettimeofday() calls to the extent practical without
changing semantics too much.  statement_timestamp is now set immediately
upon receipt of a client command message, and the various places that used
to do their own gettimeofday() calls to mark command startup are referenced
to that instead.  I have also made stats_command_string use that same
value for pg_stat_activity.query_start for both the command itself and
its eventual replacement by <IDLE> or <idle in transaction>.  There was
some debate about that, but no argument that seemed convincing enough to
justify an extra gettimeofday() call.
2006-06-20 22:52:00 +00:00
Tom Lane
b13c9686d0 Take the statistics collector out of the loop for monitoring backends'
current commands; instead, store current-status information in shared
memory.  This substantially reduces the overhead of stats_command_string
and also ensures that pg_stat_activity is fully up to date at all times.
Per my recent proposal.
2006-06-19 01:51:22 +00:00
Tom Lane
8ff80c1bd3 Remove obsolete comment about VACUUM FULL: it takes buffer content locks
now, and must do so to ensure bgwriter doesn't write a page that is in
process of being compacted.
2006-06-08 14:58:33 +00:00
Bruce Momjian
26cfefabad Fix printf mask for SizeVfdCache
Qingqing Zhou
2006-05-30 13:04:59 +00:00
Tom Lane
2246e31775 Upon closer inspection, the sparc code in s_lock.c is dead code, and
always has been, because it's not got any .globl declaration!  We've
been relying on the solaris_sparc.s code instead.  Rip it out.
(Not back-patched, since this is just cosmetic cleanup.)
2006-05-12 16:50:52 +00:00
Tom Lane
ab1ad7a653 Remove unnecessary .seg/.section directives, per Alan Stange. 2006-05-11 21:58:22 +00:00
Tom Lane
5749f6ef0c Rewrite btree vacuuming to fold the former bulkdelete and cleanup operations
into a single mostly-physical-order scan of the index.  This requires some
ticklish interlocking considerations, but should create no material
performance impact on normal index operations (at least given the
already-committed changes to make scans work a page at a time).  VACUUM
itself should get significantly faster in any index that's degenerated to a
very nonlinear page order.  Also, we save one pass over the index entirely,
except in the case where there were no deletions to do and so only one pass
happened anyway.

Original patch by Heikki Linnakangas, rework by Tom Lane.
2006-05-08 00:00:17 +00:00
Tom Lane
52667d56a3 Rethink the locking mechanisms used for CREATE/DROP/RENAME DATABASE.
The former approach used ExclusiveLock on pg_database, which being a
cluster-wide lock meant only one of these operations could proceed at
a time; worse, it also blocked all incoming connections in ReverifyMyDatabase.
Now that we have LockSharedObject(), we can use locks of different types
applied to databases considered as objects.  This allows much more
flexible management of the interlocking: two CREATE DATABASEs need not
block each other, and need not block connections except to the template
database being used.  Similarly DROP DATABASE doesn't block unrelated
operations.  The locking used in flatfiles.c is also much narrower in
scope than before.  Per recent proposal.
2006-05-04 16:07:29 +00:00
Bruce Momjian
a1ee621589 Fix s_lock_test to use tas.o file, if needed. 2006-04-28 22:54:31 +00:00
Tom Lane
486f994be7 Revise large-object access routines to avoid running with CurrentMemoryContext
set to the large object context ("fscxt"), as this is inevitably a source of
transaction-duration memory leaks.  Not sure why we'd not noticed it before;
maybe people weren't touching a whole lot of LOs in the same transaction
before the 8.1 pg_dump changes.  Per report from Wayne Conrad.

Backpatched as far as 8.1, but the problem doubtless goes all the way back.
I'm disinclined to spend the time to try to verify that the older branches
would still work if patched, seeing that this code was significantly modified
for 8.0 and again for 8.1, and that we don't have any trouble reports before
8.1.  (Maybe the leaks were smaller before?)
2006-04-26 00:34:57 +00:00
Tom Lane
b5498a26de Add some optional code (conditionally compiled under #ifdef LWLOCK_STATS)
to track the number of LWLock acquisitions and the number of times we
block waiting for an LWLock, on a per-process basis.  After having needed
this twice in the past few months, seems like it should go into CVS.
2006-04-21 16:45:12 +00:00
Tom Lane
defe93463c Make the world safe for full_page_writes. Allow XLOG records that try to
update no-longer-existing pages to fall through as no-ops, but make a note
of each page number referenced by such records.  If we don't see a later
XLOG entry dropping the table or truncating away the page, complain at
the end of XLOG replay.  Since this fixes the known failure mode for
full_page_writes = off, revert my previous band-aid patch that disabled
that GUC variable.
2006-04-14 20:27:24 +00:00
Tom Lane
0fcc3c2f1d Repair a low-probability race condition identified by Qingqing Zhou.
If a process abandons a wait in LockBufferForCleanup (in practice,
only happens if someone cancels a VACUUM) just before someone else
sends it a signal indicating the buffer is available, it was possible
for the wakeup to remain in the process' semaphore, causing misbehavior
next time the process waited for an lmgr lock.  Rather than try to
prevent the race condition directly, it seems best to make the lock
manager robust against leftover wakeups, by having it repeat waiting
on the semaphore if the lock has not actually been granted or denied
yet.
2006-04-14 03:38:56 +00:00
Tom Lane
a8b8f4db23 Clean up WAL/buffer interactions as per my recent proposal. Get rid of the
misleadingly-named WriteBuffer routine, and instead require routines that
change buffer pages to call MarkBufferDirty (which does exactly what it says).
We also require that they do so before calling XLogInsert; this takes care of
the synchronization requirement documented in SyncOneBuffer.  Note that
because bufmgr takes the buffer content lock (in shared mode) while writing
out any buffer, it doesn't matter whether MarkBufferDirty is executed before
the buffer content change is complete, so long as the content change is
completed before releasing exclusive lock on the buffer.  So it's OK to set
the dirtybit before we fill in the LSN.
This eliminates the former kluge of needing to set the dirtybit in LockBuffer.
Aside from making the code more transparent, we can also add some new
debugging assertions, in particular that the caller of MarkBufferDirty must
hold the buffer content lock, not merely a pin.
2006-03-31 23:32:07 +00:00
Tom Lane
4243f2387a Suppress attempts to report dropped tables to the stats collector from a
startup or recovery process.  Since such a process isn't a real backend,
pgstat.c gets confused.  This accounts for recent reports of strange
"invalid server process ID -1" log messages during crash recovery.
There isn't any point in attempting to make the report, since we'll discard
stats in such scenarios anyhow.
2006-03-30 22:11:55 +00:00
Tom Lane
6d61cdec07 Clean up and document the API for XLogOpenRelation and XLogReadBuffer.
This commit doesn't make much functional change, but it does eliminate some
duplicated code --- for instance, PageIsNew tests are now done inside
XLogReadBuffer rather than by each caller.
The GIST xlog code still needs a lot of love, but I'll worry about that
separately.
2006-03-29 21:17:39 +00:00
Tom Lane
0a20207060 Arrange to emit a description of the current XLOG record as error context
when an error occurs during xlog replay.  Also, replace the former risky
'write into a fixed-size buffer with no overflow detection' API for XLOG
record description routines; use an expansible StringInfo instead.  (The
latter accounts for most of the patch bulk.)

Qingqing Zhou
2006-03-24 04:32:13 +00:00
Bruce Momjian
f2f5b05655 Update copyright for 2006. Update scripts. 2006-03-05 15:59:11 +00:00
Tom Lane
60d3c9fdf4 Declare the arguments of AllocateFile() as const char *, not char *.
This is consistent with the standard definition of fopen().
2006-03-04 21:32:47 +00:00
Tom Lane
9a506a6257 Arrange to call AbsorbFsyncRequests every so often while performing a
checkpoint in the bgwriter.  This forestalls overflow of the fsync request
queue, which is not fatal but causes considerable performance degradation
when it occurs (because backends then have to do their own fsyncs).  Per
patch from Itagaki Takahiro, modified a little bit by me.
2006-03-03 00:02:02 +00:00
Bruce Momjian
d5dd3d451e Add contrib/pg_freespacemap to display free space map information.
Mark Kirkwood
2006-02-12 03:55:53 +00:00
Bruce Momjian
59bb147353 Update random() usage so ranges are inclusive/exclusive as required. 2006-02-03 12:45:47 +00:00
Tom Lane
d5db3abfb6 Modify pgstats code to reduce performance penalties from oversized stats data
files: avoid creating stats hashtable entries for tables that aren't being
touched except by vacuum/analyze, ensure that entries for dropped tables are
removed promptly, and tweak the data layout to avoid storing useless struct
padding.  Also improve the performance of pgstat_vacuum_tabstat(), and make
sure that autovacuum invokes it exactly once per autovac cycle rather than
multiple times or not at all.  This should cure recent complaints about 8.1
showing much higher stats I/O volume than was seen in 8.0.  It'd still be a
good idea to revisit the design with an eye to not re-writing the entire
stats dataset every half second ... but that would be too much to backpatch,
I fear.
2006-01-18 20:35:06 +00:00
Tom Lane
558bc2584d Fix fsync code to test whether F_FULLFSYNC is available, instead of
assuming it always is on Darwin.  Per report from Neil Brandt.
2006-01-17 23:52:31 +00:00
Tom Lane
39fc1fb07a Remove logic in XactLockTableWait() that attempted to mark a crashed
transaction as aborted.  Since we only call XactLockTableWait on XIDs
that we believe to be currently running, the odds of this code ever
actually firing are minimal.  It's certainly unnecessary, since a
transaction that's not either running or committed will be presumed
aborted anyway.  What's more, it's not hard to imagine scenarios where
this could result in corrupting pg_clog: for instance, if a bogus XID
somehow got passed to XactLockTableWait.  I think the code probably
dates from the ancient era when we didn't have TransactionIdIsInProgress;
back then it may have been necessary, but now I think it's a waste of
cycles and potentially dangerous.  Per discussion with Qingqing Zhou
and Karsten Hilbert.
2006-01-13 21:32:12 +00:00
Tom Lane
304160c3e2 Fix ReadBuffer() to correctly handle the case where it's trying to extend
the relation but it finds a pre-existing valid buffer.  The buffer does not
correspond to any page known to the kernel, so we *must* do smgrextend to
ensure that the space becomes allocated.  The 7.x branches all do this
correctly, but the corner case got lost somewhere during 8.0 bufmgr rewrites.
(My fault no doubt :-( ... I think I assumed that such a buffer must be
not-BM_VALID, which is not so.)
2006-01-06 00:04:20 +00:00
Bruce Momjian
44f9021223 Remove BEOS port. 2006-01-05 03:01:38 +00:00
Tom Lane
349f40b2c2 Rearrange backend startup sequence so that ShmemIndexLock can become
an LWLock instead of a spinlock.  This hardly matters on Unix machines
but should improve startup performance on Windows (or any port using
EXEC_BACKEND).  Per previous discussion.
2006-01-04 21:06:32 +00:00
Tom Lane
195f164228 Get rid of the SpinLockAcquire/SpinLockAcquire_NoHoldoff distinction
in favor of having just one set of macros that don't do HOLD/RESUME_INTERRUPTS
(hence, these correspond to the old SpinLockAcquire_NoHoldoff case).
Given our coding rules for spinlock use, there is no reason to allow
CHECK_FOR_INTERRUPTS to be done while holding a spinlock, and also there
is no situation where ImmediateInterruptOK will be true while holding a
spinlock.  Therefore doing HOLD/RESUME_INTERRUPTS while taking/releasing a
spinlock is just a waste of cycles.  Qingqing Zhou and Tom Lane.
2005-12-29 18:08:05 +00:00
Tom Lane
fb3dbdf986 Rethink prior patch to filter out dead backend entries from the pgstats
file.  The original code probed the PGPROC array separately for each PID,
which was not good for large numbers of backends: not only is the runtime
O(N^2) but most of it is spent holding ProcArrayLock.  Instead, take the
lock just once and copy the active PIDs into an array, then use qsort
and bsearch so that the lookup time is more like O(N log N).
2005-12-16 04:03:40 +00:00
Tom Lane
ec0baf949e Divide the lock manager's shared state into 'partitions', so as to
reduce contention for the former single LockMgrLock.  Per my recent
proposal.  I set it up for 16 partitions, but on a pgbench test this
gives only a marginal further improvement over 4 partitions --- we need
to test more scenarios to choose the number of partitions.
2005-12-11 21:02:18 +00:00
Tom Lane
c599a247bb Simplify lock manager data structures by making a clear separation between
the data defining the semantics of a lock method (ie, conflict resolution
table and ancillary data, which is all constant) and the hash tables
storing the current state.  The only thing we give up by this is the
ability to use separate hashtables for different lock methods, but there
is no need for that anyway.  Put some extra fields into the LockMethod
definition structs to clean up some other uglinesses, like hard-wired
tests for DEFAULT_LOCKMETHOD and USER_LOCKMETHOD.  This commit doesn't
do anything about the performance issues we were discussing, but it clears
away some of the underbrush that's in the way of fixing that.
2005-12-09 01:22:04 +00:00
Tom Lane
f38c3e778a Fix thinko in comment. 2005-12-08 15:38:29 +00:00
Tom Lane
887a7c61f6 Get rid of slru.c's hardwired insistence on a fixed number of slots per
SLRU area.  The number of slots is still a compile-time constant (someday
we might want to change that), but at least it's a different constant for
each SLRU area.  Increase number of subtrans buffers to 32 based on
experimentation with a heavily subtrans-bashing test case, and increase
number of multixact member buffers to 16, since it's obviously silly for
it not to be at least twice the number of multixact offset buffers.
2005-12-06 23:08:34 +00:00
Tom Lane
a98871b7ac Tweak indexscan machinery to avoid taking an AccessShareLock on an index
if we already have a stronger lock due to the index's table being the
update target table of the query.  Same optimization I applied earlier
at the table level.  There doesn't seem to be much interest in the more
radical idea of not locking indexes at all, so do what we can ...
2005-12-03 05:51:03 +00:00
Tom Lane
ace17c1d82 Retry in FileRead and FileWrite if Windows returns ERROR_NO_SYSTEM_RESOURCES.
Also add a retry for Unixen returning EINTR, which hasn't been reported
as an issue but at least theoretically could be.  Patch by Qingqing Zhou,
some minor adjustments by me.
2005-12-01 20:24:18 +00:00
Bruce Momjian
436a2956d8 Re-run pgindent, fixing a problem where comment lines after a blank
comment line where output as too long, and update typedefs for /lib
directory.  Also fix case where identifiers were used as variable names
in the backend, but as typedefs in ecpg (favor the backend for
indenting).

Backpatch to 8.1.X.
2005-11-22 18:17:34 +00:00
Tom Lane
c859308aba DropRelFileNodeBuffers failed to fix the state of the lookup hash table
that was added to localbuf.c in 8.1; therefore, applying it to a temp table
left corrupt lookup state in memory.  The only case where this had a
significant chance of causing problems was an ON COMMIT DELETE ROWS temp
table; the other possible paths left bogus state that was unlikely to
be used again.  Per report from Csaba Nagy.
2005-11-17 17:42:02 +00:00
Tom Lane
48052de722 Repair an error introduced by log_line_prefix patch: it is not acceptable
to assume that the string pointer passed to set_ps_display is good forever.
There's no need to anyway since ps_status.c itself saves the string, and
we already had an API (get_ps_display) to return it.
I believe this explains Jim Nasby's report of intermittent crashes in
elog.c when %i format code is in use in log_line_prefix.
While at it, repair a previously unnoticed problem: on some platforms such as
Darwin, the string returned by get_ps_display was blank-padded to the maximum
length, meaning that lock.c's attempt to append " waiting" to it never worked.
2005-11-05 03:04:53 +00:00
Peter Eisentraut
07bb9f086b Message corrections 2005-10-29 00:31:52 +00:00
Tom Lane
fbbe00242d Tweak buffer manager so that 'internal' accesses to a buffer do not
advance its usage_count.  This includes writes of dirty buffers triggered
by bgwriter, checkpoint, or FlushRelationBuffers, as well as various
corner cases that really ought not count as accesses to the page.
Should make for some marginal improvement in the quality of our decisions
about when to recycle buffers.  Per suggestion from ITAGAKI Takahiro.
2005-10-27 17:07:58 +00:00
Bruce Momjian
1dc3498251 Standard pgindent run for 8.1. 2005-10-15 02:49:52 +00:00
Neil Conway
c10dba2fe3 Remove an antiquated comment. 2005-10-13 06:24:05 +00:00
Tom Lane
fa72121594 Fix another recently-changed place that was messing with spinlock-
protected data structures and not using a volatile pointer for same.
2005-10-12 16:55:59 +00:00
Tom Lane
07eeb9d109 Do all accesses to shared buffer headers through volatile-qualified
pointers, to ensure that compilers won't rearrange accesses to occur
while we're not holding the buffer header spinlock.  It's probably
not necessary to mark volatile in every single place in bufmgr.c,
but better safe than sorry.  Per trouble report from Kevin Grittner.
2005-10-12 16:45:14 +00:00
Tom Lane
a72ee09090 Add infrastructure for making spins_per_delay variable depending on
whether we seem to be running in a uniprocessor or multiprocessor.
The adjustment rules could probably still use further tweaking, but
I'm convinced this should be a win overall.
2005-10-11 20:41:32 +00:00
Tom Lane
82e861fbe1 Fix LWLockAssign() so that it can safely be executed after postmaster
initialization.  Add spinlocking, fix EXEC_BACKEND unsafeness.
2005-10-07 21:42:38 +00:00
Tom Lane
bb55e583f6 Allocate a few extra LWLocks for possible use by add-on modules.
Per request from Marc Munro.
2005-10-07 20:11:03 +00:00
Bruce Momjian
4f915cd377 This patch cleans up the access to members of ItemIdData.
It uses existing macros instead of touching directly.

ITAGAKI Takahiro
2005-09-22 16:46:00 +00:00
Bruce Momjian
658657177e Print proper cause of statement cancel, user interaction or timeout. 2005-09-19 17:21:49 +00:00
Tom Lane
dc06734a72 Force the size and alignment of LWLock array entries to be either 16 or 32
bytes.  This shouldn't make any difference on x86 machines, where the size
happened to be 16 bytes anyway, but on 64-bit machines and machines with
slock_t int or wider, it will speed array indexing and hopefully reduce
SMP cache contention effects.  Per recent experimentation.
2005-09-16 00:30:05 +00:00
Tom Lane
396526d8c3 Adjust m68k spinlock code to avoid duplicate in-line and not-in-line
definitions on recent Linux systems, per Martin Pitt.
2005-08-26 14:47:35 +00:00
Tom Lane
1a33436224 Replace out-of-line tas() assembly code for MIPS with a properly
constrained GCC inline version.  Thiemo Seufer, by way of Martin Pitt.
2005-08-25 17:17:10 +00:00
Tom Lane
0007490e09 Convert the arithmetic for shared memory size calculation from 'int'
to 'Size' (that is, size_t), and install overflow detection checks in it.
This allows us to remove the former arbitrary restrictions on NBuffers
etc.  It won't make any difference in a 32-bit machine, but in a 64-bit
machine you could theoretically have terabytes of shared buffers.
(How efficiently we could manage 'em remains to be seen.)  Similarly,
num_temp_buffers, work_mem, and maintenance_work_mem can be set above
2Gb on a 64-bit machine.  Original patch from Koichi Suzuki, additional
work by moi.
2005-08-20 23:26:37 +00:00
Tatsuo Ishii
bc3991c185 Add BackendXidGetPid(). 2005-08-20 01:26:36 +00:00
Bruce Momjian
28d0515d18 Fix FSM warning to mention increasing max_fsm_pages. Was incorrectly
max_fsm_relations.
2005-08-17 03:50:59 +00:00
Bruce Momjian
27639809d2 Reverse out Assert addition. 2005-08-12 23:13:54 +00:00
Bruce Momjian
fab177e64f Improve documention on loading large data sets into plperl.
David Fetter
2005-08-12 21:42:53 +00:00
Tom Lane
3ae7e4a33b Remove BufferBlockPointers array in favor of a base + (bufnum) * BLCKSZ
computation.  On modern machines this is as fast if not faster, and we
don't have to clog the CPU's L2 cache with a tens-of-KB pointer array.
If we ever decide to adopt a more dynamic allocation method for shared
buffers, we'll probably have to revert this patch, but in the meantime
we might as well save a few bytes and nanoseconds.  Per Qingqing Zhou.
2005-08-12 05:05:51 +00:00
Tom Lane
721e53785d Solve the problem of OID collisions by probing for duplicate OIDs
whenever we generate a new OID.  This prevents occasional duplicate-OID
errors that can otherwise occur once the OID counter has wrapped around.
Duplicate relfilenode values are also checked for when creating new
physical files.  Per my recent proposal.
2005-08-12 01:36:05 +00:00
Tom Lane
15269b5955 Avoid useless loop overhead in AtEOXact routines when the backend is
compiled with USE_ASSERT_CHECKING but is running with assert_enabled false.
2005-08-08 19:44:22 +00:00
Tom Lane
7117cd3a77 Cause ShutdownPostgres to do a normal transaction abort during backend
exit, instead of trying to take shortcuts.  Introduce some additional
shutdown callback routines to eliminate kluges like having ProcKill
be responsible for shutting down the buffer manager.  Ensure that the
order of operations during shutdown is predictable and what you would
expect given the module layering.
2005-08-08 03:12:16 +00:00
Tom Lane
5337ad464e Fix count_usable_fds() to stop trying to open files once it reaches
max_files_per_process.  Going further than that is just a waste of
cycles, and it seems that current Cygwin does not cope gracefully
with deliberately running the system out of FDs.  Per Andrew Dunstan.
2005-08-07 18:47:19 +00:00
Tom Lane
6eac4e69cf Tweak BgBufferSync() so that a persistent write error on a dirty buffer
doesn't block the bgwriter from making progress writing out other buffers.
This was a hard problem in the context of the ARC/2Q design, but it's
trivial in the context of clock sweep ... just advance the sweep counter
before we try to write not after.
2005-08-02 20:52:08 +00:00
Tom Lane
2a4fad1a0e Add NOWAIT option to SELECT FOR UPDATE/SHARE.
Original patch by Hans-Juergen Schoenig, revisions by Karel Zak
and Tom Lane.
2005-08-01 20:31:16 +00:00
Tom Lane
d42cf5a42a Add per-user and per-database connection limit options.
This patch also includes preliminary update of pg_dumpall for roles.
Petr Jelinek, with review by Bruce Momjian and Tom Lane.
2005-07-31 17:19:22 +00:00
Bruce Momjian
1521aef1db SUNOS4_CC -> SUNOS_CC. 2005-07-30 03:07:42 +00:00
Tom Lane
eb5949d190 Arrange for the postmaster (and standalone backends, initdb, etc) to
chdir into PGDATA and subsequently use relative paths instead of absolute
paths to access all files under PGDATA.  This seems to give a small
performance improvement, and it should make the system more robust
against naive DBAs doing things like moving a database directory that
has a live postmaster in it.  Per recent discussion.
2005-07-04 04:51:52 +00:00
Tom Lane
b95ae32b41 Avoid WAL-logging individual tuple insertions during CREATE TABLE AS
(a/k/a SELECT INTO).  Instead, flush and fsync the whole relation before
committing.  We do still need the WAL log when PITR is active, however.
Simon Riggs and Tom Lane.
2005-06-20 18:37:02 +00:00
Tom Lane
3f749924f8 Simplify uses of readdir() by creating a function ReadDir() that
includes error checking and an appropriate ereport(ERROR) message.
This gets rid of rather tedious and error-prone manipulation of errno,
as well as a Windows-specific bug workaround, at more than a dozen
call sites.  After an idea in a recent patch by Heikki Linnakangas.
2005-06-19 21:34:03 +00:00
Tom Lane
d0a89683a3 Two-phase commit. Original patch by Heikki Linnakangas, with additional
hacking by Alvaro Herrera and Tom Lane.
2005-06-17 22:32:51 +00:00
Tom Lane
8563ccae2c Simplify shared-memory lock data structures as per recent discussion:
it is sufficient to track whether a backend holds a lock or not, and
store information about transaction vs. session locks only in the
inside-the-backend LocalLockTable.  Since there can now be but one
PROCLOCK per lock per backend, LockCountMyLocks() is no longer needed,
thus eliminating some O(N^2) behavior when a backend holds many locks.
Also simplify the LockAcquire/LockRelease API by passing just a
'sessionLock' boolean instead of a transaction ID.  The previous API
was designed with the idea that per-transaction lock holding would be
important for subtransactions, but now that we have subtransactions we
know that this is unwanted.  While at it, add an 'isTempObject' parameter
to LockAcquire to indicate whether the lock is being taken on a temp
table.  This is not used just yet, but will be needed shortly for
two-phase commit.
2005-06-14 22:15:33 +00:00
Tom Lane
a2fb7b8a1f Adjust lo_open() so that specifying INV_READ without INV_WRITE creates
a descriptor that uses the current transaction snapshot, rather than
SnapshotNow as it did before (and still does if INV_WRITE is set).
This means pg_dump will now dump a consistent snapshot of large object
contents, as it never could do before.  Also, add a lo_create() function
that is similar to lo_creat() but allows the desired OID of the large
object to be specified.  This will simplify pg_restore considerably
(but I'll fix that in a separate commit).
2005-06-13 02:26:53 +00:00
Tom Lane
ee7ac7b11e Modify XLogInsert API to make callers specify whether pages to be backed
up have the standard layout with unused space between pd_lower and pd_upper.
When this is set, XLogInsert will omit the unused space without bothering
to scan it to see if it's zero.  That saves time in XLogInsert, and also
allows reversion of my earlier patch to make PageRepairFragmentation et al
explicitly re-zero freed space.  Per suggestion by Heikki Linnakangas.
2005-06-06 20:22:58 +00:00
Tom Lane
4c8495a1f2 Remove the mostly-stubbed-out-anyway support routines for WAL UNDO.
That code is never going to be used in the foreseeable future, and
where it's more than a stub it's making the redo routines harder to
read.
2005-06-06 17:01:25 +00:00
Tom Lane
21fda22ec4 Change CRCs in WAL records from 64bit to 32bit for performance reasons.
Instead of a separate CRC on each backup block, include backup blocks
in their parent WAL record's CRC; this is important to ensure that the
backup block really goes with the WAL record, ie there was not a page
tear right at the start of the backup block.  Implement a simple form
of compression of backup blocks: drop any run of zeroes starting at
pd_lower, so as not to store the unused 'hole' that commonly exists in
PG heap and index pages.  Tweak PageRepairFragmentation and related
routines to ensure they keep the unused space zeroed, so that the above
compression method remains effective.  All per recent discussions.
2005-06-02 05:55:29 +00:00
Tom Lane
140b078d2a Improve LockAcquire API per my recent proposal. All error conditions
are now reported via elog, eliminating the need to test the result code
at most call sites.  Make it possible for the caller to distinguish a
freshly acquired lock from one already held in the current transaction.
Use that capability to avoid redundant AcceptInvalidationMessages() calls
in LockRelation().
2005-05-29 22:45:02 +00:00
Tom Lane
e92a88272e Modify hash_search() API to prevent future occurrences of the error
spotted by Qingqing Zhou.  The HASH_ENTER action now automatically
fails with elog(ERROR) on out-of-memory --- which incidentally lets
us eliminate duplicate error checks in quite a bunch of places.  If
you really need the old return-NULL-on-out-of-memory behavior, you
can ask for HASH_ENTER_NULL.  But there is now an Assert in that path
checking that you aren't hoping to get that behavior in a palloc-based
hash table.
Along the way, remove the old HASH_FIND_SAVE/HASH_REMOVE_SAVED actions,
which were not being used anywhere anymore, and were surely too ugly
and unsafe to want to see revived again.
2005-05-29 04:23:07 +00:00
Bruce Momjian
6dc7760ac3 Add support for wal_fsync_writethrough for Darwin, and restructure the
code to better handle writethrough.

Chris Campbell
2005-05-20 14:53:26 +00:00
Tom Lane
f519d04a43 Update comment that I missed the first time around. 2005-05-19 23:57:11 +00:00
Tom Lane
191b13aaca Factor out lock cleanup code that is needed in several places in lock.c.
Also, remove the rather useless return value of LockReleaseAll.  Change
response to detection of corruption in the shared lock tables to PANIC,
since that is the only way of cleaning up fully.
Originally an idea of Heikki Linnakangas, variously hacked on by
Alvaro Herrera and Tom Lane.
2005-05-19 23:30:18 +00:00
Tom Lane
ee3b71f6bc Split the shared-memory array of PGPROC pointers out of the sinval
communication structure, and make it its own module with its own lock.
This should reduce contention at least a little, and it definitely makes
the code seem cleaner.  Per my recent proposal.
2005-05-19 21:35:48 +00:00
Neil Conway
f38e413b20 Code cleanup: in C89, there is no point casting the first argument to
memset() or MemSet() to a char *. For one, memset()'s first argument is
a void *, and further void * can be implicitly coerced to/from any other
pointer type.
2005-05-11 01:26:02 +00:00
Tom Lane
93b2477278 Use the standard lock manager to establish priority order when there
is contention for a tuple-level lock.  This solves the problem of a
would-be exclusive locker being starved out by an indefinite succession
of share-lockers.  Per recent discussion with Alvaro.
2005-04-30 19:03:33 +00:00
Tom Lane
3a694bb0a1 Restructure LOCKTAG as per discussions of a couple months ago.
Essentially, we shoehorn in a lockable-object-type field by taking
a byte away from the lockmethodid, which can surely fit in one byte
instead of two.  This allows less artificial definitions of all the
other fields of LOCKTAG; we can get rid of the special pg_xactlock
pseudo-relation, and also support locks on individual tuples and
general database objects (including shared objects).  None of those
possibilities are actually exploited just yet, however.

I removed pg_xactlock from pg_class, but did not force initdb for
that change.  At this point, relkind 's' (SPECIAL) is unused and
could be removed entirely.
2005-04-29 22:28:24 +00:00
Tom Lane
bedb78d386 Implement sharable row-level locks, and use them for foreign key references
to eliminate unnecessary deadlocks.  This commit adds SELECT ... FOR SHARE
paralleling SELECT ... FOR UPDATE.  The implementation uses a new SLRU
data structure (managed much like pg_subtrans) to represent multiple-
transaction-ID sets.  When more than one transaction is holding a shared
lock on a particular row, we create a MultiXactId representing that set
of transactions and store its ID in the row's XMAX.  This scheme allows
an effectively unlimited number of row locks, just as we did before,
while not costing any extra overhead except when a shared lock actually
has to be shared.   Still TODO: use the regular lock manager to control
the grant order when multiple backends are waiting for a row lock.

Alvaro Herrera and Tom Lane.
2005-04-28 21:47:18 +00:00
Bruce Momjian
3b0a5e50d7 Update VACUUM VERBOSE FSM message, per Tom. 2005-04-24 03:51:49 +00:00
Bruce Momjian
714d5a4c37 Update VACUUM VERBOSE update, per Alvaro. 2005-04-23 21:16:34 +00:00
Bruce Momjian
9ba6587f8b Update working of VACUUM VERBOSE. 2005-04-23 21:10:20 +00:00
Bruce Momjian
52e08c35f7 Make VACUUM VERBOSE FSM output all output in a single INFO output
statement.
2005-04-23 20:56:01 +00:00
Bruce Momjian
e947e1153a Modify output of VACUUM VERBOSE to be clearer. 2005-04-23 15:20:39 +00:00
Neil Conway
ea208aca00 Remove an unused variable "waitingForSignal". From Qingqing Zhou. 2005-04-15 04:18:10 +00:00
Tom Lane
162bd08b3f Completion of project to use fixed OIDs for all system catalogs and
indexes.  Replace all heap_openr and index_openr calls by heap_open
and index_open.  Remove runtime lookups of catalog OID numbers in
various places.  Remove relcache's support for looking up system
catalogs by name.  Bulky but mostly very boring patch ...
2005-04-14 20:03:27 +00:00
Tom Lane
2193a856a2 Simplify initdb-time assignment of OIDs as I proposed yesterday, and
avoid encroaching on the 'user' range of OIDs by allowing automatic
OID assignment to use values below 16k until we reach normal operation.

initdb not forced since this doesn't make any incompatible change;
however a lot of stuff will have different OIDs after your next initdb.
2005-04-13 18:54:57 +00:00
Tom Lane
badb83f9ec If we're going to have a non-panic check for held_lwlocks[] overrun,
it must occur *before* we get into the critical state of holding a
lock we have no place to record.  Per discussion with Qingqing Zhou.
2005-04-08 14:18:35 +00:00
Tom Lane
e794dfa511 Use an always-there test, not an Assert, to check for overrun of
the held_lwlocks[] array.  Per Qingqing Zhou.
2005-04-08 03:43:54 +00:00
Neil Conway
5b1c607abe Remove an unused variable `ShmemBootstrap', and remove an obsolete
comment. Patch from Alvaro.
2005-04-04 04:34:41 +00:00
Tom Lane
94e03330cb Create a routine PageIndexMultiDelete() that replaces a loop around
PageIndexTupleDelete() with a single pass of compactification ---
logic mostly lifted from PageRepairFragmentation.  I noticed while
profiling that a VACUUM that's cleaning up a whole lot of deleted
tuples would spend as much as a third of its CPU time in
PageIndexTupleDelete; not too surprising considering the loop method
was roughly O(N^2) in the number of tuples involved.
2005-03-22 06:17:03 +00:00
Tom Lane
354049c709 Remove unnecessary calls of FlushRelationBuffers: there is no need
to write out data that we are about to tell the filesystem to drop.
smgr_internal_unlink already had a DropRelFileNodeBuffers call to
get rid of dead buffers without a write after it's no longer possible
to roll back the deleting transaction.  Adding a similar call in
smgrtruncate simplifies callers and makes the overall division of
labor clearer.  This patch removes the former behavior that VACUUM
would write all dirty buffers of a relation unconditionally.
2005-03-20 22:00:54 +00:00
Tom Lane
91728fa26c Add temp_buffers GUC variable to allow users to determine the size
of the local buffer arena for temporary table access.
2005-03-19 23:27:11 +00:00
Tom Lane
d65522aeb6 Upgrade localbuf.c to use a hash table instead of linear search to
find already-allocated local buffers.  This is the last obstacle
in the way of setting NLocBuffer to something reasonably large.
2005-03-19 17:39:43 +00:00
Tom Lane
88164799ce Need to reset local buffer pin counts, not only shared buffer pins,
before we attempt any file deletions in ShutdownPostgres.  Per Tatsuo.
2005-03-18 16:16:09 +00:00
Tom Lane
cef01c3355 Avoid infinite loop in InvalidateBuffer if we ourselves are holding
a pin on the victim buffer.
2005-03-18 05:25:23 +00:00
Bruce Momjian
2c4dea126a Issue free space notices to both the user and the server log file. 2005-03-14 20:15:09 +00:00
Bruce Momjian
45905425a0 Add warning about the need to increase "max_fsm_relations" and
"max_fsm_relations" for vacuums.  Also improve VACUUM VERBOSE final
message text.

Ron Mayer
2005-03-12 05:21:52 +00:00
Neil Conway
c129c16492 Slight refactoring and optimization of some code in WaitOnLock(). 2005-03-11 03:52:06 +00:00
Tom Lane
5d5087363d Replace the BufMgrLock with separate locks on the lookup hashtable and
the freelist, plus per-buffer spinlocks that protect access to individual
shared buffer headers.  This requires abandoning a global freelist (since
the freelist is a global contention point), which shoots down ARC and 2Q
as well as plain LRU management.  Adopt a clock sweep algorithm instead.
Preliminary results show substantial improvement in multi-backend situations.
2005-03-04 20:21:07 +00:00
Tom Lane
a2ad04f4b0 Release proclock immediately in RemoveFromWaitQueue() if it represents
no held locks.  This maintains the invariant that proclocks are present
only for procs that are holding or awaiting a lock; when this is not
true, LockRelease will fail.  Per report from Stephen Clouse.
2005-03-01 21:14:59 +00:00
Bruce Momjian
0542b1e2fe Use _() macro consistently rather than gettext(). Add translation
macros around strings that were missing them.
2005-02-22 04:43:23 +00:00
Neil Conway
11635c3f6f Refactor some duplicated code in lock.c: create UnGrantLock(), move code
from LockRelease() and LockReleaseAll() into it. From Heikki Linnakangas.
2005-02-04 02:04:53 +00:00
Tom Lane
cc4f58f4cd Ensure that all details of the ARC algorithm are hidden within freelist.c.
This refactoring does not change any algorithms or data structures, just
remove visibility of the ARC datastructures from other source files.
2005-02-03 23:29:19 +00:00
Neil Conway
a885ecd6ef Change heap_modifytuple() to require a TupleDesc rather than a
Relation. Patch from Alvaro Herrera, minor editorializing by
Neil Conway.
2005-01-27 23:24:11 +00:00
Tom Lane
0ce4d56924 Phase 1 of fix for 'SMgrRelation hashtable corrupted' problem. This
is the minimum required fix.  I want to look next at taking advantage of
it by simplifying the message semantics in the shared inval message queue,
but that part can be held over for 8.1 if it turns out too ugly.
2005-01-10 20:02:24 +00:00
Tom Lane
c9d8edc906 Repair bufmgr deadlock problem reported by Michael Wildpaner. Must take
share lock on a buffer being written out before releasing BufMgrLock in
the BufferAlloc code path; if we do it later we might block on someone
who's re-pinned the buffer.  I believe this is only an issue for BufferAlloc
and not the other places that call FlushBuffer.  BufferSync must continue
to do it the old way since it may well be trying to write buffers that
other backends have pinned; but it should not be holding any conflicting
locks.  FlushRelationBuffers is okay since it's got exclusive lock at the
relation level.
2005-01-03 18:49:41 +00:00
PostgreSQL Daemon
2ff501590b Tag appropriate files for rc3
Also performed an initial run through of upgrading our Copyright date to
extend to 2005 ... first run here was very simple ... change everything
where: grep 1996-2004 && the word 'Copyright' ... scanned through the
generated list with 'less' first, and after, to make sure that I only
picked up the right entries ...
2004-12-31 22:04:05 +00:00
Tom Lane
96ecf9d5aa Support Sun's compiler on SunOS4 (a/k/a Solaris 9). Per ayan@ayan.net 2004-12-29 23:47:40 +00:00
Tom Lane
eee5abce46 Refactor EXEC_BACKEND code so that postmaster child processes reattach
to shared memory as soon as possible, ie, right after read_backend_variables.
The effective difference from the original code is that this happens
before instead of after read_nondefault_variables(), which loads GUC
information and is apparently capable of expanding the backend's memory
allocation more than you'd think it should.  This should fix the
failure-to-attach-to-shared-memory reports we've been seeing on Windows.
Also clean up a few bits of unnecessarily grotty EXEC_BACKEND code.
2004-12-29 21:36:09 +00:00
Bruce Momjian
08690d0688 Allow NetBSD, m64k to compile the ASM spinlock code.
R?mi Zara
2004-12-18 22:12:52 +00:00
Neil Conway
4acc97d7e4 Assert that BufferIsPinned() in IncrBufferRefCount(), rather than using
a home-brewed combination of assertions that boiled down to the same
thing.
2004-11-24 02:56:17 +00:00
Tom Lane
8ecbc46bdb Reduce the default size of the local lock hash table. There's usually
no need for it to be nearly as big as the global hash table, and since
it's not in shared memory it can grow if it does need to be bigger.
By reducing the size, we speed up hash_seq_search(), which saves a
significant fraction of subtransaction entry/exit overhead.
2004-11-20 20:16:54 +00:00
Peter Eisentraut
0ed3c7665e Small message clarifications 2004-11-05 17:11:34 +00:00
Neil Conway
8ec05b28b7 Modify hash_create() to elog(ERROR) if an error occurs, rather than
returning a NULL pointer (some callers remembered to check the return
value, but some did not -- it is safer to just bail out).

Also, cleanup pgstat.c to use elog(ERROR) rather than elog(LOG) followed
by exit().
2004-10-25 00:46:43 +00:00
Tom Lane
4347cc2392 Allow background writing to be shut down by setting limit values to zero.
This does not disable the bgwriter process: it still has to wake up often
enough to collect fsync requests from backends in a timely fashion.  But
it responds to the recent gripe about not being able to prevent the disk
from being spun up constantly.
2004-10-17 22:01:51 +00:00
Tom Lane
fdd13f1568 Give the ResourceOwner mechanism full responsibility for releasing buffer
pins at end of transaction, and reduce AtEOXact_Buffers to an Assert
cross-check that this was done correctly.  When not USE_ASSERT_CHECKING,
AtEOXact_Buffers is a complete no-op.  This gets rid of an O(NBuffers)
bottleneck during transaction commit/abort, which recent testing has shown
becomes significant above a few tens of thousands of shared buffers.
2004-10-16 18:57:26 +00:00
Tom Lane
1c2de47746 Remove BufferLocks[] array in favor of a single pointer to the buffer
(if any) currently waited for by LockBufferForCleanup(), which is all
that we were using it for anymore.  Saves some space and eliminates
proportional-to-NBuffers slowdown in UnlockBuffers().
2004-10-16 18:05:07 +00:00
Tom Lane
9ffc8ed58b Repair possible failure to update hint bits back to disk, per
http://archives.postgresql.org/pgsql-hackers/2004-10/msg00464.php.
This fix is intended to be permanent: it moves the responsibility for
calling SetBufferCommitInfoNeedsSave() into the tqual.c routines,
eliminating the requirement for callers to test whether t_infomask changed.
Also, tighten validity checking on buffer IDs in bufmgr.c --- several
routines were paranoid about out-of-range shared buffer numbers but not
about out-of-range local ones, which seems a tad pointless.
2004-10-15 22:40:29 +00:00
Neil Conway
0683a47556 Allow the spinlock test to be compiled successfully in a vpath build. 2004-10-07 00:08:04 +00:00
Tom Lane
0fb3152ea9 Minor adjustments to improve the accuracy of our computation of required
shared memory size.
2004-09-29 15:15:56 +00:00
Tom Lane
3a246cc285 Arrange to preallocate all required space for the buffer and FSM hash
tables in shared memory.  This ensures that overflow of the lock table
creates no long-lasting problems.  Per discussion with Merlin Moncure.
2004-09-28 20:46:37 +00:00
Tom Lane
86fff990b2 RecentXmin is too recent to use as the cutoff point for accessing
pg_subtrans --- what we need is the oldest xmin of any snapshot in use
in the current top transaction.  Introduce a new variable TransactionXmin
to play this role.  Fixes intermittent regression failure reported by
Neil Conway.
2004-09-16 18:35:23 +00:00
Tom Lane
8f9f198603 Restructure subtransaction handling to reduce resource consumption,
as per recent discussions.  Invent SubTransactionIds that are managed like
CommandIds (ie, counter is reset at start of each top transaction), and
use these instead of TransactionIds to keep track of subtransaction status
in those modules that need it.  This means that a subtransaction does not
need an XID unless it actually inserts/modifies rows in the database.
Accordingly, don't assign it an XID nor take a lock on the XID until it
tries to do that.  This saves a lot of overhead for subtransactions that
are only used for error recovery (eg plpgsql exceptions).  Also, arrange
to release a subtransaction's XID lock as soon as the subtransaction
exits, in both the commit and abort cases.  This avoids holding many
unique locks after a long series of subtransactions.  The price is some
additional overhead in XactLockTableWait, but that seems acceptable.
Finally, restructure the state machine in xact.c to have a more orthogonal
set of states for subtransactions.
2004-09-16 16:58:44 +00:00
Tom Lane
abc98dcc15 When LockAcquire fails at the stage of creating a proclock object, be
sure to clean up the already-created lock object, if it has no other
references.  Avoids possibly-permanent leak of shared memory.
2004-09-12 18:30:50 +00:00
Tom Lane
083258e535 Fix a number of places where brittle data structures or overly strong
Asserts would lead to a server core dump if an error occurred while
trying to abort a failed subtransaction (thereby leading to re-execution
of whatever parts of AbortSubTransaction had already run).  This of course
does not prevent such an error from creating an infinite loop, but at
least we don't make the situation worse.  Responds to an open item on
the subtransactions to-do list.
2004-09-06 23:33:48 +00:00
Tom Lane
23645f0582 Fix incorrect ordering of smgr cleanup relative to buffer pin cleanup
during transaction abort.  Add a regression test case to catch related
mistakes in future.  Alvaro Herrera and Tom Lane.
2004-09-06 17:56:33 +00:00
Tom Lane
eb917c1a21 I can't see any good reason for DropRelFileNodeBuffers to be issuing
FATAL when it detects a nonzero reference count.  Reduce to ERROR.
2004-09-06 17:31:32 +00:00
Tom Lane
a421b4e850 FlushRelationBuffers was also being a bit cavalier about whether the
relation is already opened by smgr.
2004-08-31 16:13:06 +00:00
Tom Lane
332ee2dc41 Improve spinlock selftest to make it able to detect misdeclaration of
the slock_t datatype (ie, declared type smaller than what the hardware
TAS instruction needs).
2004-08-30 23:47:20 +00:00
Tom Lane
303e46ea93 Tweak md.c logic to cope with the situation where WAL replay tries to
write into a high-numbered segment of a relation that was later deleted.
We need to temporarily recreate missing segment files, instead of
failing.
2004-08-30 03:52:43 +00:00
Bruce Momjian
15d3f9f6b7 Another pgindent run with lib typedefs added. 2004-08-30 02:54:42 +00:00
Bruce Momjian
b6b71b85bc Pgindent run for 8.0. 2004-08-29 05:07:03 +00:00
Bruce Momjian
da9a8649d8 Update copyright to 2004. 2004-08-29 04:13:13 +00:00
Tom Lane
1785acebf2 Introduce local hash table for lock state, as per recent proposal.
PROCLOCK structs in shared memory now have only a bitmask for held
locks, rather than counts (making them 40 bytes smaller, which is a
good thing).  Multiple locks within a transaction are counted in the
local hash table instead, and we have provision for tracking which
ResourceOwner each count belongs to.  Solves recently reported problem
with memory leakage within long transactions.
2004-08-27 17:07:42 +00:00
Tom Lane
337b513e07 Fix user locks. Broken some time ago for all platforms by Windows-related
changes.
2004-08-26 17:23:30 +00:00
Tom Lane
4dbb880d3c Rearrange pg_subtrans handling as per recent discussion. pg_subtrans
updates are no longer WAL-logged nor even fsync'd; we do not need to,
since after a crash no old pg_subtrans data is needed again.  We truncate
pg_subtrans to RecentGlobalXmin at each checkpoint.  slru.c's API is
refactored a little bit to separate out the necessary decisions.
2004-08-23 23:22:45 +00:00
Tom Lane
f009c316ba Tweak code so that pg_subtrans is never consulted for XIDs older than
RecentXmin (== MyProc->xmin).  This ensures that it will be safe to
truncate pg_subtrans at RecentGlobalXmin, which should largely eliminate
any fear of bloat.  Along the way, eliminate SubTransXidsHaveCommonAncestor,
which isn't really needed and could not give a trustworthy result anyway
under the lookback restriction.
In an unrelated but nearby change, #ifdef out GetUndoRecPtr, which has
been dead code since 2001 and seems unlikely to ever be resurrected.
2004-08-22 02:41:58 +00:00
Tom Lane
1a3de15a3a Dept. of further reflection: I looked around to see if any other callers
of XLogInsert had the same sort of checkpoint interlock problem as
RecordTransactionCommit, and indeed I found some.  Btree index build
and ALTER TABLE SET TABLESPACE write data outside the friendly confines
of the buffer manager, and therefore they have to take their own
responsibility for checkpoint interlock.  The easiest solution seems to
be to force smgrimmedsync at the end of the index build or table copy,
even when the operation is being WAL-logged.  This is sufficient since
the new index or table will be of interest to no one if we don't get
as far as committing the current transaction.
2004-08-15 23:44:46 +00:00
Tom Lane
057ea3471f Xmin calculations should consider only top transaction IDs, and
therefore starting with GetCurrentTransactionId is wrong.  Fixes
miscomputation of RecentGlobalXmin leading to bizarre behavior
reported by Gavin Sherry.
2004-08-15 17:03:36 +00:00
Tom Lane
efcaf1e868 Some mop-up work for savepoints (nested transactions). Store a small
number of active subtransaction XIDs in each backend's PGPROC entry,
and use this to avoid expensive probes into pg_subtrans during
TransactionIdIsInProgress.  Extend EOXactCallback API to allow add-on
modules to get control at subxact start/end.  (This is deliberately
not compatible with the former API, since any uses of that API probably
need manual review anyway.)  Add basic reference documentation for
SAVEPOINT and related commands.  Minor other cleanups to check off some
of the open issues for subtransactions.
Alvaro Herrera and Tom Lane.
2004-08-01 17:32:22 +00:00
Tom Lane
a393fbf937 Restructure error handling as recently discussed. It is now really
possible to trap an error inside a function rather than letting it
propagate out to PostgresMain.  You still have to use AbortCurrentTransaction
to clean up, but at least the error handling itself will cooperate.
2004-07-31 00:45:57 +00:00
Tom Lane
1bf3d61504 Fix subtransaction behavior for large objects, temp namespace, files,
password/group files.  Also allow read-only subtransactions of a read-write
parent, but not vice versa.  These are the reasonably noncontroversial
parts of Alvaro's recent mop-up patch, plus further work on large objects
to minimize use of the TopTransactionResourceOwner.
2004-07-28 14:23:31 +00:00
Tom Lane
cc813fc2b8 Replace nested-BEGIN syntax for subtransactions with spec-compliant
SAVEPOINT/RELEASE/ROLLBACK-TO syntax.  (Alvaro)
Cause COMMIT of a failed transaction to report ROLLBACK instead of
COMMIT in its command tag.  (Tom)
Fix a few loose ends in the nested-transactions stuff.
2004-07-27 05:11:48 +00:00
Tom Lane
2042b3428d Invent WAL timelines, as per recent discussion, to make point-in-time
recovery more manageable.  Also, undo recent change to add FILE_HEADER
and WASTED_SPACE records to XLOG; instead make the XLOG page header
variable-size with extra fields in the first page of an XLOG file.
This should fix the boundary-case bugs observed by Mark Kirkwood.
initdb forced due to change of XLOG representation.
2004-07-21 22:31:26 +00:00
Tom Lane
fe548629c5 Invent ResourceOwner mechanism as per my recent proposal, and use it to
keep track of portal-related resources separately from transaction-related
resources.  This allows cursors to work in a somewhat sane fashion with
nested transactions.  For now, cursor behavior is non-subtransactional,
that is a cursor's state does not roll back if you abort a subtransaction
that fetched from the cursor.  We might want to change that later.
2004-07-17 03:32:14 +00:00
Tom Lane
8801110b20 Move TablespaceCreateDbspace() call into smgrcreate(), which is where it
probably should have been to begin with; this is to cover cases like
needing to recreate the per-db directory during WAL replay.
Also, fix heap_create to force pg_class.reltablespace to be zero instead
of the database's default tablespace; this makes the world safe for
CREATE DATABASE to handle all tables in the default tablespace alike,
as per previous discussion.  And force pg_class.reltablespace to zero
when creating a relation without physical storage (eg, a view); this
avoids possibly having dangling references in this column after a
subsequent DROP TABLESPACE.
2004-07-11 19:52:52 +00:00
Tom Lane
77a436ba55 Fix seriously nasty memory leak in new TransactionIdIsInProgress code. 2004-07-01 03:13:05 +00:00
Tom Lane
573a71a5da Nested transactions. There is still much left to do, especially on the
performance front, but with feature freeze upon us I think it's time to
drive a stake in the ground and say that this will be in 7.5.

Alvaro Herrera, with some help from Tom Lane.
2004-07-01 00:52:04 +00:00
Tom Lane
c1d9dec3e3 Looks like s_lock_test needs <time.h> on some platforms. 2004-06-19 20:31:55 +00:00
Tom Lane
1232878159 s_lock_test requires libpgport to build now. 2004-06-19 19:43:11 +00:00
Tom Lane
2467394ee1 Tablespaces. Alternate database locations are dead, long live tablespaces.
There are various things left to do: contrib dbsize and oid2name modules
need work, and so does the documentation.  Also someone should think about
COMMENT ON TABLESPACE and maybe RENAME TABLESPACE.  Also initlocation is
dead, it just doesn't know it yet.

Gavin Sherry and Tom Lane.
2004-06-18 06:14:31 +00:00
Tom Lane
bbf0ebadaf StrategyDirtyBufferList wasn't being careful to honor max_buffers limit.
Bug is only latent given that sole caller is passing NBuffers, but it
could bite someone in the rear someday.
2004-06-11 17:20:39 +00:00
Tom Lane
e6cba71503 Add some code to Assert that when we release pin on a buffer, we are
not holding the buffer's cntx_lock or io_in_progress_lock.  A recent
report from Litao Wu makes me wonder whether it is ever possible for
us to drop a buffer and forget to release its cntx_lock.  The Assert
does not fire in the regression tests, but that proves little ...
2004-06-11 16:43:24 +00:00
Bruce Momjian
a1ccbb9019 Previous code cleanup was for bufpage.c, not bufmgr.c.
This cleanup just cleans up a comment.
2004-06-09 13:11:34 +00:00
Bruce Momjian
ce04221a1e Stylistic changes in bufmgr.c
Basically replaces (*a).b with a->b as it is everywhere else in
	Postgres.

Manfred Koizar
2004-06-08 14:00:35 +00:00
Tom Lane
c3a153afed Tweak palloc/repalloc to allow zero bytes to be requested, as per recent
proposal.  Eliminate several dozen now-unnecessary hacks to avoid palloc(0).
(It's likely there are more that I didn't find.)
2004-06-05 19:48:09 +00:00
Tom Lane
921d749bd4 Adjust our timezone library to use pg_time_t (typedef'd as int64) in
place of time_t, as per prior discussion.  The behavior does not change
on machines without a 64-bit-int type, but on machines with one, which
is most, we are rid of the bizarre boundary behavior at the edges of
the 32-bit-time_t range (1901 and 2038).  The system will now treat
times over the full supported timestamp range as being in your local
time zone.  It may seem a little bizarre to consider that times in
4000 BC are PST or EST, but this is surely at least as reasonable as
propagating Gregorian calendar rules back that far.

I did not modify the format of the zic timezone database files, which
means that for the moment the system will not know about daylight-savings
periods outside the range 1901-2038.  Given the way the files are set up,
it's not a simple decision like 'widen to 64 bits'; we have to actually
think about the range of years that need to be supported.  We should
probably inquire what the plans of the upstream zic people are before
making any decisions of our own.
2004-06-03 02:08:07 +00:00
Bruce Momjian
e8d9d68ca4 Per previous discussions, here are two functions to send INT and TERM
(cancel and terminate) signals to other backends.   They permit only INT
and TERM, and permits sending only to postgresql backends.

Magnus Hagander
2004-06-02 21:29:29 +00:00
Tom Lane
2095206de1 Adjust btree index build to not use shared buffers, thereby avoiding the
locking conflict against concurrent CHECKPOINT that was discussed a few
weeks ago.  Also, if not using WAL archiving (which is always true ATM
but won't be if PITR makes it into this release), there's no need to
WAL-log the index build process; it's sufficient to force-fsync the
completed index before commit.  This seems to gain about a factor of 2
in my tests, which is consistent with writing half as much data.  I did
not try it with WAL on a separate drive though --- probably the gain would
be a lot less in that scenario.
2004-06-02 17:28:18 +00:00
Tom Lane
91d20ff7aa Additional mop-up for sync-to-fsync changes: avoid issuing fsyncs for
temp tables, and avoid WAL-logging truncations of temp tables.  Do issue
fsync on truncated files (not sure this is necessary but it seems like
a good idea).
2004-05-31 20:31:33 +00:00
Tom Lane
e674707968 Minor code rationalization: FlushRelationBuffers just returns void,
rather than an error code, and does elog(ERROR) not elog(WARNING)
when it detects a problem.  All callers were simply elog(ERROR)'ing on
failure return anyway, and I find it hard to envision a caller that would
not, so we may as well simplify the callers and produce the more useful
error message directly.
2004-05-31 19:24:05 +00:00
Tom Lane
9b178555fc Per previous discussions, get rid of use of sync(2) in favor of
explicitly fsync'ing every (non-temp) file we have written since the
last checkpoint.  In the vast majority of cases, the burden of the
fsyncs should fall on the bgwriter process not on backends.  (To this
end, we assume that an fsync issued by the bgwriter will force out
blocks written to the same file by other processes using other file
descriptors.  Anyone have a problem with that?)  This makes the world
safe for WIN32, which ain't even got sync(2), and really makes the world
safe for Unixen as well, because sync(2) never had the semantics we need:
it offers no way to wait for the requested I/O to finish.

Along the way, fix a bug I recently introduced in xlog recovery:
file truncation replay failed to clear bufmgr buffers for the dropped
blocks, which could result in 'PANIC:  heap_delete_redo: no block'
later on in xlog replay.
2004-05-31 03:48:10 +00:00
Tom Lane
c6719a2784 Implement new PostmasterIsAlive() check for WIN32, per Claudio Natoli.
In passing, align a few error messages with the style guide.
2004-05-30 03:50:15 +00:00
Tom Lane
076a055acf Separate out bgwriter code into a logically separate module, rather
than being random pieces of other files.  Give bgwriter responsibility
for all checkpoint activity (other than a post-recovery checkpoint);
so this child process absorbs the functionality of the former transient
checkpoint and shutdown subprocesses.  While at it, create an actual
include file for postmaster.c, which for some reason never had its own
file before.
2004-05-29 22:48:23 +00:00
Tom Lane
1a321f26d8 Code review for EXEC_BACKEND changes. Reduce the number of #ifdefs by
about a third, make it work on non-Windows platforms again.  (But perhaps
I broke the WIN32 code, since I have no way to test that.)  Fold all the
paths that fork postmaster child processes to go through the single
routine SubPostmasterMain, which takes care of resurrecting the state that
would normally be inherited from the postmaster (including GUC variables).
Clean up some places where there's no particularly good reason for the
EXEC and non-EXEC cases to work differently.  Take care of one or two
FIXMEs that remained in the code.
2004-05-28 05:13:32 +00:00
Tom Lane
ebfc56d3fb Handle impending sinval queue overflow by means of a separate signal
(SIGUSR1, which we have not been using recently) instead of piggybacking
on SIGUSR2-driven NOTIFY processing.  This has several good results:
the processing needed to drain the sinval queue is a lot less than the
processing needed to answer a NOTIFY; there's less contention since we
don't have a bunch of backends all trying to acquire exclusive lock on
pg_listener; backends that are sitting inside a transaction block can
still drain the queue, whereas NOTIFY processing can't run if there's
an open transaction block.  (This last is a fairly serious issue that
I don't think we ever recognized before --- with clients like JDBC that
tend to sit with open transaction blocks, the sinval queue draining
mechanism never really worked as intended, probably resulting in a lot
of useless cache-reset overhead.)  This is the last of several proposed
changes in response to Philip Warner's recent report of sinval-induced
performance problems.
2004-05-23 03:50:45 +00:00
Tom Lane
4af3421161 Get rid of rd_nblocks field in relcache entries. Turns out this was
costing us lots more to maintain than it was worth.  On shared tables
it was of exactly zero benefit because we couldn't trust it to be
up to date.  On temp tables it sometimes saved an lseek, but not often
enough to be worth getting excited about.  And the real problem was that
we forced an lseek on every relcache flush in order to update the field.
So all in all it seems best to lose the complexity.
2004-05-08 19:09:25 +00:00
Neil Conway
0370951347 Tiny assorted fixes: correct a typo in a comment in vacuumlazy.c, remove
some unused #include directives from bufmgr.c, and clarify comments in
bufmgr.h and buf.h
2004-04-25 23:50:58 +00:00
Neil Conway
139abc2896 Make LocalRefCount and PrivateRefCount arrays of int32, rather than long.
This saves a small amount of per-backend memory for LP64 machines.
2004-04-22 07:21:55 +00:00
Tom Lane
95a03e9cdf Another round of code cleanup on bufmgr. Use BM_VALID flag to keep track
of whether we have successfully read data into a buffer; this makes the
error behavior a bit more transparent (IMHO anyway), and also makes it
work correctly for local buffers which don't use Start/TerminateBufferIO.
Collapse three separate functions for writing a shared buffer into one.
This overlaps a bit with cleanups that Neil proposed awhile back, but
seems not to have committed yet.
2004-04-21 18:06:30 +00:00
Tom Lane
011c3e62e7 Code review for ARC patch. Eliminate static variables, improve handling
of VACUUM cases so that VACUUM requests don't affect the ARC state at all,
avoid corner case where BufferSync would uselessly rewrite a buffer that
no longer contains the page that was to be flushed.  Make some minor
other cleanups in and around the bufmgr as well, such as moving PinBuffer
and UnpinBuffer into bufmgr.c where they really belong.
2004-04-19 23:27:17 +00:00
Bruce Momjian
31338352bd * Most changes are to fix warnings issued when compiling win32
* removed a few redundant defines
* get_user_name safe under win32
* rationalized pipe read EOF for win32 (UPDATED PATCH USED)
* changed all backend instances of sleep() to pg_usleep

    - except for the SLEEP_ON_ASSERT in assert.c, as it would exceed a
32-bit long [Note to patcher: If a SLEEP_ON_ASSERT of 2000 seconds is
acceptable, please replace with pg_usleep(2000000000L)]

I added a comment to that part of the code:

    /*
     *  It would be nice to use pg_usleep() here, but only does 2000 sec
     *  or 33 minutes, which seems too short.
     */
    sleep(1000000);

Claudio Natoli
2004-04-19 17:42:59 +00:00
Bruce Momjian
48b2802eee When changing select() calls for delays into pg_usleep(), two comments
in s_lock.c were not updated, and still refers to select. Made my grep
hit the wrong files, so I figured a simple patch was in order.. (other
refs in the same comment block was changed..)

Magnus Hagander
2004-03-23 21:39:46 +00:00
Bruce Momjian
3947f653f9 * postmaster.c: cleanup pmdaemonize under win32; missed failure message
in CreateOptsFile
* s_lock.c: minor comment fix
* findbe.c: variables not used under win32 moved within #ifndef WIN32
case

Claudio Natoli
2004-03-15 16:18:43 +00:00
Bruce Momjian
c672aa823b For application to HEAD, following community review.
* Changes incorrect CYGWIN defines to __CYGWIN__

* Some localtime returns NULL checks (when unchecked cause SEGVs under
Win32
regression tests)

* Rationalized CreateSharedMemoryAndSemaphores and
AttachSharedMemoryAndSemaphores (Bruce, I finally remembered to do it);
requires attention.

Claudio Natoli
2004-02-25 19:41:23 +00:00
Tom Lane
7a57a67278 Replace opendir/closedir calls throughout the backend with AllocateDir
and FreeDir routines modeled on the existing AllocateFile/FreeFile.
Like the latter, these routines will avoid failing on EMFILE/ENFILE
conditions whenever possible, and will prevent leakage of directory
descriptors if an elog() occurs while one is open.
Also, reduce PANIC to ERROR in MoveOfflineLogs() --- this is not
critical code and there is no reason to force a DB restart on failure.
All per recent trouble report from Olivier Hubaut.
2004-02-23 23:03:10 +00:00
Tom Lane
f83356c7f5 Do a direct probe during postmaster startup to determine the maximum
number of openable files and the number already opened.  This eliminates
depending on sysconf(_SC_OPEN_MAX), and allows much saner behavior on
platforms where open-file slots are used up by semaphores.
2004-02-23 20:45:59 +00:00
Bruce Momjian
af3b182a57 Here is a patch that implements setitimer() on win32. With this patch
applied, deadlock detection and statement_timeout now works.

The file timer.c goes into src/backend/port/win32/.

The patch also removes two lines of "printf debugging" accidentally left
in pqsignal.h, in the console control handler.

Magnus Hagander
2004-02-18 16:25:12 +00:00
Tom Lane
da99cce7cd Avoid delaying postmaster shutdown by up to 10 seconds on platforms
where signals do not terminate sleep() delays.
2004-02-12 20:07:26 +00:00
Jan Wieck
fc65a3e1fd Fixed bug where FlushRelationBuffers() did call StrategyInvalidateBuffer()
for already empty buffers because their buffer tag was not cleard out
when the buffers have been invalidated before.

Also removed the misnamed BM_FREE bufhdr flag and replaced the checks,
which effectively ask if the buffer is unpinned, with checks against the
refcount field.

Jan
2004-02-12 15:06:56 +00:00
Tom Lane
c3c09be34b Commit the reasonably uncontroversial parts of J.R. Nield's PITR patch, to
wit: Add a header record to each WAL segment file so that it can be reliably
identified.  Avoid splitting WAL records across segment files (this is not
strictly necessary, but makes it simpler to incorporate the header records).
Make WAL entries for file creation, deletion, and truncation (as foreseen but
never implemented by Vadim).  Also, add support for making XLOG_SEG_SIZE
configurable at compile time, similarly to BLCKSZ.  Fix a couple bugs I
introduced in WAL replay during recent smgr API changes.  initdb is forced
due to changes in pg_control contents.
2004-02-11 22:55:26 +00:00
Tom Lane
58f337a343 Centralize implementation of delay code by creating a pg_usleep()
subroutine in src/port/pgsleep.c.  Remove platform dependencies from
miscadmin.h and put them in port.h where they belong.  Extend recent
vacuum cost-based-delay patch to apply to VACUUM FULL, ANALYZE, and
non-btree index vacuuming.

By the way, where is the documentation for the cost-based-delay patch?
2004-02-10 03:42:45 +00:00
Tom Lane
87bd956385 Restructure smgr API as per recent proposal. smgr no longer depends on
the relcache, and so the notion of 'blind write' is gone.  This should
improve efficiency in bgwriter and background checkpoint processes.
Internal restructuring in md.c to remove the not-very-useful array of
MdfdVec objects --- might as well just use pointers.
Also remove the long-dead 'persistent main memory' storage manager (mm.c),
since it seems quite unlikely to ever get resurrected.
2004-02-10 01:55:27 +00:00
Neil Conway
f06e79525a Win32 signals cleanup. Patch by Magnus Hagander, with input from Claudio
Natoli and Bruce Momjian (and some cosmetic fixes from Neil Conway).
Changes:

    - remove duplicate signal definitions from pqsignal.h

    - replace pqkill() with kill() and redefine kill() in Win32

    - use ereport() in place of fprintf() in some error handling in
      pqsignal.c

    - export pg_queue_signal() and make use of it where necessary

    - add a console control handler for Ctrl-C and similar handling
      on Win32

    - do WaitForSingleObjectEx() in CHECK_FOR_INTERRUPTS() on Win32;
      query cancelling should now work on Win32

    - various other fixes and cleanups
2004-02-08 22:28:57 +00:00
Jan Wieck
f425b605f4 Cost based vacuum delay feature.
Jan
2004-02-06 19:36:18 +00:00
Jan Wieck
8d09e25693 Backing out the background writer sync() option.
Jan
2004-02-04 01:24:53 +00:00
Bruce Momjian
5ee2ae2049 Remove sleep() and use single PG_SLEEP call for Win32 signal handling
and consistency.

Change PG_USLEEP to use SleepEx() for signal interuptability.
2004-01-30 15:57:04 +00:00
Bruce Momjian
50491963cb Here's the latest win32 signals code, this time in the form of a patch
against the latest shapshot. It also includes the replacement of kill()
with pqkill() and sigsetmask() with pqsigsetmask().

Passes all tests fine on my linux machine once applied. Still doesn't
link completely on Win32 - there are a few things still required. But
much closer than before.

At Bruce's request, I'm goint to write up a README file about the method
of signals delivery chosen and why the others were rejected (basically a
summary of the mailinglist discussions). I'll finish that up once/if the
patch is accepted.


Magnus Hagander
2004-01-27 00:45:26 +00:00
Bruce Momjian
eec08b95e7 [all] Removed call to getppid in SendPostmasterSignal, replacing with a
PostmasterPid variable, which gets set (early) in PostmasterMain
getppid would not be the postmaster?

[fork/exec] Implements processCancelRequest by keeping an array of

pid/cancel_key structs in shared mem

[fork/exec] Moves AttachSharedMemoryAndSemaphores call for backends into
SubPostmasterMain

[win32] Implements reaper/waitpid by keeping an arrays of children
pids,handles in postmaster local mem
      - this item is largely untested, for reasons which should be
obvious, but appears sound

[win32/all] Added extern for pgpipe in Win32 case, and changed the second
pipe call (which seems to have been missed earlier) to pgpipe

[win32] #define'd ftruncate to chsize in the Win32 case

[win32] PG_USLEEP for Win32 has a misplaced paren. Fixed.

[win32] DLLIMPORT handling for MingW case


Claudio Natoli
2004-01-26 22:59:54 +00:00