Commit Graph

1167 Commits

Author SHA1 Message Date
Heikki Linnakangas 23dc89d2c3 Improve error messages in md.c. When a filesystem operation like open() or
fsync() fails, say "file" rather than "relation" when printing the filename.

This makes messages that display block numbers a bit confusing. For example,
in message 'could not read block 150000 of file "base/1234/5678.1"', 150000
is the block number from the beginning of the relation, ie. segment 0, not
150000th block within that segment. Per discussion, users aren't usually
interested in the exact location within the file, so we can live with that.

To ease constructing error messages, add FilePathName(File) function to
return the pathname of a virtual fd.
2009-08-05 18:01:54 +00:00
Tom Lane 2487d872e0 Create a multiplexing structure for signals to Postgres child processes.
This patch gets us out from under the Unix limitation of two user-defined
signal types.  We already had done something similar for signals directed to
the postmaster process; this adds multiplexing for signals directed to
backends and auxiliary processes (so long as they're connected to shared
memory).

As proof of concept, replace the former usage of SIGUSR1 and SIGUSR2
for backends with use of the multiplexing mechanism.  There are still some
hard-wired definitions of SIGUSR1 and SIGUSR2 for other process types,
but getting rid of those doesn't seem interesting at the moment.

Fujii Masao
2009-07-31 20:26:23 +00:00
Tom Lane 8504905793 Fix a thinko introduced into CountActiveBackends by a recent patch:
we should ignore NULL array entries, not non-NULL ones.  This had the
effect of disabling commit_delay, and could have caused a crash in the
rare race condition the patch was intended to fix.

Bug report and diagnosis by Jeff Janes, in bug #4952.
2009-07-29 15:57:11 +00:00
Tom Lane 2de48a83e6 Cleanup and code review for the patch that made bgwriter active during
archive recovery.  Invent a separate state variable and inquiry function
for XLogInsertAllowed() to clarify some tests and make the management of
writing the end-of-recovery checkpoint less klugy.  Fix several places
that were incorrectly testing InRecovery when they should be looking at
RecoveryInProgress or XLogInsertAllowed (because they will now be executed
in the bgwriter not startup process).  Clarify handling of bad LSNs passed
to XLogFlush during recovery.  Use a spinlock for setting/testing
SharedRecoveryInProgress.  Improve quite a lot of comments.

Heikki and Tom
2009-06-26 20:29:04 +00:00
Heikki Linnakangas 7e48b77b1c Fix some serious bugs in archive recovery, now that bgwriter is active
during it:

When bgwriter is active, the startup process can't perform mdsync() correctly
because it won't see the fsync requests accumulated in bgwriter's private
pendingOpsTable. Therefore make bgwriter responsible for the end-of-recovery
checkpoint as well, when it's active.

When bgwriter is active (= archive recovery), the startup process must not
accumulate fsync requests to its own pendingOpsTable, since bgwriter won't
see them there when it performs restartpoints. Make startup process drop its
pendingOpsTable when bgwriter is launched to avoid that.

Update minimum recovery point one last time when leaving archive recovery.
It won't be updated by the end-of-recovery checkpoint because XLogFlush()
sees us as out of recovery already.

This fixes bug #4879 reported by Fujii Masao.
2009-06-25 21:36:00 +00:00
Tom Lane 6382448cf9 For bulk write operations (eg COPY IN), use a ring buffer of 16MB instead
of the 256KB limit originally enforced by a patch committed 2008-11-06.
Per recent test results, the smaller size resulted in an undesirable decrease
in bulk data loading speed, due to COPY processing frequently getting blocked
for WAL flushing.  This area might need more tweaking later, but this setting
seems to be good enough for 8.4.
2009-06-22 20:04:28 +00:00
Bruce Momjian d747140279 8.4 pgindent run, with new combined Linux/FreeBSD/MinGW typedef list
provided by Andrew.
2009-06-11 14:49:15 +00:00
Tom Lane 4616d57dad Fix all the server-side SIGQUIT handlers (grumble ... why so many identical
copies?) to ensure they really don't run proc_exit/shmem_exit callbacks,
as was intended.  I broke this behavior recently by installing atexit
callbacks without thinking about the one case where we truly don't want
to run those callback functions.  Noted in an example from Dave Page.
2009-05-15 15:56:39 +00:00
Tom Lane 249a899f73 Install an atexit(2) callback that ensures that proc_exit's cleanup processing
will still be performed if something in a backend process calls exit()
directly, instead of going through proc_exit() as we prefer.  This is a second
response to the issue that we might load third-party code that doesn't know it
should not call exit().  Such a call will now cause a reasonably graceful
backend shutdown, if possible.  (Of course, if the reason for the exit() call
is out-of-memory or some such, we might not be able to recover, but at least
we will try.)
2009-05-05 20:06:07 +00:00
Tom Lane 969d7cd431 Install a "dead man switch" to allow the postmaster to detect cases where
a backend has done exit(0) or exit(1) without having disengaged itself
from shared memory.  We are at risk for this whenever third-party code is
loaded into a backend, since such code might not know it's supposed to go
through proc_exit() instead.  Also, it is reported that under Windows
there are ways to externally kill a process that cause the status code
returned to the postmaster to be indistinguishable from a voluntary exit
(thank you, Microsoft).  If this does happen then the system is probably
hosed --- for instance, the dead session might still be holding locks.
So the best recovery method is to treat this like a backend crash.

The dead man switch is armed for a particular child process when it
acquires a regular PGPROC, and disarmed when the PGPROC is released;
these should be the first and last touches of shared memory resources
in a backend, or close enough anyway.  This choice means there is no
coverage for auxiliary processes, but I doubt we need that, since they
shouldn't be executing any user-provided code anyway.

This patch also improves the management of the EXEC_BACKEND
ShmemBackendArray array a bit, by reducing search costs.

Although this problem is of long standing, the lack of field complaints
seems to mean it's not critical enough to risk back-patching; at least
not till we get some more testing of this mechanism.
2009-05-05 19:59:00 +00:00
Tom Lane c973051ae6 A session that does not have any live snapshots does not have to be waited for
when we are waiting for old snapshots to go away during a concurrent index
build.  In particular, this rule lets us avoid waiting for
idle-in-transaction sessions.

This logic could be improved further if we had some way to wake up when
the session we are currently waiting for goes idle-in-transaction.  However
that would be a significantly more complex/invasive patch, so it'll have to
wait for some other day.

Simon Riggs, with some improvements by Tom.
2009-04-04 17:40:36 +00:00
Tom Lane 1b2bb33a54 Add a comment documenting the question of whether PrefetchBuffer should
try to protect an already-existing buffer from being evicted.  This was
left as an open issue when the posix_fadvise patch was committed.  I'm
not sure there's any evidence to justify more work in this area, but we
should have some record about it in the source code.
2009-04-03 18:17:43 +00:00
Tom Lane 948d6ec90f Modify the relcache to record the temp status of both local and nonlocal
temp relations; this is no more expensive than before, now that we have
pg_class.relistemp.  Insert tests into bufmgr.c to prevent attempting
to fetch pages from nonlocal temp relations.  This provides a low-level
defense against bugs-of-omission allowing temp pages to be loaded into shared
buffers, as in the contrib/pgstattuple problem reported by Stuart Bishop.
While at it, tweak a bunch of places to use new relcache tests (instead of
expensive probes into pg_namespace) to detect local or nonlocal temp tables.
2009-03-31 22:12:48 +00:00
Heikki Linnakangas eeeb782e60 Fix a rare race condition when commit_siblings > 0 and a transaction commits
at the same instant as a new backend is spawned. Since CountActiveBackends()
doesn't hold ProcArrayLock, it needs to be prepared for the case that a
pointer at the end of the proc array is still NULL even though numProcs says
it should be valid, since it doesn't hold ProcArrayLock. Backpatch to 8.1.
8.0 and earlier had this right, but it was broken in the split of PGPROC and
sinval shared memory arrays.

Per report and proposal by Marko Kreen.
2009-03-31 05:18:33 +00:00
Tom Lane 471913a6a5 More fixes for 8.4 DTrace probes. Remove useless BUFFER_HIT/BUFFER_MISS
probes --- the BUFFER_READ_DONE probe provides the same information and more
besides.  Expand the LOCK_WAIT_START/DONE probe arguments so that there's
actually some chance of telling what is being waited for.  Update and
clean up the documentation.
2009-03-23 01:52:38 +00:00
Tom Lane 44023dc5f5 Add isExtend to the parameters of the buffer_read_start and buffer_read_done
DTrace probes, so that ordinary reads can be distinguished from relation
extension operations.  Move buffer_read_start probe to before the
smgrnblocks() call that's needed in the isExtend case, since really that step
should be charged as part of the time needed for the extension operation.
(This makes it slightly harder to match the read_start with the associated
read_done, since now you can't match them on blockNumber, but it should still
be possible since isExtend operations on the same relation can never be
interleaved.)  Per recent discussion.

In passing, add the page identity (forkNum/blockNum) to the parameters of the
buffer_flush_start/buffer_flush_done probes, which were unaccountably lacking
the info.
2009-03-22 22:39:05 +00:00
Tom Lane d287c9eff0 Restore previous ordering of BUFFER_FLUSH_START probe. I had wanted to
make it include the time for the possible smgropen() call, but that
results in a null pointer dereference :-(.

An alternative solution would be to fetch the buffer tag instead of
looking at *reln, but I'll just put it back as it was for the moment.

BTW, this indicates that DTrace probes evaluate their arguments even
when nominally inactive.  What was that about "zero cost", again?
2009-03-13 17:46:21 +00:00
Tom Lane e04810e8c4 Code review for dtrace probes added (so far) to 8.4. Adjust placement of
some bufmgr probes, take out redundant and memory-leak-inducing path arguments
to smgr__md__read__done and smgr__md__write__done, fix bogus attempt to
recalculate space used in sort__done, clean up formatting in places where
I'm not sure pgindent will do a nice job by itself.
2009-03-11 23:19:25 +00:00
Peter Eisentraut 9add9f95c3 Don't actively violate the system limit of maximum open files (RLIMIT_NOFILE).
This avoids irritating kernel logs (if system overstep violations are enabled)
and also the grsecurity alert when starting PostgreSQL.

original patch by Jacek Drobiecki

References:
http://archives.postgresql.org/pgsql-bugs/2004-05/msg00103.php
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=248967
2009-03-04 09:12:49 +00:00
Heikki Linnakangas cdd46c7654 Start background writer during archive recovery. Background writer now performs
its usual buffer cleaning duties during archive recovery, and it's responsible
for performing restartpoints.

This requires some changes in postmaster. When the startup process has done
all the initialization and is ready to start WAL redo, it signals the
postmaster to launch the background writer. The postmaster is signaled again
when the point in recovery is reached where we know that the database is in
consistent state. Postmaster isn't interested in that at the moment, but
that's the point where we could let other backends in to perform read-only
queries. The postmaster is signaled third time when the recovery has ended,
so that postmaster knows that it's safe to start accepting connections.

The startup process now traps SIGTERM, and performs a "clean" shutdown. If
you do a fast shutdown during recovery, a shutdown restartpoint is performed,
like a shutdown checkpoint, and postmaster kills the processes cleanly. You
still have to continue the recovery at next startup, though.

Currently, the background writer is only launched during archive recovery.
We could launch it during crash recovery as well, but it seems better to keep
that codepath as simple as possible, for the sake of robustness. And it
couldn't do any restartpoints during crash recovery anyway, so it wouldn't be
that useful.

log_restartpoints is gone. Use log_checkpoints instead. This is yet to be
documented.

This whole operation is a pre-requisite for Hot Standby, but has some value of
its own whether the hot standby patch makes 8.4 or not.

Simon Riggs, with lots of modifications by me.
2009-02-18 15:58:41 +00:00
Heikki Linnakangas b2a667b9ee Add a new option to RestoreBkpBlocks() to indicate if a cleanup lock should
be used instead of the normal exclusive lock, and make WAL redo functions
responsible for calling RestoreBkpBlocks(). They know better what kind of a
lock they need.

At the moment, this just moves things around with no functional change, but
makes the hot standby patch that's under review cleaner.
2009-01-20 18:59:37 +00:00
Tom Lane b7b8f0b609 Implement prefetching via posix_fadvise() for bitmap index scans. A new
GUC variable effective_io_concurrency controls how many concurrent block
prefetch requests will be issued.

(The best way to handle this for plain index scans is still under debate,
so that part is not applied yet --- tgl)

Greg Stark
2009-01-12 05:10:45 +00:00
Tom Lane dad75a62bf Create a "shmem_startup_hook" to be called at the end of shared memory
initialization, to give loadable modules a reasonable place to perform
creation of any shared memory areas they need.  This is the logical conclusion
of our previous creation of RequestAddinShmemSpace() and RequestAddinLWLocks().
We don't need an explicit shmem_shutdown_hook, because the existing
on_shmem_exit and on_proc_exit mechanisms serve that need.

Also, adjust SubPostmasterMain so that libraries that got loaded into the
postmaster will be loaded into all child processes, not only regular backends.
This improves consistency with the non-EXEC_BACKEND behavior, and might be
necessary for functionality for some types of add-ons.
2009-01-03 17:08:39 +00:00
Bruce Momjian 511db38ace Update copyright for 2009. 2009-01-01 17:24:05 +00:00
Bruce Momjian 5a90bc1fbe The attached patch contains a couple of fixes in the existing probes and
includes a few new ones.

- Fixed compilation errors on OS X for probes that use typedefs
- Fixed a number of probes to pass ForkNumber per the relation forks
patch
- The new probes are those that were taken out from the previous
submitted patch and required simple fixes. Will submit the other probes
that may require more discussion in a separate patch.

Robert Lor
2008-12-17 01:39:04 +00:00
Tom Lane 55368223cd Tweak the tree descent loop in fsm_search_avail to not look at the
right child if it doesn't need to.  This saves some miniscule number
of cycles, but the ulterior motive is to avoid an optimization bug
known to exist in SCO's C compiler (and perhaps others?)
2008-12-10 17:11:18 +00:00
Heikki Linnakangas dea81a6cf6 Revert SIGUSR1 multiplexing patch, per Tom's objection. 2008-12-09 15:59:39 +00:00
Heikki Linnakangas 7b05b3fa39 Provide support for multiplexing SIGUSR1 signal. The upcoming synchronous
replication patch needs a signal, but we've already used SIGUSR1 and
SIGUSR2 in normal backends. This patch allows reusing SIGUSR1 for that,
and for other purposes too if the need arises.
2008-12-09 14:28:20 +00:00
Alvaro Herrera 7b640b0345 Fix a couple of snapshot management bugs in the new ResourceOwner world:
non-writable large objects need to have their snapshots registered on the
transaction resowner, not the current portal's, because it must persist until
the large object is closed (which the portal does not).  Also, ensure that the
serializable snapshot is recorded by the transaction resource owner too, even
when a subtransaction has changed the current resource owner before
serializable is taken.

Per bug reports from Pavan Deolasee.
2008-12-04 14:51:02 +00:00
Heikki Linnakangas 011fa3662e Small comment fixes. 2008-12-03 12:22:53 +00:00
Heikki Linnakangas 4d6ee26171 Don't force creation of the FSM on searches. It will still be created
as soon as the first page fills up, and is marked as (almost) full,
though.
2008-11-27 13:32:26 +00:00
Heikki Linnakangas 58bece7a60 Fix #ifdeffed debugging code to work with relation forks. 2008-11-27 07:38:01 +00:00
Heikki Linnakangas 9858a8c81c Rely on relcache invalidation to update the cached size of the FSM. 2008-11-26 17:08:58 +00:00
Heikki Linnakangas 3396000684 Rethink the way FSM truncation works. Instead of WAL-logging FSM
truncations in FSM code, call FreeSpaceMapTruncateRel from smgr_redo. To
make that cleaner from modularity point of view, move the WAL-logging one
level up to RelationTruncate, and move RelationTruncate and all the
related WAL-logging to new src/backend/catalog/storage.c file. Introduce
new RelationCreateStorage and RelationDropStorage functions that are used
instead of calling smgrcreate/smgrscheduleunlink directly. Move the
pending rel deletion stuff from smgrcreate/smgrscheduleunlink to the new
functions. This leaves smgr.c as a thin wrapper around md.c; all the
transactional stuff is now in storage.c.

This will make it easier to add new forks with similar truncation logic,
like the visibility map.
2008-11-19 10:34:52 +00:00
Heikki Linnakangas f06b7604ca Fix oversight in previous error-reporting patch; mustn't pfree path string
before passing it to elog.
2008-11-14 11:09:50 +00:00
Tom Lane cad3a26a95 Fix sloppy omission of now-required #include's. 2008-11-11 14:17:02 +00:00
Heikki Linnakangas 7e8b0b9ab1 Change error messages to print the physical path, like
"base/11517/3767_fsm", instead of symbolic names like "1663/11517/3767/1",
per Alvaro's suggestion. I didn't change the messages in the higher-level
index, heap and FSM routines, though, where the fork is implicit.
2008-11-11 13:19:16 +00:00
Tom Lane 6517f377d6 Implement ALTER DATABASE SET TABLESPACE to move a whole database (or at least
as much of it as lives in its default tablespace) to a new tablespace.

Guillaume Lelarge, with some help from Bernd Helmle and Tom Lane
2008-11-07 18:25:07 +00:00
Tom Lane 85e2cedf98 Improve bulk-insert performance by keeping the current target buffer pinned
(but not locked, as that would risk deadlocks).  Also, make it work in a small
ring of buffers to avoid having bulk inserts trash the whole buffer arena.

Robert Haas, after an idea of Simon Riggs'.
2008-11-06 20:51:15 +00:00
Tom Lane b4eae023bb Clean up the messy semantics (not to mention inefficiency) of PageGetTempPage
by splitting it into three functions with better-defined behaviors.

Zdenek Kotala
2008-11-03 20:47:49 +00:00
Tom Lane d7112cfa88 Remove the last vestiges of the MAKE_PTR/MAKE_OFFSET mechanism. We haven't
allowed different processes to have different addresses for the shmem segment
in quite a long time, but there were still a few places left that used the
old coding convention.  Clean them up to reduce confusion and improve the
compiler's ability to detect pointer type mismatches.

Kris Jurka
2008-11-02 21:24:52 +00:00
Tom Lane 902d1cb35f Remove all uses of the deprecated functions heap_formtuple, heap_modifytuple,
and heap_deformtuple in favor of the newer functions heap_form_tuple et al
(which do the same things but use bool control flags instead of arbitrary
char values).  Eliminate the former duplicate coding of these functions,
reducing the deprecated functions to mere wrappers around the newer ones.
We can't get rid of them entirely because add-on modules probably still
contain many instances of the old coding style.

Kris Jurka
2008-11-02 01:45:28 +00:00
Heikki Linnakangas e9816533e3 Update FSM on WAL replay. This is a bit limited; the FSM is only updated
on non-full-page-image WAL records, and quite arbitrarily, only if there's
less than 20% free space on the page after the insert/update (not on HOT
updates, though). The 20% cutoff should avoid most of the overhead, when
replaying a bulk insertion, for example, while ensuring that pages that
are full are marked as full in the FSM.

This is mostly to avoid the nasty worst case scenario, where you replay
from a PITR archive, and the FSM information in the base backup is really
out of date. If there was a lot of pages that the outdated FSM claims to
have free space, but don't actually have any, the first unlucky inserter
after the recovery would traverse through all those pages, just to find
out that they're full. We didn't have this problem with the old FSM
implementation, because we simply threw the FSM information away on a
non-clean shutdown.
2008-10-31 19:40:27 +00:00
Heikki Linnakangas 19c8dc839b Unite ReadBufferWithFork, ReadBufferWithStrategy, and ZeroOrReadBuffer
functions into one ReadBufferExtended function, that takes the strategy
and mode as argument. There's three modes, RBM_NORMAL which is the default
used by plain ReadBuffer(), RBM_ZERO, which replaces ZeroOrReadBuffer, and
a new mode RBM_ZERO_ON_ERROR, which allows callers to read corrupt pages
without throwing an error. The FSM needs the new mode to recover from
corrupt pages, which could happend if we crash after extending an FSM file,
and the new page is "torn".

Add fork number to some error messages in bufmgr.c, that still lacked it.
2008-10-31 15:05:00 +00:00
Alvaro Herrera 089ae3bc9a Properly access a buffer's LSN using existing access macros instead of abusing
knowledge of page layout.

Stolen from Jonah Harris' CRC patch
2008-10-20 21:11:15 +00:00
Tom Lane dd4c165bc3 Improve some of the comments in fsmpage.c. 2008-10-07 21:10:11 +00:00
Heikki Linnakangas 89f373bf5b Index FSMs needs to be vacuumed as well. Report by Jeff Davis. 2008-10-06 08:04:11 +00:00
Tom Lane 68827a7ada Suppress an uninitialized-variable warning (not all versions of gcc
complain here, but some do)
2008-10-01 14:59:23 +00:00
Heikki Linnakangas f06ef2bede Fix WAL redo of FSM truncation. We can't call smgrtruncate() during WAL
replay, because it tries to XLogInsert().
2008-10-01 08:12:14 +00:00
Tom Lane 6ca1b1cd95 Fix compiler warning (unportable sprintf usage) 2008-09-30 14:15:58 +00:00
Heikki Linnakangas 15c121b3ed Rewrite the FSM. Instead of relying on a fixed-size shared memory segment, the
free space information is stored in a dedicated FSM relation fork, with each
relation (except for hash indexes; they don't use FSM).

This eliminates the max_fsm_relations and max_fsm_pages GUC options; remove any
trace of them from the backend, initdb, and documentation.

Rewrite contrib/pg_freespacemap to match the new FSM implementation. Also
introduce a new variant of the get_raw_page(regclass, int4, int4) function in
contrib/pageinspect that let's you to return pages from any relation fork, and
a new fsm_page_contents() function to inspect the new FSM pages.
2008-09-30 10:52:14 +00:00
Alvaro Herrera 5817d861e9 Optimize CleanupTempFiles by having a boolean flag that keeps track of whether
there are FD_XACT_TEMPORARY files to clean up at transaction end.

Per performance profiling results on AWeber's huge systems.

Patch by me after an idea suggested by Simon Riggs.
2008-09-19 04:57:10 +00:00
Tom Lane 35c2a3c3cf Allow ShowBufferUsage() to report the number of reads/writes that have
occurred to temporary files.  This replaces the unused
NDirectFileRead/NDirectFileWrite counters.

Itagaki Takahiro
2008-09-17 13:15:55 +00:00
Heikki Linnakangas 3f0e808c4a Introduce the concept of relation forks. An smgr relation can now consist
of multiple forks, and each fork can be created and grown separately.

The bulk of this patch is about changing the smgr API to include an extra
ForkNumber argument in every smgr function. Also, smgrscheduleunlink and
smgrdounlink no longer implicitly call smgrclose, because other forks might
still exist after unlinking one. The callers of those functions have been
modified to call smgrclose instead.

This patch in itself doesn't have any user-visible effect, but provides the
infrastructure needed for upcoming patches. The additional forks envisioned
are a rewritten FSM implementation that doesn't rely on a fixed-size shared
memory block, and a visibility map to allow skipping portions of a table in
VACUUM that have no dead tuples.
2008-08-11 11:05:11 +00:00
Tom Lane d8b04d5fac In ReadOrZeroBuffer (and related entry points), don't bother to call
PageHeaderIsValid when we zero the buffer instead of reading the page in.
The actual performance improvement is probably marginal since this function
isn't very heavily used, but a cycle saved is a cycle earned.

Zdenek Kotala
2008-08-05 15:09:04 +00:00
Tom Lane 4abd7b49f1 Improve CREATE/DROP/RENAME DATABASE so that when failing because the source
or target database is being accessed by other users, it tells you whether
the "other users" are live sessions or uncommitted prepared transactions.
(Indeed, it tells you exactly how many of each, but that's mostly just
because it was easy to do so.)  This should help forestall the gotcha of
not realizing that a prepared transaction is what's blocking the command.
Per discussion.
2008-08-04 18:03:46 +00:00
Alvaro Herrera e36e6b1cab Add a few more DTrace probes to the backend.
Robert Lor
2008-08-01 13:16:09 +00:00
Tom Lane dc02a4814a Fix a race condition that I introduced into sinvaladt.c during the recent
rewrite.  When called from SIInsertDataEntries, SICleanupQueue releases
the write lock if it has to issue a kill() to signal some laggard backend.
That still seems like a good idea --- but it's possible that by the time
we get the lock back, there are no longer enough free message slots to
satisfy SIInsertDataEntries' requirement.  Must recheck, and repeat the
whole SICleanupQueue process if not.  Noted while reading code.
2008-07-18 14:45:48 +00:00
Tom Lane 6816577a78 Change the PageGetContents() macro to guarantee its result is maxalign'd,
thereby forestalling any problems with alignment of the data structure placed
there.  Since SizeOfPageHeaderData is maxalign'd anyway in 8.3 and HEAD, this
does not actually change anything right now, but it is foreseeable that the
header size will change again someday.  I had to fix a couple of places that
were assuming that the content offset is just SizeOfPageHeaderData rather than
MAXALIGN(SizeOfPageHeaderData).  Per discussion of Zdenek's page-macros patch.
2008-07-13 21:50:04 +00:00
Tom Lane 9d035f4254 Clean up the use of some page-header-access macros: principally, use
SizeOfPageHeaderData instead of sizeof(PageHeaderData) in places where that
makes the code clearer, and avoid casting between Page and PageHeader where
possible.  Zdenek Kotala, with some additional cleanup by Heikki Linnakangas.

I did not apply the parts of the proposed patch that would have resulted in
slightly changing the on-disk format of hash indexes; it seems to me that's
not a win as long as there's any chance of having in-place upgrade for 8.4.
2008-07-13 20:45:47 +00:00
Alvaro Herrera 110147653a Make sure we only try to free snapshots that have been passed through
CopySnapshot, per Neil Conway.  Also add a comment about the assumption in
GetSnapshotData that the argument is statically allocated.

Also, fix some more typos in comments in snapmgr.c.
2008-07-11 02:10:14 +00:00
Tom Lane 5b965bf08b Teach autovacuum how to determine whether a temp table belongs to a crashed
backend.  If so, send a LOG message to the postmaster log, and if the table
is beyond the vacuum-for-wraparound horizon, forcibly drop it.  Per recent
discussions.  Perhaps we ought to back-patch this, but it probably needs
to age a bit in HEAD first.
2008-07-01 02:09:34 +00:00
Tom Lane dab421d2f0 Seems I was too optimistic in supposing that sinval's maxMsgNum could be
read and written without a lock.  The value itself is atomic, sure, but on
processors with weak memory ordering it's possible for a reader to see the
value change before it sees the associated message written into the buffer
array.  Fix by introducing a spinlock that's used just to read and write
maxMsgNum.  (We could do this with less overhead if we recognized a concept
of "memory access barrier"; is it worth introducing such a thing?  At the
moment probably not --- I can't measure any clear slowdown from adding the
spinlock, so this solution is probably fine.)  Per buildfarm results.
2008-06-20 00:24:53 +00:00
Tom Lane fad153ec45 Rewrite the sinval messaging mechanism to reduce contention and avoid
unnecessary cache resets.  The major changes are:

* When the queue overflows, we only issue a cache reset to the specific
backend or backends that still haven't read the oldest message, rather
than resetting everyone as in the original coding.

* When we observe backend(s) falling well behind, we signal SIGUSR1
to only one backend, the one that is furthest behind and doesn't already
have a signal outstanding for it.  When it finishes catching up, it will
in turn signal SIGUSR1 to the next-furthest-back guy, if there is one that
is far enough behind to justify a signal.  The PMSIGNAL_WAKEN_CHILDREN
mechanism is removed.

* We don't attempt to clean out dead messages after every message-receipt
operation; rather, we do it on the insertion side, and only when the queue
fullness passes certain thresholds.

* Split SInvalLock into SInvalReadLock and SInvalWriteLock so that readers
don't block writers nor vice versa (except during the infrequent queue
cleanout operations).

* Transfer multiple sinval messages for each acquisition of a read or
write lock.
2008-06-19 21:32:56 +00:00
Alvaro Herrera a3540b0f65 Improve our #include situation by moving pointer types away from the
corresponding struct definitions.  This allows other headers to avoid including
certain highly-loaded headers such as rel.h and relscan.h, instead using just
relcache.h, heapam.h or genam.h, which are more lightweight and thus cause less
unnecessary dependencies.
2008-06-19 00:46:06 +00:00
Tom Lane 86fdb32bd0 Remove freeBackends counter from the sinval shared memory area. We used to
use it to help enforce superuser_reserved_backends, but since 8.1 it's
just been dead weight.
2008-06-17 20:07:08 +00:00
Heikki Linnakangas a213f1ee6c Refactor XLogOpenRelation() and XLogReadBuffer() in preparation for relation
forks. XLogOpenRelation() and the associated light-weight relation cache in
xlogutils.c is gone, and XLogReadBuffer() now takes a RelFileNode as argument,
instead of Relation.

For functions that still need a Relation struct during WAL replay, there's a
new function called CreateFakeRelcacheEntry() that returns a fake entry like
XLogOpenRelation() used to.
2008-06-12 09:12:31 +00:00
Neil Conway 8374246054 Further tweak for comment in CheckDeadLock(), per Tom. 2008-06-09 18:23:05 +00:00
Neil Conway da80a4b97e Fix typo in comment. 2008-06-09 06:55:34 +00:00
Alvaro Herrera cc87402d6e Move BufferGetPageSize and BufferGetPage from bufpage.h to bufmgr.h. It is
more logical that way, and also it reduces the amount of unnecessary includes
in bufpage.h, which is widely used.

Zdenek Kotala.

My previous patch to bufpage.h should also have credited him as author, but I
forgot (sorry about that).
2008-06-08 22:00:48 +00:00
Bruce Momjian d82a1d582c This is the patch replace offnum++ by OffsetNumberNext, to be
consistent.  OffsetNumberNext() has some casting that makes it useful.

Fujii Masao
2008-05-13 15:44:08 +00:00
Alvaro Herrera 5da9da71c4 Improve snapshot manager by keeping explicit track of snapshots.
There are two ways to track a snapshot: there's the "registered" list, which
is used for arbitrary long-lived snapshots; and there's the "active stack",
which is used for the snapshot that is considered "active" at any time.
This also allows users of snapshots to stop worrying about snapshot memory
allocation and freeing, and about using PG_TRY blocks around ActiveSnapshot
assignment.  This is all done automatically now.

As a consequence, this allows us to reset MyProc->xmin when there are no
more snapshots registered in the current backend, reducing the impact that
long-running transactions have on VACUUM.
2008-05-12 20:02:02 +00:00
Alvaro Herrera 9084399782 Put back bufmgr.h in bufpage.h -- it is needed by some macros.
Remove #include bufmgr.h from (most?) source files which already include
bufpage.h.
2008-05-12 16:06:10 +00:00
Alvaro Herrera f8c4d7db60 Restructure some header files a bit, in particular heapam.h, by removing some
unnecessary #include lines in it.  Also, move some tuple routine prototypes and
macros to htup.h, which allows removal of heapam.h inclusion from some .c
files.

For this to work, a new header file access/sysattr.h needed to be created,
initially containing attribute numbers of system columns, for pg_dump usage.

While at it, make contrib ltree, intarray and hstore header files more
consistent with our header style.
2008-05-12 00:00:54 +00:00
Tom Lane 3c6248a828 Remove the recently added USE_SEGMENTED_FILES option, and indeed remove all
support for a nonsegmented mode from md.c.  Per recent discussions, there
doesn't seem to be much value in a "never segment" option as opposed to
segmenting with a suitably large segment size.  So instead provide a
configure-time switch to set the desired segment size in units of gigabytes.
While at it, expose a configure switch for BLCKSZ as well.

Zdenek Kotala
2008-05-02 01:08:27 +00:00
Heikki Linnakangas 9cb91f90c9 Fix two race conditions between the pending unlink mechanism that was put in
place to prevent reusing relation OIDs before next checkpoint, and DROP
DATABASE. First, if a database was dropped, bgwriter would still try to unlink
the files that the rmtree() call by the DROP DATABASE command has already
deleted, or is just about to delete. Second, if a database is dropped, and
another database is created with the same OID, bgwriter would in the worst
case delete a relation in the new database that happened to get the same OID
as a dropped relation in the old database.

To fix these race conditions:
- make rmtree() ignore ENOENT errors. This fixes the 1st race condition.
- make ForgetDatabaseFsyncRequests forget unlink requests as well.
- force checkpoint on in dropdb on all platforms

Since ForgetDatabaseFsyncRequests() is asynchronous, the 2nd change isn't
enough on its own to fix the problem of dropping and creating a database with
same OID, but forcing a checkpoint on DROP DATABASE makes it sufficient.

Per Tom Lane's bug report and proposal. Backpatch to 8.3.
2008-04-18 06:48:38 +00:00
Tom Lane d1cbd26ded Repair two places where SIGTERM exit could leave shared memory state
corrupted.  (Neither is very important if SIGTERM is used to shut down the
whole database cluster together, but there's a problem if someone tries to
SIGTERM individual backends.)  To do this, introduce new infrastructure
macros PG_ENSURE_ERROR_CLEANUP/PG_END_ENSURE_ERROR_CLEANUP that take care
of transiently pushing an on_shmem_exit cleanup hook.  Also use this method
for createdb cleanup --- that wasn't a shared-memory-corruption problem,
but SIGTERM abort of createdb could leave orphaned files lying around.

Backpatch as far as 8.2.  The shmem corruption cases don't exist in 8.1,
and the createdb usage doesn't seem important enough to risk backpatching
further.
2008-04-16 23:59:40 +00:00
Tom Lane ec498cdcbb Create new routines systable_beginscan_ordered, systable_getnext_ordered,
systable_endscan_ordered that have API similar to systable_beginscan etc
(in particular, the passed-in scankeys have heap not index attnums),
but guarantee ordered output, unlike the existing functions.  For the moment
these are just very thin wrappers around index_beginscan/index_getnext/etc.
Someday they might need to get smarter; but for now this is just a code
refactoring exercise to reduce the number of direct callers of index_getnext,
in preparation for changing that function's API.

In passing, remove index_getnext_indexitem, which has been dead code for
quite some time, and will have even less use than that in the presence
of run-time-lossy indexes.
2008-04-12 23:14:21 +00:00
Alvaro Herrera 73b0300b2a Move the HTSU_Result enum definition into snapshot.h, to avoid including
tqual.h into heapam.h.  This makes all inclusion of tqual.h explicit.

I also sorted alphabetically the includes on some source files.
2008-03-26 21:10:39 +00:00
Alvaro Herrera 78f02ca1f5 Rename snapmgmt.c/h to snapmgr.c/h, for consistency with other files.
Per complaint from Tom Lane.
2008-03-26 18:48:59 +00:00
Alvaro Herrera d43b085d57 Separate snapshot management code from tuple visibility code, create a
snapmgmt.c file for the former.  The header files have also been reorganized
in three parts: the most basic snapshot definitions are now in a new file
snapshot.h, and the also new snapmgmt.h keeps the definitions for snapmgmt.c.
tqual.h has been reduced to the bare minimum.

This patch is just a first step towards managing live snapshots within a
transaction; there is no functionality change.

Per my proposal to pgsql-patches on 20080318191940.GB27458@alvh.no-ip.org and
subsequent discussion.
2008-03-26 16:20:48 +00:00
Tom Lane 9b8e1eb375 Adjust the recent patch for reporting of deadlocked queries so that we report
query texts only to the server log.  This eliminates the issue of possible
leaking of security-sensitive data in other sessions' queries.  Since the
log is presumed secure, we can now log the queries of all sessions involved
in the deadlock, whether or not they belong to the same user as the one
reporting the failure.
2008-03-24 18:22:36 +00:00
Tom Lane 4b7ae4afae Report the current queries of all backends involved in a deadlock
(if they'd be visible to the current user in pg_stat_activity).

This might look like it's subject to race conditions, but it's actually
pretty safe because at the time DeadLockReport() is constructing the
report, we haven't yet aborted our transaction and so we can expect that
everyone else involved in the deadlock is still blocked on some lock.
(There are corner cases where that might not be true, such as a statement
timeout triggering in another backend before we finish reporting; but at
worst we'd report a misleading activity string, so it seems acceptable
considering the usefulness of reporting the queries.)

Original patch by Itagaki Takahiro, heavily modified by me.
2008-03-21 21:08:31 +00:00
Bruce Momjian fca9fff41b More README src cleanups. 2008-03-21 13:23:29 +00:00
Bruce Momjian 4e228447aa Make source code READMEs more consistent. Add CVS tags to all README files. 2008-03-20 17:55:15 +00:00
Alvaro Herrera d54bb24cdd Move elog(DEBUG4) call outside the locked area, per suggestion from Tom Lane. 2008-03-18 12:36:43 +00:00
Peter Eisentraut a7b7b07af3 Enable probes to work with Mac OS X Leopard and other OSes that will
support DTrace in the future.

Switch from using DTRACE_PROBEn macros to the dynamically generated macros.
Use "dtrace -h" to create a header file that contains the dynamically
generated macros to be used in the source code instead of the DTRACE_PROBEn
macros.  A dummy header file is generated for builds without DTrace support.

Author: Robert Lor <Robert.Lor@sun.com>
2008-03-17 19:44:41 +00:00
Alvaro Herrera 23057f51f5 Move ProcState definition into sinvaladt.c from sinvaladt.h, since it's not
needed anywhere after my previous patch.  Noticed by Tom Lane.

Also, remove #include <signal.h> from sinval.c.
2008-03-17 11:50:27 +00:00
Alvaro Herrera ec6550c6c0 Modify interactions between sinval.c and sinvaladt.c. The code that actually
deals with the queue, including locking etc, is all in sinvaladt.c.  This means
that the struct definition of the queue, and the queue pointer, are now
internal "implementation details" inside sinvaladt.c.

Per my proposal dated 25-Jun-2007 and followup discussion.
2008-03-16 19:47:34 +00:00
Tom Lane 611b4393f2 Make TransactionIdIsInProgress check transam.c's single-item XID status cache
before it goes groveling through the ProcArray.  In situations where the same
recently-committed transaction ID is checked repeatedly by tqual.c, this saves
a lot of shared-memory searches.  And it's cheap enough that it shouldn't
hurt noticeably when it doesn't help.
Concept and patch by Simon, some minor tweaking and comment-cleanup by Tom.
2008-03-11 20:20:35 +00:00
Tom Lane f0828b2fc3 Provide a build-time option to store large relations as single files, rather
than dividing them into 1GB segments as has been our longtime practice.  This
requires working support for large files in the operating system; at least for
the time being, it won't be the default.

Zdenek Kotala
2008-03-10 20:06:27 +00:00
Tom Lane 3fcc7e8e18 Reduce memory consumption during VACUUM of large relations, by using
FSMPageData (6 bytes) instead of PageFreeSpaceInfo (8 or 16 bytes)
for the temporary array of page-free-space information.

Itagaki Takahiro
2008-03-10 02:04:10 +00:00
Tom Lane 7d6e6e2e97 Fix PREPARE TRANSACTION to reject the case where the transaction has dropped a
temporary table; we can't support that because there's no way to clean up the
source backend's internal state if the eventual COMMIT PREPARED is done by
another backend.  This was checked correctly in 8.1 but I broke it in 8.2 :-(.
Patch by Heikki Linnakangas, original trouble report by John Smith.
2008-03-04 19:54:06 +00:00
Tom Lane d50e256b67 Fix another place that was assuming that a local variable declared as
"struct varlena" would be at least word-aligned.  Per buildfarm results
from gypsy_moth.  I did a little bit of trawling for other instances of
this coding pattern, and didn't find any; but if we turn up any more
of them I think we'd better revert the "char [4]" patch and find another
way of making tuptoaster.c alignment-safe.
2008-03-01 19:26:22 +00:00
Peter Eisentraut 0474dcb608 Refactor backend makefiles to remove lots of duplicate code 2008-02-19 10:30:09 +00:00
Tom Lane 082aca9ec2 Fix PageGetExactFreeSpace() so that it actually behaves sensibly
if pd_lower > pd_upper, rather than merely claiming to.  This would
only matter if the page header were corrupt, which shouldn't occur,
but ...
2008-02-10 20:39:08 +00:00
Tom Lane 6f906905b1 Fix WaitOnLock() to ensure that the process's "waiting" flag is reset after
erroring out of a wait.  We can use a PG_TRY block for this, but add a comment
explaining why it'd be a bad idea to use it for any other state cleanup.

Back-patch to 8.2.  Prior releases had the same issue, but only with respect
to the process title, which is likely to get reset almost immediately anyway
after the transaction aborts, so it seems not worth changing them.  In 8.2
and HEAD, the pg_stat_activity "waiting" flag could remain set incorrectly
for a long time.

Per report from Gurjeet Singh.
2008-02-02 22:26:17 +00:00
Tom Lane 6322e84430 Change StatementCancelHandler() to check the DoingCommandRead flag to decide
whether to execute an immediate interrupt, rather than testing whether
LockWaitCancel() cancelled a lock wait.  The old way misclassified the case
where we were blocked in ProcWaitForSignal(), and arguably would misclassify
any other future additions of new ImmediateInterruptOK states too.  This
allows reverting the old kluge that gave LockWaitCancel() a return value,
since no callers care anymore.  Improve comments in the various
implementations of PGSemaphoreLock() to explain that on some platforms, the
assumption that semop() exits after a signal is wrong, and so we must ensure
that the signal handler itself throws elog if we want cancel or die interrupts
to be effective.  Per testing related to bug #3883, though this patch doesn't
solve those problems fully.

Perhaps this change should be back-patched, but since pre-8.3 branches aren't
really relying on autovacuum to respond to SIGINT, it doesn't seem critical
for them.
2008-01-26 19:55:08 +00:00
Tom Lane ceb9360067 Fix CREATE INDEX CONCURRENTLY to not deadlock against an automatic or manual
VACUUM that is blocked waiting to get lock on the table being indexed.
Per report and fix suggestion from Greg Stark.
2008-01-09 21:52:36 +00:00
Tom Lane da3df47c84 lmgr.c:DescribeLockTag was never taught about virtual xids, per Greg Stark.
Also a couple of minor tweaks to try to future-proof the code a bit better
against future locktag additions.
2008-01-08 23:18:51 +00:00