Commit Graph

317 Commits

Author SHA1 Message Date
Heikki Linnakangas 3f0e808c4a Introduce the concept of relation forks. An smgr relation can now consist
of multiple forks, and each fork can be created and grown separately.

The bulk of this patch is about changing the smgr API to include an extra
ForkNumber argument in every smgr function. Also, smgrscheduleunlink and
smgrdounlink no longer implicitly call smgrclose, because other forks might
still exist after unlinking one. The callers of those functions have been
modified to call smgrclose instead.

This patch in itself doesn't have any user-visible effect, but provides the
infrastructure needed for upcoming patches. The additional forks envisioned
are a rewritten FSM implementation that doesn't rely on a fixed-size shared
memory block, and a visibility map to allow skipping portions of a table in
VACUUM that have no dead tuples.
2008-08-11 11:05:11 +00:00
Tom Lane 9d035f4254 Clean up the use of some page-header-access macros: principally, use
SizeOfPageHeaderData instead of sizeof(PageHeaderData) in places where that
makes the code clearer, and avoid casting between Page and PageHeader where
possible.  Zdenek Kotala, with some additional cleanup by Heikki Linnakangas.

I did not apply the parts of the proposed patch that would have resulted in
slightly changing the on-disk format of hash indexes; it seems to me that's
not a win as long as there's any chance of having in-place upgrade for 8.4.
2008-07-13 20:45:47 +00:00
Bruce Momjian 6b797c852b Fix recovery.conf boolean variables to take the same range of string
values as postgresql.conf.
2008-06-30 22:10:43 +00:00
Heikki Linnakangas a213f1ee6c Refactor XLogOpenRelation() and XLogReadBuffer() in preparation for relation
forks. XLogOpenRelation() and the associated light-weight relation cache in
xlogutils.c is gone, and XLogReadBuffer() now takes a RelFileNode as argument,
instead of Relation.

For functions that still need a Relation struct during WAL replay, there's a
new function called CreateFakeRelcacheEntry() that returns a fake entry like
XLogOpenRelation() used to.
2008-06-12 09:12:31 +00:00
Alvaro Herrera cc87402d6e Move BufferGetPageSize and BufferGetPage from bufpage.h to bufmgr.h. It is
more logical that way, and also it reduces the amount of unnecessary includes
in bufpage.h, which is widely used.

Zdenek Kotala.

My previous patch to bufpage.h should also have credited him as author, but I
forgot (sorry about that).
2008-06-08 22:00:48 +00:00
Magnus Hagander 8eee526c19 Set hidden field for guc enum missed in previous commit. 2008-05-28 15:22:05 +00:00
Tom Lane 1a604b4e31 Fix a subtle bug exposed by recent wal_sync_method rearrangements.
Formerly, the default value of wal_sync_method was determined inside xlog.c,
but now it is determined inside guc.c.  guc.c was reading xlogdefs.h
without having read <fcntl.h>, leading to wrong determination of
DEFAULT_SYNC_METHOD.  Obviously xlogdefs.h needs to include <fcntl.h>
for itself to ensure stable results.
2008-05-17 17:24:57 +00:00
Tom Lane 8a2f5d221b Reduce unnecessary PANIC to ERROR, improve a couple of comments. 2008-05-16 19:15:05 +00:00
Magnus Hagander 9bf1db04c0 Remove the special variable for open_sync_bit used in O_SYNC and O_DSYNC
modes, replacing it with a call to a function that derives it from the
sync_method variable, now that it has distinct values for these two cases.

This means that assign_xlog_sync_method() no longer changes any settings,
thus fixing the bug introduced in the change to use a guc enum for
wal_sync_method.
2008-05-14 14:02:57 +00:00
Magnus Hagander 72e2db86b9 Don't try to close negative file descriptors, since this can cause
crashes on certain platforms. In particular, the MSVC runtime is known
to do this.

Fixes bug #4162, reported and diagnosed by Javier Pimas
2008-05-13 20:53:52 +00:00
Magnus Hagander aa82790fca Fix breakage by the wal_sync_method patch in installations that use
O_DSYNC (specifically this broke all the Windows buildfarm members)
2008-05-12 19:45:23 +00:00
Alvaro Herrera 9084399782 Put back bufmgr.h in bufpage.h -- it is needed by some macros.
Remove #include bufmgr.h from (most?) source files which already include
bufpage.h.
2008-05-12 16:06:10 +00:00
Magnus Hagander 2739a4e1d2 Report which WAL sync method we are trying to change *to* when it fails,
not which one we had before (that worked, and thus is completley irrelevant)
2008-05-12 14:27:47 +00:00
Magnus Hagander f99760c19f Convert wal_sync_method to guc enum. 2008-05-12 08:35:05 +00:00
Alvaro Herrera f8c4d7db60 Restructure some header files a bit, in particular heapam.h, by removing some
unnecessary #include lines in it.  Also, move some tuple routine prototypes and
macros to htup.h, which allows removal of heapam.h inclusion from some .c
files.

For this to work, a new header file access/sysattr.h needed to be created,
initially containing attribute numbers of system columns, for pg_dump usage.

While at it, make contrib ltree, intarray and hstore header files more
consistent with our header style.
2008-05-12 00:00:54 +00:00
Heikki Linnakangas c5f42ce8d5 Fix Assert introduced in previous patch. 2008-05-09 15:27:17 +00:00
Heikki Linnakangas f0eb3e5e58 Fix incorrect archive truncation point calculation in the %r recovery_command
parameter. This fixes bug 4137 reported by Wojciech Strzalka, where a WAL
file is deleted too early when starting the recovery of a warm standby server.

Also add a sanity check in pg_standby so that it will refuse to delete anything
earlier than the file being restored, and improve the debug message in case
nothing is deleted.

Simon Riggs. Backpatch to 8.3, which is where %r was introduced.
2008-05-09 14:27:47 +00:00
Magnus Hagander 380d1ee69e Update error messages, per notes from Tom.
Laurenz Albe
2008-04-24 14:23:43 +00:00
Magnus Hagander c979a1fefa Prevent shutdown in normal mode if online backup is running, and
have pg_ctl warn about this.

Cancel running online backups (by renaming the backup_label file,
thus rendering the backup useless) when shutting down in fast mode.

Laurenz Albe
2008-04-23 13:44:59 +00:00
Tom Lane 8472bf7a73 Allow float8, int8, and related datatypes to be passed by value on machines
where Datum is 8 bytes wide.  Since this will break old-style C functions
(those still using version 0 calling convention) that have arguments or
results of these types, provide a configure option to disable it and retain
the old pass-by-reference behavior.  Likewise, provide a configure option
to disable the recently-committed float4 pass-by-value change.

Zoltan Boszormenyi, plus configurability stuff by me.
2008-04-21 00:26:47 +00:00
Tom Lane d1cbd26ded Repair two places where SIGTERM exit could leave shared memory state
corrupted.  (Neither is very important if SIGTERM is used to shut down the
whole database cluster together, but there's a problem if someone tries to
SIGTERM individual backends.)  To do this, introduce new infrastructure
macros PG_ENSURE_ERROR_CLEANUP/PG_END_ENSURE_ERROR_CLEANUP that take care
of transiently pushing an on_shmem_exit cleanup hook.  Also use this method
for createdb cleanup --- that wasn't a shared-memory-corruption problem,
but SIGTERM abort of createdb could leave orphaned files lying around.

Backpatch as far as 8.2.  The shmem corruption cases don't exist in 8.1,
and the createdb usage doesn't seem important enough to risk backpatching
further.
2008-04-16 23:59:40 +00:00
Bruce Momjian 2a1cf97c22 Have pg_stop_backup() wait for all archive files to be sent, rather than
returing right away.  This guarantees that when pg_stop_backup()
returns, you have a valid backup.

Simon Riggs
2008-04-05 01:34:06 +00:00
Tom Lane 220db7ccd8 Simplify and standardize conversions between TEXT datums and ordinary C
strings.  This patch introduces four support functions cstring_to_text,
cstring_to_text_with_len, text_to_cstring, and text_to_cstring_buffer, and
two macros CStringGetTextDatum and TextDatumGetCString.  A number of
existing macros that provided variants on these themes were removed.

Most of the places that need to make such conversions now require just one
function or macro call, in place of the multiple notational layers that used
to be needed.  There are no longer any direct calls of textout or textin,
and we got most of the places that were using handmade conversions via
memcpy (there may be a few still lurking, though).

This commit doesn't make any serious effort to eliminate transient memory
leaks caused by detoasting toasted text objects before they reach
text_to_cstring.  We changed PG_GETARG_TEXT_P to PG_GETARG_TEXT_PP in a few
places where it was easy, but much more could be done.

Brendan Jurd and Tom Lane
2008-03-25 22:42:46 +00:00
Tom Lane 2fc2795456 Remove no-longer-used XLogCacheByte field of XLogCtl.
Itagaki Takahiro
2008-03-10 02:13:22 +00:00
Tom Lane cd00406774 Replace time_t with pg_time_t (same values, but always int64) in on-disk
data structures and backend internal APIs.  This solves problems we've seen
recently with inconsistent layout of pg_control between machines that have
32-bit time_t and those that have already migrated to 64-bit time_t.  Also,
we can get out from under the problem that Windows' Unix-API emulation is not
consistent about the width of time_t.

There are a few remaining places where local time_t variables are used to hold
the current or recent result of time(NULL).  I didn't bother changing these
since they do not affect any cross-module APIs and surely all platforms will
have 64-bit time_t before overflow becomes an actual risk.  time_t should
be avoided for anything visible to extension modules, however.
2008-02-17 02:09:32 +00:00
Peter Eisentraut 6f8f8d2daa Provide a clearer error message if the pg_control version number looks
wrong because of mismatched byte ordering.
2008-01-21 11:17:46 +00:00
Bruce Momjian 9098ab9e32 Update copyrights in source tree to 2008. 2008-01-01 19:46:01 +00:00
Bruce Momjian f6e8730d11 Re-run pgindent with updated list of typedefs. (Updated README should
avoid this problem in the future.)
2007-11-15 22:25:18 +00:00
Peter Eisentraut b30769ee54 When logging the recovery.conf parameters, show them quoted as they would
appear in the configuration file.
2007-11-15 22:02:12 +00:00
Bruce Momjian fdf5a5efb7 pgindent run for 8.3. 2007-11-15 21:14:46 +00:00
Tom Lane 6cc4451b5c Prevent re-use of a deleted relation's relfilenode until after the next
checkpoint.  This guards against an unlikely data-loss scenario in which
we re-use the relfilenode, then crash, then replay the deletion and
recreation of the file.  Even then we'd be OK if all insertions into the
new relation had been WAL-logged ... but that's not guaranteed given all
the no-WAL-logging optimizations that have recently been added.

Patch by Heikki Linnakangas, per a discussion last month.
2007-11-15 20:36:40 +00:00
Tom Lane 5c8eb929e6 When telling the bgwriter that we need a checkpoint because too much xlog
has been consumed, recheck against the latest value of RedoRecPtr before
really sending the signal.  This avoids useless checkpoint activity if
XLogWrite is executed when we have a very stale local copy of RedoRecPtr.
The potential for useless checkpoint is very much worse in 8.3 because of
the walwriter process (which never does XLogInsert), so while this behavior
was intentional, it needs to be changed.  Per report from Itagaki Takahiro.
2007-10-12 19:39:59 +00:00
Tom Lane ab051bd293 Adjust recovery PS display as agreed with Simon: 'waiting for XXX'
while the restore_command does its thing, then 'recovering XXX' while
processing the segment file.  These operations are heavyweight enough
that an extra PS display set shouldn't bother anyone.
2007-09-30 17:28:56 +00:00
Tom Lane 77ccbe64dd Make recovery show the current input WAL segment name in the startup
process' PS display.  After a suggestion by Simon (not exactly his
patch though).
2007-09-29 18:32:56 +00:00
Tom Lane b46bd55a6c Make archive recovery always start a new timeline, rather than only when a
recovery stop time was used.  This avoids a corner-case risk of trying to
overwrite an existing archived copy of the last WAL segment, and seems
simpler and cleaner all around than the original definition.  Per example
from Jon Colverson and subsequent analysis by Simon.
2007-09-29 01:36:10 +00:00
Tom Lane f18dfc4835 Minor improvements in backup and recovery:
- create a separate archive_mode GUC, on which archive_command is dependent

- %r option in recovery.conf sends last restartpoint to recovery command

- %r used in pg_standby, updated README

- minor other code cleanup in pg_standby

- doc on Warm Standby now mentions pg_standby and %r

- log_restartpoints recovery option emits LOG message at each restartpoint

- end of recovery now displays last transaction end time, as requested
  by Warren Little; also shown at each restartpoint

- restart archiver if needed to carry away WAL files at shutdown

Simon Riggs
2007-09-26 22:36:30 +00:00
Tom Lane 6bd4f401b0 Replace the former method of determining snapshot xmax --- to wit, calling
ReadNewTransactionId from GetSnapshotData --- with a "latestCompletedXid"
variable that is updated during transaction commit or abort.  Since
latestCompletedXid is written only in places that had to lock ProcArrayLock
exclusively anyway, and is read only in places that had to lock ProcArrayLock
shared anyway, it adds no new locking requirements to the system despite being
cluster-wide.  Moreover, removing ReadNewTransactionId from snapshot
acquisition eliminates the need to take both XidGenLock and ProcArrayLock at
the same time.  Since XidGenLock is sometimes held across I/O this can be a
significant win.  Some preliminary benchmarking suggested that this patch has
no effect on average throughput but can significantly improve the worst-case
transaction times seen in pgbench.  Concept by Florian Pflug, implementation
by Tom Lane.
2007-09-08 20:31:15 +00:00
Tom Lane 295e63983d Implement lazy XID allocation: transactions that do not modify any database
rows will normally never obtain an XID at all.  We already did things this way
for subtransactions, but this patch extends the concept to top-level
transactions.  In applications where there are lots of short read-only
transactions, this should improve performance noticeably; not so much from
removal of the actual XID-assignments, as from reduction of overhead that's
driven by the rate of XID consumption.  We add a concept of a "virtual
transaction ID" so that active transactions can be uniquely identified even
if they don't have a regular XID.  This is a much lighter-weight concept:
uniqueness of VXIDs is only guaranteed over the short term, and no on-disk
record is made about them.

Florian Pflug, with some editorialization by Tom.
2007-09-05 18:10:48 +00:00
Tom Lane a52e4408b9 Add a debug logging message when a resource manager rejects an attempted
restart point.  Per suggestion from Simon Riggs.
2007-08-28 23:17:47 +00:00
Tom Lane 647fd9a108 Fix two bugs induced in VACUUM FULL by async-commit patch.
First, we cannot assume that XLogAsyncCommitFlush guarantees hint bits will be
settable, because clog.c's inexact LSN bookkeeping results in windows where a
previously flushed transaction is considered unhintable because it shares an
LSN slot with a later unflushed transaction.  But repair_frag requires
XMIN_COMMITTED to be correct so that it can distinguish tuples moved by the
current vacuum.  Since not being able to set the bit is an uncommon corner
case, the most practical way of dealing with it seems to be to abandon
shrinking (ie, don't invoke repair_frag) when we find a non-dead tuple whose
XMIN_COMMITTED bit couldn't be set.

Second, it is possible for the same reason that a RECENTLY_DEAD tuple does not
get its XMAX_COMMITTED bit set during scan_heap.  But by the time repair_frag
examines the tuple it might be possible to set the bit.  We therefore must
take buffer content lock when calling HeapTupleSatisfiesVacuum a second time,
else we can get an Assert failure in SetBufferCommitInfoNeedsSave.  This
latter bug is latent in existing releases, but I think it cannot actually
occur without async commit, since the first HeapTupleSatisfiesVacuum call
should always have set the bit.  So I'm not going to back-patch it.

In passing, reduce the existing "cannot shrink relation" messages from NOTICE
to LOG level.  The new message must be no higher than LOG if we don't want
unpredictable regression test failures, and consistency seems like a good
idea.  Also arrange that only one such message is reported per VACUUM FULL;
in typical scenarios you could get spammed with many such messages, which
seems a bit useless.
2007-08-13 19:08:26 +00:00
Tom Lane bdd6b62245 Switch over to using the src/timezone functions for formatting timestamps
displayed in the postmaster log.  This avoids Windows-specific problems with
localized time zone names that are in the wrong encoding, and generally seems
like a good idea to forestall other potential platform-dependent issues.
To preserve the existing behavior that all backends will log in the same time
zone, create a new GUC variable log_timezone that can only be changed on a
system-wide basis, and reference log-related calculations to that zone instead
of the TimeZone variable.

This fixes the issue reported by Hiroshi Saito that timestamps printed by
xlog.c startup could be improperly localized on Windows.  We still need a
simpler patch for that problem in the back branches, however.
2007-08-04 01:26:54 +00:00
Tom Lane 4a78cdeb6b Support an optional asynchronous commit mode, in which we don't flush WAL
before reporting a transaction committed.  Data consistency is still
guaranteed (unlike setting fsync = off), but a crash may lose the effects
of the last few transactions.  Patch by Simon, some editorialization by Tom.
2007-08-01 22:45:09 +00:00
Tom Lane ad4295728e Create a new dedicated Postgres process, "wal writer", which exists to write
and fsync WAL at convenient intervals.  For the moment it just tries to
offload this work from backends, but soon it will be responsible for
guaranteeing a maximum delay before asynchronously-committed transactions
will be flushed to disk.

This is a portion of Simon Riggs' async-commit patch, committed to CVS
separately because a background WAL writer seems like it might be a good idea
independently of the async-commit feature.  I rebased walwriter.c on
bgwriter.c because it seemed like a more appropriate way of handling signals;
while the startup/shutdown logic in postmaster.c is more like autovac because
we want walwriter to quit before we start the shutdown checkpoint.
2007-07-24 04:54:09 +00:00
Tom Lane 9fc25c0511 Improve logging of checkpoints. Patch by Greg Smith, worked over
by Heikki and a little bit by me.
2007-06-30 19:12:02 +00:00
Tom Lane 867e2c91a0 Implement "distributed" checkpoints in which the checkpoint I/O is spread
over a fairly long period of time, rather than being spat out in a burst.
This happens only for background checkpoints carried out by the bgwriter;
other cases, such as a shutdown checkpoint, are still done at full speed.

Remove the "all buffers" scan in the bgwriter, and associated stats
infrastructure, since this seems no longer very useful when the checkpoint
itself is properly throttled.

Original patch by Itagaki Takahiro, reworked by Heikki Linnakangas,
and some minor API editorialization by me.
2007-06-28 00:02:40 +00:00
Peter Eisentraut 7ce9b3683e Make some messages more consistent 2007-05-31 15:13:06 +00:00
Peter Eisentraut 71fb7b9014 Downgrade some low-level startup messages to DEBUG1. 2007-05-31 07:36:12 +00:00
Tom Lane d526575f89 Make large sequential scans and VACUUMs work in a limited-size "ring" of
buffers, rather than blowing out the whole shared-buffer arena.  Aside from
avoiding cache spoliation, this fixes the problem that VACUUM formerly tended
to cause a WAL flush for every page it modified, because we had it hacked to
use only a single buffer.  Those flushes will now occur only once per
ring-ful.  The exact ring size, and the threshold for seqscans to switch into
the ring usage pattern, remain under debate; but the infrastructure seems
done.  The key bit of infrastructure is a new optional BufferAccessStrategy
object that can be passed to ReadBuffer operations; this replaces the former
StrategyHintVacuum API.

This patch also changes the buffer usage-count methodology a bit: we now
advance usage_count when first pinning a buffer, rather than when last
unpinning it.  To preserve the behavior that a buffer's lifetime starts to
decrease when it's released, the clock sweep code is modified to not decrement
usage_count of pinned buffers.

Work not done in this commit: teach GiST and GIN indexes to use the vacuum
BufferAccessStrategy for vacuum-driven fetches.

Original patch by Simon, reworked by Heikki and again by Tom.
2007-05-30 20:12:03 +00:00
Tom Lane a8d539f124 To support external compression of archived WAL data, add a flag bit to
WAL records that shows whether it is safe to remove full-page images
(ie, whether or not an on-line backup was in progress when the WAL entry
was made).  Also make provision for an XLOG_NOOP record type that can be
used to fill in the extra space when decompressing the data for restore.

This is the portion of Koichi Suzuki's "full page writes" patch that
has to go into the core database.  The remainder of that work is two
external compression and decompression programs, which for the time being
will undergo separate development on pgfoundry.  Per discussion.

Also, twiddle the handling of BTREE_SPLIT records to ensure it'll be
possible to compress them (the previous coding caused essential info
to be omitted).  The other commonly-used record types seem OK already,
with the possible exception of GIN and GIST WAL records, which I don't
understand well enough to opine on.
2007-05-20 21:08:19 +00:00
Tom Lane c432061963 Change the timestamps recorded in transaction commit/abort xlog records
from time_t to TimestampTz representation.  This provides full gettimeofday()
resolution of the timestamps, which might be useful when attempting to
do point-in-time recovery --- previously it was not possible to specify
the stop point with sub-second resolution.  But mostly this is to get
rid of TimestampTz-to-time_t conversion overhead during commit.  Per my
proposal of a day or two back.
2007-04-30 21:01:53 +00:00