Commit Graph

1339 Commits

Author SHA1 Message Date
Tom Lane 6d6d14b6d5 Redefine IsTransactionState() to only return true for TRANS_INPROGRESS state,
which is the only state in which it's safe to initiate database queries.
It turns out that all but two of the callers thought that's what it meant;
and the other two were using it as a proxy for "will GetTopTransactionId()
return a nonzero XID"?  Since it was in fact an unreliable guide to that,
make those two just invoke GetTopTransactionId() always, then deal with a
zero result if they get one.
2007-06-07 21:45:59 +00:00
Teodor Sigaev f74426283d Move call of MarkBufferDirty() before XLogInsert() as required.
Many thanks to Heikki Linnakangas <heikki@enterprisedb.com> for his
sharp eyes.
2007-06-05 12:47:49 +00:00
Teodor Sigaev 853d1c3103 Fix bundle bugs of GIN:
- Fix possible deadlock between UPDATE and VACUUM queries. Bug never was
  observed in 8.2, but it still exist there. HEAD is more sensitive to
  bug after recent "ring" of buffer improvements.
- Fix WAL creation: if parent page is stored as is after split then
  incomplete split isn't removed during replay. This happens rather rare, only
  on large tables with a lot of updates/inserts.
- Fix WAL replay: there was wrong test of XLR_BKP_BLOCK_* for left
  page after deletion of page. That causes wrong rightlink field: it pointed
  to deleted page.
- add checking of match of clearing incomplete split
- cleanup incomplete split list after proceeding

All of this chages doesn't change on-disk storage, so backpatch...
But second point may be an issue for replaying logs from previous version.
2007-06-04 15:56:28 +00:00
Peter Eisentraut f4a3789b39 Clarify some error messages about duplicate things. 2007-06-03 22:16:03 +00:00
Tom Lane 1f559b7d3a Fix several hash functions that were taking chintzy shortcuts instead of
delivering a well-randomized hash value.  I got religion on this after
observing that performance of multi-batch hash join degrades terribly if the
higher-order bits of hash values aren't random, as indeed was true for say
hashes of small integer values.  It's now expected and documented that hash
functions should use hash_any or some comparable method to ensure that all
bits of their output are about equally random.

initdb forced because this change invalidates existing hash indexes.  For the
same reason, this isn't back-patchable; the hash join performance problem
will get a band-aid fix in the back branches.
2007-06-01 15:33:19 +00:00
Peter Eisentraut 7ce9b3683e Make some messages more consistent 2007-05-31 15:13:06 +00:00
Teodor Sigaev 54af876593 Replace ReadBuffer to ReadBufferWithStrategy in all vacuum-involved places
to implement limited-size "ring" of buffers for VACUUM for GIN & GIST
2007-05-31 14:03:09 +00:00
Peter Eisentraut 71fb7b9014 Downgrade some low-level startup messages to DEBUG1. 2007-05-31 07:36:12 +00:00
Tom Lane fa0e318f94 Fix overly-strict sanity check in BeginInternalSubTransaction that made it
fail when used in a deferred trigger.  Bug goes back to 8.0; no doubt the
reason it hadn't been noticed is that we've been discouraging use of
user-defined constraint triggers.  Per report from Frank van Vugt.
2007-05-30 21:01:39 +00:00
Tom Lane d526575f89 Make large sequential scans and VACUUMs work in a limited-size "ring" of
buffers, rather than blowing out the whole shared-buffer arena.  Aside from
avoiding cache spoliation, this fixes the problem that VACUUM formerly tended
to cause a WAL flush for every page it modified, because we had it hacked to
use only a single buffer.  Those flushes will now occur only once per
ring-ful.  The exact ring size, and the threshold for seqscans to switch into
the ring usage pattern, remain under debate; but the infrastructure seems
done.  The key bit of infrastructure is a new optional BufferAccessStrategy
object that can be passed to ReadBuffer operations; this replaces the former
StrategyHintVacuum API.

This patch also changes the buffer usage-count methodology a bit: we now
advance usage_count when first pinning a buffer, rather than when last
unpinning it.  To preserve the behavior that a buffer's lifetime starts to
decrease when it's released, the clock sweep code is modified to not decrement
usage_count of pinned buffers.

Work not done in this commit: teach GiST and GIN indexes to use the vacuum
BufferAccessStrategy for vacuum-driven fetches.

Original patch by Simon, reworked by Heikki and again by Tom.
2007-05-30 20:12:03 +00:00
Tom Lane 77947c51c0 Fix up pgstats counting of live and dead tuples to recognize that committed
and aborted transactions have different effects; also teach it not to assume
that prepared transactions are always committed.

Along the way, simplify the pgstats API by tying counting directly to
Relations; I cannot detect any redeeming social value in having stats
pointers in HeapScanDesc and IndexScanDesc structures.  And fix a few
corner cases in which counts might be missed because the relation's
pgstat_info pointer hadn't been set.
2007-05-27 03:50:39 +00:00
Tom Lane a8d539f124 To support external compression of archived WAL data, add a flag bit to
WAL records that shows whether it is safe to remove full-page images
(ie, whether or not an on-line backup was in progress when the WAL entry
was made).  Also make provision for an XLOG_NOOP record type that can be
used to fill in the extra space when decompressing the data for restore.

This is the portion of Koichi Suzuki's "full page writes" patch that
has to go into the core database.  The remainder of that work is two
external compression and decompression programs, which for the time being
will undergo separate development on pgfoundry.  Per discussion.

Also, twiddle the handling of BTREE_SPLIT records to ensure it'll be
possible to compress them (the previous coding caused essential info
to be omitted).  The other commonly-used record types seem OK already,
with the possible exception of GIN and GIST WAL records, which I don't
understand well enough to opine on.
2007-05-20 21:08:19 +00:00
Alvaro Herrera 3b0347b36e Move the tuple freezing point in CLUSTER to a point further back in the past,
to avoid losing useful Xid information in not-so-old tuples.  This makes
CLUSTER behave the same as VACUUM as far a tuple-freezing behavior goes
(though CLUSTER does not yet advance the table's relfrozenxid).

While at it, move the actual freezing operation in rewriteheap.c to a more
appropriate place, and document it thoroughly.  This part of the patch from
Tom Lane.
2007-05-17 15:28:29 +00:00
Alvaro Herrera dfed0012bc Have the rewriteheap code freeze old tuples. This is safe because it is only
applied to live tuples older than a recent Xmin, not to tuples that may be part
of an update chain.  Those still keep their original markings.

This patch makes it possible for CLUSTER to advance relfrozenxid, thus avoiding
the need of vacuuming the table for Xid wraparound purposes.  That will be
patched separately.

Patch from Heikki Linnakangas.
2007-05-16 16:36:56 +00:00
Tom Lane 0fef38da21 Tweak hash index AM to use the new ReadOrZeroBuffer bufmgr API when fetching
pages it intends to zero immediately.  Just to show there is some use for that
function besides WAL recovery :-).
Along the way, fold _hash_checkpage and _hash_pageinit calls into _hash_getbuf
and friends, instead of expecting callers to do that separately.
2007-05-03 16:45:58 +00:00
Tom Lane 8c3cc86e7b During WAL recovery, when reading a page that we intend to overwrite completely
from the WAL data, don't bother to physically read it; just have bufmgr.c
return a zeroed-out buffer instead.  This speeds recovery significantly,
and also avoids unnecessary failures when a page-to-be-overwritten has corrupt
page headers on disk.  This replaces a former kluge that accomplished the
latter by pretending zero_damaged_pages was always ON during WAL recovery;
which was OK when the kluge was put in, but is unsafe when restoring a WAL
log that was written with full_page_writes off.

Heikki Linnakangas
2007-05-02 23:18:03 +00:00
Tom Lane c432061963 Change the timestamps recorded in transaction commit/abort xlog records
from time_t to TimestampTz representation.  This provides full gettimeofday()
resolution of the timestamps, which might be useful when attempting to
do point-in-time recovery --- previously it was not possible to specify
the stop point with sub-second resolution.  But mostly this is to get
rid of TimestampTz-to-time_t conversion overhead during commit.  Per my
proposal of a day or two back.
2007-04-30 21:01:53 +00:00
Tom Lane 957d08c81f Implement rate-limiting logic on how often backends will attempt to send
messages to the stats collector.  This avoids the problem that enabling
stats_row_level for autovacuum has a significant overhead for short
read-only transactions, as noted by Arjen van der Meijden.  We can avoid
an extra gettimeofday call by piggybacking on the one done for WAL-logging
xact commit or abort (although that doesn't help read-only transactions,
since they don't WAL-log anything).

In my proposal for this, I noted that we could change the WAL log entries
for commit/abort to record full TimestampTz precision, instead of only
time_t as at present.  That's not done in this patch, but will be committed
separately.
2007-04-30 03:23:49 +00:00
Tom Lane a2e923a652 Fix dynahash.c to suppress hash bucket splits while a hash_seq_search() scan
is in progress on the same hashtable.  This seems the least invasive way to
fix the recently-recognized problem that a split could cause the scan to
visit entries twice or (with much lower probability) miss them entirely.
The only field-reported problem caused by this is the "failed to re-find
shared lock object" PANIC in COMMIT PREPARED reported by Michel Dorochevsky,
which was caused by multiply visited entries.  However, it seems certain
that mdsync() is vulnerable to missing required fsync's due to missed
entries, and I am fearful that RelationCacheInitializePhase2() might be at
risk as well.  Because of that and the generalized hazard presented by this
bug, back-patch all the supported branches.

Along the way, fix pg_prepared_statement() and pg_cursor() to not assume
that the hashtables they are examining will stay static between calls.
This is risky regardless of the newly noted dynahash problem, because
hash_seq_search() has never promised to cope with deletion of table entries
other than the just-returned one.  There may be no bug here because the only
supported way to call these functions is via ExecMakeTableFunctionResult()
which will cycle them to completion before doing anything very interesting,
but it seems best to get rid of the assumption.  This affects 8.2 and HEAD
only, since those functions weren't there earlier.
2007-04-26 23:24:46 +00:00
Tom Lane 9d37c038fc Repair PANIC condition in hash indexes when a previous index extension attempt
failed (due to lock conflicts or out-of-space).  We might have already
extended the index's filesystem EOF before failing, causing the EOF to be
beyond what the metapage says is the last used page.  Hence the invariant
maintained by the code needs to be "EOF is at or beyond last used page",
not "EOF is exactly the last used page".  Problem was created by my patch
of 2006-11-19 that attempted to repair bug #2737.  Since that was
back-patched to 7.4, this needs to be as well.  Per report and test case
from Vlastimil Krejcir.
2007-04-19 20:24:04 +00:00
Tom Lane 836feeda9c Fix condition for whether end_heap_rewrite must fsync, per Heikki. 2007-04-17 21:29:31 +00:00
Tom Lane 4942ee656a Don't assume rd_smgr stays open across all of a rewriteheap operation;
doing so can result in crash if an sinval reset occurs meanwhile.
I believe this explains intermittent buildfarm failures in cluster test.
2007-04-17 20:49:39 +00:00
Tom Lane 226a100568 Code review for btree page split WAL reduction patch. Make it actually work
(original code *always* created a full-page image for the left page, thus
leaving the intended savings unrealized), avoid risk of not having enough room
on the page during xlog restore, squeeze out another couple bytes in the xlog
record, clean up neglected comments.
2007-04-11 20:47:38 +00:00
Tom Lane 56218fbc48 Minor tweaking of index special-space definitions so that the various
index types can be reliably distinguished by examining the special space
on an index page.  Per my earlier proposal, plus the realization that
there's no need for btree's vacuum cycle ID to cycle through every possible
16-bit value.  Restricting its range a little costs nearly nothing and
eliminates the possibility of collisions.
Memo to self: remember to make bitmap indexes play along with this scheme,
assuming that patch ever gets accepted.
2007-04-09 22:04:08 +00:00
Tom Lane 7b78474da3 Make CLUSTER MVCC-safe. Heikki Linnakangas 2007-04-08 01:26:33 +00:00
Tom Lane f02a82b6ad Make 'col IS NULL' clauses be indexable conditions.
Teodor Sigaev, with some kibitzing from Tom Lane.
2007-04-06 22:33:43 +00:00
Tom Lane 3e23b68dac Support varlena fields with single-byte headers and unaligned storage.
This commit breaks any code that assumes that the mere act of forming a tuple
(without writing it to disk) does not "toast" any fields.  While all available
regression tests pass, I'm not totally sure that we've fixed every nook and
cranny, especially in contrib.

Greg Stark with some help from Tom Lane
2007-04-06 04:21:44 +00:00
Tom Lane 9c9b619473 Remove the CheckpointStartLock in favor of having backends show whether they
are in their commit critical sections via flags in the ProcArray.  Checkpoint
can watch the ProcArray to determine when it's safe to proceed.  This is
a considerably better solution to the original problem of race conditions
between checkpoint and transaction commit: it speeds up commit, since there's
one less lock to fool with, and it prevents the problem of checkpoint being
delayed indefinitely when there's a constant flow of commits.  Heikki, with
some kibitzing from Tom.
2007-04-03 16:34:36 +00:00
Tom Lane b3005276eb Decouple the values of TOAST_TUPLE_THRESHOLD and TOAST_MAX_CHUNK_SIZE.
Add the latter to the values checked in pg_control, since it can't be changed
without invalidating toast table content.  This commit in itself shouldn't
change any behavior, but it lays some necessary groundwork for experimentation
with these toast-control numbers.

Note: while TOAST_TUPLE_THRESHOLD can now be changed without initdb, some
thought still needs to be given to needs_toast_table() in toasting.c before
unleashing random changes.
2007-04-03 04:14:26 +00:00
Tom Lane 57690c6803 Support enum data types. Along the way, use macros for the values of
pg_type.typtype whereever practical.  Tom Dunstan, with some kibitzing
from Tom Lane.
2007-04-02 03:49:42 +00:00
Tom Lane 8875d0987d Fix oversight in coding of _bt_start_vacuum: we can't assume that the LWLock
will be released by transaction abort before _bt_end_vacuum gets called.
If either of these "can't happen" errors actually happened, we'd freeze up
trying to acquire an already-held lock.  Latest word is that this does
not explain Martin Pitt's trouble report, but it still looks like a bug.
2007-03-30 00:12:59 +00:00
Tom Lane fba8113c1b Teach CLUSTER to skip writing WAL if not needed (ie, not using archiving)
--- Simon.
Also, code review and cleanup for the previous COPY-no-WAL patches --- Tom.
2007-03-29 00:15:39 +00:00
Tom Lane e85a01df67 Clean up the representation of special snapshots by including a "method
pointer" in every Snapshot struct.  This allows removal of the case-by-case
tests in HeapTupleSatisfiesVisibility, which should make it a bit faster
(I didn't try any performance tests though).  More importantly, we are no
longer violating portable C practices by assuming that small integers are
distinct from all pointer values, and HeapTupleSatisfiesDirty no longer
has a non-reentrant API involving side-effects on a global variable.

There were a couple of places calling HeapTupleSatisfiesXXX routines
directly rather than through the HeapTupleSatisfiesVisibility macro.
Since these places had to be changed anyway, I chose to make them go
through the macro for uniformity.

Along the way I renamed HeapTupleSatisfiesSnapshot to HeapTupleSatisfiesMVCC
to emphasize that it's only used with MVCC-type snapshots.  I was sorely
tempted to rename HeapTupleSatisfiesVisibility to HeapTupleSatisfiesSnapshot,
but forebore for the moment to avoid confusion and reduce the likelihood that
this patch breaks some of the pending patches.  Might want to reconsider
doing that later.
2007-03-25 19:45:14 +00:00
Tom Lane 4f896dac17 Arrange for PreventTransactionChain to reject commands submitted as part
of a multi-statement simple-Query message.  This bug goes all the way
back, but unfortunately is not nearly so easy to fix in existing releases;
it is only the recent ProcessUtility API change that makes it fixable in
HEAD.  Per report from William Garrison.
2007-03-22 19:55:04 +00:00
Peter Eisentraut f4ee82e3d3 Reverted waiting for further fixes:
Make configuration parameters fall back to their default values when they
are removed from the configuration file.

Joachim Wieland
2007-03-13 14:32:25 +00:00
Tom Lane b9527e9840 First phase of plan-invalidation project: create a plan cache management
module and teach PREPARE and protocol-level prepared statements to use it.
In service of this, rearrange utility-statement processing so that parse
analysis does not assume table schemas can't change before execution for
utility statements (necessary because we don't attempt to re-acquire locks
for utility statements when reusing a stored plan).  This requires some
refactoring of the ProcessUtility API, but it ends up cleaner anyway,
for instance we can get rid of the QueryContext global.

Still to do: fix up SPI and related code to use the plan cache; I'm tempted to
try to make SQL functions use it too.  Also, there are at least some aspects
of system state that we want to ensure remain the same during a replan as in
the original processing; search_path certainly ought to behave that way for
instance, and perhaps there are others.
2007-03-13 00:33:44 +00:00
Peter Eisentraut f84308f195 Make configuration parameters fall back to their default values when they
are removed from the configuration file.

Joachim Wieland
2007-03-12 22:09:28 +00:00
Neil Conway e1d8deb918 Fix a typo in a comment. Heikki Linnakangas. 2007-03-05 14:13:12 +00:00
Bruce Momjian bc292937ae Split _bt_insertonpg to two functions.
Heikki Linnakangas
2007-03-03 20:13:06 +00:00
Bruce Momjian ae35867a39 Remove undo information from pg_controldata --- never used.
Florian G. Pflug
2007-03-03 20:02:27 +00:00
Tom Lane 234a02b2a8 Replace direct assignments to VARATT_SIZEP(x) with SET_VARSIZE(x, len).
Get rid of VARATT_SIZE and VARATT_DATA, which were simply redundant with
VARSIZE and VARDATA, and as a consequence almost no code was using the
longer names.  Rename the length fields of struct varlena and various
derived structures to catch anyplace that was accessing them directly;
and clean up various places so caught.  In itself this patch doesn't
change any behavior at all, but it is necessary infrastructure if we hope
to play any games with the representation of varlena headers.
Greg Stark and Tom Lane
2007-02-27 23:48:10 +00:00
Bruce Momjian 6f519ad01c btree source code cleanups:
I refactored findsplitloc and checksplitloc so that the division of
labor is more clear IMO. I pushed all the space calculation inside the
loop to checksplitloc.

I also fixed the off by 4 in free space calculation caused by
PageGetFreeSpace subtracting sizeof(ItemIdData), even though it was
harmless, because it was distracting and I felt it might come back to
bite us in the future if we change the page layout or alignments.
There's now a new function PageGetExactFreeSpace that doesn't do the
subtraction.

findsplitloc now tries the "just the new item to right page" split as
well. If people don't like the refactoring, I can write a patch to just
add that.

Heikki Linnakangas
2007-02-21 20:02:17 +00:00
Alvaro Herrera 1820650934 Restructure autovacuum in two processes: a dummy process, which runs
continuously, and requests vacuum runs of "autovacuum workers" to postmaster.
The workers do the actual vacuum work.  This allows for future improvements,
like allowing multiple autovacuum jobs running in parallel.

For now, the code keeps the original behavior of having a single autovac
process at any time by sleeping until the previous worker has finished.
2007-02-15 23:23:23 +00:00
Bruce Momjian a9eb53969a Move fsync method macro defines into /include/access/xlogdefs.h so they
can be used by src/tools/fsync/test_fsync.c.
2007-02-14 05:00:40 +00:00
Tom Lane caf2b64a75 Disallow committing a prepared transaction unless we are in the same database
it was executed in.  Someday it might be nice to allow cross-DB commits, but
work would be needed in NOTIFY and perhaps other places.  Per Heikki.
2007-02-13 19:39:42 +00:00
Peter Eisentraut c138b966d4 Replace useless uses of := by = in makefiles. 2007-02-09 15:56:00 +00:00
Tom Lane c398300330 Combine cmin and cmax fields of HeapTupleHeaders into a single field, by
keeping private state in each backend that has inserted and deleted the same
tuple during its current top-level transaction.  This is sufficient since
there is no need to be able to determine the cmin/cmax from any other
transaction.  This gets us back down to 23-byte headers, removing a penalty
paid in 8.0 to support subtransactions.  Patch by Heikki Linnakangas, with
minor revisions by moi, following a design hashed out awhile back on the
pghackers list.
2007-02-09 03:35:35 +00:00
Alvaro Herrera f8ebab901b Fix reference-after-free in the new btree page split code, as reported by
the buildfarm via Stefan Kaltenbrunner.

Patch from Heikki Linnakangas.
2007-02-08 13:52:55 +00:00
Peter Eisentraut 086c189456 Normalize fgets() calls to use sizeof() for calculating the buffer size
where possible, and fix some sites that apparently thought that fgets()
will overwrite the buffer by one byte.

Also add some strlcpy() to eliminate some weird memory handling.
2007-02-08 11:10:27 +00:00
Bruce Momjian b79575ce45 Reduce WAL activity for page splits:
> Currently, an index split writes all the data on the split page to
> WAL. That's a lot of WAL traffic. The tuples that are copied to the
> right page need to be WAL logged, but the tuples that stay on the
> original page don't.

Heikki Linnakangas
2007-02-08 05:05:53 +00:00
Tom Lane aec4cf1c8c Add a function pg_stat_clear_snapshot() that discards any statistics snapshot
already collected in the current transaction; this allows plpgsql functions to
watch for stats updates even though they are confined to a single transaction.
Use this instead of the previous kluge involving pg_stat_file() to wait for
the stats collector to update in the stats regression test.  Internally,
decouple storage of stats snapshots from transaction boundaries; they'll
now stick around until someone calls pgstat_clear_snapshot --- which xact.c
still does at transaction end, to maintain the previous behavior.  This makes
the logic a lot cleaner, at the price of a couple dozen cycles per transaction
exit.
2007-02-07 23:11:30 +00:00
Tom Lane 78d1216160 Remove the xlog-centric "database system is ready" message and replace it with
"database system is ready to accept connections", which is issued by the
postmaster when it really is ready to accept connections.  Per proposal from
Markus Schiltknecht and subsequent discussion.
2007-02-07 16:44:48 +00:00
Tom Lane c76ed81513 Remove some dead code, per Heikki. 2007-02-06 14:55:11 +00:00
Tom Lane 23c4978e6c Rename MaxTupleSize to MaxHeapTupleSize to clarify that it's not meant to
describe the maximum size of index tuples (which is typically AM-dependent
anyway); and consequently remove the bogus deduction for "special space"
that was built into it.

Adjust TOAST_TUPLE_THRESHOLD and TOAST_MAX_CHUNK_SIZE to avoid wasting two
bytes per toast chunk, and to ensure that the calculation correctly tracks any
future changes in page header size.  The computation had been inaccurate in a
way that didn't cause any harm except space wastage, but future changes could
have broken it more drastically.

Fix the calculation of BTMaxItemSize, which was formerly computed as 1 byte
more than it could safely be.  This didn't cause any harm in practice because
it's only compared against maxalign'd lengths, but future changes in the size
of page headers or btree special space could have exposed the problem.

initdb forced because of change in TOAST_MAX_CHUNK_SIZE, which alters the
storage of toast tables.
2007-02-05 04:22:18 +00:00
Tom Lane a2e092e1c7 Don't MAXALIGN in the checks to decide whether a tuple is over TOAST's
threshold for tuple length.  On 4-byte-MAXALIGN machines, the toast code
creates tuples that have t_len exactly TOAST_TUPLE_THRESHOLD ... but this
number is not itself maxaligned, so if heap_insert maxaligns t_len before
comparing to TOAST_TUPLE_THRESHOLD, it'll uselessly recurse back to
tuptoaster.c, wasting cycles.  (It turns out that this does not happen on
8-byte-MAXALIGN machines, because for them the outer MAXALIGN in the
TOAST_MAX_CHUNK_SIZE macro reduces TOAST_MAX_CHUNK_SIZE so that toast tuples
will be less than TOAST_TUPLE_THRESHOLD in size.  That MAXALIGN is really
incorrect, but we can't remove it now, see below.)  There isn't any particular
value in maxaligning before comparing to the thresholds, so just don't do
that, which saves a small number of cycles in itself.

These numbers should be rejiggered to minimize wasted space on toast-relation
pages, but we can't do that in the back branches because changing
TOAST_MAX_CHUNK_SIZE would force an initdb (by changing the contents of toast
tables).  We can move the toast decision thresholds a bit, though, which is
what this patch effectively does.

Thanks to Pavan Deolasee for discovering the unintended recursion.

Back-patch into 8.2, but not further, pending more testing.  (HEAD is about
to get a further patch modifying the thresholds, so it won't help much
for testing this form of the patch.)
2007-02-04 20:00:37 +00:00
Bruce Momjian 8b4ff8b6a1 Wording cleanup for error messages. Also change can't -> cannot.
Standard English uses "may", "can", and "might" in different ways:

        may - permission, "You may borrow my rake."

        can - ability, "I can lift that log."

        might - possibility, "It might rain today."

Unfortunately, in conversational English, their use is often mixed, as
in, "You may use this variable to do X", when in fact, "can" is a better
choice.  Similarly, "It may crash" is better stated, "It might crash".
2007-02-01 19:10:30 +00:00
Neil Conway dbcaee49b5 Fix a few typos in comments in GiN. 2007-02-01 04:16:08 +00:00
Teodor Sigaev d4c6da1527 Allow GIN's extractQuery method to signal that nothing can satisfy the query.
In this case extractQuery should returns -1 as nentries. This changes
prototype of extractQuery method to use int32* instead of uint32* for
nentries argument.
Based on that gincostestimate may see two corner cases: nothing will be found
or seqscan should be used.

Per proposal at http://archives.postgresql.org/pgsql-hackers/2007-01/msg01581.php

PS tsearch_core patch should be sightly modified to support changes, but I'm
waiting a verdict about reviewing of tsearch_core patch.
2007-01-31 15:09:45 +00:00
Tom Lane a635c08fa1 Add support for cross-type hashing in hash index searches and hash joins.
Hashing for aggregation purposes still needs work, so it's not time to
mark any cross-type operators as hashable for general use, but these cases
work if the operators are so marked by hand in the system catalogs.
2007-01-30 01:33:36 +00:00
Tom Lane e8cd6f14a2 Add comment noting that hashm_procid in a hash index's metapage isn't
actually used for anything.
2007-01-29 23:22:59 +00:00
Tom Lane 6cefacd7c8 Correct an old logic error in btree page splitting: when considering a split
exactly at the point where we need to insert a new item, the calculation used
the wrong size for the "high key" of the new left page.  This could lead to
choosing an unworkable split, resulting in "PANIC: failed to add item to the
left sibling" (or "right sibling") failure.  Although this bug has been there
a long time, it's very difficult to trigger a failure before 8.2, since there
was generally a lot of free space on both sides of a chosen split.  In 8.2,
where the user-selected fill factor determines how much free space the code
tries to leave, an unworkable split is much more likely.  Report by Joe
Conway, diagnosis and fix by Heikki Linnakangas.
2007-01-27 20:53:30 +00:00
Bruce Momjian ef65f6f7a4 Prevent WAL logging when COPY is done in the same transation that
created it.

Simon Riggs
2007-01-25 02:17:26 +00:00
Neil Conway 2b7334d487 Refactor the index AM API slightly: move currentItemData and
currentMarkData from IndexScanDesc to the opaque structs for the
AMs that need this information (currently gist and hash).

Patch from Heikki Linnakangas, fixes by Neil Conway.
2007-01-20 18:43:35 +00:00
Peter Eisentraut 2cc01004c6 Remove remains of old depend target. 2007-01-20 17:16:17 +00:00
Alvaro Herrera eb63cc3da8 Arrange for autovacuum to be killed when another operation wants to be alone
accessing it, like DROP DATABASE.  This allows the regression tests to pass
with autovacuum enabled, which open the gates for finally enabling autovacuum
by default.
2007-01-16 13:28:57 +00:00
Tom Lane d83235415b Add some notes about the basic mathematical laws that the system presumes
hold true for operators in a btree operator family.  This is mostly to
clarify my own thinking about what the planner can assume for optimization
purposes.  (blowing dust off an old abstract-algebra textbook...)
2007-01-12 17:04:54 +00:00
Bruce Momjian 40f797be03 Enable another five tuple status bits by using the high bits of the
nattr field, and rename the field.

Heikki Linnakangas
2007-01-09 22:01:00 +00:00
Tom Lane 69db009163 Add a citation to Seltzer and Yigit's Usenix '91 paper about hash table
management.  The paper clearly describes many of the ideas embodied in
our current hashing code, but as far as I could find out there is not
a direct code heritage.  (Mike Olsen recalls discussion of this paper
at Postgres meetings but believes it "informed the Postgres implementation
probably just at the design level".  Margo herself says she wasn't
involved with Postgres' hash code.)  Credit where credit is due 'n all
that, even if fifteen years after the fact.
2007-01-09 07:30:49 +00:00
Tom Lane 4431758229 Support ORDER BY ... NULLS FIRST/LAST, and add ASC/DESC/NULLS FIRST/NULLS LAST
per-column options for btree indexes.  The planner's support for this is still
pretty rudimentary; it does not yet know how to plan mergejoins with
nondefault ordering options.  The documentation is pretty rudimentary, too.
I'll work on improving that stuff later.

Note incompatible change from prior behavior: ORDER BY ... USING will now be
rejected if the operator is not a less-than or greater-than member of some
btree opclass.  This prevents less-than-sane behavior if an operator that
doesn't actually define a proper sort ordering is selected.
2007-01-09 02:14:16 +00:00
Bruce Momjian 29dccf5fe0 Update CVS HEAD for 2007 copyright. Back branches are typically not
back-stamped for this.
2007-01-05 22:20:05 +00:00
Tom Lane 7c8927bf08 Fix some small typos in comments. Greg Stark 2007-01-04 16:29:42 +00:00
Tom Lane ef07221997 Clean up smgr.c/md.c APIs as per discussion a couple months ago. Instead of
having md.c return a success/failure boolean to smgr.c, which was just going
to elog anyway, let md.c issue the elog messages itself.  This allows better
error reporting, particularly in cases such as "short read" or "short write"
which Peter was complaining of.  Also, remove the kluge of allowing mdread()
to return zeroes from a read-beyond-EOF: this is now an error condition
except when InRecovery or zero_damaged_pages = true.  (Hash indexes used to
require that behavior, but no more.)  Also, enforce that mdwrite() is to be
used for rewriting existing blocks while mdextend() is to be used for
extending the relation EOF.  This restriction lets us get rid of the old
ad-hoc defense against creating huge files by an accidental reference to
a bogus block number: we'll only create new segments in mdextend() not
mdwrite() or mdread().  (Again, when InRecovery we allow it anyway, since
we need to allow updates of blocks that were later truncated away.)
Also, clean up the original makeshift patch for bug #2737: move the
responsibility for padding relation segments to full length into md.c.
2007-01-03 18:11:01 +00:00
Tom Lane 5725b9d9af Support type modifiers for user-defined types, and pull most knowledge
about typmod representation for standard types out into type-specific
typmod I/O functions.  Teodor Sigaev, with some editorialization by
Tom Lane.
2006-12-30 21:21:56 +00:00
Tom Lane 9aefd56669 Fix up btree's initial scankey processing to be able to detect redundant
or contradictory keys even in cross-data-type scenarios.  This is another
benefit of the opfamily rewrite: we can find the needed comparison
operators now.
2006-12-28 23:16:39 +00:00
Tom Lane a78fcfb512 Restructure operator classes to allow improved handling of cross-data-type
cases.  Operator classes now exist within "operator families".  While most
families are equivalent to a single class, related classes can be grouped
into one family to represent the fact that they are semantically compatible.
Cross-type operators are now naturally adjunct parts of a family, without
having to wedge them into a particular opclass as we had done originally.

This commit restructures the catalogs and cleans up enough of the fallout so
that everything still works at least as well as before, but most of the work
needed to actually improve the planner's behavior will come later.  Also,
there are not yet CREATE/DROP/ALTER OPERATOR FAMILY commands; the only way
to create a new family right now is to allow CREATE OPERATOR CLASS to make
one by default.  I owe some more documentation work, too.  But that can all
be done in smaller pieces once this infrastructure is in place.
2006-12-23 00:43:13 +00:00
Tom Lane 0cb91ccba9 Remove the logId/logSeg fields from pg_control, because they are not needed
in normal operation, and we can avoid rewriting pg_control at every log
segment switch if we don't insist that these values be valid.  Reducing
the number of pg_control updates is a good idea for both performance and
reliability.  It does make pg_resetxlog's life a bit harder, but that seems
a good tradeoff; and anyway the change to pg_resetxlog amounts to automating
something people formerly needed to do by hand, namely look at the existing
pg_xlog files to make sure the new WAL start point was past them.

In passing, change the wording of xlog.c's "database system was interrupted"
messages: describe the pg_control timestamp as "last known up at" rather than
implying it is the exact time of service interruption.  With this change the
timestamp will generally be the time of the last checkpoint, which could be
many minutes before the failure; and we've already seen indications that
people tend to misinterpret the old wording.

initdb forced due to change in pg_control layout.  Simon Riggs and Tom Lane
2006-12-08 19:50:53 +00:00
Neil Conway 886a02d1cb Add a txn_start column to pg_stat_activity. This makes it easier to
identify long-running transactions. Since we already need to record
the transaction-start time (e.g. for now()), we don't need any
additional system calls to report this information.

Catversion bumped, initdb required.
2006-12-06 18:06:48 +00:00
Tom Lane 5f60086e10 Minor adjustments to make failures in startup/shutdown behave more cleanly.
StartupXLOG and ShutdownXLOG no longer need to be critical sections, because
in all contexts where they are invoked, elog(ERROR) would be translated to
elog(FATAL) anyway.  (One change in bgwriter.c is needed to make this true:
set ExitOnAnyError before trying to exit.  This is a good fix anyway since
the existing code would have gone into an infinite loop on elog(ERROR) during
shutdown.)  That avoids a misleading report of PANIC during semi-orderly
failures.  Modify the postmaster to include the startup process in the set of
processes that get SIGTERM when a fast shutdown is requested, and also fix it
to not try to restart the bgwriter if the bgwriter fails while trying to write
the shutdown checkpoint.  Net result is that "pg_ctl stop -m fast" does
something reasonable for a system in warm standby mode, and so should Unix
system shutdown (ie, universal SIGTERM).  Per gripe from Stephen Harris and
some corner-case testing of my own.
2006-11-30 18:29:12 +00:00
Teodor Sigaev ef148d6b85 Fix bug with page deletion. If inner page is removed and it tries to
remove page on next level linked from next inner page, ginScanToDelete()
wrongly sets parent page. Bug reveals when many item pointers from index
was deleted ( several hundred thousands).

Bug is discovered by hubert depesz lubaczewski <depesz@gmail.com>

Suppose, we need rc2 before release...
2006-11-30 16:22:32 +00:00
Neil Conway 546d6848ca Add a comment noting that heap_copytuple_with_tuple() results in a
HeapTuple that is no longer allocated as a single palloc() block; if
used carelessly, this might result in a subsequent memory leak after
heap_freetuple().
2006-11-23 05:27:18 +00:00
Tom Lane 395249ecbe Several changes to reduce the probability of running out of memory during
AbortTransaction, which would lead to recursion and eventual PANIC exit
as illustrated in recent report from Jeff Davis.  First, in xact.c create
a special dedicated memory context for AbortTransaction to run in.  This
solves the problem as long as AbortTransaction doesn't need more than 32K
(or whatever other size we create the context with).  But in corner cases
it might.  Second, in trigger.c arrange to keep pending after-trigger event
records in separate contexts that can be freed near the beginning of
AbortTransaction, rather than having them persist until CleanupTransaction
as before.  Third, in portalmem.c arrange to free executor state data
earlier as well.  These two changes should result in backing off the
out-of-memory condition before AbortTransaction needs any significant
amount of memory, at least in typical cases such as memory overrun due
to too many trigger events or too big an executor hash table.  And all
the same for subtransaction abort too, of course.
2006-11-23 01:14:59 +00:00
Tom Lane 3ad0728c81 On systems that have setsid(2) (which should be just about everything except
Windows), arrange for each postmaster child process to be its own process
group leader, and deliver signals SIGINT, SIGTERM, SIGQUIT to the whole
process group not only the direct child process.  This provides saner behavior
for archive and recovery scripts; in particular, it's possible to shut down a
warm-standby recovery server using "pg_ctl stop -m immediate", since delivery
of SIGQUIT to the startup subprocess will result in killing the waiting
recovery_command.  Also, this makes Query Cancel and statement_timeout apply
to scripts being run from backends via system().  (There is no support in the
core backend for that, but it's widely done using untrusted PLs.)  Per gripe
from Stephen Harris and subsequent discussion.
2006-11-21 20:59:53 +00:00
Tom Lane d68efb3f8d Repair problems with hash indexes that span multiple segments: the hash code's
preference for filling pages out-of-order tends to confuse the sanity checks
in md.c, as per report from Balazs Nagy in bug #2737.  The fix is to ensure
that the smgr-level code always has the same idea of the logical EOF as the
hash index code does, by using ReadBuffer(P_NEW) where we are adding a single
page to the end of the index, and using smgrextend() to reserve a large batch
of pages when creating a new splitpoint.  The patch is a bit ugly because it
avoids making any changes in md.c, which seems the most prudent approach for a
backpatchable beta-period fix.  After 8.3 development opens, I'll take a look
at a cleaner but more invasive patch, in particular getting rid of the now
unnecessary hack to allow reading beyond EOF in mdread().

Backpatch as far as 7.4.  The bug likely exists in 7.3 as well, but because
of the magnitude of the 7.3-to-7.4 changes in hash, the later-version patch
doesn't even begin to apply.  Given the other known bugs in the 7.3-era hash
code, it does not seem worth trying to develop a separate patch for 7.3.
2006-11-19 21:33:23 +00:00
Tom Lane 4f335a3d7f Repair two related errors in heap_lock_tuple: it was failing to recognize
cases where we already hold the desired lock "indirectly", either via
membership in a MultiXact or because the lock was originally taken by a
different subtransaction of the current transaction.  These cases must be
accounted for to avoid needless deadlocks and/or inappropriate replacement of
an exclusive lock with a shared lock.  Per report from Clarence Gardner and
subsequent investigation.
2006-11-17 18:00:15 +00:00
Peter Eisentraut e138b80996 String fix 2006-11-16 14:28:41 +00:00
Neil Conway dc10387eb1 Fix some typos in comments. 2006-11-12 06:55:54 +00:00
Tom Lane a46ca619f8 Suppress a few 'uninitialized variable' warnings that gcc emits only at
-O3 or higher (presumably because it inlines more things).  Per gripe
from Mark Mielke.
2006-11-11 01:14:19 +00:00
Tom Lane 792d6edd5b Clean up some misleading references to %p being a full path, per Simon. 2006-11-10 22:32:20 +00:00
Tom Lane dcbdf9b1d4 Change Windows rename and unlink substitutes so that they time out after
30 seconds instead of retrying forever.  Also modify xlog.c so that if
it fails to rename an old xlog segment up to a future slot, it will
unlink the segment instead.  Per discussion of bug #2712, in which it
became apparent that Windows can handle unlinking a file that's being
held open, but not renaming it.
2006-11-08 20:12:05 +00:00
Tom Lane 48188e1621 Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios.  We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases.  Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId.  Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done.  Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database.  initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs.  Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 22:42:10 +00:00
Tom Lane 70ce5c9082 Fix "failed to re-find parent key" btree VACUUM failure by revising page
deletion code to avoid the case where an upper-level btree page remains "half
dead" for a significant period of time, and to block insertions into a key
range that is in process of being re-assigned to the right sibling of the
deleted page's parent.  This prevents the scenario reported by Ed L. wherein
index keys could become out-of-order in the grandparent index level.

Since this is a moderately invasive fix, I'm applying it only to HEAD.
The bug exists back to 7.4, but the back branches will get a different patch.
2006-11-01 19:43:17 +00:00
Tom Lane 1e758d5263 Add some code to CREATE DATABASE to check for pre-existing subdirectories
that conflict with the OID that we want to use for the new database.
This avoids the risk of trying to remove files that maybe we shouldn't
remove.  Per gripe from Jon Lapham and subsequent discussion of 27-Sep.
2006-10-18 22:44:12 +00:00
Peter Eisentraut b9b4f10b5b Message style improvements 2006-10-06 17:14:01 +00:00
Tom Lane 378c79dc78 Cleanup for pglz_compress code: remove dead code, const-ify API of
remaining functions, simplify pglz_compress's API to not require a useless
data copy when compression fails.  Also add a check in pglz_decompress that
the expected amount of data was decompressed.
2006-10-05 23:33:33 +00:00
Tom Lane e378f82e00 Make use of qsort_arg in several places that were formerly using klugy
static variables.  This avoids any risk of potential non-reentrancy,
and in particular offers a much cleaner workaround for the Intel compiler
bug that was affecting ginutil.c.
2006-10-05 17:57:40 +00:00
Bruce Momjian f99a569a2e pgindent run for 8.2. 2006-10-04 00:30:14 +00:00
Bruce Momjian 45c8ed96b9 Make some sentences consistent with similar ones.
Euler Taveira de Oliveira
2006-10-03 21:21:36 +00:00
Alvaro Herrera 4650c4fdb9 Degrade the transaction-id wraparound point message from LOG to DEBUG1, per
discussion.

Patch from Simon Riggs.
2006-09-26 17:21:39 +00:00
Tom Lane 9e936693a9 Fix free space map to correctly track the total amount of FSM space needed
even when a single relation requires more than max_fsm_pages pages.  Also,
make VACUUM emit a warning in this case, since it likely means that VACUUM
FULL or other drastic corrective measure is needed.  Per reports from Jeff
Frost and others of unexpected changes in the claimed max_fsm_pages need.
2006-09-21 20:31:22 +00:00
Teodor Sigaev deb66e013c Improve error message. Per discussion
http://archives.postgresql.org/pgsql-general/2006-09/msg00186.php
2006-09-14 11:26:49 +00:00