Commit Graph

1783 Commits

Author SHA1 Message Date
Magnus Hagander 380d1ee69e Update error messages, per notes from Tom.
Laurenz Albe
2008-04-24 14:23:43 +00:00
Magnus Hagander c979a1fefa Prevent shutdown in normal mode if online backup is running, and
have pg_ctl warn about this.

Cancel running online backups (by renaming the backup_label file,
thus rendering the backup useless) when shutting down in fast mode.

Laurenz Albe
2008-04-23 13:44:59 +00:00
Teodor Sigaev cf23b75b4d Fix using too many LWLocks bug, reported by Craig Ringer
<craig@postnewspapers.com.au>.
It was my mistake, I missed limitation of number of held locks, now GIN doesn't
use continiuous locks, but still hold buffers pinned to prevent interference
with vacuum's deletion algorithm.

Backpatch is needed.
2008-04-22 17:52:43 +00:00
Tom Lane 8472bf7a73 Allow float8, int8, and related datatypes to be passed by value on machines
where Datum is 8 bytes wide.  Since this will break old-style C functions
(those still using version 0 calling convention) that have arguments or
results of these types, provide a configure option to disable it and retain
the old pass-by-reference behavior.  Likewise, provide a configure option
to disable the recently-committed float4 pass-by-value change.

Zoltan Boszormenyi, plus configurability stuff by me.
2008-04-21 00:26:47 +00:00
Alvaro Herrera 2f0f7b4bce Clean up a few places where Datums were being treated as pointers (and vice
versa) without going through DatumGetPointer.

Gavin Sherry, with Feng Tian.
2008-04-17 21:37:28 +00:00
Tom Lane d1cbd26ded Repair two places where SIGTERM exit could leave shared memory state
corrupted.  (Neither is very important if SIGTERM is used to shut down the
whole database cluster together, but there's a problem if someone tries to
SIGTERM individual backends.)  To do this, introduce new infrastructure
macros PG_ENSURE_ERROR_CLEANUP/PG_END_ENSURE_ERROR_CLEANUP that take care
of transiently pushing an on_shmem_exit cleanup hook.  Also use this method
for createdb cleanup --- that wasn't a shared-memory-corruption problem,
but SIGTERM abort of createdb could leave orphaned files lying around.

Backpatch as far as 8.2.  The shmem corruption cases don't exist in 8.1,
and the createdb usage doesn't seem important enough to risk backpatching
further.
2008-04-16 23:59:40 +00:00
Tom Lane 9b5c8d45f6 Push index operator lossiness determination down to GIST/GIN opclass
"consistent" functions, and remove pg_amop.opreqcheck, as per recent
discussion.  The main immediate benefit of this is that we no longer need
8.3's ugly hack of requiring @@@ rather than @@ to test weight-using tsquery
searches on GIN indexes.  In future it should be possible to optimize some
other queries better than is done now, by detecting at runtime whether the
index match is exact or not.

Tom Lane, after an idea of Heikki's, and with some help from Teodor.
2008-04-14 17:05:34 +00:00
Tom Lane 24558da14a Phase 2 of project to make index operator lossiness be determined at runtime
instead of plan time.  Extend the amgettuple API so that the index AM returns
a boolean indicating whether the indexquals need to be rechecked, and make
that rechecking happen in nodeIndexscan.c (currently the only place where
it's expected to be needed; other callers of index_getnext are just erroring
out for now).  For the moment, GIN and GIST have stub logic that just always
sets the recheck flag to TRUE --- I'm hoping to get Teodor to handle pushing
that control down to the opclass consistent() functions.  The planner no
longer pays any attention to amopreqcheck, and that catalog column will go
away in due course.
2008-04-13 19:18:14 +00:00
Tom Lane ec498cdcbb Create new routines systable_beginscan_ordered, systable_getnext_ordered,
systable_endscan_ordered that have API similar to systable_beginscan etc
(in particular, the passed-in scankeys have heap not index attnums),
but guarantee ordered output, unlike the existing functions.  For the moment
these are just very thin wrappers around index_beginscan/index_getnext/etc.
Someday they might need to get smarter; but for now this is just a code
refactoring exercise to reduce the number of direct callers of index_getnext,
in preparation for changing that function's API.

In passing, remove index_getnext_indexitem, which has been dead code for
quite some time, and will have even less use than that in the presence
of run-time-lossy indexes.
2008-04-12 23:14:21 +00:00
Tom Lane 4e82a95476 Replace "amgetmulti" AM functions with "amgetbitmap", in which the whole
indexscan always occurs in one call, and the results are returned in a
TIDBitmap instead of a limited-size array of TIDs.  This should improve
speed a little by reducing AM entry/exit overhead, and it is necessary
infrastructure if we are ever to support bitmap indexes.

In an only slightly related change, add support for TIDBitmaps to preserve
(somewhat lossily) the knowledge that particular TIDs reported by an index
need to have their quals rechecked when the heap is visited.  This facility
is not really used yet; we'll need to extend the forced-recheck feature to
plain indexscans before it's useful, and that hasn't been coded yet.
The intent is to use it to clean up 8.3's horrid @@@ kluge for text search
with weighted queries.  There might be other uses in future, but that one
alone is sufficient reason.

Heikki Linnakangas, with some adjustments by me.
2008-04-10 22:25:26 +00:00
Tom Lane 2604359251 Improve hash_any() to use word-wide fetches when hashing suitably aligned
data.  This makes for a significant speedup at the cost that the results
now vary between little-endian and big-endian machines; which forces us
to add explicit ORDER BYs in a couple of regression tests to preserve
machine-independent comparison results.  Also, force initdb by bumping
catversion, since the contents of hash indexes will change (at least on
big-endian machines).

Kenneth Marshall and Tom Lane, based on work from Bob Jenkins.  This commit
does not adopt Bob's new faster mix() algorithm, however, since we still need
to convince ourselves that that doesn't degrade the quality of the hashing.
2008-04-06 16:54:49 +00:00
Bruce Momjian 2a1cf97c22 Have pg_stop_backup() wait for all archive files to be sent, rather than
returing right away.  This guarantees that when pg_stop_backup()
returns, you have a valid backup.

Simon Riggs
2008-04-05 01:34:06 +00:00
Tom Lane b03271590e Remove heap_release_fetch, which is no longer used anywhere; this simplifies
heap_fetch a little.
2008-04-03 17:12:27 +00:00
Alvaro Herrera 73b0300b2a Move the HTSU_Result enum definition into snapshot.h, to avoid including
tqual.h into heapam.h.  This makes all inclusion of tqual.h explicit.

I also sorted alphabetically the includes on some source files.
2008-03-26 21:10:39 +00:00
Alvaro Herrera 78f02ca1f5 Rename snapmgmt.c/h to snapmgr.c/h, for consistency with other files.
Per complaint from Tom Lane.
2008-03-26 18:48:59 +00:00
Alvaro Herrera d43b085d57 Separate snapshot management code from tuple visibility code, create a
snapmgmt.c file for the former.  The header files have also been reorganized
in three parts: the most basic snapshot definitions are now in a new file
snapshot.h, and the also new snapmgmt.h keeps the definitions for snapmgmt.c.
tqual.h has been reduced to the bare minimum.

This patch is just a first step towards managing live snapshots within a
transaction; there is no functionality change.

Per my proposal to pgsql-patches on 20080318191940.GB27458@alvh.no-ip.org and
subsequent discussion.
2008-03-26 16:20:48 +00:00
Tom Lane 220db7ccd8 Simplify and standardize conversions between TEXT datums and ordinary C
strings.  This patch introduces four support functions cstring_to_text,
cstring_to_text_with_len, text_to_cstring, and text_to_cstring_buffer, and
two macros CStringGetTextDatum and TextDatumGetCString.  A number of
existing macros that provided variants on these themes were removed.

Most of the places that need to make such conversions now require just one
function or macro call, in place of the multiple notational layers that used
to be needed.  There are no longer any direct calls of textout or textin,
and we got most of the places that were using handmade conversions via
memcpy (there may be a few still lurking, though).

This commit doesn't make any serious effort to eliminate transient memory
leaks caused by detoasting toasted text objects before they reach
text_to_cstring.  We changed PG_GETARG_TEXT_P to PG_GETARG_TEXT_PP in a few
places where it was easy, but much more could be done.

Brendan Jurd and Tom Lane
2008-03-25 22:42:46 +00:00
Bruce Momjian fca9fff41b More README src cleanups. 2008-03-21 13:23:29 +00:00
Bruce Momjian 4e228447aa Make source code READMEs more consistent. Add CVS tags to all README files. 2008-03-20 17:55:15 +00:00
Peter Eisentraut a7b7b07af3 Enable probes to work with Mac OS X Leopard and other OSes that will
support DTrace in the future.

Switch from using DTRACE_PROBEn macros to the dynamically generated macros.
Use "dtrace -h" to create a header file that contains the dynamically
generated macros to be used in the source code instead of the DTRACE_PROBEn
macros.  A dummy header file is generated for builds without DTrace support.

Author: Robert Lor <Robert.Lor@sun.com>
2008-03-17 19:44:41 +00:00
Tom Lane 32846f8152 Fix TransactionIdIsCurrentTransactionId() to use binary search instead of
linear search when checking child-transaction XIDs.  This makes for an
important speedup in transactions that have large numbers of children,
as in a recent example from Craig Ringer.  We can also get rid of an
ugly kluge that represented lists of TransactionIds as lists of OIDs.

Heikki Linnakangas
2008-03-17 02:18:55 +00:00
Tom Lane 787eba734b When creating a large hash index, pre-sort the index entries by estimated
bucket number, so as to ensure locality of access to the index during the
insertion step.  Without this, building an index significantly larger than
available RAM takes a very long time because of thrashing.  On the other
hand, sorting is just useless overhead when the index does fit in RAM.
We choose to sort when the initial index size exceeds effective_cache_size.

This is a revised version of work by Tom Raney and Shreya Bhargava.
2008-03-16 23:15:08 +00:00
Tom Lane c9a1cc694a Change hash index creation so that rather than always establishing exactly
two buckets at the start, we create a number of buckets appropriate for the
estimated size of the table.  This avoids a lot of expensive bucket-split
actions during initial index build on an already-populated table.

This is one of the two core ideas of Tom Raney and Shreya Bhargava's patch
to reduce hash index build time.  I'm committing it separately to make it
easier for people to test the effects of this separately from the effects
of their other core idea (pre-sorting the index entries by bucket number).
2008-03-15 20:46:31 +00:00
Tom Lane 3e701a04fe Fix heap_page_prune's problem with failing to send cache invalidation
messages if the calling transaction aborts later on.  Collapsing out line
pointer redirects is a done deal as soon as we complete the page update,
so syscache *must* be notified even if the VACUUM FULL as a whole doesn't
complete.  To fix, add some functionality to inval.c to allow the pending
inval messages to be sent immediately while heap_page_prune is still
running.  The implementation is a bit chintzy: it will only work in the
context of VACUUM FULL.  But that's all we need now, and it can always be
extended later if needed.  Per my trouble report of a week ago.
2008-03-13 18:00:32 +00:00
Tom Lane 611b4393f2 Make TransactionIdIsInProgress check transam.c's single-item XID status cache
before it goes groveling through the ProcArray.  In situations where the same
recently-committed transaction ID is checked repeatedly by tqual.c, this saves
a lot of shared-memory searches.  And it's cheap enough that it shouldn't
hurt noticeably when it doesn't help.
Concept and patch by Simon, some minor tweaking and comment-cleanup by Tom.
2008-03-11 20:20:35 +00:00
Tom Lane 2fc2795456 Remove no-longer-used XLogCacheByte field of XLogCtl.
Itagaki Takahiro
2008-03-10 02:13:22 +00:00
Tom Lane 6f10eb2111 Refactor heap_page_prune so that instead of changing item states on-the-fly,
it accumulates the set of changes to be made and then applies them.  It had
to accumulate the set of changes anyway to prepare a WAL record for the
pruning action, so this isn't an enormous change; the only new complexity is
to not doubly mark tuples that are visited twice in the scan.  The main
advantage is that we can substantially reduce the scope of the critical
section in which the changes are applied, thus avoiding PANIC in foreseeable
cases like running out of memory in inval.c.  A nice secondary advantage is
that it is now far clearer that WAL replay will actually do the same thing
that the original pruning did.

This commit doesn't do anything about the open problem that
CacheInvalidateHeapTuple doesn't have the right semantics for a CTID change
caused by collapsing out a redirect pointer.  But whatever we do about that,
it'll be a good idea to not do it inside a critical section.
2008-03-08 21:57:59 +00:00
Tom Lane ad434473eb This patch addresses some issues in TOAST compression strategy that
were discussed last year, but we felt it was too late in the 8.3 cycle to
change the code immediately.  Specifically, the patch:

* Reduces the minimum datum size to be considered for compression from
256 to 32 bytes, as suggested by Greg Stark.

* Increases the required compression rate for compressed storage from
20% to 25%, again per Greg's suggestion.

* Replaces force_input_size (size above which compression is forced)
with a maximum size to be considered for compression.  It was agreed
that allowing large inputs to escape the minimum-compression-rate
requirement was not bright, and that indeed we'd rather have a knob
that acted in the other direction.  I set this value to 1MB for the
moment, but it could use some performance studies to tune it.

* Adds an early-failure path to the compressor as suggested by Jan:
if it's been unable to find even one compressible substring in the
first 1KB (parameterizable), assume we're looking at incompressible
input and give up.  (Possibly this logic can be improved, but I'll
commit it as-is for now.)

* Improves the toasting heuristics so that when we have very large
fields with attstorage 'x' or 'e', we will push those out to toast
storage before considering inline compression of shorter fields.
This also responds to a suggestion of Greg's, though my original
proposal for a solution was a bit off base because it didn't fix
the problem for large 'e' fields.

There was some discussion in the earlier threads of exposing some
of the compression knobs to users, perhaps even on a per-column
basis.  I have not done anything about that here.  It seems to me
that if we are changing around the parameters, we'd better get some
experience and be sure we are happy with the design before we set
things in stone by providing user-visible knobs.
2008-03-07 23:20:21 +00:00
Tom Lane 6a17826621 Change hashscan.c to keep its list of active hash index scans in
TopMemoryContext, rather than scattered through executor per-query contexts.
This poses no danger of memory leak since the ResourceOwner mechanism
guarantees release of no-longer-needed items.  It is needed because the
per-query context might already be released by the time we try to clean up
the hash scan list.  Report by ykhuang, diagnosis by Heikki.

Back-patch to 8.0, where the ResourceOwner-based cleanup was introduced.
The given test case does not fail before 8.2, probably because we rearranged
transaction abort processing somehow; but this coding is undoubtedly risky
so I'll patch 8.0 and 8.1 anyway.
2008-03-07 15:59:03 +00:00
Tom Lane 7d6e6e2e97 Fix PREPARE TRANSACTION to reject the case where the transaction has dropped a
temporary table; we can't support that because there's no way to clean up the
source backend's internal state if the eventual COMMIT PREPARED is done by
another backend.  This was checked correctly in 8.1 but I broke it in 8.2 :-(.
Patch by Heikki Linnakangas, original trouble report by John Smith.
2008-03-04 19:54:06 +00:00
Tom Lane c67f6f2f57 Reducing the assumed alignment of struct varlena means that the compiler
is also licensed to put a local variable declared that way at an unaligned
address.  Which will not work if the variable is then manipulated with
SET_VARSIZE or other macros that assume alignment.  So the previous patch
is not an unalloyed good, but on balance I think it's still a win, since
we have very few places that do that sort of thing.  Fix the one place in
tuptoaster.c that does it.  Per buildfarm results from gypsy_moth
(I'm a bit surprised that only one machine showed a failure).
2008-02-29 17:47:41 +00:00
Tom Lane 9713c06319 Change the declaration of struct varlena so that the length word is
represented as "char ...[4]" not "int32".  Since the length word is never
supposed to be accessed via this struct member anyway, this won't break
any existing code that is following the rules.  The advantage is that C
compilers will no longer assume that a pointer to struct varlena is
word-aligned, which prevents incorrect optimizations in TOAST-pointer
access and perhaps other places.  gcc doesn't seem to do this (at least
not at -O2), but the problem is demonstrable on some other compilers.

I changed struct inet as well, but didn't bother to touch a lot of other
struct definitions in which it wouldn't make any difference because there
were other fields forcing int alignment anyway.  Hopefully none of those
struct definitions are used for accessing unaligned Datums.
2008-02-23 19:11:45 +00:00
Peter Eisentraut 1f4a587fc3 Remove another target I forgot during the refactoring 2008-02-19 11:49:12 +00:00
Peter Eisentraut 0474dcb608 Refactor backend makefiles to remove lots of duplicate code 2008-02-19 10:30:09 +00:00
Tom Lane cd00406774 Replace time_t with pg_time_t (same values, but always int64) in on-disk
data structures and backend internal APIs.  This solves problems we've seen
recently with inconsistent layout of pg_control between machines that have
32-bit time_t and those that have already migrated to 64-bit time_t.  Also,
we can get out from under the problem that Windows' Unix-API emulation is not
consistent about the width of time_t.

There are a few remaining places where local time_t variables are used to hold
the current or recent result of time(NULL).  I didn't bother changing these
since they do not affect any cross-module APIs and surely all platforms will
have 64-bit time_t before overflow becomes an actual risk.  time_t should
be avoided for anything visible to extension modules, however.
2008-02-17 02:09:32 +00:00
Tom Lane 47df4f6688 Add a GUC variable "synchronize_seqscans" to allow clients to disable the new
synchronized-scanning behavior, and make pg_dump disable sync scans so that
it will reliably preserve row ordering.  Per recent discussions.
2008-01-30 18:35:55 +00:00
Peter Eisentraut 6f8f8d2daa Provide a clearer error message if the pg_control version number looks
wrong because of mismatched byte ordering.
2008-01-21 11:17:46 +00:00
Tom Lane ac12412ede Revise memory management for libxml calls. Instead of keeping libxml's data
in whichever context happens to be current during a call of an xml.c function,
use a dedicated context that will not go away until we explicitly delete it
(which we do at transaction end or subtransaction abort).  This makes recovery
after an error much simpler --- we don't have to individually delete the data
structures created by libxml.  Also, we need to initialize and cleanup libxml
only once per transaction (if there's no error) instead of once per function
call, so it should be a bit faster.  We'll need to keep an eye out for
intra-transaction memory leaks, though.  Alvaro and Tom.
2008-01-15 18:57:00 +00:00
Tom Lane d3b1b1f9d8 Fix CREATE INDEX CONCURRENTLY so that it won't use synchronized scan for
its second pass over the table.  It has to start at block zero, else the
"merge join" logic for detecting which TIDs are already in the index
doesn't work.  Hence, extend heapam.c's API so that callers can enable or
disable syncscan.  (I put in an option to disable buffer access strategy,
too, just in case somebody needs it.)  Per report from Hannes Dorbath.
2008-01-14 01:39:09 +00:00
Tom Lane eedb068c0a Make standard maintenance operations (including VACUUM, ANALYZE, REINDEX,
and CLUSTER) execute as the table owner rather than the calling user, using
the same privilege-switching mechanism already used for SECURITY DEFINER
functions.  The purpose of this change is to ensure that user-defined
functions used in index definitions cannot acquire the privileges of a
superuser account that is performing routine maintenance.  While a function
used in an index is supposed to be IMMUTABLE and thus not able to do anything
very interesting, there are several easy ways around that restriction; and
even if we could plug them all, there would remain a risk of reading sensitive
information and broadcasting it through a covert channel such as CPU usage.

To prevent bypassing this security measure, execution of SET SESSION
AUTHORIZATION and SET ROLE is now forbidden within a SECURITY DEFINER context.

Thanks to Itagaki Takahiro for reporting this vulnerability.

Security: CVE-2007-6600
2008-01-03 21:23:15 +00:00
Bruce Momjian 9098ab9e32 Update copyrights in source tree to 2008. 2008-01-01 19:46:01 +00:00
Tom Lane ac1ae9f2fa Improve a number of elog messages for not-supposed-to-happen cases in btrees,
since these seem to happen after all in corrupted indexes.  Make sure we
supply the index name in all cases, and provide relevant block numbers where
available.  Also consistently identify the index name as such.

Back-patch to 8.2, in hopes that this might help Mason Hale figure out his
problem.
2007-12-31 04:52:05 +00:00
Tom Lane 265f904d8f Code review for LIKE ... INCLUDING INDEXES patch. Fix failure to propagate
constraint status of copied indexes (bug #3774), as well as various other
small bugs such as failure to pstrdup when needed.  Allow INCLUDING INDEXES
indexes to be merged with identical declared indexes (perhaps not real useful,
but the code is there and having it not apply to LIKE indexes seems pretty
unorthogonal).  Avoid useless work in generateClonedIndexStmt().  Undo some
poorly chosen API changes, and put a couple of routines in modules that seem
to be better places for them.
2007-12-01 23:44:44 +00:00
Tom Lane 895a94de6d Avoid incrementing the CommandCounter when CommandCounterIncrement is called
but no database changes have been made since the last CommandCounterIncrement.
This should result in a significant improvement in the number of "commands"
that can typically be performed within a transaction before hitting the 2^32
CommandId size limit.  In particular this buys back (and more) the possible
adverse consequences of my previous patch to fix plan caching behavior.

The implementation requires tracking whether the current CommandCounter
value has been "used" to mark any tuples.  CommandCounter values stored into
snapshots are presumed not to be used for this purpose.  This requires some
small executor changes, since the executor used to conflate the curcid of
the snapshot it was using with the command ID to mark output tuples with.
Separating these concepts allows some small simplifications in executor APIs.

Something for the TODO list: look into having CommandCounterIncrement not do
AcceptInvalidationMessages.  It seems fairly bogus to be doing it there,
but exactly where to do it instead isn't clear, and I'm disinclined to mess
with asynchronous behavior during late beta.
2007-11-30 21:22:54 +00:00
Tom Lane 6f4cfe48ac Improve GIN index build's tracking of memory usage by using
GetMemoryChunkSpace, not just the palloc request size.  This brings the
allocatedMemory counter close enough to reality (as measured by
MemoryContextStats printouts) that I think we can get rid of the arbitrary
factor-of-2 adjustment that was put into the code initially.  Given the
sensitivity of GIN build to work memory size, not using as much of work
memory as we're allowed to seems a pretty bad idea.
2007-11-16 21:55:59 +00:00
Tom Lane 93190c3098 Repair still another bug in the btree page split WAL reduction patch:
it failed for splits of non-leaf pages because in such pages the first
data key on a page is suppressed, and so we can't just copy the first
key from the right page to reconstitute the left page's high key.
Problem found by Koichi Suzuki, patch by Heikki.
2007-11-16 19:53:50 +00:00
Bruce Momjian f639df0d61 Small comment spacing improvement. 2007-11-16 01:51:22 +00:00
Bruce Momjian 7d4c99b414 Fix pgindent to properly handle 'else' and single-line comments on the
same line;  previous fix was only partial.  Re-run pgindent on files
that need it.
2007-11-15 23:23:44 +00:00
Bruce Momjian f6e8730d11 Re-run pgindent with updated list of typedefs. (Updated README should
avoid this problem in the future.)
2007-11-15 22:25:18 +00:00
Peter Eisentraut b30769ee54 When logging the recovery.conf parameters, show them quoted as they would
appear in the configuration file.
2007-11-15 22:02:12 +00:00
Bruce Momjian fdf5a5efb7 pgindent run for 8.3. 2007-11-15 21:14:46 +00:00
Tom Lane 6cc4451b5c Prevent re-use of a deleted relation's relfilenode until after the next
checkpoint.  This guards against an unlikely data-loss scenario in which
we re-use the relfilenode, then crash, then replay the deletion and
recreation of the file.  Even then we'd be OK if all insertions into the
new relation had been WAL-logged ... but that's not guaranteed given all
the no-WAL-logging optimizations that have recently been added.

Patch by Heikki Linnakangas, per a discussion last month.
2007-11-15 20:36:40 +00:00
Tom Lane b40c0a4bb0 Clean up some stray references to tsearch2. 2007-11-13 23:36:26 +00:00
Tom Lane 0bd4da23a4 Ensure that typmod decoration on a datatype name is validated in all cases,
even in code paths where we don't pay any subsequent attention to the typmod
value.  This seems needed in view of the fact that 8.3's generalized typmod
support will accept a lot of bogus syntax, such as "timestamp(foo)" or
"record(int, 42)" --- if we allow such things to pass without comment,
users will get confused.  Per a recent example from Greg Stark.

To implement this in a way that's not very vulnerable to future
bugs-of-omission, refactor the API of parse_type.c's TypeName lookup routines
so that typmod validation is folded into the base lookup operation.  Callers
can still choose not to receive the encoded typmod, but we'll check the
decoration anyway if it's present.
2007-11-11 19:22:49 +00:00
Bruce Momjian 82748bc253 Reduce error level of ROLLBACK outside a transaction from WARNING to
NOTICE.
2007-11-10 14:36:44 +00:00
Peter Eisentraut 5f9869d0ee Use "alternative" instead of "alternate" where it is clearer. 2007-11-07 12:24:24 +00:00
Teodor Sigaev bf5ccf382c - Add check of already changed page while replay WAL. This touches only
ginRedoInsert(), because other ginRedo* functions rewrite whole page or
make changes which could be applied several times without consistent's loss

- Remove check of identifying of corresponding split record:
it's possible that replaying of WAL starts after actual page split, but before
removing of that split from incomplete splits list. In this case, that check
cause FATAL error.

Per stress test which reproduces bug reported by Craig McElroy
<craig.mcelroy@contegix.com>
2007-10-29 19:26:57 +00:00
Teodor Sigaev 85376c6f7d Fix coredump during replay WAL after crash. Change entrySplitPage() to prevent
usage of any information from system catalog, because it could be called during
replay of WAL.

Per bug report from Craig McElroy <craig.mcelroy@contegix.com>. Patch doesn't
change on-disk storage.
2007-10-29 13:49:21 +00:00
Alvaro Herrera 745c1b2c2a Rearrange vacuum-related bits in PGPROC as a bitmask, to better support
having several of them.  Add two more flags: whether the process is
executing an ANALYZE, and whether a vacuum is for Xid wraparound (which
is obviously only set by autovacuum).

Sneakily move the worker's recently-acquired PostAuthDelay to a more useful
place.
2007-10-24 20:55:36 +00:00
Tom Lane 9226ba817b Keep heap_page_prune from marking the buffer dirty when it didn't
really change anything.  Per report from Itagaki Takahiro.  Fix by
Pavan Deolasee.
2007-10-24 13:05:57 +00:00
Tom Lane 56303abff0 Tweak toast-related logic in heapam.c so that the toaster is only invoked
when relkind = RELKIND_RELATION.  This syncs these tests with the Asserts
in tuptoaster.c, and ensures that we won't ever try to, for example,
compress a sequence's tuple.  Problem found by Greg Stark while stress-testing
with much-smaller-than-normal page sizes.
2007-10-16 17:05:26 +00:00
Tom Lane 5c8eb929e6 When telling the bgwriter that we need a checkpoint because too much xlog
has been consumed, recheck against the latest value of RedoRecPtr before
really sending the signal.  This avoids useless checkpoint activity if
XLogWrite is executed when we have a very stale local copy of RedoRecPtr.
The potential for useless checkpoint is very much worse in 8.3 because of
the walwriter process (which never does XLogInsert), so while this behavior
was intentional, it needs to be changed.  Per report from Itagaki Takahiro.
2007-10-12 19:39:59 +00:00
Tom Lane 56b7695cf5 Remove incorrect use of VARSIZE() on a toasted datum. We can just remove it
instead of fix it, since once we've set toast_action[i] to 'p' it no longer
matters what toast_sizes[i] is.  Greg Stark
2007-10-11 18:19:58 +00:00
Tom Lane b526462f9e Avoid assuming that struct varattrib_pointer doesn't get padded by the
compiler --- at least on ARM, it does.  I suspect that the varvarlena patch
has been creating larger-than-intended toast pointers all along on ARM,
but it wasn't exposed until the latest tweak added some Asserts that
calculated the expected size in a different way.  We could probably have
fixed this by adding __attribute__((packed)) as is done for ItemPointerData,
but struct varattrib_pointer isn't really all that useful anyway, so it
seems cleanest to just get rid of it and have only struct varattrib_1b_e.
Per results from buildfarm member quagga.
2007-10-01 16:25:56 +00:00
Tom Lane 27b8922221 Add an extra header byte to TOAST-pointer datums to represent their size
explicitly.  This means a TOAST pointer takes 18 bytes instead of 17 --- still
smaller than in 8.2 --- which seems a good tradeoff to ensure we won't have
painted ourselves into a corner if we want to support multiple types of TOAST
pointer later on.  Per discussion with Greg Stark.
2007-09-30 19:54:58 +00:00
Tom Lane ab051bd293 Adjust recovery PS display as agreed with Simon: 'waiting for XXX'
while the restore_command does its thing, then 'recovering XXX' while
processing the segment file.  These operations are heavyweight enough
that an extra PS display set shouldn't bother anyone.
2007-09-30 17:28:56 +00:00
Tom Lane 77ccbe64dd Make recovery show the current input WAL segment name in the startup
process' PS display.  After a suggestion by Simon (not exactly his
patch though).
2007-09-29 18:32:56 +00:00
Tom Lane b46bd55a6c Make archive recovery always start a new timeline, rather than only when a
recovery stop time was used.  This avoids a corner-case risk of trying to
overwrite an existing archived copy of the last WAL segment, and seems
simpler and cleaner all around than the original definition.  Per example
from Jon Colverson and subsequent analysis by Simon.
2007-09-29 01:36:10 +00:00
Tom Lane 84fe8990ae Some small tuptoaster improvements from Greg Stark. Avoid unnecessary
decompression of an already-compressed external value when we have to copy
it; save a few cycles when a value is too short for compression; and
annotate various lines that are currently unreachable.
2007-09-26 23:29:10 +00:00
Tom Lane f18dfc4835 Minor improvements in backup and recovery:
- create a separate archive_mode GUC, on which archive_command is dependent

- %r option in recovery.conf sends last restartpoint to recovery command

- %r used in pg_standby, updated README

- minor other code cleanup in pg_standby

- doc on Warm Standby now mentions pg_standby and %r

- log_restartpoints recovery option emits LOG message at each restartpoint

- end of recovery now displays last transaction end time, as requested
  by Warren Little; also shown at each restartpoint

- restart archiver if needed to carry away WAL files at shutdown

Simon Riggs
2007-09-26 22:36:30 +00:00
Tom Lane 7583f9a7ca Fix regex, LIKE, and some other second-rank text-manipulation functions
to not cause needless copying of text datums that have 1-byte headers.
Greg Stark, in response to performance gripe from Guillaume Smet and
ITAGAKI Takahiro.
2007-09-21 22:52:52 +00:00
Tom Lane cc59049daf Improve handling of prune/no-prune decisions by storing a page's oldest
unpruned XMAX in its header.  At the cost of 4 bytes per page, this keeps us
from performing heap_page_prune when there's no chance of pruning anything.
Seems to be necessary per Heikki's preliminary performance testing.
2007-09-21 21:25:42 +00:00
Tom Lane bd0af827da Fix comments that misspelled TransactionIdIsInProgress, per Heikki. 2007-09-21 16:32:19 +00:00
Tom Lane 282d2a03dd HOT updates. When we update a tuple without changing any of its indexed
columns, and the new version can be stored on the same heap page, we no longer
generate extra index entries for the new version.  Instead, index searches
follow the HOT-chain links to ensure they find the correct tuple version.

In addition, this patch introduces the ability to "prune" dead tuples on a
per-page basis, without having to do a complete VACUUM pass to recover space.
VACUUM is still needed to clean up dead index entries, however.

Pavan Deolasee, with help from a bunch of other people.
2007-09-20 17:56:33 +00:00
Bruce Momjian 63490ddf1e Remove GIN interface section, which is now documented in SGML.
Heikki Linnakangas
2007-09-14 16:28:17 +00:00
Tom Lane 6889303531 Redefine the lp_flags field of item pointers as having four states, rather
than two independent bits (one of which was never used in heap pages anyway,
or at least hadn't been in a very long time).  This gives us flexibility to
add the HOT notions of redirected and dead item pointers without requiring
anything so klugy as magic values of lp_off and lp_len.  The state values
are chosen so that for the states currently in use (pre-HOT) there is no
change in the physical representation.
2007-09-12 22:10:26 +00:00
Tom Lane ef4d38c86c Rename recently-added pg_stat_activity column from txn_start to xact_start,
for consistency with other column names such as in pg_stat_database.
2007-09-11 03:28:05 +00:00
Tom Lane 6bd4f401b0 Replace the former method of determining snapshot xmax --- to wit, calling
ReadNewTransactionId from GetSnapshotData --- with a "latestCompletedXid"
variable that is updated during transaction commit or abort.  Since
latestCompletedXid is written only in places that had to lock ProcArrayLock
exclusively anyway, and is read only in places that had to lock ProcArrayLock
shared anyway, it adds no new locking requirements to the system despite being
cluster-wide.  Moreover, removing ReadNewTransactionId from snapshot
acquisition eliminates the need to take both XidGenLock and ProcArrayLock at
the same time.  Since XidGenLock is sometimes held across I/O this can be a
significant win.  Some preliminary benchmarking suggested that this patch has
no effect on average throughput but can significantly improve the worst-case
transaction times seen in pgbench.  Concept by Florian Pflug, implementation
by Tom Lane.
2007-09-08 20:31:15 +00:00
Tom Lane 0a51e7073c Don't take ProcArrayLock while exiting a transaction that has no XID; there is
no need for serialization against snapshot-taking because the xact doesn't
affect anyone else's snapshot anyway.  Per discussion.  Also, move various
info about the interlocking of transactions and snapshots out of code comments
and into a hopefully-more-cohesive discussion in access/transam/README.

Also, remove a couple of now-obsolete comments about having to force some WAL
to be written to persuade RecordTransactionCommit to do its thing.
2007-09-07 20:59:26 +00:00
Teodor Sigaev 0392ea5097 Improve page split in rtree emulation. Now if splitted result has
big misalignement, then it tries to split page basing on distribution
of boxe's centers.

Per report from  Dolafi, Tom <dolafit@janelia.hhmi.org>

 Backpatch is needed, change doesn't affect on-disk storage.
2007-09-07 17:04:26 +00:00
Tom Lane 4bf2dfb9a2 Quick hack to make the VXID of a prepared transaction be -1/XID,
so that different prepared xacts can be told apart in the pg_locks
view.  Per suggestion from Florian.
2007-09-05 20:53:17 +00:00
Tom Lane 295e63983d Implement lazy XID allocation: transactions that do not modify any database
rows will normally never obtain an XID at all.  We already did things this way
for subtransactions, but this patch extends the concept to top-level
transactions.  In applications where there are lots of short read-only
transactions, this should improve performance noticeably; not so much from
removal of the actual XID-assignments, as from reduction of overhead that's
driven by the rate of XID consumption.  We add a concept of a "virtual
transaction ID" so that active transactions can be uniquely identified even
if they don't have a regular XID.  This is a much lighter-weight concept:
uniqueness of VXIDs is only guaranteed over the short term, and no on-disk
record is made about them.

Florian Pflug, with some editorialization by Tom.
2007-09-05 18:10:48 +00:00
Tom Lane 2abae34a2e Implement function-local GUC parameter settings, as per recent discussion.
There are still some loose ends: I didn't do anything about the SET FROM
CURRENT idea yet, and it's not real clear whether we are happy with the
interaction of SET LOCAL with function-local settings.  The documentation
is a bit spartan, too.
2007-09-03 00:39:26 +00:00
Tom Lane a52e4408b9 Add a debug logging message when a resource manager rejects an attempted
restart point.  Per suggestion from Simon Riggs.
2007-08-28 23:17:47 +00:00
Tom Lane 140d4ebcb4 Tsearch2 functionality migrates to core. The bulk of this work is by
Oleg Bartunov and Teodor Sigaev, but I did a lot of editorializing,
so anything that's broken is probably my fault.

Documentation is nonexistent as yet, but let's land the patch so we can
get some portability testing done.
2007-08-21 01:11:32 +00:00
Tom Lane 67f99d216a Fix oversight in async-commit patch: there were some places in heapam.c
that still thought they could set HEAP_XMAX_COMMITTED immediately after
seeing the other transaction commit.  Make them use the same logic as
tqual.c does to determine if the hint bit can be set yet.
2007-08-14 17:35:18 +00:00
Tom Lane 647fd9a108 Fix two bugs induced in VACUUM FULL by async-commit patch.
First, we cannot assume that XLogAsyncCommitFlush guarantees hint bits will be
settable, because clog.c's inexact LSN bookkeeping results in windows where a
previously flushed transaction is considered unhintable because it shares an
LSN slot with a later unflushed transaction.  But repair_frag requires
XMIN_COMMITTED to be correct so that it can distinguish tuples moved by the
current vacuum.  Since not being able to set the bit is an uncommon corner
case, the most practical way of dealing with it seems to be to abandon
shrinking (ie, don't invoke repair_frag) when we find a non-dead tuple whose
XMIN_COMMITTED bit couldn't be set.

Second, it is possible for the same reason that a RECENTLY_DEAD tuple does not
get its XMAX_COMMITTED bit set during scan_heap.  But by the time repair_frag
examines the tuple it might be possible to set the bit.  We therefore must
take buffer content lock when calling HeapTupleSatisfiesVacuum a second time,
else we can get an Assert failure in SetBufferCommitInfoNeedsSave.  This
latter bug is latent in existing releases, but I think it cannot actually
occur without async commit, since the first HeapTupleSatisfiesVacuum call
should always have set the bit.  So I'm not going to back-patch it.

In passing, reduce the existing "cannot shrink relation" messages from NOTICE
to LOG level.  The new message must be no higher than LOG if we don't want
unpredictable regression test failures, and consistency seems like a good
idea.  Also arrange that only one such message is reported per VACUUM FULL;
in typical scenarios you could get spammed with many such messages, which
seems a bit useless.
2007-08-13 19:08:26 +00:00
Tom Lane bdd6b62245 Switch over to using the src/timezone functions for formatting timestamps
displayed in the postmaster log.  This avoids Windows-specific problems with
localized time zone names that are in the wrong encoding, and generally seems
like a good idea to forestall other potential platform-dependent issues.
To preserve the existing behavior that all backends will log in the same time
zone, create a new GUC variable log_timezone that can only be changed on a
system-wide basis, and reference log-related calculations to that zone instead
of the TimeZone variable.

This fixes the issue reported by Hiroshi Saito that timestamps printed by
xlog.c startup could be improperly localized on Windows.  We still need a
simpler patch for that problem in the back branches, however.
2007-08-04 01:26:54 +00:00
Tom Lane 4a78cdeb6b Support an optional asynchronous commit mode, in which we don't flush WAL
before reporting a transaction committed.  Data consistency is still
guaranteed (unlike setting fsync = off), but a crash may lose the effects
of the last few transactions.  Patch by Simon, some editorialization by Tom.
2007-08-01 22:45:09 +00:00
Tom Lane ad4295728e Create a new dedicated Postgres process, "wal writer", which exists to write
and fsync WAL at convenient intervals.  For the moment it just tries to
offload this work from backends, but soon it will be responsible for
guaranteeing a maximum delay before asynchronously-committed transactions
will be flushed to disk.

This is a portion of Simon Riggs' async-commit patch, committed to CVS
separately because a background WAL writer seems like it might be a good idea
independently of the async-commit feature.  I rebased walwriter.c on
bgwriter.c because it seemed like a more appropriate way of handling signals;
while the startup/shutdown logic in postmaster.c is more like autovac because
we want walwriter to quit before we start the shutdown checkpoint.
2007-07-24 04:54:09 +00:00
Tom Lane 9fc25c0511 Improve logging of checkpoints. Patch by Greg Smith, worked over
by Heikki and a little bit by me.
2007-06-30 19:12:02 +00:00
Tom Lane 867e2c91a0 Implement "distributed" checkpoints in which the checkpoint I/O is spread
over a fairly long period of time, rather than being spat out in a burst.
This happens only for background checkpoints carried out by the bgwriter;
other cases, such as a shutdown checkpoint, are still done at full speed.

Remove the "all buffers" scan in the bgwriter, and associated stats
infrastructure, since this seems no longer very useful when the checkpoint
itself is properly throttled.

Original patch by Itagaki Takahiro, reworked by Heikki Linnakangas,
and some minor API editorialization by me.
2007-06-28 00:02:40 +00:00
Tom Lane 85d72f0516 Teach heapam code to know the difference between a real seqscan and the
pseudo HeapScanDesc created for a bitmap heap scan.  This avoids some useless
overhead during a bitmap scan startup, in particular invoking the syncscan
code.  (We might someday want to do that, but right now it's merely useless
contention for shared memory, to say nothing of possibly pushing useful
entries out of syncscan's small LRU list.)  This also allows elimination of
ugly pgstat_discount_heap_scan() kluge.
2007-06-09 18:49:55 +00:00
Tom Lane a04a423599 Arrange for large sequential scans to synchronize with each other, so that
when multiple backends are scanning the same relation concurrently, each page
is (ideally) read only once.

Jeff Davis, with review by Heikki and Tom.
2007-06-08 18:23:53 +00:00
Tom Lane 6d6d14b6d5 Redefine IsTransactionState() to only return true for TRANS_INPROGRESS state,
which is the only state in which it's safe to initiate database queries.
It turns out that all but two of the callers thought that's what it meant;
and the other two were using it as a proxy for "will GetTopTransactionId()
return a nonzero XID"?  Since it was in fact an unreliable guide to that,
make those two just invoke GetTopTransactionId() always, then deal with a
zero result if they get one.
2007-06-07 21:45:59 +00:00
Teodor Sigaev f74426283d Move call of MarkBufferDirty() before XLogInsert() as required.
Many thanks to Heikki Linnakangas <heikki@enterprisedb.com> for his
sharp eyes.
2007-06-05 12:47:49 +00:00
Teodor Sigaev 853d1c3103 Fix bundle bugs of GIN:
- Fix possible deadlock between UPDATE and VACUUM queries. Bug never was
  observed in 8.2, but it still exist there. HEAD is more sensitive to
  bug after recent "ring" of buffer improvements.
- Fix WAL creation: if parent page is stored as is after split then
  incomplete split isn't removed during replay. This happens rather rare, only
  on large tables with a lot of updates/inserts.
- Fix WAL replay: there was wrong test of XLR_BKP_BLOCK_* for left
  page after deletion of page. That causes wrong rightlink field: it pointed
  to deleted page.
- add checking of match of clearing incomplete split
- cleanup incomplete split list after proceeding

All of this chages doesn't change on-disk storage, so backpatch...
But second point may be an issue for replaying logs from previous version.
2007-06-04 15:56:28 +00:00
Peter Eisentraut f4a3789b39 Clarify some error messages about duplicate things. 2007-06-03 22:16:03 +00:00
Tom Lane 1f559b7d3a Fix several hash functions that were taking chintzy shortcuts instead of
delivering a well-randomized hash value.  I got religion on this after
observing that performance of multi-batch hash join degrades terribly if the
higher-order bits of hash values aren't random, as indeed was true for say
hashes of small integer values.  It's now expected and documented that hash
functions should use hash_any or some comparable method to ensure that all
bits of their output are about equally random.

initdb forced because this change invalidates existing hash indexes.  For the
same reason, this isn't back-patchable; the hash join performance problem
will get a band-aid fix in the back branches.
2007-06-01 15:33:19 +00:00
Peter Eisentraut 7ce9b3683e Make some messages more consistent 2007-05-31 15:13:06 +00:00
Teodor Sigaev 54af876593 Replace ReadBuffer to ReadBufferWithStrategy in all vacuum-involved places
to implement limited-size "ring" of buffers for VACUUM for GIN & GIST
2007-05-31 14:03:09 +00:00
Peter Eisentraut 71fb7b9014 Downgrade some low-level startup messages to DEBUG1. 2007-05-31 07:36:12 +00:00
Tom Lane fa0e318f94 Fix overly-strict sanity check in BeginInternalSubTransaction that made it
fail when used in a deferred trigger.  Bug goes back to 8.0; no doubt the
reason it hadn't been noticed is that we've been discouraging use of
user-defined constraint triggers.  Per report from Frank van Vugt.
2007-05-30 21:01:39 +00:00
Tom Lane d526575f89 Make large sequential scans and VACUUMs work in a limited-size "ring" of
buffers, rather than blowing out the whole shared-buffer arena.  Aside from
avoiding cache spoliation, this fixes the problem that VACUUM formerly tended
to cause a WAL flush for every page it modified, because we had it hacked to
use only a single buffer.  Those flushes will now occur only once per
ring-ful.  The exact ring size, and the threshold for seqscans to switch into
the ring usage pattern, remain under debate; but the infrastructure seems
done.  The key bit of infrastructure is a new optional BufferAccessStrategy
object that can be passed to ReadBuffer operations; this replaces the former
StrategyHintVacuum API.

This patch also changes the buffer usage-count methodology a bit: we now
advance usage_count when first pinning a buffer, rather than when last
unpinning it.  To preserve the behavior that a buffer's lifetime starts to
decrease when it's released, the clock sweep code is modified to not decrement
usage_count of pinned buffers.

Work not done in this commit: teach GiST and GIN indexes to use the vacuum
BufferAccessStrategy for vacuum-driven fetches.

Original patch by Simon, reworked by Heikki and again by Tom.
2007-05-30 20:12:03 +00:00
Tom Lane 77947c51c0 Fix up pgstats counting of live and dead tuples to recognize that committed
and aborted transactions have different effects; also teach it not to assume
that prepared transactions are always committed.

Along the way, simplify the pgstats API by tying counting directly to
Relations; I cannot detect any redeeming social value in having stats
pointers in HeapScanDesc and IndexScanDesc structures.  And fix a few
corner cases in which counts might be missed because the relation's
pgstat_info pointer hadn't been set.
2007-05-27 03:50:39 +00:00
Tom Lane a8d539f124 To support external compression of archived WAL data, add a flag bit to
WAL records that shows whether it is safe to remove full-page images
(ie, whether or not an on-line backup was in progress when the WAL entry
was made).  Also make provision for an XLOG_NOOP record type that can be
used to fill in the extra space when decompressing the data for restore.

This is the portion of Koichi Suzuki's "full page writes" patch that
has to go into the core database.  The remainder of that work is two
external compression and decompression programs, which for the time being
will undergo separate development on pgfoundry.  Per discussion.

Also, twiddle the handling of BTREE_SPLIT records to ensure it'll be
possible to compress them (the previous coding caused essential info
to be omitted).  The other commonly-used record types seem OK already,
with the possible exception of GIN and GIST WAL records, which I don't
understand well enough to opine on.
2007-05-20 21:08:19 +00:00
Alvaro Herrera 3b0347b36e Move the tuple freezing point in CLUSTER to a point further back in the past,
to avoid losing useful Xid information in not-so-old tuples.  This makes
CLUSTER behave the same as VACUUM as far a tuple-freezing behavior goes
(though CLUSTER does not yet advance the table's relfrozenxid).

While at it, move the actual freezing operation in rewriteheap.c to a more
appropriate place, and document it thoroughly.  This part of the patch from
Tom Lane.
2007-05-17 15:28:29 +00:00
Alvaro Herrera dfed0012bc Have the rewriteheap code freeze old tuples. This is safe because it is only
applied to live tuples older than a recent Xmin, not to tuples that may be part
of an update chain.  Those still keep their original markings.

This patch makes it possible for CLUSTER to advance relfrozenxid, thus avoiding
the need of vacuuming the table for Xid wraparound purposes.  That will be
patched separately.

Patch from Heikki Linnakangas.
2007-05-16 16:36:56 +00:00
Tom Lane 0fef38da21 Tweak hash index AM to use the new ReadOrZeroBuffer bufmgr API when fetching
pages it intends to zero immediately.  Just to show there is some use for that
function besides WAL recovery :-).
Along the way, fold _hash_checkpage and _hash_pageinit calls into _hash_getbuf
and friends, instead of expecting callers to do that separately.
2007-05-03 16:45:58 +00:00
Tom Lane 8c3cc86e7b During WAL recovery, when reading a page that we intend to overwrite completely
from the WAL data, don't bother to physically read it; just have bufmgr.c
return a zeroed-out buffer instead.  This speeds recovery significantly,
and also avoids unnecessary failures when a page-to-be-overwritten has corrupt
page headers on disk.  This replaces a former kluge that accomplished the
latter by pretending zero_damaged_pages was always ON during WAL recovery;
which was OK when the kluge was put in, but is unsafe when restoring a WAL
log that was written with full_page_writes off.

Heikki Linnakangas
2007-05-02 23:18:03 +00:00
Tom Lane c432061963 Change the timestamps recorded in transaction commit/abort xlog records
from time_t to TimestampTz representation.  This provides full gettimeofday()
resolution of the timestamps, which might be useful when attempting to
do point-in-time recovery --- previously it was not possible to specify
the stop point with sub-second resolution.  But mostly this is to get
rid of TimestampTz-to-time_t conversion overhead during commit.  Per my
proposal of a day or two back.
2007-04-30 21:01:53 +00:00
Tom Lane 957d08c81f Implement rate-limiting logic on how often backends will attempt to send
messages to the stats collector.  This avoids the problem that enabling
stats_row_level for autovacuum has a significant overhead for short
read-only transactions, as noted by Arjen van der Meijden.  We can avoid
an extra gettimeofday call by piggybacking on the one done for WAL-logging
xact commit or abort (although that doesn't help read-only transactions,
since they don't WAL-log anything).

In my proposal for this, I noted that we could change the WAL log entries
for commit/abort to record full TimestampTz precision, instead of only
time_t as at present.  That's not done in this patch, but will be committed
separately.
2007-04-30 03:23:49 +00:00
Tom Lane a2e923a652 Fix dynahash.c to suppress hash bucket splits while a hash_seq_search() scan
is in progress on the same hashtable.  This seems the least invasive way to
fix the recently-recognized problem that a split could cause the scan to
visit entries twice or (with much lower probability) miss them entirely.
The only field-reported problem caused by this is the "failed to re-find
shared lock object" PANIC in COMMIT PREPARED reported by Michel Dorochevsky,
which was caused by multiply visited entries.  However, it seems certain
that mdsync() is vulnerable to missing required fsync's due to missed
entries, and I am fearful that RelationCacheInitializePhase2() might be at
risk as well.  Because of that and the generalized hazard presented by this
bug, back-patch all the supported branches.

Along the way, fix pg_prepared_statement() and pg_cursor() to not assume
that the hashtables they are examining will stay static between calls.
This is risky regardless of the newly noted dynahash problem, because
hash_seq_search() has never promised to cope with deletion of table entries
other than the just-returned one.  There may be no bug here because the only
supported way to call these functions is via ExecMakeTableFunctionResult()
which will cycle them to completion before doing anything very interesting,
but it seems best to get rid of the assumption.  This affects 8.2 and HEAD
only, since those functions weren't there earlier.
2007-04-26 23:24:46 +00:00
Tom Lane 9d37c038fc Repair PANIC condition in hash indexes when a previous index extension attempt
failed (due to lock conflicts or out-of-space).  We might have already
extended the index's filesystem EOF before failing, causing the EOF to be
beyond what the metapage says is the last used page.  Hence the invariant
maintained by the code needs to be "EOF is at or beyond last used page",
not "EOF is exactly the last used page".  Problem was created by my patch
of 2006-11-19 that attempted to repair bug #2737.  Since that was
back-patched to 7.4, this needs to be as well.  Per report and test case
from Vlastimil Krejcir.
2007-04-19 20:24:04 +00:00
Tom Lane 836feeda9c Fix condition for whether end_heap_rewrite must fsync, per Heikki. 2007-04-17 21:29:31 +00:00
Tom Lane 4942ee656a Don't assume rd_smgr stays open across all of a rewriteheap operation;
doing so can result in crash if an sinval reset occurs meanwhile.
I believe this explains intermittent buildfarm failures in cluster test.
2007-04-17 20:49:39 +00:00
Tom Lane 226a100568 Code review for btree page split WAL reduction patch. Make it actually work
(original code *always* created a full-page image for the left page, thus
leaving the intended savings unrealized), avoid risk of not having enough room
on the page during xlog restore, squeeze out another couple bytes in the xlog
record, clean up neglected comments.
2007-04-11 20:47:38 +00:00
Tom Lane 56218fbc48 Minor tweaking of index special-space definitions so that the various
index types can be reliably distinguished by examining the special space
on an index page.  Per my earlier proposal, plus the realization that
there's no need for btree's vacuum cycle ID to cycle through every possible
16-bit value.  Restricting its range a little costs nearly nothing and
eliminates the possibility of collisions.
Memo to self: remember to make bitmap indexes play along with this scheme,
assuming that patch ever gets accepted.
2007-04-09 22:04:08 +00:00
Tom Lane 7b78474da3 Make CLUSTER MVCC-safe. Heikki Linnakangas 2007-04-08 01:26:33 +00:00
Tom Lane f02a82b6ad Make 'col IS NULL' clauses be indexable conditions.
Teodor Sigaev, with some kibitzing from Tom Lane.
2007-04-06 22:33:43 +00:00
Tom Lane 3e23b68dac Support varlena fields with single-byte headers and unaligned storage.
This commit breaks any code that assumes that the mere act of forming a tuple
(without writing it to disk) does not "toast" any fields.  While all available
regression tests pass, I'm not totally sure that we've fixed every nook and
cranny, especially in contrib.

Greg Stark with some help from Tom Lane
2007-04-06 04:21:44 +00:00
Tom Lane 9c9b619473 Remove the CheckpointStartLock in favor of having backends show whether they
are in their commit critical sections via flags in the ProcArray.  Checkpoint
can watch the ProcArray to determine when it's safe to proceed.  This is
a considerably better solution to the original problem of race conditions
between checkpoint and transaction commit: it speeds up commit, since there's
one less lock to fool with, and it prevents the problem of checkpoint being
delayed indefinitely when there's a constant flow of commits.  Heikki, with
some kibitzing from Tom.
2007-04-03 16:34:36 +00:00
Tom Lane b3005276eb Decouple the values of TOAST_TUPLE_THRESHOLD and TOAST_MAX_CHUNK_SIZE.
Add the latter to the values checked in pg_control, since it can't be changed
without invalidating toast table content.  This commit in itself shouldn't
change any behavior, but it lays some necessary groundwork for experimentation
with these toast-control numbers.

Note: while TOAST_TUPLE_THRESHOLD can now be changed without initdb, some
thought still needs to be given to needs_toast_table() in toasting.c before
unleashing random changes.
2007-04-03 04:14:26 +00:00
Tom Lane 57690c6803 Support enum data types. Along the way, use macros for the values of
pg_type.typtype whereever practical.  Tom Dunstan, with some kibitzing
from Tom Lane.
2007-04-02 03:49:42 +00:00
Tom Lane 8875d0987d Fix oversight in coding of _bt_start_vacuum: we can't assume that the LWLock
will be released by transaction abort before _bt_end_vacuum gets called.
If either of these "can't happen" errors actually happened, we'd freeze up
trying to acquire an already-held lock.  Latest word is that this does
not explain Martin Pitt's trouble report, but it still looks like a bug.
2007-03-30 00:12:59 +00:00
Tom Lane fba8113c1b Teach CLUSTER to skip writing WAL if not needed (ie, not using archiving)
--- Simon.
Also, code review and cleanup for the previous COPY-no-WAL patches --- Tom.
2007-03-29 00:15:39 +00:00
Tom Lane e85a01df67 Clean up the representation of special snapshots by including a "method
pointer" in every Snapshot struct.  This allows removal of the case-by-case
tests in HeapTupleSatisfiesVisibility, which should make it a bit faster
(I didn't try any performance tests though).  More importantly, we are no
longer violating portable C practices by assuming that small integers are
distinct from all pointer values, and HeapTupleSatisfiesDirty no longer
has a non-reentrant API involving side-effects on a global variable.

There were a couple of places calling HeapTupleSatisfiesXXX routines
directly rather than through the HeapTupleSatisfiesVisibility macro.
Since these places had to be changed anyway, I chose to make them go
through the macro for uniformity.

Along the way I renamed HeapTupleSatisfiesSnapshot to HeapTupleSatisfiesMVCC
to emphasize that it's only used with MVCC-type snapshots.  I was sorely
tempted to rename HeapTupleSatisfiesVisibility to HeapTupleSatisfiesSnapshot,
but forebore for the moment to avoid confusion and reduce the likelihood that
this patch breaks some of the pending patches.  Might want to reconsider
doing that later.
2007-03-25 19:45:14 +00:00
Tom Lane 4f896dac17 Arrange for PreventTransactionChain to reject commands submitted as part
of a multi-statement simple-Query message.  This bug goes all the way
back, but unfortunately is not nearly so easy to fix in existing releases;
it is only the recent ProcessUtility API change that makes it fixable in
HEAD.  Per report from William Garrison.
2007-03-22 19:55:04 +00:00
Peter Eisentraut f4ee82e3d3 Reverted waiting for further fixes:
Make configuration parameters fall back to their default values when they
are removed from the configuration file.

Joachim Wieland
2007-03-13 14:32:25 +00:00
Tom Lane b9527e9840 First phase of plan-invalidation project: create a plan cache management
module and teach PREPARE and protocol-level prepared statements to use it.
In service of this, rearrange utility-statement processing so that parse
analysis does not assume table schemas can't change before execution for
utility statements (necessary because we don't attempt to re-acquire locks
for utility statements when reusing a stored plan).  This requires some
refactoring of the ProcessUtility API, but it ends up cleaner anyway,
for instance we can get rid of the QueryContext global.

Still to do: fix up SPI and related code to use the plan cache; I'm tempted to
try to make SQL functions use it too.  Also, there are at least some aspects
of system state that we want to ensure remain the same during a replan as in
the original processing; search_path certainly ought to behave that way for
instance, and perhaps there are others.
2007-03-13 00:33:44 +00:00
Peter Eisentraut f84308f195 Make configuration parameters fall back to their default values when they
are removed from the configuration file.

Joachim Wieland
2007-03-12 22:09:28 +00:00
Neil Conway e1d8deb918 Fix a typo in a comment. Heikki Linnakangas. 2007-03-05 14:13:12 +00:00
Bruce Momjian bc292937ae Split _bt_insertonpg to two functions.
Heikki Linnakangas
2007-03-03 20:13:06 +00:00
Bruce Momjian ae35867a39 Remove undo information from pg_controldata --- never used.
Florian G. Pflug
2007-03-03 20:02:27 +00:00
Tom Lane 234a02b2a8 Replace direct assignments to VARATT_SIZEP(x) with SET_VARSIZE(x, len).
Get rid of VARATT_SIZE and VARATT_DATA, which were simply redundant with
VARSIZE and VARDATA, and as a consequence almost no code was using the
longer names.  Rename the length fields of struct varlena and various
derived structures to catch anyplace that was accessing them directly;
and clean up various places so caught.  In itself this patch doesn't
change any behavior at all, but it is necessary infrastructure if we hope
to play any games with the representation of varlena headers.
Greg Stark and Tom Lane
2007-02-27 23:48:10 +00:00
Bruce Momjian 6f519ad01c btree source code cleanups:
I refactored findsplitloc and checksplitloc so that the division of
labor is more clear IMO. I pushed all the space calculation inside the
loop to checksplitloc.

I also fixed the off by 4 in free space calculation caused by
PageGetFreeSpace subtracting sizeof(ItemIdData), even though it was
harmless, because it was distracting and I felt it might come back to
bite us in the future if we change the page layout or alignments.
There's now a new function PageGetExactFreeSpace that doesn't do the
subtraction.

findsplitloc now tries the "just the new item to right page" split as
well. If people don't like the refactoring, I can write a patch to just
add that.

Heikki Linnakangas
2007-02-21 20:02:17 +00:00
Alvaro Herrera 1820650934 Restructure autovacuum in two processes: a dummy process, which runs
continuously, and requests vacuum runs of "autovacuum workers" to postmaster.
The workers do the actual vacuum work.  This allows for future improvements,
like allowing multiple autovacuum jobs running in parallel.

For now, the code keeps the original behavior of having a single autovac
process at any time by sleeping until the previous worker has finished.
2007-02-15 23:23:23 +00:00
Bruce Momjian a9eb53969a Move fsync method macro defines into /include/access/xlogdefs.h so they
can be used by src/tools/fsync/test_fsync.c.
2007-02-14 05:00:40 +00:00
Tom Lane caf2b64a75 Disallow committing a prepared transaction unless we are in the same database
it was executed in.  Someday it might be nice to allow cross-DB commits, but
work would be needed in NOTIFY and perhaps other places.  Per Heikki.
2007-02-13 19:39:42 +00:00
Peter Eisentraut c138b966d4 Replace useless uses of := by = in makefiles. 2007-02-09 15:56:00 +00:00
Tom Lane c398300330 Combine cmin and cmax fields of HeapTupleHeaders into a single field, by
keeping private state in each backend that has inserted and deleted the same
tuple during its current top-level transaction.  This is sufficient since
there is no need to be able to determine the cmin/cmax from any other
transaction.  This gets us back down to 23-byte headers, removing a penalty
paid in 8.0 to support subtransactions.  Patch by Heikki Linnakangas, with
minor revisions by moi, following a design hashed out awhile back on the
pghackers list.
2007-02-09 03:35:35 +00:00
Alvaro Herrera f8ebab901b Fix reference-after-free in the new btree page split code, as reported by
the buildfarm via Stefan Kaltenbrunner.

Patch from Heikki Linnakangas.
2007-02-08 13:52:55 +00:00
Peter Eisentraut 086c189456 Normalize fgets() calls to use sizeof() for calculating the buffer size
where possible, and fix some sites that apparently thought that fgets()
will overwrite the buffer by one byte.

Also add some strlcpy() to eliminate some weird memory handling.
2007-02-08 11:10:27 +00:00
Bruce Momjian b79575ce45 Reduce WAL activity for page splits:
> Currently, an index split writes all the data on the split page to
> WAL. That's a lot of WAL traffic. The tuples that are copied to the
> right page need to be WAL logged, but the tuples that stay on the
> original page don't.

Heikki Linnakangas
2007-02-08 05:05:53 +00:00
Tom Lane aec4cf1c8c Add a function pg_stat_clear_snapshot() that discards any statistics snapshot
already collected in the current transaction; this allows plpgsql functions to
watch for stats updates even though they are confined to a single transaction.
Use this instead of the previous kluge involving pg_stat_file() to wait for
the stats collector to update in the stats regression test.  Internally,
decouple storage of stats snapshots from transaction boundaries; they'll
now stick around until someone calls pgstat_clear_snapshot --- which xact.c
still does at transaction end, to maintain the previous behavior.  This makes
the logic a lot cleaner, at the price of a couple dozen cycles per transaction
exit.
2007-02-07 23:11:30 +00:00
Tom Lane 78d1216160 Remove the xlog-centric "database system is ready" message and replace it with
"database system is ready to accept connections", which is issued by the
postmaster when it really is ready to accept connections.  Per proposal from
Markus Schiltknecht and subsequent discussion.
2007-02-07 16:44:48 +00:00
Tom Lane c76ed81513 Remove some dead code, per Heikki. 2007-02-06 14:55:11 +00:00
Tom Lane 23c4978e6c Rename MaxTupleSize to MaxHeapTupleSize to clarify that it's not meant to
describe the maximum size of index tuples (which is typically AM-dependent
anyway); and consequently remove the bogus deduction for "special space"
that was built into it.

Adjust TOAST_TUPLE_THRESHOLD and TOAST_MAX_CHUNK_SIZE to avoid wasting two
bytes per toast chunk, and to ensure that the calculation correctly tracks any
future changes in page header size.  The computation had been inaccurate in a
way that didn't cause any harm except space wastage, but future changes could
have broken it more drastically.

Fix the calculation of BTMaxItemSize, which was formerly computed as 1 byte
more than it could safely be.  This didn't cause any harm in practice because
it's only compared against maxalign'd lengths, but future changes in the size
of page headers or btree special space could have exposed the problem.

initdb forced because of change in TOAST_MAX_CHUNK_SIZE, which alters the
storage of toast tables.
2007-02-05 04:22:18 +00:00
Tom Lane a2e092e1c7 Don't MAXALIGN in the checks to decide whether a tuple is over TOAST's
threshold for tuple length.  On 4-byte-MAXALIGN machines, the toast code
creates tuples that have t_len exactly TOAST_TUPLE_THRESHOLD ... but this
number is not itself maxaligned, so if heap_insert maxaligns t_len before
comparing to TOAST_TUPLE_THRESHOLD, it'll uselessly recurse back to
tuptoaster.c, wasting cycles.  (It turns out that this does not happen on
8-byte-MAXALIGN machines, because for them the outer MAXALIGN in the
TOAST_MAX_CHUNK_SIZE macro reduces TOAST_MAX_CHUNK_SIZE so that toast tuples
will be less than TOAST_TUPLE_THRESHOLD in size.  That MAXALIGN is really
incorrect, but we can't remove it now, see below.)  There isn't any particular
value in maxaligning before comparing to the thresholds, so just don't do
that, which saves a small number of cycles in itself.

These numbers should be rejiggered to minimize wasted space on toast-relation
pages, but we can't do that in the back branches because changing
TOAST_MAX_CHUNK_SIZE would force an initdb (by changing the contents of toast
tables).  We can move the toast decision thresholds a bit, though, which is
what this patch effectively does.

Thanks to Pavan Deolasee for discovering the unintended recursion.

Back-patch into 8.2, but not further, pending more testing.  (HEAD is about
to get a further patch modifying the thresholds, so it won't help much
for testing this form of the patch.)
2007-02-04 20:00:37 +00:00
Bruce Momjian 8b4ff8b6a1 Wording cleanup for error messages. Also change can't -> cannot.
Standard English uses "may", "can", and "might" in different ways:

        may - permission, "You may borrow my rake."

        can - ability, "I can lift that log."

        might - possibility, "It might rain today."

Unfortunately, in conversational English, their use is often mixed, as
in, "You may use this variable to do X", when in fact, "can" is a better
choice.  Similarly, "It may crash" is better stated, "It might crash".
2007-02-01 19:10:30 +00:00
Neil Conway dbcaee49b5 Fix a few typos in comments in GiN. 2007-02-01 04:16:08 +00:00
Teodor Sigaev d4c6da1527 Allow GIN's extractQuery method to signal that nothing can satisfy the query.
In this case extractQuery should returns -1 as nentries. This changes
prototype of extractQuery method to use int32* instead of uint32* for
nentries argument.
Based on that gincostestimate may see two corner cases: nothing will be found
or seqscan should be used.

Per proposal at http://archives.postgresql.org/pgsql-hackers/2007-01/msg01581.php

PS tsearch_core patch should be sightly modified to support changes, but I'm
waiting a verdict about reviewing of tsearch_core patch.
2007-01-31 15:09:45 +00:00
Tom Lane a635c08fa1 Add support for cross-type hashing in hash index searches and hash joins.
Hashing for aggregation purposes still needs work, so it's not time to
mark any cross-type operators as hashable for general use, but these cases
work if the operators are so marked by hand in the system catalogs.
2007-01-30 01:33:36 +00:00
Tom Lane e8cd6f14a2 Add comment noting that hashm_procid in a hash index's metapage isn't
actually used for anything.
2007-01-29 23:22:59 +00:00
Tom Lane 6cefacd7c8 Correct an old logic error in btree page splitting: when considering a split
exactly at the point where we need to insert a new item, the calculation used
the wrong size for the "high key" of the new left page.  This could lead to
choosing an unworkable split, resulting in "PANIC: failed to add item to the
left sibling" (or "right sibling") failure.  Although this bug has been there
a long time, it's very difficult to trigger a failure before 8.2, since there
was generally a lot of free space on both sides of a chosen split.  In 8.2,
where the user-selected fill factor determines how much free space the code
tries to leave, an unworkable split is much more likely.  Report by Joe
Conway, diagnosis and fix by Heikki Linnakangas.
2007-01-27 20:53:30 +00:00
Bruce Momjian ef65f6f7a4 Prevent WAL logging when COPY is done in the same transation that
created it.

Simon Riggs
2007-01-25 02:17:26 +00:00
Neil Conway 2b7334d487 Refactor the index AM API slightly: move currentItemData and
currentMarkData from IndexScanDesc to the opaque structs for the
AMs that need this information (currently gist and hash).

Patch from Heikki Linnakangas, fixes by Neil Conway.
2007-01-20 18:43:35 +00:00
Peter Eisentraut 2cc01004c6 Remove remains of old depend target. 2007-01-20 17:16:17 +00:00
Alvaro Herrera eb63cc3da8 Arrange for autovacuum to be killed when another operation wants to be alone
accessing it, like DROP DATABASE.  This allows the regression tests to pass
with autovacuum enabled, which open the gates for finally enabling autovacuum
by default.
2007-01-16 13:28:57 +00:00
Tom Lane d83235415b Add some notes about the basic mathematical laws that the system presumes
hold true for operators in a btree operator family.  This is mostly to
clarify my own thinking about what the planner can assume for optimization
purposes.  (blowing dust off an old abstract-algebra textbook...)
2007-01-12 17:04:54 +00:00
Bruce Momjian 40f797be03 Enable another five tuple status bits by using the high bits of the
nattr field, and rename the field.

Heikki Linnakangas
2007-01-09 22:01:00 +00:00
Tom Lane 69db009163 Add a citation to Seltzer and Yigit's Usenix '91 paper about hash table
management.  The paper clearly describes many of the ideas embodied in
our current hashing code, but as far as I could find out there is not
a direct code heritage.  (Mike Olsen recalls discussion of this paper
at Postgres meetings but believes it "informed the Postgres implementation
probably just at the design level".  Margo herself says she wasn't
involved with Postgres' hash code.)  Credit where credit is due 'n all
that, even if fifteen years after the fact.
2007-01-09 07:30:49 +00:00
Tom Lane 4431758229 Support ORDER BY ... NULLS FIRST/LAST, and add ASC/DESC/NULLS FIRST/NULLS LAST
per-column options for btree indexes.  The planner's support for this is still
pretty rudimentary; it does not yet know how to plan mergejoins with
nondefault ordering options.  The documentation is pretty rudimentary, too.
I'll work on improving that stuff later.

Note incompatible change from prior behavior: ORDER BY ... USING will now be
rejected if the operator is not a less-than or greater-than member of some
btree opclass.  This prevents less-than-sane behavior if an operator that
doesn't actually define a proper sort ordering is selected.
2007-01-09 02:14:16 +00:00
Bruce Momjian 29dccf5fe0 Update CVS HEAD for 2007 copyright. Back branches are typically not
back-stamped for this.
2007-01-05 22:20:05 +00:00
Tom Lane 7c8927bf08 Fix some small typos in comments. Greg Stark 2007-01-04 16:29:42 +00:00
Tom Lane ef07221997 Clean up smgr.c/md.c APIs as per discussion a couple months ago. Instead of
having md.c return a success/failure boolean to smgr.c, which was just going
to elog anyway, let md.c issue the elog messages itself.  This allows better
error reporting, particularly in cases such as "short read" or "short write"
which Peter was complaining of.  Also, remove the kluge of allowing mdread()
to return zeroes from a read-beyond-EOF: this is now an error condition
except when InRecovery or zero_damaged_pages = true.  (Hash indexes used to
require that behavior, but no more.)  Also, enforce that mdwrite() is to be
used for rewriting existing blocks while mdextend() is to be used for
extending the relation EOF.  This restriction lets us get rid of the old
ad-hoc defense against creating huge files by an accidental reference to
a bogus block number: we'll only create new segments in mdextend() not
mdwrite() or mdread().  (Again, when InRecovery we allow it anyway, since
we need to allow updates of blocks that were later truncated away.)
Also, clean up the original makeshift patch for bug #2737: move the
responsibility for padding relation segments to full length into md.c.
2007-01-03 18:11:01 +00:00
Tom Lane 5725b9d9af Support type modifiers for user-defined types, and pull most knowledge
about typmod representation for standard types out into type-specific
typmod I/O functions.  Teodor Sigaev, with some editorialization by
Tom Lane.
2006-12-30 21:21:56 +00:00
Tom Lane 9aefd56669 Fix up btree's initial scankey processing to be able to detect redundant
or contradictory keys even in cross-data-type scenarios.  This is another
benefit of the opfamily rewrite: we can find the needed comparison
operators now.
2006-12-28 23:16:39 +00:00
Tom Lane a78fcfb512 Restructure operator classes to allow improved handling of cross-data-type
cases.  Operator classes now exist within "operator families".  While most
families are equivalent to a single class, related classes can be grouped
into one family to represent the fact that they are semantically compatible.
Cross-type operators are now naturally adjunct parts of a family, without
having to wedge them into a particular opclass as we had done originally.

This commit restructures the catalogs and cleans up enough of the fallout so
that everything still works at least as well as before, but most of the work
needed to actually improve the planner's behavior will come later.  Also,
there are not yet CREATE/DROP/ALTER OPERATOR FAMILY commands; the only way
to create a new family right now is to allow CREATE OPERATOR CLASS to make
one by default.  I owe some more documentation work, too.  But that can all
be done in smaller pieces once this infrastructure is in place.
2006-12-23 00:43:13 +00:00
Tom Lane 0cb91ccba9 Remove the logId/logSeg fields from pg_control, because they are not needed
in normal operation, and we can avoid rewriting pg_control at every log
segment switch if we don't insist that these values be valid.  Reducing
the number of pg_control updates is a good idea for both performance and
reliability.  It does make pg_resetxlog's life a bit harder, but that seems
a good tradeoff; and anyway the change to pg_resetxlog amounts to automating
something people formerly needed to do by hand, namely look at the existing
pg_xlog files to make sure the new WAL start point was past them.

In passing, change the wording of xlog.c's "database system was interrupted"
messages: describe the pg_control timestamp as "last known up at" rather than
implying it is the exact time of service interruption.  With this change the
timestamp will generally be the time of the last checkpoint, which could be
many minutes before the failure; and we've already seen indications that
people tend to misinterpret the old wording.

initdb forced due to change in pg_control layout.  Simon Riggs and Tom Lane
2006-12-08 19:50:53 +00:00
Neil Conway 886a02d1cb Add a txn_start column to pg_stat_activity. This makes it easier to
identify long-running transactions. Since we already need to record
the transaction-start time (e.g. for now()), we don't need any
additional system calls to report this information.

Catversion bumped, initdb required.
2006-12-06 18:06:48 +00:00
Tom Lane 5f60086e10 Minor adjustments to make failures in startup/shutdown behave more cleanly.
StartupXLOG and ShutdownXLOG no longer need to be critical sections, because
in all contexts where they are invoked, elog(ERROR) would be translated to
elog(FATAL) anyway.  (One change in bgwriter.c is needed to make this true:
set ExitOnAnyError before trying to exit.  This is a good fix anyway since
the existing code would have gone into an infinite loop on elog(ERROR) during
shutdown.)  That avoids a misleading report of PANIC during semi-orderly
failures.  Modify the postmaster to include the startup process in the set of
processes that get SIGTERM when a fast shutdown is requested, and also fix it
to not try to restart the bgwriter if the bgwriter fails while trying to write
the shutdown checkpoint.  Net result is that "pg_ctl stop -m fast" does
something reasonable for a system in warm standby mode, and so should Unix
system shutdown (ie, universal SIGTERM).  Per gripe from Stephen Harris and
some corner-case testing of my own.
2006-11-30 18:29:12 +00:00
Teodor Sigaev ef148d6b85 Fix bug with page deletion. If inner page is removed and it tries to
remove page on next level linked from next inner page, ginScanToDelete()
wrongly sets parent page. Bug reveals when many item pointers from index
was deleted ( several hundred thousands).

Bug is discovered by hubert depesz lubaczewski <depesz@gmail.com>

Suppose, we need rc2 before release...
2006-11-30 16:22:32 +00:00
Neil Conway 546d6848ca Add a comment noting that heap_copytuple_with_tuple() results in a
HeapTuple that is no longer allocated as a single palloc() block; if
used carelessly, this might result in a subsequent memory leak after
heap_freetuple().
2006-11-23 05:27:18 +00:00
Tom Lane 395249ecbe Several changes to reduce the probability of running out of memory during
AbortTransaction, which would lead to recursion and eventual PANIC exit
as illustrated in recent report from Jeff Davis.  First, in xact.c create
a special dedicated memory context for AbortTransaction to run in.  This
solves the problem as long as AbortTransaction doesn't need more than 32K
(or whatever other size we create the context with).  But in corner cases
it might.  Second, in trigger.c arrange to keep pending after-trigger event
records in separate contexts that can be freed near the beginning of
AbortTransaction, rather than having them persist until CleanupTransaction
as before.  Third, in portalmem.c arrange to free executor state data
earlier as well.  These two changes should result in backing off the
out-of-memory condition before AbortTransaction needs any significant
amount of memory, at least in typical cases such as memory overrun due
to too many trigger events or too big an executor hash table.  And all
the same for subtransaction abort too, of course.
2006-11-23 01:14:59 +00:00
Tom Lane 3ad0728c81 On systems that have setsid(2) (which should be just about everything except
Windows), arrange for each postmaster child process to be its own process
group leader, and deliver signals SIGINT, SIGTERM, SIGQUIT to the whole
process group not only the direct child process.  This provides saner behavior
for archive and recovery scripts; in particular, it's possible to shut down a
warm-standby recovery server using "pg_ctl stop -m immediate", since delivery
of SIGQUIT to the startup subprocess will result in killing the waiting
recovery_command.  Also, this makes Query Cancel and statement_timeout apply
to scripts being run from backends via system().  (There is no support in the
core backend for that, but it's widely done using untrusted PLs.)  Per gripe
from Stephen Harris and subsequent discussion.
2006-11-21 20:59:53 +00:00
Tom Lane d68efb3f8d Repair problems with hash indexes that span multiple segments: the hash code's
preference for filling pages out-of-order tends to confuse the sanity checks
in md.c, as per report from Balazs Nagy in bug #2737.  The fix is to ensure
that the smgr-level code always has the same idea of the logical EOF as the
hash index code does, by using ReadBuffer(P_NEW) where we are adding a single
page to the end of the index, and using smgrextend() to reserve a large batch
of pages when creating a new splitpoint.  The patch is a bit ugly because it
avoids making any changes in md.c, which seems the most prudent approach for a
backpatchable beta-period fix.  After 8.3 development opens, I'll take a look
at a cleaner but more invasive patch, in particular getting rid of the now
unnecessary hack to allow reading beyond EOF in mdread().

Backpatch as far as 7.4.  The bug likely exists in 7.3 as well, but because
of the magnitude of the 7.3-to-7.4 changes in hash, the later-version patch
doesn't even begin to apply.  Given the other known bugs in the 7.3-era hash
code, it does not seem worth trying to develop a separate patch for 7.3.
2006-11-19 21:33:23 +00:00
Tom Lane 4f335a3d7f Repair two related errors in heap_lock_tuple: it was failing to recognize
cases where we already hold the desired lock "indirectly", either via
membership in a MultiXact or because the lock was originally taken by a
different subtransaction of the current transaction.  These cases must be
accounted for to avoid needless deadlocks and/or inappropriate replacement of
an exclusive lock with a shared lock.  Per report from Clarence Gardner and
subsequent investigation.
2006-11-17 18:00:15 +00:00
Peter Eisentraut e138b80996 String fix 2006-11-16 14:28:41 +00:00
Neil Conway dc10387eb1 Fix some typos in comments. 2006-11-12 06:55:54 +00:00
Tom Lane a46ca619f8 Suppress a few 'uninitialized variable' warnings that gcc emits only at
-O3 or higher (presumably because it inlines more things).  Per gripe
from Mark Mielke.
2006-11-11 01:14:19 +00:00
Tom Lane 792d6edd5b Clean up some misleading references to %p being a full path, per Simon. 2006-11-10 22:32:20 +00:00
Tom Lane dcbdf9b1d4 Change Windows rename and unlink substitutes so that they time out after
30 seconds instead of retrying forever.  Also modify xlog.c so that if
it fails to rename an old xlog segment up to a future slot, it will
unlink the segment instead.  Per discussion of bug #2712, in which it
became apparent that Windows can handle unlinking a file that's being
held open, but not renaming it.
2006-11-08 20:12:05 +00:00
Tom Lane 48188e1621 Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios.  We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases.  Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId.  Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done.  Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database.  initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs.  Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 22:42:10 +00:00
Tom Lane 70ce5c9082 Fix "failed to re-find parent key" btree VACUUM failure by revising page
deletion code to avoid the case where an upper-level btree page remains "half
dead" for a significant period of time, and to block insertions into a key
range that is in process of being re-assigned to the right sibling of the
deleted page's parent.  This prevents the scenario reported by Ed L. wherein
index keys could become out-of-order in the grandparent index level.

Since this is a moderately invasive fix, I'm applying it only to HEAD.
The bug exists back to 7.4, but the back branches will get a different patch.
2006-11-01 19:43:17 +00:00
Tom Lane 1e758d5263 Add some code to CREATE DATABASE to check for pre-existing subdirectories
that conflict with the OID that we want to use for the new database.
This avoids the risk of trying to remove files that maybe we shouldn't
remove.  Per gripe from Jon Lapham and subsequent discussion of 27-Sep.
2006-10-18 22:44:12 +00:00
Peter Eisentraut b9b4f10b5b Message style improvements 2006-10-06 17:14:01 +00:00
Tom Lane 378c79dc78 Cleanup for pglz_compress code: remove dead code, const-ify API of
remaining functions, simplify pglz_compress's API to not require a useless
data copy when compression fails.  Also add a check in pglz_decompress that
the expected amount of data was decompressed.
2006-10-05 23:33:33 +00:00
Tom Lane e378f82e00 Make use of qsort_arg in several places that were formerly using klugy
static variables.  This avoids any risk of potential non-reentrancy,
and in particular offers a much cleaner workaround for the Intel compiler
bug that was affecting ginutil.c.
2006-10-05 17:57:40 +00:00
Bruce Momjian f99a569a2e pgindent run for 8.2. 2006-10-04 00:30:14 +00:00
Bruce Momjian 45c8ed96b9 Make some sentences consistent with similar ones.
Euler Taveira de Oliveira
2006-10-03 21:21:36 +00:00
Alvaro Herrera 4650c4fdb9 Degrade the transaction-id wraparound point message from LOG to DEBUG1, per
discussion.

Patch from Simon Riggs.
2006-09-26 17:21:39 +00:00
Tom Lane 9e936693a9 Fix free space map to correctly track the total amount of FSM space needed
even when a single relation requires more than max_fsm_pages pages.  Also,
make VACUUM emit a warning in this case, since it likely means that VACUUM
FULL or other drastic corrective measure is needed.  Per reports from Jeff
Frost and others of unexpected changes in the claimed max_fsm_pages need.
2006-09-21 20:31:22 +00:00
Teodor Sigaev deb66e013c Improve error message. Per discussion
http://archives.postgresql.org/pgsql-general/2006-09/msg00186.php
2006-09-14 11:26:49 +00:00
Bruce Momjian d18768867e Remove unnecessary brace pair. 2006-09-10 23:33:22 +00:00
Tom Lane f5b4d9a9e0 If we're going to advertise the array overlap/containment operators,
we probably should make them work reliably for all arrays.  Fix code
to handle NULLs and multidimensional arrays, move it into arrayfuncs.c.
GIN is still restricted to indexing arrays with no null elements, however.
2006-09-10 20:14:20 +00:00
Tom Lane ba920e1c91 Rename contains/contained-by operators to @> and <@, per discussion that
agreed these symbols are less easily confused.  I made new pg_operator
entries (with new OIDs) for the old names, so as to provide backward
compatibility while making it pretty easy to remove the old names in
some future release cycle.  This commit only touches the core datatypes,
contrib will be fixed separately.
2006-09-10 00:29:35 +00:00
Teodor Sigaev 889ec4b998 Fix Intel compiler bug. Per discussion
'GIN FailedAssertions on Itanium2 with Intel compiler' in pgsql-hackers,
http://archives.postgresql.org/pgsql-hackers/2006-08/msg01914.php
2006-09-05 18:25:10 +00:00
Tom Lane 8fad2e3ff4 Arrange for GetSnapshotData to copy live-subtransaction XIDs from the
PGPROC array into snapshots, and use this information to avoid visits
to pg_subtrans in HeapTupleSatisfiesSnapshot.  This appears to solve
the pg_subtrans-related context swap storm problem that's been reported
by several people for 8.1.  While at it, modify GetSnapshotData to not
take an exclusive lock on ProcArrayLock, as closer analysis shows that
shared lock is always sufficient.
Itagaki Takahiro and Tom Lane
2006-09-03 15:59:39 +00:00
Teodor Sigaev b681bfdd59 Fix BUG #2594: Gin Indexes cause server to crash when it builds on empty table 2006-08-29 14:05:44 +00:00
Tom Lane ca1fd0ea5b Move xact.c's partial support for Lists of TransactionIds into pg_list.h.
Needed because lock.c is now going to use the same type of list.
2006-08-27 19:11:46 +00:00
Tom Lane e093dcdd28 Add the ability to create indexes 'concurrently', that is, without
blocking concurrent writes to the table.  Greg Stark, with a little help
from Tom Lane.
2006-08-25 04:06:58 +00:00
Tom Lane 08ae5edc5c Optimize the case where a btree indexscan has current and mark positions
on the same index page; we can avoid data copying as well as buffer refcount
manipulations in this common case.  Makes for a small but noticeable
improvement in mergejoin speed.

Heikki Linnakangas
2006-08-24 01:18:34 +00:00
Tom Lane 35af5422f6 Make the server track an 'XID epoch', that is, maintain higher-order bits
of the transaction ID counter.  Nothing is done with the epoch except to
store it in checkpoint records, but this provides a foundation with which
add-on code can pretend that XIDs never wrap around.  This is a severely
trimmed and rewritten version of the xxid patch submitted by Marko Kreen.
Per discussion, the epoch counter seems the only part of xxid that really
needs to be in the core server.
2006-08-21 16:16:31 +00:00
Tom Lane 7aa772f03e Now that we've rearranged relation open to get a lock before touching
the rel, it's easy to get rid of the narrow race-condition window that
used to exist in VACUUM and CLUSTER.  Did some minor code-beautification
work in the same area, too.
2006-08-18 16:09:13 +00:00
Tom Lane e8ea9e9587 Implement archive_timeout feature to force xlog file switches to occur no more
than N seconds apart.  This allows a simple, if not very high performance,
means of guaranteeing that a PITR archive is no more than N seconds behind
real time.  Also make pg_current_xlog_location return the WAL Write pointer,
add pg_current_xlog_insert_location to return the Insert pointer, and fix
pg_xlogfile_name_offset to return its results as a two-element record instead
of a smashed-together string, as per recent discussion.

Simon Riggs
2006-08-17 23:04:10 +00:00
Tom Lane 7a3e30e608 Add INSERT/UPDATE/DELETE RETURNING, with basic docs and regression tests.
plpgsql support to come later.  Along the way, convert execMain's
SELECT INTO support into a DestReceiver, in order to eliminate some ugly
special cases.

Jonah Harris and Tom Lane
2006-08-12 02:52:06 +00:00
Tom Lane e002836913 Make recovery from WAL be restartable, by executing a checkpoint-like
operation every so often.  This improves the usefulness of PITR log
shipping for hot standby: formerly, if the standby server crashed, it
was necessary to restart it from the last base backup and replay all
the WAL since then.  Now it will only need to reread about the same
amount of WAL as the master server would.  The behavior might also
come in handy during a long PITR replay sequence.  Simon Riggs,
with some editorialization by Tom Lane.
2006-08-07 16:57:57 +00:00
Tom Lane 704ddaaa09 Add support for forcing a switch to a new xlog file; cause such a switch
to happen automatically during pg_stop_backup().  Add some functions for
interrogating the current xlog insertion point and for easily extracting
WAL filenames from the hex WAL locations displayed by pg_stop_backup
and friends.  Simon Riggs with some editorialization by Tom Lane.
2006-08-06 03:53:44 +00:00
Tom Lane bc8ac3ce40 Add missing pgstat_count_index_scan(), per Andreas Seltenreich. 2006-08-03 15:22:09 +00:00
Tom Lane 09d3670df3 Change the relation_open protocol so that we obtain lock on a relation
(table or index) before trying to open its relcache entry.  This fixes
race conditions in which someone else commits a change to the relation's
catalog entries while we are in process of doing relcache load.  Problems
of that ilk have been reported sporadically for years, but it was not
really practical to fix until recently --- for instance, the recent
addition of WAL-log support for in-place updates helped.

Along the way, remove pg_am.amconcurrent: all AMs are now expected to support
concurrent update.
2006-07-31 20:09:10 +00:00
Alvaro Herrera 92c2ecc130 Modify snapshot definition so that lazy vacuums are ignored by other
vacuums.  This allows a OLTP-like system with big tables to continue
regular vacuuming on small-but-frequently-updated tables while the
big tables are being vacuumed.

Original patch from Hannu Krossing, rewritten by Tom Lane and updated
by me.
2006-07-30 02:07:18 +00:00
Tom Lane e6284649b9 Modify btree to delete known-dead index entries without an actual VACUUM.
When we are about to split an index page to do an insertion, first look
to see if any entries marked LP_DELETE exist on the page, and if so remove
them to try to make enough space for the desired insert.  This should reduce
index bloat in heavily-updated tables, although of course you still need
VACUUM eventually to clean up the heap.

Junji Teramoto
2006-07-25 19:13:00 +00:00
Peter Eisentraut e9b4969062 DTrace support, with a small initial set of probes
by Robert Lor
2006-07-24 16:32:45 +00:00
Tom Lane 9dc842f083 Don't try to truncate multixact SLRU files in checkpoints done during xlog
recovery.  In the first place, it doesn't work because slru's
latest_page_number isn't set up yet (this is why we've been hearing reports
of strange "apparent wraparound" log messages during crash recovery, but
only from people who'd managed to advance their next-mxact counters some
considerable distance from 0).  In the second place, it seems a bit unwise
to be throwing away data during crash recovery anwyway.  This latter
consideration convinces me to just disable truncation during recovery,
rather than computing latest_page_number and pushing ahead.
2006-07-20 00:46:42 +00:00
Tom Lane c36418be40 Fix getDatumCopy(): don't use store_att_byval to copy into a Datum
variable (this accounts for regression failures on PPC64, and in fact
won't work on any big-endian machine).  Get rid of hardwired knowledge
about datum size rules; make it look just like datumCopy().
2006-07-16 00:54:22 +00:00
Tom Lane e040ab44e4 Improve error message wording. 2006-07-16 00:52:05 +00:00
Tom Lane 98bac16e4d Fix misguided removal of access/tuptoaster.h inclusion, per Kris Jurka.
I'm going to insist on reversion of this entire patch unless pgrminclude
is upgraded to a less broken state, but in the meantime let's get contrib
passing regression again.
2006-07-14 19:05:52 +00:00
Bruce Momjian e0522505bd Remove 576 references of include files that were not needed. 2006-07-14 14:52:27 +00:00
Bruce Momjian a22d76d96a Allow include files to compile own their own.
Strip unused include files out unused include files, and add needed
includes to C files.

The next step is to remove unused include files in C files.
2006-07-13 16:49:20 +00:00
Tom Lane d29b66882a Tweak fillfactor code as per my recent proposal. Fix nbtsort.c so that
it can handle small fillfactors for ordinary-sized index entries without
failing on large ones; fix nbtinsert.c to distinguish leaf and nonleaf
pages; change the minimum fillfactor to 10% for all index types.
2006-07-11 21:05:57 +00:00
Teodor Sigaev 001d30ee6b Add support to GIN for =(anyarray,anyarray) operation 2006-07-11 19:49:14 +00:00
Bruce Momjian ac230e7431 Alphabetically order reference to include files, "S"-"Z". 2006-07-11 18:26:11 +00:00
Bruce Momjian 0ff3461bcc Alphabetically order reference to include files, "N" - "S". 2006-07-11 17:26:59 +00:00
Bruce Momjian 3a534ade39 Alphabetically order reference to include files, "G" - "M". 2006-07-11 17:04:13 +00:00
Teodor Sigaev 234163649e GIN improvements
- Replace sorted array of entries in maintenance_work_mem to binary tree,
  this should improve create performance.
- More precisely calculate allocated memory, eliminate leaks
  with user-defined extractValue()
- Improve wordings in tsearch2
2006-07-11 16:55:34 +00:00
Alvaro Herrera d4cef0aa2a Improve vacuum code to track minimum Xids per table instead of per database.
To this end, add a couple of columns to pg_class, relminxid and relvacuumxid,
based on which we calculate the pg_database columns after each vacuum.

We now force all databases to be vacuumed, even template ones.  A backend
noticing too old a database (meaning pg_database.datminxid is in danger of
falling behind Xid wraparound) will signal the postmaster, which in turn will
start an autovacuum iteration to process the offending database.  In principle
this is only there to cope with frozen (non-connectable) databases without
forcing users to set them to connectable, but it could force regular user
database to go through a database-wide vacuum at any time.  Maybe we should
warn users about this somehow.  Of course the real solution will be to use
autovacuum all the time ;-)

There are some additional improvements we could have in this area: for example
the vacuum code could be smarter about not updating pg_database for each table
when called by autovacuum, and do it only once the whole autovacuum iteration
is done.

I updated the system catalogs documentation, but I didn't modify the
maintenance section.  Also having some regression tests for this would be nice
but it's not really a very straightforward thing to do.

Catalog version bumped due to system catalog changes.
2006-07-10 16:20:52 +00:00
Tom Lane b7b78d24f7 Code review for FILLFACTOR patch. Change WITH grammar as per earlier
discussion (including making def_arg allow reserved words), add missed
opt_definition for UNIQUE case.  Put the reloptions support code in a less
random place (I chose to make a new file access/common/reloptions.c).
Eliminate header inclusion creep.  Make the index options functions safely
user-callable (seems like client apps might like to be able to test validity
of options before trying to make an index).  Reduce overhead for normal case
with no options by allowing rd_options to be NULL.  Fix some unmaintainably
klugy code, including getting rid of Natts_pg_class_fixed at long last.
Some stylistic cleanup too, and pay attention to keeping comments in sync
with code.

Documentation still needs work, though I did fix the omissions in
catalogs.sgml and indexam.sgml.
2006-07-03 22:45:41 +00:00
Bruce Momjian 277807bd9e Add FILLFACTOR to CREATE INDEX.
ITAGAKI Takahiro
2006-07-02 02:23:23 +00:00
Teodor Sigaev 783a73168b Forget to add new file :(( 2006-06-28 12:08:35 +00:00
Teodor Sigaev 1f7ef548ec Changes
* new split algorithm (as proposed in http://archives.postgresql.org/pgsql-hackers/2006-06/msg00254.php)
  * possible call pickSplit() for second and below columns
  * add spl_(l|r)datum_exists to GIST_SPLITVEC -
    pickSplit should check its values to use already defined
    spl_(l|r)datum for splitting. pickSplit should set
    spl_(l|r)datum_exists to 'false' (if they was 'true') to
    signal to caller about using spl_(l|r)datum.
  * support for old pickSplit(): not very optimal
    but correct split
* remove 'bytes' field from GISTENTRY: in any case size of
  value is defined by it's type.
* split GIST_SPLITVEC to two structures: one for using in picksplit
  and second - for internal use.
* some code refactoring
* support of subsplit to rtree opclasses

TODO: add support of subsplit to contrib modules
2006-06-28 12:00:14 +00:00
Tom Lane 3c71244b74 Put #ifdef NOT_USED around posix_fadvise call. We may want to resurrect
this someday, but right now it seems that posix_fadvise is immature to
the point of being broken on many platforms ... and we don't have any
benchmark evidence proving it's worth spending time on.
2006-06-27 18:59:17 +00:00
Tom Lane cdd5178c69 Extend the MinimalTuple concept to tuplesort.c, thereby reducing the
per-tuple space overhead for sorts in memory.  I chose to replace the
previous patch that tried to write out the bare minimum amount of data
when sorting on disk; instead, just dump the MinimalTuples as-is.  This
wastes 3 to 10 bytes per tuple depending on architecture and null-bitmap
length, but the simplification in the writetup/readtup routines seems
worth it.
2006-06-27 16:53:02 +00:00
Tom Lane 3f50ba27cf Create infrastructure for 'MinimalTuple' representation of in-memory
tuples with less header overhead than a regular HeapTuple, per my
recent proposal.  Teach TupleTableSlot code how to deal with these.
As proof of concept, change tuplestore.c to store MinimalTuples instead
of HeapTuples.  Future patches will expand the concept to other places
where it is useful.
2006-06-27 02:51:40 +00:00
Tom Lane 3a04f53e7f pg_stop_backup was calling XLogArchiveNotify() twice for the newly created
backup history file.  Bug introduced by the 8.1 change to make pg_stop_backup
delete older history files.  Per report from Masao Fujii.
2006-06-22 20:42:57 +00:00
Tom Lane 27c3e3de09 Remove redundant gettimeofday() calls to the extent practical without
changing semantics too much.  statement_timestamp is now set immediately
upon receipt of a client command message, and the various places that used
to do their own gettimeofday() calls to mark command startup are referenced
to that instead.  I have also made stats_command_string use that same
value for pg_stat_activity.query_start for both the command itself and
its eventual replacement by <IDLE> or <idle in transaction>.  There was
some debate about that, but no argument that seemed convincing enough to
justify an extra gettimeofday() call.
2006-06-20 22:52:00 +00:00
Tom Lane 1e8ae13640 Don't try to call posix_fadvise() unless <fcntl.h> supplies a declaration
for it.  Hopefully will fix core dump evidenced by some buildfarm members
since fadvise patch went in.  The actual definition of the function is not
ABI-compatible with compiler's default assumption in the absence of any
declaration, so it's clearly unsafe to try to call it without seeing a
declaration.
2006-06-18 18:30:21 +00:00
Tom Lane 06e10abc0b Fix problems with cached tuple descriptors disappearing while still in use
by creating a reference-count mechanism, similar to what we did a long time
ago for catcache entries.  The back branches have an ugly solution involving
lots of extra copies, but this way is more efficient.  Reference counting is
only applied to tupdescs that are actually in caches --- there seems no need
to use it for tupdescs that are generated in the executor, since they'll go
away during plan shutdown by virtue of being in the per-query memory context.
Neil Conway and Tom Lane
2006-06-16 18:42:24 +00:00
Bruce Momjian 40bc06fa16 Test for POSIX_FADV_DONTNEED to use posix_fadvise(). 2006-06-16 04:11:48 +00:00
Bruce Momjian 94a5c4a01b Use posix_fadvise() to avoid kernel caching of WAL contents on WAL file
close.

ITAGAKI Takahiro
2006-06-15 19:15:00 +00:00
Teodor Sigaev b32000eda4 Som improve page split in multicolumn GiST index.
If user picksplit on n-th column generate equals
left and right unions then it calls picksplit on n+1-th
column.
2006-05-29 12:50:06 +00:00
Teodor Sigaev 0a6fde5a26 Correct cheking in findParents(). i
From Andreas Seltenreich <andreas+pg@gate450.dyndns.org>
2006-05-29 08:39:44 +00:00
Alvaro Herrera 3d58a1c168 Remove traces of otherwise unused RELKIND_SPECIAL symbol. Leave the psql bits
in place though, so that it plays nicely with older servers.

Per discussion.
2006-05-28 02:27:08 +00:00
Teodor Sigaev 5d1a066e64 Fix findParents() in case of multiple levels to find.
By Andreas Seltenreich <andreas+pg@gate450.dyndns.org>
2006-05-26 08:01:17 +00:00
Teodor Sigaev d2158b0281 * Add support NULL to GiST.
* some refactoring and simplify code int gistutil.c and gist.c
* now in some cases it can be called used-defined
  picksplit method for non-first column in index, but here
	is a place to do more.
* small fix of docs related to support NULL.
2006-05-24 11:01:39 +00:00
Teodor Sigaev 09518fbdf4 Call MarkBufferDirty() before XLogInsert() during completion of insert 2006-05-19 17:15:41 +00:00
Teodor Sigaev 420cbff881 Simplify gistSplit() and some refactoring related code. 2006-05-19 16:15:17 +00:00
Teodor Sigaev 5890790b4a Rework completion of incomplete inserts. Now it writes
WAL log during inserts.
2006-05-19 11:10:25 +00:00
Teodor Sigaev 8876e37d07 Reduce size of critial section during vacuum full, critical
sections now isn't nested. All user-defined functions now is
called outside critsections. Small improvements in WAL
protocol.

TODO: improve XLOG replay
2006-05-17 16:34:59 +00:00
Tom Lane 3fdeb189e9 Clean up code associated with updating pg_class statistics columns
(relpages/reltuples).  To do this, create formal support in heapam.c for
"overwrite" tuple updates (including xlog replay capability) and use that
instead of the ad-hoc overwrites we'd been using in VACUUM and CREATE INDEX.
Take the responsibility for updating stats during CREATE INDEX out of the
individual index AMs, and do it where it belongs, in catalog/index.c.  Aside
from being more modular, this avoids having to update the same tuple twice in
some paths through CREATE INDEX.  It's probably not measurably faster, but
for sure it's a lot cleaner than before.
2006-05-10 23:18:39 +00:00
Teodor Sigaev 10dd8df68e Reduce size of critical section and remove call of user-defined functions in
insertion and deletion, modify gistSplit() to do not use buffers.

 TODO: gistvacuumcleanup and XLOG
2006-05-10 09:19:54 +00:00
Tom Lane 5749f6ef0c Rewrite btree vacuuming to fold the former bulkdelete and cleanup operations
into a single mostly-physical-order scan of the index.  This requires some
ticklish interlocking considerations, but should create no material
performance impact on normal index operations (at least given the
already-committed changes to make scans work a page at a time).  VACUUM
itself should get significantly faster in any index that's degenerated to a
very nonlinear page order.  Also, we save one pass over the index entirely,
except in the case where there were no deletions to do and so only one pass
happened anyway.

Original patch by Heikki Linnakangas, rework by Tom Lane.
2006-05-08 00:00:17 +00:00
Tom Lane 09cb5c0e7d Rewrite btree index scans to work a page at a time in all cases (both
btgettuple and btgetmulti).  This eliminates the problem of "re-finding" the
exact stopping point, since the stopping point is effectively always a page
boundary, and index items are never moved across pre-existing page boundaries.
A small penalty is that the keys_are_unique optimization is effectively
disabled (and, therefore, is removed in this patch), causing us to apply
_bt_checkkeys() to at least one more tuple than necessary when looking up a
unique key.  However, the advantages for non-unique cases seem great enough to
accept this tradeoff.  Aside from simplifying and (sometimes) speeding up the
indexscan code, this will allow us to reimplement btbulkdelete as a largely
sequential scan instead of index-order traversal, thereby significantly
reducing the cost of VACUUM.  Those changes will come in a separate patch.

Original patch by Heikki Linnakangas, rework by Tom Lane.
2006-05-07 01:21:30 +00:00
Teodor Sigaev 2a58f3bff6 Fix typo noticed by Alvaro Herrera 2006-05-03 06:56:47 +00:00
Tom Lane e57345975c Clean up API for ambulkdelete/amvacuumcleanup as per today's discussion.
This formulation requires every AM to provide amvacuumcleanup, unlike before,
but it's surely a whole lot cleaner.  Also, add an 'amstorage' column to
pg_am so that we can get rid of hardwired knowledge in DefineOpClass().
2006-05-02 22:25:10 +00:00
Tom Lane 67030eec1e Suppress some gcc warnings. 2006-05-02 15:48:11 +00:00
Teodor Sigaev 8a3631f8d8 GIN: Generalized Inverted iNdex.
text[], int4[], Tsearch2 support for GIN.
2006-05-02 11:28:56 +00:00
Tom Lane d2896a9ed1 Arrange to cache btree metapage data in the relcache entry for the index,
thereby saving a visit to the metapage in most index searches/updates.
This wouldn't actually save any I/O (since in the old regime the metapage
generally stayed in cache anyway), but it does provide a useful decrease
in bufmgr traffic in high-contention scenarios.  Per my recent proposal.
2006-04-25 22:46:05 +00:00
Bruce Momjian e6004f0151 Add statement_timestamp(), clock_timestamp(), and
transaction_timestamp() (just like now()).

Also update statement_timeout() to mention it is statement arrival time
that is measured.

Catalog version updated.
2006-04-25 00:25:22 +00:00
Tom Lane eac825aa68 Ensure that we validate the page header of the first page of a WAL file
whenever we start to read within that file.  The first page carries
extra identification information that really ought to be checked, but
as the code stood, this was only checked when we switched sequentially
into a new WAL file, or if by chance the starting checkpoint record was
within the first page.  This patch ensures that we will detect bogus
'long header' information before we start replaying the WAL sequence.
2006-04-20 04:07:38 +00:00
Tom Lane 0a87394956 Fix the torn-page hazard for PITR base backups by forcing full page writes
to occur between pg_start_backup() and pg_stop_backup(), even if the GUC
setting full_page_writes is OFF.  Per discussion, doing this in combination
with the already-existing checkpoint during pg_start_backup() should ensure
safety against partial page updates being included in the backup.  We do
not have to force full page writes to occur during normal PITR operation,
as I had first feared.
2006-04-17 18:55:05 +00:00
Tom Lane defe93463c Make the world safe for full_page_writes. Allow XLOG records that try to
update no-longer-existing pages to fall through as no-ops, but make a note
of each page number referenced by such records.  If we don't see a later
XLOG entry dropping the table or truncating away the page, complain at
the end of XLOG replay.  Since this fixes the known failure mode for
full_page_writes = off, revert my previous band-aid patch that disabled
that GUC variable.
2006-04-14 20:27:24 +00:00
Tom Lane 49a7610c36 Fix an ancient oversight in btree xlog replay. When trying to determine if an
upper-level insertion completes a previously-seen split, we cannot simply grab
the downlink block number out of the buffer, because the buffer could contain
a later state of the page --- or perhaps the page doesn't even exist at all
any more, due to relation truncation.  These possibilities have been masked up
to now because the use of full_page_writes effectively ensured that no xlog
replay routine ever actually saw a page state newer than its own change.
Since we're deprecating full_page_writes in 8.1.*, there's no need to fix this
in existing release branches, but we need a fix in HEAD if we want to have any
hope of re-allowing full_page_writes.  Accordingly, adjust the contents of
btree WAL records so that we can always get the downlink block number from the
WAL record rather than having to depend on buffer contents.  Per report from
Kevin Grittner and Peter Brant.

Improve a few comments in related code while at it.
2006-04-13 03:53:05 +00:00
Tom Lane 7fdb4305db Fix a bunch of problems with domains by making them use special input functions
that apply the necessary domain constraint checks immediately.  This fixes
cases where domain constraints went unchecked for statement parameters,
PL function local variables and results, etc.  We can also eliminate existing
special cases for domains in places that had gotten it right, eg COPY.

Also, allow domains over domains (base of a domain is another domain type).
This almost worked before, but was disallowed because the original patch
hadn't gotten it quite right.
2006-04-05 22:11:58 +00:00
Tom Lane 09b5271ebd Add a field to the first page of each WAL file to indicate the
XLOG_BLCKSZ.  This ought to help in preventing configuration mismatch
problems if anyone tries to ship PITR files between servers compiled
with different XLOG_BLCKSZ settings.  Simon Riggs
2006-04-05 03:34:05 +00:00
Tom Lane e6140d9052 Don't use BLCKSZ for the physical length of the pg_control file, but
instead a dedicated symbol.  This probably makes no functional difference
for likely values of BLCKSZ, but it makes the intent clearer.
Simon Riggs, minor editorialization by Tom Lane.
2006-04-04 22:39:59 +00:00
Tom Lane 147d4bf3e5 Modify all callers of datatype input and receive functions so that if these
functions are not strict, they will be called (passing a NULL first parameter)
during any attempt to input a NULL value of their datatype.  Currently, all
our input functions are strict and so this commit does not change any
behavior.  However, this will make it possible to build domain input functions
that centralize checking of domain constraints, thereby closing numerous holes
in our domain support, as per previous discussion.

While at it, I took the opportunity to introduce convenience functions
InputFunctionCall, OutputFunctionCall, etc to use in code that calls I/O
functions.  This eliminates a lot of grotty-looking casts, but the main
motivation is to make it easier to grep for these places if we ever need
to touch them again.
2006-04-04 19:35:37 +00:00
Tom Lane eaef111396 Define a separately configurable XLOG_BLCKSZ symbol for the page size
used within WAL files.  Historically this was the same as the data file
BLCKSZ, but there's no necessary connection, and it's possible that
performance gains might ensue from reducing XLOG_BLCKSZ.  In any case
distinguishing two symbols should improve code clarity.  This commit
does not actually change the page size, only provide the infrastructure
to make it possible to do so.  initdb forced because of addition of a
field to pg_control.
Mark Wong, with some help from Simon Riggs and Tom Lane.
2006-04-03 23:35:05 +00:00
Tom Lane c9a2b6d4ca Fix thinko in gistRedoPageUpdateRecord: if XLR_BKP_BLOCK_1 is set, we
don't have anything to do to the page, but we still have to adjust the
incomplete_inserts list that we're maintaining in memory.
2006-04-03 16:45:50 +00:00
Teodor Sigaev 8d02b15e33 Eliminate ajust scan code. Since concurrent GiST it doesn't
do real work. That was missed during concurrence development.
2006-04-03 13:44:33 +00:00
Tom Lane 89bda95d82 Remove the 'slow' path for btree index build, which built the btree
incrementally by successive inserts rather than by sorting the data.
We were only using the slow path during bootstrap, apparently because
when first written it failed during bootstrap --- but it works fine now
AFAICT.  Removing it saves a hundred or so lines of code and produces
noticeably (~10%) smaller initial states of the system catalog indexes.
While that won't make much difference for heavily-modified catalogs,
for the more static ones there may be a useful long-term performance
improvement.
2006-04-01 03:03:37 +00:00
Tom Lane a8b8f4db23 Clean up WAL/buffer interactions as per my recent proposal. Get rid of the
misleadingly-named WriteBuffer routine, and instead require routines that
change buffer pages to call MarkBufferDirty (which does exactly what it says).
We also require that they do so before calling XLogInsert; this takes care of
the synchronization requirement documented in SyncOneBuffer.  Note that
because bufmgr takes the buffer content lock (in shared mode) while writing
out any buffer, it doesn't matter whether MarkBufferDirty is executed before
the buffer content change is complete, so long as the content change is
completed before releasing exclusive lock on the buffer.  So it's OK to set
the dirtybit before we fill in the LSN.
This eliminates the former kluge of needing to set the dirtybit in LockBuffer.
Aside from making the code more transparent, we can also add some new
debugging assertions, in particular that the caller of MarkBufferDirty must
hold the buffer content lock, not merely a pin.
2006-03-31 23:32:07 +00:00
Tom Lane 89395bfa6f Improve gist XLOG code to follow the coding rules needed to prevent
torn-page problems.  This introduces some issues of its own, mainly
that there are now some critical sections of unreasonably broad scope,
but it's a step forward anyway.  Further cleanup will require some
code refactoring that I'd prefer to get Oleg and Teodor involved in.
2006-03-30 23:03:10 +00:00
Tom Lane 6d61cdec07 Clean up and document the API for XLogOpenRelation and XLogReadBuffer.
This commit doesn't make much functional change, but it does eliminate some
duplicated code --- for instance, PageIsNew tests are now done inside
XLogReadBuffer rather than by each caller.
The GIST xlog code still needs a lot of love, but I'll worry about that
separately.
2006-03-29 21:17:39 +00:00
Tom Lane 0a971e2f20 Disable full_page_writes, because turning it off risks causing crash-recovery
failures even when the hardware and OS did nothing wrong.  Per recent analysis
of a problem report from Alex Bahdushka.

For the moment I've just diked out the test of the parameter, rather than
removing the GUC infrastructure and documentation, in case we conclude that
there's something salvageable there.  There seems no chance of it being
resurrected in the 8.1 branch though.
2006-03-28 22:01:16 +00:00
Tom Lane 288551fc60 Repair longstanding error in btree xlog replay: XLogReadBuffer should be
passed extend = true whenever we are reading a page we intend to reinitialize
completely, even if we think the page "should exist".  This is because it
might indeed not exist, if the relation got truncated sometime after the
current xlog record was made and before the crash we're trying to recover
from.  These two thinkos appear to explain both of the old bug reports
discussed here:
http://archives.postgresql.org/pgsql-hackers/2005-05/msg01369.php
2006-03-28 21:17:23 +00:00
Tom Lane 0a20207060 Arrange to emit a description of the current XLOG record as error context
when an error occurs during xlog replay.  Also, replace the former risky
'write into a fixed-size buffer with no overflow detection' API for XLOG
record description routines; use an expansible StringInfo instead.  (The
latter accounts for most of the patch bulk.)

Qingqing Zhou
2006-03-24 04:32:13 +00:00
Tom Lane 2316013961 Clean up representation of function RTEs for functions returning RECORD.
The original coding stored the raw parser output (ColumnDef and TypeName
nodes) which was ugly, bulky, and wrong because it failed to create any
dependency on the referenced datatype --- and in fact would not track type
renamings and suchlike.  Instead store a list of column type OIDs in the
RTE.

Also fix up general failure of recordDependencyOnExpr to do anything sane
about recording dependencies on datatypes.  While there are many cases where
there will be an indirect dependency (eg if an operator returns a datatype,
the dependency on the operator is enough), we do have to record the datatype
as a separate dependency in examples like CoerceToDomain.

initdb forced because of change of stored rules.
2006-03-16 00:31:55 +00:00
Tom Lane 20ab467d76 Improve parser so that we can show an error cursor position for errors
during parse analysis, not only errors detected in the flex/bison stages.
This is per my earlier proposal.  This commit includes all the basic
infrastructure, but locations are only tracked and reported for errors
involving column references, function calls, and operators.  More could
be done later but this seems like a good set to start with.  I've also
moved the ReportSyntaxErrorPosition logic out of psql and into libpq,
which should make it available to more people --- even within psql this
is an improvement because warnings weren't handled by ReportSyntaxErrorPosition.
2006-03-14 22:48:25 +00:00
Tom Lane 9f6192490e Add a CHECK_FOR_INTERRUPTS() in _bt_buildadd(). This fixes problem
with not responding to query cancel during the last stage of btree index
creation.
2006-03-10 20:18:15 +00:00
Bruce Momjian f2f5b05655 Update copyright for 2006. Update scripts. 2006-03-05 15:59:11 +00:00
Neil Conway 8e5a10d46c This patch makes the error message strings throughout the backend
more compliant with the error message style guide. In particular,
errdetail should begin with a capital letter and end with a period,
whereas errmsg should not. I also fixed a few related issues in
passing, such as fixing the repeated misspelling of "lexeme" in
contrib/tsearch2 (per Tom's suggestion).
2006-03-01 06:30:32 +00:00
Neil Conway 737651f6be Cleanup the usage of ScanDirection: use the symbolic names for the
possible ScanDirection alternatives rather than magic numbers
(-1, 0, 1).  Also, use the ScanDirection macros in a few places
rather than directly checking whether `dir == ForwardScanDirection'
and the like. Per patch from James William Pye. His patch also
changed ScanDirection to be a "char" rather than an enum, which
I haven't applied.
2006-02-21 23:01:54 +00:00
Tom Lane 2d7f694729 Move btbulkdelete's vacuum_delay_point() call to a place in the loop where
we are not holding a buffer content lock; where it was, InterruptHoldoffCount
is positive and so we'd not respond to cancel signals as intended.  Also
add missing vacuum_delay_point() call in btvacuumcleanup.  This should fix
complaint from Evgeny Gridasov about failure to respond to SIGINT/SIGTERM
in a timely fashion (bug #2257).
2006-02-14 17:20:01 +00:00
Tom Lane 49758f4703 Add some missing vacuum_delay_point calls in GIST vacuuming. 2006-02-14 16:39:32 +00:00
Tom Lane d52a57fc30 Actually there's a better way to do this, which is to count tuples
during the vacuumcleanup scan that we're going to do anyway.  Should
save a few cycles (one calculation per page, not per tuple) as well
as not having to depend on assumptions about heap and index being
in step.
I think this could probably be made to work for GIST too, but that
code looks messy enough that I'm disinclined to try right now.
2006-02-12 00:18:17 +00:00
Tom Lane fd267c1ebc Skip ambulkdelete scan if there's nothing to delete and the index is not
partial.  None of the existing AMs do anything useful except counting
tuples when there's nothing to delete, and we can get a tuple count
from the heap as long as it's not a partial index.  (hash actually can
skip anyway because it maintains a tuple count in the index metapage.)
GIST is not currently able to exploit this optimization because, due to
failure to index NULLs, GIST is always effectively partial.  Possibly
we should fix that sometime.
Simon Riggs w/ some review by Tom Lane.
2006-02-11 23:31:34 +00:00
Bruce Momjian 77bb65d3fc Revert based on Tom's recommendation:
> Allow VACUUM to complete faster by avoiding scanning the indexes when no
> rows were removed from the heap by the VACUUM.
2006-02-11 17:14:09 +00:00
Bruce Momjian bf324946b3 Allow VACUUM to complete faster by avoiding scanning the indexes when no
rows were removed from the heap by the VACUUM.

Simon Riggs
2006-02-11 16:59:09 +00:00
Tom Lane 5997386a0a Remove the no-longer-useful HashItem/HashItemData level of structure.
Same motivation as for BTItem.
2006-01-25 23:26:11 +00:00
Tom Lane c389760c32 Remove the no-longer-useful BTItem/BTItemData level of structure, and
just refer to btree index entries as plain IndexTuples, which is what
they have been for a very long time.  This is mostly just an exercise
in removing extraneous notation, but it does save a palloc/pfree cycle
per index insertion.
2006-01-25 23:04:21 +00:00
Tom Lane 3a0a16cb7e Allow row comparisons to be used as indexscan qualifications.
This completes the project to upgrade our handling of row comparisons.
2006-01-25 20:29:24 +00:00
Tom Lane 7ccaf13a06 Instead of using a numberOfRequiredKeys count to distinguish required
and non-required keys in a btree index scan, mark the required scankeys
with private flag bits SK_BT_REQFWD and/or SK_BT_REQBKWD.  This seems
at least marginally clearer to me, and it eliminates a wired-into-the-
data-structure assumption that required keys are consecutive.  Even though
that assumption will remain true for the foreseeable future, having it
in there makes the code seem more complex than necessary.
2006-01-23 22:31:41 +00:00
Tom Lane c89a0dd3bb Repair longstanding bug in slru/clog logic: it is possible for two backends
to try to create a log segment file concurrently, but the code erroneously
specified O_EXCL to open(), resulting in a needless failure.  Before 7.4,
it was even a PANIC condition :-(.  Correct code is actually simpler than
what we had, because we can just say O_CREAT to start with and not need a
second open() call.  I believe this accounts for several recent reports of
hard-to-reproduce "could not create file ...: File exists" errors in both
pg_clog and pg_subtrans.
2006-01-21 04:38:21 +00:00
Tom Lane 73e3566078 Improve comments about btree's use of ScanKey data structures: there
are two basically different kinds of scankeys, and we ought to try harder
to indicate which is used in each place in the code.  I've chosen the names
"search scankey" and "insertion scankey", though you could make about
as good an argument for "operator scankey" and "comparison function
scankey".
2006-01-17 00:09:01 +00:00
Tom Lane f7ea931287 Some minor code cleanup, falling out from the removal of rtree. SK_NEGATE
isn't being used anywhere anymore, and there seems no point in a generic
index_keytest() routine when two out of three remaining access methods
aren't using it.  Also, add a comment documenting a convention for
letting access methods define private flag bits in ScanKey sk_flags.
There are no such flags at the moment but I'm thinking about changing
btree's handling of "required keys" to use flag bits in the keys
rather than a count of required key positions.  Also, if some AM did
still want SK_NEGATE then it would be reasonable to treat it as a private
flag bit.
2006-01-14 22:03:35 +00:00
Neil Conway fb627b76cc Cosmetic code cleanup: fix a bunch of places that used "return (expr);"
rather than "return expr;" -- the latter style is used in most of the
tree. I kept the parentheses when they were necessary or useful because
the return expression was complex.
2006-01-11 08:43:13 +00:00
Tom Lane afa8f1971a Add RelationOpenSmgr() calls to ensure rd_smgr is valid when we try to
use it.  While it normally has been opened earlier during btree index
build, testing shows that it's possible for the link to be closed again
if an sinval reset occurs while the index is being built.
2006-01-07 22:45:41 +00:00
Tom Lane 2d0475e480 Convert Assert checking for empty page into a regular test and elog.
The consequences of overwriting a non-empty page are bad enough that
we should not omit this test in production builds.
2006-01-06 00:15:50 +00:00
Peter Eisentraut 86c23a6eb2 Make all command-line options of postmaster and postgres the same. See
http://archives.postgresql.org/pgsql-hackers/2006-01/msg00151.php for the
complete plan.
2006-01-05 10:07:46 +00:00
Tom Lane 195f164228 Get rid of the SpinLockAcquire/SpinLockAcquire_NoHoldoff distinction
in favor of having just one set of macros that don't do HOLD/RESUME_INTERRUPTS
(hence, these correspond to the old SpinLockAcquire_NoHoldoff case).
Given our coding rules for spinlock use, there is no reason to allow
CHECK_FOR_INTERRUPTS to be done while holding a spinlock, and also there
is no situation where ImmediateInterruptOK will be true while holding a
spinlock.  Therefore doing HOLD/RESUME_INTERRUPTS while taking/releasing a
spinlock is just a waste of cycles.  Qingqing Zhou and Tom Lane.
2005-12-29 18:08:05 +00:00
Tom Lane ab51bbaa06 Arrange to set the LC_XXX environment variables to match our locale
setup.  This protects against undesired changes in locale behavior
if someone carelessly does setlocale(LC_ALL, "") (and we know who
you are, perl guys).
2005-12-28 23:22:51 +00:00
Bruce Momjian 261114a23f I have added these macros to c.h:
#define HIGHBIT                 (0x80)
        #define IS_HIGHBIT_SET(ch)      ((unsigned char)(ch) & HIGHBIT)

and removed CSIGNBIT and mapped it uses to HIGHBIT.  I have also added
uses for IS_HIGHBIT_SET where appropriate.  This change is
purely for code clarity.
2005-12-25 02:14:19 +00:00
Tom Lane 656beff590 Adjust string comparison so that only bitwise-equal strings are considered
equal: if strcoll claims two strings are equal, check it with strcmp, and
sort according to strcmp if not identical.  This fixes inconsistent
behavior under glibc's hu_HU locale, and probably under some other locales
as well.  Also, take advantage of the now-well-defined behavior to speed up
texteq, textne, bpchareq, bpcharne: they may as well just do a bitwise
comparison and not bother with strcoll at all.

NOTE: affected databases may need to REINDEX indexes on text columns to be
sure they are self-consistent.
2005-12-22 22:50:00 +00:00
Tom Lane ec0baf949e Divide the lock manager's shared state into 'partitions', so as to
reduce contention for the former single LockMgrLock.  Per my recent
proposal.  I set it up for 16 partitions, but on a pgbench test this
gives only a marginal further improvement over 4 partitions --- we need
to test more scenarios to choose the number of partitions.
2005-12-11 21:02:18 +00:00
Tom Lane cefcbbf1fd Push the responsibility for handling ignore_killed_tuples down into
_bt_checkkeys(), instead of checking it in the top-level nbtree.c routines
as formerly.  This saves a little bit of loop overhead, but more importantly
it lets us skip performing the index key comparisons for dead tuples.
2005-12-07 19:37:53 +00:00
Tom Lane f1b059af12 A couple of tiny performance hacks in _bt_step(). Remove PageIsEmpty
checks, which were once needed because PageGetMaxOffsetNumber would
fail on empty pages, but are now just redundant.  Also, don't set up
local variables that aren't needed in the fast path --- most of the
time, we only need to advance offnum and not step across a page boundary.
Motivated by noticing _bt_step at the top of OProfile profile for a
pgbench run.
2005-12-07 18:03:48 +00:00
Tom Lane 887a7c61f6 Get rid of slru.c's hardwired insistence on a fixed number of slots per
SLRU area.  The number of slots is still a compile-time constant (someday
we might want to change that), but at least it's a different constant for
each SLRU area.  Increase number of subtrans buffers to 32 based on
experimentation with a heavily subtrans-bashing test case, and increase
number of multixact member buffers to 16, since it's obviously silly for
it not to be at least twice the number of multixact offset buffers.
2005-12-06 23:08:34 +00:00
Tom Lane a615acf555 Arrange for read-only accesses to SLRU page buffers to take only a shared
lock, not exclusive, if the desired page is already in memory.  This can
be demonstrated to be a significant win on the pg_subtrans cache when there
is a large window of open transactions.  It should be useful for pg_clog
as well.  I didn't try to make GetMultiXactIdMembers() use the code, as
that would have taken some restructuring, and what with the local cache
for multixact contents it probably wouldn't really make a difference.
Per my recent proposal.
2005-12-06 18:10:06 +00:00
Tom Lane a98871b7ac Tweak indexscan machinery to avoid taking an AccessShareLock on an index
if we already have a stronger lock due to the index's table being the
update target table of the query.  Same optimization I applied earlier
at the table level.  There doesn't seem to be much interest in the more
radical idea of not locking indexes at all, so do what we can ...
2005-12-03 05:51:03 +00:00
Tom Lane 4c4eb57154 Some marginal additional hacking to shave a few more cycles off
heapgettup.
2005-11-26 05:03:06 +00:00
Tom Lane 70f1482de3 Change seqscan logic so that we check visibility of all tuples on a page
when we first read the page, rather than checking them one at a time.
This allows us to take and release the buffer content lock just once
per page, instead of once per tuple.  Since it's a shared lock the
contention penalty for holding the lock longer shouldn't be too bad.
We can safely do this only when using an MVCC snapshot; else the
assumption that visibility won't change over time is uncool.  Therefore
there are now two code paths depending on the snapshot type.  I also
made the same change in nodeBitmapHeapscan.c, where it can be done always
because we only support MVCC snapshots for bitmap scans anyway.
Also make some incidental cleanups in the APIs of these functions.
Per a suggestion from Qingqing Zhou.
2005-11-26 03:03:07 +00:00
Bruce Momjian 436a2956d8 Re-run pgindent, fixing a problem where comment lines after a blank
comment line where output as too long, and update typedefs for /lib
directory.  Also fix case where identifiers were used as variable names
in the backend, but as typedefs in ecpg (favor the backend for
indenting).

Backpatch to 8.1.X.
2005-11-22 18:17:34 +00:00
Tom Lane dd218ae7b0 Remove the t_datamcxt field of HeapTupleData. This was introduced for
the convenience of tuptoaster.c and is no longer needed, so may as well
get rid of some small amount of overhead.
2005-11-20 19:49:08 +00:00
Tom Lane 40314f2dac Modify tuptoaster's API so that it does not try to modify the passed
tuple in-place, but instead passes back an all-new tuple structure if
any changes are needed.  This is a much cleaner and more robust solution
for the bug discovered by Alexey Beschiokov; accordingly, revert the
quick hack I installed yesterday.
With this change, HeapTupleData.t_datamcxt is no longer needed; will
remove it in a separate commit in HEAD only.
2005-11-20 18:38:20 +00:00
Tom Lane 2a8d3d83ef R-tree is dead ... long live GiST. 2005-11-07 17:36:47 +00:00
Tom Lane 6236991143 Add simple sanity checks on newly-read pages to GiST, too. 2005-11-06 22:39:21 +00:00
Tom Lane 766dc45d9f Add defenses to btree and hash index AMs to do simple sanity checks
on every index page they read; in particular to catch the case of an
all-zero page, which PageHeaderIsValid allows to pass.  It turns out
hash already had this idea, but it was just Assert()ing things rather
than doing a straight error check, and the Asserts were partially
redundant with PageHeaderIsValid anyway.  Per recent failure example
from Jim Nasby.  (gist still needs the same treatment.)
2005-11-06 19:29:01 +00:00
Tom Lane 18691d8ee3 Clean up representation of SLRU page state. This is the cleaner fix
for the SLRU race condition that I posted a few days ago, but we decided
not to use in 8.1 and older branches.
2005-11-05 21:19:47 +00:00
Alvaro Herrera 902377c465 Rename the members of CommandDest enum so they don't collide with other uses of
those names.  (Debug and None were pretty bad names anyway.)  I hope I catched
all uses of the names in comments too.
2005-11-03 17:11:40 +00:00
Tom Lane 99d48695d4 Fix longstanding race condition in transaction log management: there was a
very narrow window in which SimpleLruReadPage or SimpleLruWritePage could
think that I/O was needed when it wasn't (and indeed the buffer had already
been assigned to another page).  This would result in an Assert failure if
Asserts were enabled, and probably in silent data corruption if not.
Reported independently by Jim Nasby and Robert Creager.

I intend a more extensive fix when 8.2 development starts, but this is a
reasonably low-impact patch for the existing branches.
2005-11-03 00:23:36 +00:00
Peter Eisentraut 07bb9f086b Message corrections 2005-10-29 00:31:52 +00:00
Tom Lane a037926295 Reorder code so that we don't have to hold a critical section while
reserving SLRU space for a new MultiXact.  The original coding would have
treated out-of-disk-space as a PANIC condition, which is unnecessary.
2005-10-28 19:00:19 +00:00
Tom Lane 1986ca5ce5 Fix race condition in multixact code: it's possible to try to read a
multixact's starting offset before the offset has been stored into the
SLRU file.  A simple fix would be to hold the MultiXactGenLock until the
offset has been stored, but that looks like a big concurrency hit.  Instead
rely on knowledge that unset offsets will be zero, and loop when we see
a zero.  This requires a little extra hacking to ensure that zero is never
a valid value for the offset.  Problem reported by Matteo Beccati, fix
ideas from Martijn van Oosterhout, Alvaro Herrera, and Tom Lane.
2005-10-28 17:27:29 +00:00
Tom Lane 6d6c3722fb Make code for selecting default WAL sync method less confusing. 2005-10-22 20:27:17 +00:00
Tom Lane d9cb48786e Better solution to the problem of labeling whole-row Datums that are
generated from subquery outputs: use the type info stored in the Var
itself.  To avoid making ExecEvalVar and slot_getattr more complex
and slower, I split out the whole-row case into a separate ExecEval routine.
2005-10-19 22:30:30 +00:00
Tom Lane 07908c9c37 Ensure that the Datum generated from a whole-row Var contains valid
type ID information even when it's a record type.  This is needed to
handle whole-row Vars referencing subquery outputs.  Per example from
Richard Huxton.
2005-10-19 18:18:33 +00:00
Tom Lane 23836fb1fb A few trivial code cleanups motivated by reading warnings generated
by a recent HP C compiler.  Mostly, get rid of useless local variables
that are assigned to but never used.
2005-10-18 01:06:24 +00:00
Bruce Momjian 1dc3498251 Standard pgindent run for 8.1. 2005-10-15 02:49:52 +00:00
Bruce Momjian 1d028537a2 This makes the error messages for PREPARE TRANSACTION, COMMIT PREPARED
etc. match the docs, which talk about "transaction identifier" not
"gid" or "global transaction identifier".

Steve Woodcock
2005-10-13 22:55:55 +00:00
Bruce Momjian 84cc9a4bb3 Back out this because of fear of changing error strings:
This makes the error messages for PREPARE TRANSACTION, COMMIT PREPARED
etc. match the docs, which talk about "transaction identifier" not
"gid" or "global transaction identifier".

Steve Woodcock
2005-10-13 17:57:57 +00:00
Bruce Momjian 90c22c9206 This makes the error messages for PREPARE TRANSACTION, COMMIT PREPARED
etc. match the docs, which talk about "transaction identifier" not
"gid" or "global transaction identifier".

Steve Woodcock
2005-10-13 17:57:17 +00:00
Tom Lane e952ae1268 Fix longstanding bug found by Atsushi Ogawa: _bt_check_unique would mark
the wrong buffer dirty when trying to kill a dead index entry that's on
a page after the one it started on.  No risk of data corruption, just
inefficiency, but still a bug.
2005-10-12 17:18:03 +00:00
Tom Lane cb8b6618ce Revise pgstats stuff to fix the problems with not counting accesses
generated by bitmap index scans.  Along the way, simplify and speed up
the code for counting sequential and index scans; it was both confusing
and inefficient to be taking care of that in the per-tuple loops, IMHO.
initdb forced because of internal changes in pg_stat view definitions.
2005-10-06 02:29:23 +00:00
Tom Lane 64eea6c21d Expand pg_control information so that we can verify that the database
was created on a machine with alignment rules and floating-point format
similar to the current machine.  Per recent discussion, this seems like
a good idea with the increasing prevalence of 32/64 bit environments.
2005-10-03 00:28:43 +00:00
Tom Lane 303e089df5 Clean up possibly-uninitialized-variable warnings reported by gcc 4.x. 2005-09-24 22:54:44 +00:00
Bruce Momjian b3364fc81b pgindent new GIST index code, per request from Tom. 2005-09-22 20:44:36 +00:00
Tom Lane 08817bdb76 Adjust GiST error messages to conform to message style guidelines. 2005-09-22 18:49:45 +00:00
Teodor Sigaev f4516f8732 Small fixes 2005-09-16 14:40:54 +00:00
Neil Conway 1dd9b09332 Copy-editing for GiST README. 2005-09-15 17:44:27 +00:00
Teodor Sigaev 79fae4a764 Readme about GiST's algorithms 2005-09-15 16:39:15 +00:00
Tom Lane 35e9b1cc1e Clean up a couple of ad-hoc computations of the maximum number of tuples
on a page, as suggested by ITAGAKI Takahiro.  Also, change a few places
that were using some other estimates of max-items-per-page to consistently
use MaxOffsetNumber.  This is conservatively large --- we could have used
the new MaxHeapTuplesPerPage macro, or a similar one for index tuples ---
but those places are simply declaring a fixed-size buffer and assuming it
will work, rather than actively testing for overrun.  It seems safer to
size these buffers in a way that can't overflow even if the page is
corrupt.
2005-09-02 19:02:20 +00:00
Tom Lane 037709e0b3 Reduce default value of max_prepared_transactions from 50 to 5. This
saves nearly 700kB in the default shared memory segment size, which seems
worthwhile, and it is a feature that many users won't use anyway.  Per
Heikki's argument, there is no point in a compromise value --- those who
are using 2PC at all will probably want it at least equal to max_connections.
But we can't set it to zero by default without breaking the prepared_xacts
regression test.
2005-08-29 21:38:18 +00:00
Tom Lane 9052537325 Rewrite gather-write patch into something less obviously bolted on
after the fact.  Fix bug with incorrect test for whether we are at end
of logfile segment.  Arrange for writes triggered by XLogInsert's
is-cache-more-than-half-full test to synchronize with the cache boundaries,
so that in long transactions we tend to write alternating halves of the
cache rather than randomly chosen portions of it; this saves one more
write syscall per cache load.
2005-08-22 23:59:04 +00:00
Bruce Momjian 8ad3965a11 Improve xid wraparound message (the server isn't really shut down, just
not accepting queries).

         errmsg("database is not accepting queries to avoid
	 wraparound data loss in database \"%s\"",
         errhint("Stop the postmaster and use a standalone
	 backend to VACUUM database \"%s\".",
2005-08-22 16:59:47 +00:00
Tom Lane d0096a41fa Fix some inconsistent choices of datatypes in xlog.c. Make buffer
indexes all be int, rather than variously int, uint16 and uint32;
add some casts where necessary to support large buffer arrays.
2005-08-22 00:41:28 +00:00
Tom Lane f39f6b500f Seems that the childXids list would be better based on Oid lists than
integer lists.
2005-08-20 23:45:08 +00:00
Tom Lane 0007490e09 Convert the arithmetic for shared memory size calculation from 'int'
to 'Size' (that is, size_t), and install overflow detection checks in it.
This allows us to remove the former arbitrary restrictions on NBuffers
etc.  It won't make any difference in a 32-bit machine, but in a 64-bit
machine you could theoretically have terabytes of shared buffers.
(How efficiently we could manage 'em remains to be seen.)  Similarly,
num_temp_buffers, work_mem, and maintenance_work_mem can be set above
2Gb on a 64-bit machine.  Original patch from Koichi Suzuki, additional
work by moi.
2005-08-20 23:26:37 +00:00
Tatsuo Ishii ba2fc7eb4b Make GetMultiXactIdMembers() a public function. 2005-08-20 01:29:27 +00:00
Tom Lane f57e3f4cf3 Repair problems with VACUUM destroying t_ctid chains too soon, and with
insufficient paranoia in code that follows t_ctid links.  (We must do both
because even with VACUUM doing it properly, the intermediate state with
a dangling t_ctid link is visible concurrently during lazy VACUUM, and
could be seen afterwards if either type of VACUUM crashes partway through.)
Also try to improve documentation about what's going on.  Patch is a bit
bulky because passing the XMAX information around required changing the
APIs of some low-level heapam.c routines, but it's not conceptually very
complicated.  Per trouble report from Teodor and subsequent analysis.
This needs to be back-patched, but I'll do that after 8.1 beta is out.
2005-08-20 00:40:32 +00:00
Tom Lane f8d0a82bf9 Avoid an Assert failure if OuterUserId hasn't been set yet during
AbortTransaction.  This can happen if a backend's InitPostgres transaction
fails (eg, because the given username is invalid).  Per Alvaro.
2005-08-17 22:14:34 +00:00
Tom Lane 6ea05c16a4 Change a couple of "can't happen" error messages to be a shade more
verbose when they do happen.  The "left link changed unexpectedly"
one in particular has been seen more than once in the field.
2005-08-12 14:34:14 +00:00
Tom Lane 721e53785d Solve the problem of OID collisions by probing for duplicate OIDs
whenever we generate a new OID.  This prevents occasional duplicate-OID
errors that can otherwise occur once the OID counter has wrapped around.
Duplicate relfilenode values are also checked for when creating new
physical files.  Per my recent proposal.
2005-08-12 01:36:05 +00:00
Tom Lane d90c531188 Autovacuum loose end mop-up. Provide autovacuum-specific vacuum cost
delay and limit, both as global GUCs and as table-specific entries in
pg_autovacuum.  stats_reset_on_server_start is now OFF by default,
but a reset is forced if we did WAL replay.  XID-wrap vacuums do not
ANALYZE, but do FREEZE if it's a template database.  Alvaro Herrera
2005-08-11 21:11:50 +00:00
Bruce Momjian 949ebbd55e Mention MD5 function index for indexing long values. 2005-08-11 13:22:33 +00:00
Tom Lane 24ff62d76f Make new hints follow style guide. 2005-08-10 22:39:00 +00:00
Bruce Momjian 237be3cc29 Add hints to cases where indexes fail because of values that are too long. 2005-08-10 21:36:46 +00:00
Tom Lane 4568e0f791 Modify AtEOXact_CatCache and AtEOXact_RelationCache to assume that the
ResourceOwner mechanism already released all reference counts for the
cache entries; therefore, we do not need to scan the catcache or relcache
at transaction end, unless we want to do it as a debugging crosscheck.
Do the crosscheck only in Assert mode.  This is the same logic we had
previously installed in AtEOXact_Buffers to avoid overhead with large
numbers of shared buffers.  I thought it'd be a good idea to do it here
too, in view of Kari Lavikka's recent report showing a real-world case
where AtEOXact_CatCache is taking a significant fraction of runtime.
2005-08-08 19:17:23 +00:00
Tom Lane 0001e98d54 Code and docs review for pg_column_size() patch. 2005-08-02 16:11:57 +00:00
Tom Lane 2a4fad1a0e Add NOWAIT option to SELECT FOR UPDATE/SHARE.
Original patch by Hans-Juergen Schoenig, revisions by Karel Zak
and Tom Lane.
2005-08-01 20:31:16 +00:00
Tom Lane d42cf5a42a Add per-user and per-database connection limit options.
This patch also includes preliminary update of pg_dumpall for roles.
Petr Jelinek, with review by Bruce Momjian and Tom Lane.
2005-07-31 17:19:22 +00:00
Bruce Momjian 5b0bfec414 Fix compile for no O_SYNC, but introduced with O_DIRECT. 2005-07-30 14:15:44 +00:00
Tom Lane 5d5f1a79e6 Clean up a number of autovacuum loose ends. Make the stats collector
track shared relations in a separate hashtable, so that operations done
from different databases are counted correctly.  Add proper support for
anti-XID-wraparound vacuuming, even in databases that are never connected
to and so have no stats entries.  Miscellaneous other bug fixes.
Alvaro Herrera, some additional fixes by Tom Lane.
2005-07-29 19:30:09 +00:00
Bruce Momjian c6b1724c67 Update O_DIRECT comment. 2005-07-29 03:25:53 +00:00
Bruce Momjian c34bb00581 Use O_DIRECT if available when using O_SYNC for wal_sync_method.
Also, write multiple WAL buffers out in one write() operation.

ITAGAKI Takahiro

---------------------------------------------------------------------------

> If we disable writeback-cache and use open_sync, the per-page writing
> behavior in WAL module will show up as bad result. O_DIRECT is similar
> to O_DSYNC (at least on linux), so that the benefit of it will disappear
> behind the slow disk revolution.
>
> In the current source, WAL is written as:
>     for (i = 0; i < N; i++) { write(&buffers[i], BLCKSZ); }
> Is this intentional? Can we rewrite it as follows?
>    write(&buffers[0], N * BLCKSZ);
>
> In order to achieve it, I wrote a 'gather-write' patch (xlog.gw.diff).
> Aside from this, I'll also send the fixed direct io patch (xlog.dio.diff).
> These two patches are independent, so they can be applied either or both.
>
>
> I tested them on my machine and the results as follows. It shows that
> direct-io and gather-write is the best choice when writeback-cache is off.
> Are these two patches worth trying if they are used together?
>
>
>             | writeback | fsync= | fdata | open_ | fsync_ | open_
> patch       | cache     |  false |  sync |  sync | direct | direct
> ------------+-----------+--------+-------+-------+--------+---------
> direct io   | off       |  124.2 | 105.7 |  48.3 |   48.3 |  48.2
> direct io   | on        |  129.1 | 112.3 | 114.1 |  142.9 | 144.5
> gather-write| off       |  124.3 | 108.7 | 105.4 |  (N/A) | (N/A)
> both        | off       |  131.5 | 115.5 | 114.4 |  145.4 | 145.2
>
> - 20runs * pgbench -s 100 -c 50 -t 200
>    - with tuning (wal_buffers=64, commit_delay=500, checkpoint_segments=8)
> - using 2 ATA disks:
>    - hda(reiserfs) includes system and wal.
>    - hdc(jfs) includes database files. writeback-cache is always on.
>
> ---
> ITAGAKI Takahiro
2005-07-29 03:22:33 +00:00
Tom Lane e5d6b91220 Add SET ROLE. This is a partial commit of Stephen Frost's recent patch;
I'm still working on the has_role function and information_schema changes.
2005-07-25 22:12:34 +00:00
Bruce Momjian 9af9d674c6 Remove unintended code addition. 2005-07-23 15:31:16 +00:00
Bruce Momjian 4098c8867d Macro alignment cleanup. 2005-07-23 15:29:47 +00:00
Tom Lane f2bf2d2dc5 Fix a couple of bogus comments, per Alvaro. 2005-07-13 22:46:09 +00:00
Tom Lane d7207cfc6b Even though I'd like to see full_page_writes go away before 8.1,
a minimum requirement is that it not completely break the system
meanwhile.  Put the test in the right place.
2005-07-08 04:07:26 +00:00
Bruce Momjian a923602855 Add pg_column_size() to return storage size of a column, including
possible compression.

Mark Kirkwood
2005-07-06 19:02:54 +00:00
Bruce Momjian 326a7a0788 Add GUC full_page_writes to control writing full pages to WAL. 2005-07-05 23:18:10 +00:00
Tom Lane eb5949d190 Arrange for the postmaster (and standalone backends, initdb, etc) to
chdir into PGDATA and subsequently use relative paths instead of absolute
paths to access all files under PGDATA.  This seems to give a small
performance improvement, and it should make the system more robust
against naive DBAs doing things like moving a database directory that
has a live postmaster in it.  Per recent discussion.
2005-07-04 04:51:52 +00:00
Tom Lane e7e1694295 Migrate rtree_gist functionality into the core system, and add some
basic regression tests for GiST to the standard regression tests.
I took the opportunity to add an rtree-equivalent gist opclass for
circles; the contrib version only covered boxes and polygons, but
indexing circles is very handy for distance searches.
2005-07-01 19:19:05 +00:00
Teodor Sigaev 5350216156 Improve error messages and add comment 2005-07-01 13:18:17 +00:00
Teodor Sigaev 898a7bd13b Bug fixes for GiST crash recovery.
- add forgotten check of lsn for insert completion
- remove level of pages: hard to check in recovery
- some cleanups
2005-06-30 17:52:14 +00:00
Tom Lane 401de9c8be Improve the checkpoint signaling mechanism so that the bgwriter can tell
the difference between checkpoints forced due to WAL segment consumption
and checkpoints forced for other reasons (such as CREATE DATABASE).  Avoid
generating 'checkpoints are occurring too frequently' messages when the
checkpoint wasn't caused by WAL segment consumption.  Per gripe from
Chris K-L.
2005-06-30 00:00:52 +00:00
Tom Lane b5f7cff84f Clean up the rather historically encumbered interface to now() and
current time: provide a GetCurrentTimestamp() function that returns
current time in the form of a TimestampTz, instead of separate time_t
and microseconds fields.  This is what all the callers really want
anyway, and it eliminates low-level dependencies on AbsoluteTime,
which is a deprecated datatype that will have to disappear eventually.
2005-06-29 22:51:57 +00:00
Teodor Sigaev 4523e0b63a Cleanup, remove unneeded pallocs 2005-06-29 14:06:14 +00:00
Teodor Sigaev 88b49cdc95 Code cleanup. gistfillbuffer accepts InvalidOffsetNumber. 2005-06-28 15:51:00 +00:00
Tom Lane 7762619e95 Replace pg_shadow and pg_group by new role-capable catalogs pg_authid
and pg_auth_members.  There are still many loose ends to finish in this
patch (no documentation, no regression tests, no pg_dump support for
instance).  But I'm going to commit it now anyway so that Alvaro can
make some progress on shared dependencies.  The catalog changes should
be pretty much done.
2005-06-28 05:09:14 +00:00
Teodor Sigaev e8cab5fe49 Concurrency for GiST
- full concurrency for insert/update/select/vacuum:
        - select and vacuum never locks more than one page simultaneously
        - select (gettuple) hasn't any lock across it's calls
        - insert never locks more than two page simultaneously:
                - during search of leaf to insert it locks only one page
                  simultaneously
                - while walk upward to the root it locked only parent (may be
                  non-direct parent) and child. One of them X-lock, another may
                  be S- or X-lock
- 'vacuum full' locks index
- improve gistgetmulti
- simplify XLOG records

Fix bug in index_beginscan_internal: LockRelation may clean
  rd_aminfo structure, so move GET_REL_PROCEDURE after LockRelation
2005-06-27 12:45:23 +00:00
Tom Lane b90f8f20f0 Extend r-tree operator classes to handle Y-direction tests equivalent
to the existing X-direction tests.  An rtree class now includes 4 actual
2-D tests, 4 1-D X-direction tests, and 4 1-D Y-direction tests.
This involved adding four new Y-direction test operators for each of
box and polygon; I followed the PostGIS project's lead as to the names
of these operators.
NON BACKWARDS COMPATIBLE CHANGE: the poly_overleft (&<) and poly_overright
(&>) operators now have semantics comparable to box_overleft and box_overright.
This is necessary to make r-tree indexes work correctly on polygons.
Also, I changed circle_left and circle_right to agree with box_left and
box_right --- formerly they allowed the boundaries to touch.  This isn't
actually essential given the lack of any r-tree opclass for circles, but
it seems best to sync all the definitions while we are at it.
2005-06-24 20:53:34 +00:00
Tom Lane 9a09248edd Fix rtree and contrib/rtree_gist search behavior for the 1-D box and
polygon operators (<<, &<, >>, &>).  Per ideas originally put forward
by andrew@supernews and later rediscovered by moi.  This patch just
fixes the existing opclasses, and does not add any new behavior as I
proposed earlier; that can be sorted out later.  In principle this
could be back-patched, since it changes only search behavior and not
system catalog entries nor rtree index contents.  I'm not currently
planning to do that, though, since I think it could use more testing.
2005-06-24 00:18:52 +00:00
Tom Lane e98edb5555 Fix the mechanism for reporting the original table OID and column number
of columns of a query result so that it can "see through" cursors and
prepared statements.  Per gripe a couple months back from John DeSoi.
2005-06-22 17:45:46 +00:00
Tom Lane b95ae32b41 Avoid WAL-logging individual tuple insertions during CREATE TABLE AS
(a/k/a SELECT INTO).  Instead, flush and fsync the whole relation before
committing.  We do still need the WAL log when PITR is active, however.
Simon Riggs and Tom Lane.
2005-06-20 18:37:02 +00:00
Teodor Sigaev 1bfdd1a893 fix founded hole in recovery after crash, add vacuum_delay_point() 2005-06-20 15:22:38 +00:00
Teodor Sigaev d544ec8bbd 1. full functional WAL for GiST
2. improve vacuum for gist
   - use FSM
   - full vacuum:
      - reforms parent tuple if it's needed
        ( tuples was deleted on child page or parent tuple remains invalid
          after crash recovery )
      - truncate index file if possible
3. fixes bugs and mistakes
2005-06-20 10:29:37 +00:00
Tom Lane d961a56899 Avoid unnecessary palloc overhead in _bt_first(). The temporary
scankeys arrays that it needs can never have more than INDEX_MAX_KEYS
entries, so it's reasonable to just allocate them as fixed-size local
arrays, and save the cost of palloc/pfree.  Not a huge savings, but
a cycle saved is a cycle earned ...
2005-06-19 22:41:00 +00:00
Tom Lane fc654583ab Need #include <time.h> on some platforms. 2005-06-19 22:34:56 +00:00
Tom Lane 3f749924f8 Simplify uses of readdir() by creating a function ReadDir() that
includes error checking and an appropriate ereport(ERROR) message.
This gets rid of rather tedious and error-prone manipulation of errno,
as well as a Windows-specific bug workaround, at more than a dozen
call sites.  After an idea in a recent patch by Heikki Linnakangas.
2005-06-19 21:34:03 +00:00
Tom Lane e26b0abda3 Arrange to fsync two-phase-commit state files only during checkpoints;
given reasonably short lifespans for prepared transactions, this should
mean that only a small minority of state files ever need to be fsynced
at all.  Per discussion with Heikki Linnakangas.
2005-06-19 20:00:39 +00:00
Tom Lane a8d1075f27 Add a time-of-preparation column to the pg_prepared_xacts view, per an
old suggestion by Oliver Jowett.  Also, add a transaction column to the
pg_locks view to show the xid of each transaction holding or awaiting
locks; this allows prepared transactions to be properly associated with
the locks they own.  There was already a column named 'transaction',
and I chose to rename it to 'transactionid' --- since this column is
new in the current devel cycle there should be no backwards compatibility
issue to worry about.
2005-06-18 19:33:42 +00:00
Tom Lane 66b098492e Dept. of second thoughts: regular COMMIT deletes deletable files before
releasing locks, so COMMIT PREPARED should too.
2005-06-18 05:21:09 +00:00
Tom Lane d0a89683a3 Two-phase commit. Original patch by Heikki Linnakangas, with additional
hacking by Alvaro Herrera and Tom Lane.
2005-06-17 22:32:51 +00:00
Bruce Momjian f4d907ca85 Remove old *.backup files when we do pg_stop_backup(). This
prevents a large number of *.backup files from existing in pg_xlog/
2005-06-15 01:36:08 +00:00
Teodor Sigaev 37c839365c WAL for GiST. It work for online backup and so on, but on
recovery after crash (power loss etc) it may say that it can't restore
index and index should be reindexed.

Some refactoring code.
2005-06-14 11:45:14 +00:00
Tom Lane c186c93148 Change the planner to allow indexscan qualification clauses to use
nonconsecutive columns of a multicolumn index, as per discussion around
mid-May (pghackers thread "Best way to scan on-disk bitmaps").  This
turns out to require only minimal changes in btree, and so far as I can
see none at all in GiST.  btcostestimate did need some work, but its
original assumption that index selectivity == heap selectivity was
quite bogus even before this.
2005-06-13 23:14:49 +00:00
Bruce Momjian 51746c4549 Free buffer allocated via malloc (process is short-lived, but fix it anyway). 2005-06-09 22:36:27 +00:00
Tom Lane dce83e76e0 Add missing #include -- mea culpa. 2005-06-09 21:01:25 +00:00
Tom Lane 82358ca936 Put a critical section around update of hash index metapage. Per
discussion with Qingqing Zhou.
2005-06-09 18:23:50 +00:00
Tom Lane f5b2f60bd1 Change WAL-logging scheme for multixacts to be more like regular
transaction IDs, rather than like subtrans; in particular, the information
now survives a database restart.  Per previous discussion, this is
essential for PITR log shipping and for 2PC.
2005-06-08 15:50:28 +00:00
Tom Lane ee7ac7b11e Modify XLogInsert API to make callers specify whether pages to be backed
up have the standard layout with unused space between pd_lower and pd_upper.
When this is set, XLogInsert will omit the unused space without bothering
to scan it to see if it's zero.  That saves time in XLogInsert, and also
allows reversion of my earlier patch to make PageRepairFragmentation et al
explicitly re-zero freed space.  Per suggestion by Heikki Linnakangas.
2005-06-06 20:22:58 +00:00
Tom Lane 4c8495a1f2 Remove the mostly-stubbed-out-anyway support routines for WAL UNDO.
That code is never going to be used in the foreseeable future, and
where it's more than a stub it's making the redo routines harder to
read.
2005-06-06 17:01:25 +00:00
Tom Lane 21fda22ec4 Change CRCs in WAL records from 64bit to 32bit for performance reasons.
Instead of a separate CRC on each backup block, include backup blocks
in their parent WAL record's CRC; this is important to ensure that the
backup block really goes with the WAL record, ie there was not a page
tear right at the start of the backup block.  Implement a simple form
of compression of backup blocks: drop any run of zeroes starting at
pd_lower, so as not to store the unused 'hole' that commonly exists in
PG heap and index pages.  Tweak PageRepairFragmentation and related
routines to ensure they keep the unused space zeroed, so that the above
compression method remains effective.  All per recent discussions.
2005-06-02 05:55:29 +00:00
Tom Lane a91fa39028 Add test to WAL replay to verify that xl_prev points back to the previous
WAL record; this is necessary to be sure we recognize stale WAL records
when a WAL page was only partially written during a system crash.
2005-05-31 19:10:28 +00:00
Tom Lane e92a88272e Modify hash_search() API to prevent future occurrences of the error
spotted by Qingqing Zhou.  The HASH_ENTER action now automatically
fails with elog(ERROR) on out-of-memory --- which incidentally lets
us eliminate duplicate error checks in quite a bunch of places.  If
you really need the old return-NULL-on-out-of-memory behavior, you
can ask for HASH_ENTER_NULL.  But there is now an Assert in that path
checking that you aren't hoping to get that behavior in a palloc-based
hash table.
Along the way, remove the old HASH_FIND_SAVE/HASH_REMOVE_SAVED actions,
which were not being used anywhere anymore, and were surely too ugly
and unsafe to want to see revived again.
2005-05-29 04:23:07 +00:00
Tom Lane 32e8fc4a28 Arrange to cache fmgr lookup information for an index's access method
routines in the index's relcache entry, instead of doing a fresh fmgr_info
on every index access.  We were already doing this for the index's opclass
support functions; not sure why we didn't think to do it for the AM
functions too.  This supersedes the former method of caching (only)
amgettuple in indexscan scan descriptors; it's an improvement because the
function lookup can be amortized across multiple statements instead of
being repeated for each statement.  Even though lookup for builtin
functions is pretty cheap, this seems to drop a percent or two off some
simple benchmarks.
2005-05-27 23:31:21 +00:00
Bruce Momjian b492c3accc Add parentheses to macros when args are used in computations. Without
them, the executation behavior could be unexpected.
2005-05-25 21:40:43 +00:00
Bruce Momjian 6dc7760ac3 Add support for wal_fsync_writethrough for Darwin, and restructure the
code to better handle writethrough.

Chris Campbell
2005-05-20 14:53:26 +00:00
Tom Lane ff0c143a3b Make a comment pgindent-proof, per suggestion from Alvaro. 2005-05-19 23:58:51 +00:00
Tom Lane ee3b71f6bc Split the shared-memory array of PGPROC pointers out of the sinval
communication structure, and make it its own module with its own lock.
This should reduce contention at least a little, and it definitely makes
the code seem cleaner.  Per my recent proposal.
2005-05-19 21:35:48 +00:00
Neil Conway c891e05f26 Cleanup GiST header files. Since GiST extensions are often written as
external projects, we should be careful about what parts of the GiST
API are considered implementation details, and which are part of the
public API. Therefore, I've moved internal-only declarations into
gist_private.h -- future backward-incompatible changes to gist.h should
be made with care, to avoid needlessly breaking external GiST extensions.

Also did some related header cleanup: remove some unnecessary #includes
from gist.h, and remove some unused definitions: isAttByVal(), _gistdump(),
and GISTNStrategies.
2005-05-17 03:34:18 +00:00
Neil Conway eda6dd32d1 GiST improvements:
- make sure we always invoke user-supplied GiST methods in a short-lived
  memory context. This means the backend isn't exposed to any memory leaks
  that be in those methods (in fact, it is probably a net loss for most
  GiST methods to bother manually freeing memory now). This also means
  we can do away with a lot of ugly manual memory management in the
  GiST code itself.

- keep the current page of a GiST index scan pinned, rather than doing a
  ReadBuffer() for each tuple produced by the scan. Since ReadBuffer() is
  expensive, this is a perf. win

- implement dead tuple killing for GiST indexes (which is easy to do, now
  that we keep a pin on the current scan page). Now all the builtin indexes
  implement dead tuple killing.

- cleanup a lot of ugly code in GiST
2005-05-17 00:59:30 +00:00
Tom Lane 2ef172a2a4 Fix latent bug in ExecSeqRestrPos: it leaves the plan node's result slot
in an inconsistent state.  (This is only latent because in reality
ExecSeqRestrPos is dead code at the moment ... but someday maybe it won't
be.)  Add some comments about what the API for plan node mark/restore
actually is, because it's not immediately obvious.
2005-05-15 21:19:55 +00:00
Neil Conway eb0d00a9be Various style cleanups for GiST; no changes to functionality. 2005-05-15 04:08:29 +00:00
Neil Conway 3140437495 This patch refactors away some duplicated code in the index AM build
methods: they all invoke UpdateStats() since they have computed the
number of heap tuples, so I created a function in catalog/index.c that
each AM now calls.
2005-05-11 06:24:55 +00:00
Neil Conway f38e413b20 Code cleanup: in C89, there is no point casting the first argument to
memset() or MemSet() to a char *. For one, memset()'s first argument is
a void *, and further void * can be implicitly coerced to/from any other
pointer type.
2005-05-11 01:26:02 +00:00
Bruce Momjian 35e1651508 Back out check for unreferenced files.
Heikki Linnakangas
2005-05-10 22:27:30 +00:00
Neil Conway dc5ebcfcce Fix typo in comment. 2005-05-10 05:15:07 +00:00
Tom Lane 30f540be43 Repair very-low-probability race condition between relation extension
and VACUUM: in the interval between adding a new page to the relation
and formatting it, it was possible for VACUUM to come along and decide
it should format the page too.  Though not harmful in itself, this would
cause data loss if a third transaction were able to insert tuples into
the vacuumed page before the original extender got control back.
2005-05-07 21:32:24 +00:00
Tom Lane 4cd4ed0cc2 Fix case in which a debug printout would print already-pfreed data. 2005-05-07 18:14:25 +00:00
Tom Lane 278bd0cc22 For some reason access/tupmacs.h has been #including utils/memutils.h,
which is neither needed by nor related to that header.  Remove the bogus
inclusion and instead include the header in those C files that actually
need it.  Also fix unnecessary inclusions and bad inclusion order in
tsearch2 files.
2005-05-06 17:24:55 +00:00
Tom Lane 126eaef651 Clean up MultiXactIdExpand's API by separating out the case where we
are creating a new MultiXactId from two regular XIDs.  The original
coding was unnecessarily complicated and didn't save any code anyway.
2005-05-03 19:42:41 +00:00
Bruce Momjian 76668e6eb4 Check the file system on postmaster startup and report any unreferenced
files in the server log.

Heikki Linnakangas
2005-05-02 18:26:54 +00:00
Tom Lane 6c412f0605 Change CREATE TYPE to require datatype output and send functions to have
only one argument.  (Per recent discussion, the option to accept multiple
arguments is pretty useless for user-defined types, and would be a likely
source of security holes if it was used.)  Simplify call sites of
output/send functions to not bother passing more than one argument.
2005-05-01 18:56:19 +00:00
Tom Lane 93b2477278 Use the standard lock manager to establish priority order when there
is contention for a tuple-level lock.  This solves the problem of a
would-be exclusive locker being starved out by an indefinite succession
of share-lockers.  Per recent discussion with Alvaro.
2005-04-30 19:03:33 +00:00
Tom Lane 3a694bb0a1 Restructure LOCKTAG as per discussions of a couple months ago.
Essentially, we shoehorn in a lockable-object-type field by taking
a byte away from the lockmethodid, which can surely fit in one byte
instead of two.  This allows less artificial definitions of all the
other fields of LOCKTAG; we can get rid of the special pg_xactlock
pseudo-relation, and also support locks on individual tuples and
general database objects (including shared objects).  None of those
possibilities are actually exploited just yet, however.

I removed pg_xactlock from pg_class, but did not force initdb for
that change.  At this point, relkind 's' (SPECIAL) is unused and
could be removed entirely.
2005-04-29 22:28:24 +00:00
Tom Lane bedb78d386 Implement sharable row-level locks, and use them for foreign key references
to eliminate unnecessary deadlocks.  This commit adds SELECT ... FOR SHARE
paralleling SELECT ... FOR UPDATE.  The implementation uses a new SLRU
data structure (managed much like pg_subtrans) to represent multiple-
transaction-ID sets.  When more than one transaction is holding a shared
lock on a particular row, we create a MultiXactId representing that set
of transactions and store its ID in the row's XMAX.  This scheme allows
an effectively unlimited number of row locks, just as we did before,
while not costing any extra overhead except when a shared lock actually
has to be shared.   Still TODO: use the regular lock manager to control
the grant order when multiple backends are waiting for a row lock.

Alvaro Herrera and Tom Lane.
2005-04-28 21:47:18 +00:00
Tom Lane 19d127548c Add comment about checkpoint panic behavior during shutdown, per
suggestion from Qingqing Zhou.
2005-04-23 18:49:54 +00:00
Tom Lane 3842892492 Recent changes got the sense of the notnull bit backwards in the 2.0
protocol output routines.  Mea culpa :-(.  Per report from Kris Jurka.
2005-04-23 17:45:35 +00:00
Bruce Momjian 1a6ad669fb Fix comment typo. 2005-04-17 03:04:29 +00:00
Tom Lane 5f0a974ea9 Reduce PANIC to ERROR in several xlog routines that are used in both
critical and noncritical contexts (an example of noncritical being
post-checkpoint removal of dead xlog segments).  In the critical cases
the CRIT_SECTION mechanism will cause ERROR to be promoted to PANIC
anyway, and in the noncritical cases we shouldn't let an error take
down the entire database.  Arguably there should be *no* explicit
PANIC errors in this module, only more START/END_CRIT_SECTION calls,
but I didn't go that far.  (Yet.)
2005-04-15 22:19:48 +00:00
Tom Lane 61b861421b Modify MoveOfflineLogs/InstallXLogFileSegment to avoid O(N^2) behavior
when recycling a large number of xlog segments during checkpoint.
The former behavior searched from the same start point each time,
requiring O(checkpoint_segments^2) stat() calls to relocate all the
segments.  Instead keep track of where we stopped last time through.
2005-04-15 18:48:10 +00:00
Tom Lane 8e14408028 Make equalTupleDescs() compare attlen/attbyval/attalign rather than
assuming comparison of atttypid is sufficient.  In a dropped column
atttypid will be 0, and we'd better check the physical-storage data
to make sure the tupdescs are physically compatible.
I do not believe there is a real risk before 8.0, since before that
we only used this routine to compare successive states of the tupdesc
for a particular relation.  But 8.0's typcache.c might be comparing
arbitrary tupdescs so we'd better play it safer.
2005-04-14 22:34:48 +00:00
Tom Lane 162bd08b3f Completion of project to use fixed OIDs for all system catalogs and
indexes.  Replace all heap_openr and index_openr calls by heap_open
and index_open.  Remove runtime lookups of catalog OID numbers in
various places.  Remove relcache's support for looking up system
catalogs by name.  Bulky but mostly very boring patch ...
2005-04-14 20:03:27 +00:00
Tom Lane 2193a856a2 Simplify initdb-time assignment of OIDs as I proposed yesterday, and
avoid encroaching on the 'user' range of OIDs by allowing automatic
OID assignment to use values below 16k until we reach normal operation.

initdb not forced since this doesn't make any incompatible change;
however a lot of stuff will have different OIDs after your next initdb.
2005-04-13 18:54:57 +00:00
Tom Lane c3294f1cbf Fix interaction between materializing holdable cursors and firing
deferred triggers: either one can create more work for the other,
so we have to loop till it's all gone.  Per example from andrew@supernews.
Add a regression test to help spot trouble in this area in future.
2005-04-11 19:51:16 +00:00
Tom Lane ad161bcc8a Merge Resdom nodes into TargetEntry nodes to simplify code and save a
few palloc's.  I also chose to eliminate the restype and restypmod fields
entirely, since they are redundant with information stored in the node's
contained expression; re-examining the expression at need seems simpler
and more reliable than trying to keep restype/restypmod up to date.

initdb forced due to change in contents of stored rules.
2005-04-06 16:34:07 +00:00
Tom Lane 47888fe842 First phase of OUT-parameters project. We can now define and use SQL
functions with OUT parameters.  The various PLs still need work, as does
pg_dump.  Rudimentary docs and regression tests included.
2005-03-31 22:46:33 +00:00
Tom Lane 8c85a34a3b Officially decouple FUNC_MAX_ARGS from INDEX_MAX_KEYS, and set the
former to 100 by default.  Clean up some of the less necessary
dependencies on FUNC_MAX_ARGS; however, the biggie (FunctionCallInfoData)
remains.
2005-03-29 03:01:32 +00:00
Tom Lane 70c9763d48 Convert oidvector and int2vector into variable-length arrays. This
change saves a great deal of space in pg_proc and its primary index,
and it eliminates the former requirement that INDEX_MAX_KEYS and
FUNC_MAX_ARGS have the same value.  INDEX_MAX_KEYS is still embedded
in the on-disk representation (because it affects index tuple header
size), but FUNC_MAX_ARGS is not.  I believe it would now be possible
to increase FUNC_MAX_ARGS at little cost, but haven't experimented yet.
There are still a lot of vestigial references to FUNC_MAX_ARGS, which
I will clean up in a separate pass.  However, getting rid of it
altogether would require changing the FunctionCallInfoData struct,
and I'm not sure I want to buy into that.
2005-03-29 00:17:27 +00:00
Tom Lane 119191609c Remove dead push/pop rollback code. Vadim once planned to implement
transaction rollback via UNDO but I think that's highly unlikely to
happen, so we may as well remove the stubs.  (Someday we ought to
rip out the stub xxx_undo routines, too.)  Per Alvaro.
2005-03-28 01:50:34 +00:00
Tom Lane bf3dbb5881 First steps towards index scans with heap access decoupled from index
access: define new index access method functions 'amgetmulti' that can
fetch multiple TIDs per call.  (The functions exist but are totally
untested as yet.)  Since I was modifying pg_am anyway, remove the
no-longer-needed 'rel' parameter from amcostestimate functions, and
also remove the vestigial amowner column that was creating useless
work for Alvaro's shared-object-dependencies project.
Initdb forced due to changes in pg_am.
2005-03-27 23:53:05 +00:00
Tom Lane 617dd33b6e Eliminate duplicate hasnulls bit testing in index tuple access, and
clean up itup.h a little bit.
2005-03-27 18:38:27 +00:00
Bruce Momjian b1f57d88f5 Change Win32 O_SYNC method to O_DSYNC because that is what the method
currently does.  This is now the default Win32 wal sync method because
we perfer o_datasync to fsync.

Also, change Win32 fsync to a new wal sync method called
fsync_writethrough because that is the behavior of _commit, which is
what is used for fsync on Win32.

Backpatch to 8.0.X.
2005-03-24 04:36:20 +00:00
Tom Lane 94e03330cb Create a routine PageIndexMultiDelete() that replaces a loop around
PageIndexTupleDelete() with a single pass of compactification ---
logic mostly lifted from PageRepairFragmentation.  I noticed while
profiling that a VACUUM that's cleaning up a whole lot of deleted
tuples would spend as much as a third of its CPU time in
PageIndexTupleDelete; not too surprising considering the loop method
was roughly O(N^2) in the number of tuples involved.
2005-03-22 06:17:03 +00:00
Tom Lane ee4ddac137 Convert index-related tuple handling routines from char 'n'/' ' to bool
convention for isnull flags.  Also, remove the useless InsertIndexResult
return struct from index AM aminsert calls --- there is no reason for
the caller to know where in the index the tuple was inserted, and we
were wasting a palloc cycle per insert to deliver this uninteresting
value (plus nontrivial complexity in some AMs).
I forced initdb because of the change in the signature of the aminsert
routines, even though nothing really looks at those pg_proc entries...
2005-03-21 01:24:04 +00:00
Neil Conway fe7015f5e8 Change the return value of HeapTupleSatisfiesUpdate() to be an enum,
rather than an integer, and fix the associated fallout. From Alvaro
Herrera.
2005-03-20 23:40:34 +00:00
Tom Lane 354049c709 Remove unnecessary calls of FlushRelationBuffers: there is no need
to write out data that we are about to tell the filesystem to drop.
smgr_internal_unlink already had a DropRelFileNodeBuffers call to
get rid of dead buffers without a write after it's no longer possible
to roll back the deleting transaction.  Adding a similar call in
smgrtruncate simplifies callers and makes the overall division of
labor clearer.  This patch removes the former behavior that VACUUM
would write all dirty buffers of a relation unconditionally.
2005-03-20 22:00:54 +00:00