Commit Graph

1539 Commits

Author SHA1 Message Date
Tom Lane aec4cf1c8c Add a function pg_stat_clear_snapshot() that discards any statistics snapshot
already collected in the current transaction; this allows plpgsql functions to
watch for stats updates even though they are confined to a single transaction.
Use this instead of the previous kluge involving pg_stat_file() to wait for
the stats collector to update in the stats regression test.  Internally,
decouple storage of stats snapshots from transaction boundaries; they'll
now stick around until someone calls pgstat_clear_snapshot --- which xact.c
still does at transaction end, to maintain the previous behavior.  This makes
the logic a lot cleaner, at the price of a couple dozen cycles per transaction
exit.
2007-02-07 23:11:30 +00:00
Tom Lane 78d1216160 Remove the xlog-centric "database system is ready" message and replace it with
"database system is ready to accept connections", which is issued by the
postmaster when it really is ready to accept connections.  Per proposal from
Markus Schiltknecht and subsequent discussion.
2007-02-07 16:44:48 +00:00
Tom Lane c76ed81513 Remove some dead code, per Heikki. 2007-02-06 14:55:11 +00:00
Tom Lane 23c4978e6c Rename MaxTupleSize to MaxHeapTupleSize to clarify that it's not meant to
describe the maximum size of index tuples (which is typically AM-dependent
anyway); and consequently remove the bogus deduction for "special space"
that was built into it.

Adjust TOAST_TUPLE_THRESHOLD and TOAST_MAX_CHUNK_SIZE to avoid wasting two
bytes per toast chunk, and to ensure that the calculation correctly tracks any
future changes in page header size.  The computation had been inaccurate in a
way that didn't cause any harm except space wastage, but future changes could
have broken it more drastically.

Fix the calculation of BTMaxItemSize, which was formerly computed as 1 byte
more than it could safely be.  This didn't cause any harm in practice because
it's only compared against maxalign'd lengths, but future changes in the size
of page headers or btree special space could have exposed the problem.

initdb forced because of change in TOAST_MAX_CHUNK_SIZE, which alters the
storage of toast tables.
2007-02-05 04:22:18 +00:00
Tom Lane a2e092e1c7 Don't MAXALIGN in the checks to decide whether a tuple is over TOAST's
threshold for tuple length.  On 4-byte-MAXALIGN machines, the toast code
creates tuples that have t_len exactly TOAST_TUPLE_THRESHOLD ... but this
number is not itself maxaligned, so if heap_insert maxaligns t_len before
comparing to TOAST_TUPLE_THRESHOLD, it'll uselessly recurse back to
tuptoaster.c, wasting cycles.  (It turns out that this does not happen on
8-byte-MAXALIGN machines, because for them the outer MAXALIGN in the
TOAST_MAX_CHUNK_SIZE macro reduces TOAST_MAX_CHUNK_SIZE so that toast tuples
will be less than TOAST_TUPLE_THRESHOLD in size.  That MAXALIGN is really
incorrect, but we can't remove it now, see below.)  There isn't any particular
value in maxaligning before comparing to the thresholds, so just don't do
that, which saves a small number of cycles in itself.

These numbers should be rejiggered to minimize wasted space on toast-relation
pages, but we can't do that in the back branches because changing
TOAST_MAX_CHUNK_SIZE would force an initdb (by changing the contents of toast
tables).  We can move the toast decision thresholds a bit, though, which is
what this patch effectively does.

Thanks to Pavan Deolasee for discovering the unintended recursion.

Back-patch into 8.2, but not further, pending more testing.  (HEAD is about
to get a further patch modifying the thresholds, so it won't help much
for testing this form of the patch.)
2007-02-04 20:00:37 +00:00
Bruce Momjian 8b4ff8b6a1 Wording cleanup for error messages. Also change can't -> cannot.
Standard English uses "may", "can", and "might" in different ways:

        may - permission, "You may borrow my rake."

        can - ability, "I can lift that log."

        might - possibility, "It might rain today."

Unfortunately, in conversational English, their use is often mixed, as
in, "You may use this variable to do X", when in fact, "can" is a better
choice.  Similarly, "It may crash" is better stated, "It might crash".
2007-02-01 19:10:30 +00:00
Neil Conway dbcaee49b5 Fix a few typos in comments in GiN. 2007-02-01 04:16:08 +00:00
Teodor Sigaev d4c6da1527 Allow GIN's extractQuery method to signal that nothing can satisfy the query.
In this case extractQuery should returns -1 as nentries. This changes
prototype of extractQuery method to use int32* instead of uint32* for
nentries argument.
Based on that gincostestimate may see two corner cases: nothing will be found
or seqscan should be used.

Per proposal at http://archives.postgresql.org/pgsql-hackers/2007-01/msg01581.php

PS tsearch_core patch should be sightly modified to support changes, but I'm
waiting a verdict about reviewing of tsearch_core patch.
2007-01-31 15:09:45 +00:00
Tom Lane a635c08fa1 Add support for cross-type hashing in hash index searches and hash joins.
Hashing for aggregation purposes still needs work, so it's not time to
mark any cross-type operators as hashable for general use, but these cases
work if the operators are so marked by hand in the system catalogs.
2007-01-30 01:33:36 +00:00
Tom Lane e8cd6f14a2 Add comment noting that hashm_procid in a hash index's metapage isn't
actually used for anything.
2007-01-29 23:22:59 +00:00
Tom Lane 6cefacd7c8 Correct an old logic error in btree page splitting: when considering a split
exactly at the point where we need to insert a new item, the calculation used
the wrong size for the "high key" of the new left page.  This could lead to
choosing an unworkable split, resulting in "PANIC: failed to add item to the
left sibling" (or "right sibling") failure.  Although this bug has been there
a long time, it's very difficult to trigger a failure before 8.2, since there
was generally a lot of free space on both sides of a chosen split.  In 8.2,
where the user-selected fill factor determines how much free space the code
tries to leave, an unworkable split is much more likely.  Report by Joe
Conway, diagnosis and fix by Heikki Linnakangas.
2007-01-27 20:53:30 +00:00
Bruce Momjian ef65f6f7a4 Prevent WAL logging when COPY is done in the same transation that
created it.

Simon Riggs
2007-01-25 02:17:26 +00:00
Neil Conway 2b7334d487 Refactor the index AM API slightly: move currentItemData and
currentMarkData from IndexScanDesc to the opaque structs for the
AMs that need this information (currently gist and hash).

Patch from Heikki Linnakangas, fixes by Neil Conway.
2007-01-20 18:43:35 +00:00
Peter Eisentraut 2cc01004c6 Remove remains of old depend target. 2007-01-20 17:16:17 +00:00
Alvaro Herrera eb63cc3da8 Arrange for autovacuum to be killed when another operation wants to be alone
accessing it, like DROP DATABASE.  This allows the regression tests to pass
with autovacuum enabled, which open the gates for finally enabling autovacuum
by default.
2007-01-16 13:28:57 +00:00
Tom Lane d83235415b Add some notes about the basic mathematical laws that the system presumes
hold true for operators in a btree operator family.  This is mostly to
clarify my own thinking about what the planner can assume for optimization
purposes.  (blowing dust off an old abstract-algebra textbook...)
2007-01-12 17:04:54 +00:00
Bruce Momjian 40f797be03 Enable another five tuple status bits by using the high bits of the
nattr field, and rename the field.

Heikki Linnakangas
2007-01-09 22:01:00 +00:00
Tom Lane 69db009163 Add a citation to Seltzer and Yigit's Usenix '91 paper about hash table
management.  The paper clearly describes many of the ideas embodied in
our current hashing code, but as far as I could find out there is not
a direct code heritage.  (Mike Olsen recalls discussion of this paper
at Postgres meetings but believes it "informed the Postgres implementation
probably just at the design level".  Margo herself says she wasn't
involved with Postgres' hash code.)  Credit where credit is due 'n all
that, even if fifteen years after the fact.
2007-01-09 07:30:49 +00:00
Tom Lane 4431758229 Support ORDER BY ... NULLS FIRST/LAST, and add ASC/DESC/NULLS FIRST/NULLS LAST
per-column options for btree indexes.  The planner's support for this is still
pretty rudimentary; it does not yet know how to plan mergejoins with
nondefault ordering options.  The documentation is pretty rudimentary, too.
I'll work on improving that stuff later.

Note incompatible change from prior behavior: ORDER BY ... USING will now be
rejected if the operator is not a less-than or greater-than member of some
btree opclass.  This prevents less-than-sane behavior if an operator that
doesn't actually define a proper sort ordering is selected.
2007-01-09 02:14:16 +00:00
Bruce Momjian 29dccf5fe0 Update CVS HEAD for 2007 copyright. Back branches are typically not
back-stamped for this.
2007-01-05 22:20:05 +00:00
Tom Lane 7c8927bf08 Fix some small typos in comments. Greg Stark 2007-01-04 16:29:42 +00:00
Tom Lane ef07221997 Clean up smgr.c/md.c APIs as per discussion a couple months ago. Instead of
having md.c return a success/failure boolean to smgr.c, which was just going
to elog anyway, let md.c issue the elog messages itself.  This allows better
error reporting, particularly in cases such as "short read" or "short write"
which Peter was complaining of.  Also, remove the kluge of allowing mdread()
to return zeroes from a read-beyond-EOF: this is now an error condition
except when InRecovery or zero_damaged_pages = true.  (Hash indexes used to
require that behavior, but no more.)  Also, enforce that mdwrite() is to be
used for rewriting existing blocks while mdextend() is to be used for
extending the relation EOF.  This restriction lets us get rid of the old
ad-hoc defense against creating huge files by an accidental reference to
a bogus block number: we'll only create new segments in mdextend() not
mdwrite() or mdread().  (Again, when InRecovery we allow it anyway, since
we need to allow updates of blocks that were later truncated away.)
Also, clean up the original makeshift patch for bug #2737: move the
responsibility for padding relation segments to full length into md.c.
2007-01-03 18:11:01 +00:00
Tom Lane 5725b9d9af Support type modifiers for user-defined types, and pull most knowledge
about typmod representation for standard types out into type-specific
typmod I/O functions.  Teodor Sigaev, with some editorialization by
Tom Lane.
2006-12-30 21:21:56 +00:00
Tom Lane 9aefd56669 Fix up btree's initial scankey processing to be able to detect redundant
or contradictory keys even in cross-data-type scenarios.  This is another
benefit of the opfamily rewrite: we can find the needed comparison
operators now.
2006-12-28 23:16:39 +00:00
Tom Lane a78fcfb512 Restructure operator classes to allow improved handling of cross-data-type
cases.  Operator classes now exist within "operator families".  While most
families are equivalent to a single class, related classes can be grouped
into one family to represent the fact that they are semantically compatible.
Cross-type operators are now naturally adjunct parts of a family, without
having to wedge them into a particular opclass as we had done originally.

This commit restructures the catalogs and cleans up enough of the fallout so
that everything still works at least as well as before, but most of the work
needed to actually improve the planner's behavior will come later.  Also,
there are not yet CREATE/DROP/ALTER OPERATOR FAMILY commands; the only way
to create a new family right now is to allow CREATE OPERATOR CLASS to make
one by default.  I owe some more documentation work, too.  But that can all
be done in smaller pieces once this infrastructure is in place.
2006-12-23 00:43:13 +00:00
Tom Lane 0cb91ccba9 Remove the logId/logSeg fields from pg_control, because they are not needed
in normal operation, and we can avoid rewriting pg_control at every log
segment switch if we don't insist that these values be valid.  Reducing
the number of pg_control updates is a good idea for both performance and
reliability.  It does make pg_resetxlog's life a bit harder, but that seems
a good tradeoff; and anyway the change to pg_resetxlog amounts to automating
something people formerly needed to do by hand, namely look at the existing
pg_xlog files to make sure the new WAL start point was past them.

In passing, change the wording of xlog.c's "database system was interrupted"
messages: describe the pg_control timestamp as "last known up at" rather than
implying it is the exact time of service interruption.  With this change the
timestamp will generally be the time of the last checkpoint, which could be
many minutes before the failure; and we've already seen indications that
people tend to misinterpret the old wording.

initdb forced due to change in pg_control layout.  Simon Riggs and Tom Lane
2006-12-08 19:50:53 +00:00
Neil Conway 886a02d1cb Add a txn_start column to pg_stat_activity. This makes it easier to
identify long-running transactions. Since we already need to record
the transaction-start time (e.g. for now()), we don't need any
additional system calls to report this information.

Catversion bumped, initdb required.
2006-12-06 18:06:48 +00:00
Tom Lane 5f60086e10 Minor adjustments to make failures in startup/shutdown behave more cleanly.
StartupXLOG and ShutdownXLOG no longer need to be critical sections, because
in all contexts where they are invoked, elog(ERROR) would be translated to
elog(FATAL) anyway.  (One change in bgwriter.c is needed to make this true:
set ExitOnAnyError before trying to exit.  This is a good fix anyway since
the existing code would have gone into an infinite loop on elog(ERROR) during
shutdown.)  That avoids a misleading report of PANIC during semi-orderly
failures.  Modify the postmaster to include the startup process in the set of
processes that get SIGTERM when a fast shutdown is requested, and also fix it
to not try to restart the bgwriter if the bgwriter fails while trying to write
the shutdown checkpoint.  Net result is that "pg_ctl stop -m fast" does
something reasonable for a system in warm standby mode, and so should Unix
system shutdown (ie, universal SIGTERM).  Per gripe from Stephen Harris and
some corner-case testing of my own.
2006-11-30 18:29:12 +00:00
Teodor Sigaev ef148d6b85 Fix bug with page deletion. If inner page is removed and it tries to
remove page on next level linked from next inner page, ginScanToDelete()
wrongly sets parent page. Bug reveals when many item pointers from index
was deleted ( several hundred thousands).

Bug is discovered by hubert depesz lubaczewski <depesz@gmail.com>

Suppose, we need rc2 before release...
2006-11-30 16:22:32 +00:00
Neil Conway 546d6848ca Add a comment noting that heap_copytuple_with_tuple() results in a
HeapTuple that is no longer allocated as a single palloc() block; if
used carelessly, this might result in a subsequent memory leak after
heap_freetuple().
2006-11-23 05:27:18 +00:00
Tom Lane 395249ecbe Several changes to reduce the probability of running out of memory during
AbortTransaction, which would lead to recursion and eventual PANIC exit
as illustrated in recent report from Jeff Davis.  First, in xact.c create
a special dedicated memory context for AbortTransaction to run in.  This
solves the problem as long as AbortTransaction doesn't need more than 32K
(or whatever other size we create the context with).  But in corner cases
it might.  Second, in trigger.c arrange to keep pending after-trigger event
records in separate contexts that can be freed near the beginning of
AbortTransaction, rather than having them persist until CleanupTransaction
as before.  Third, in portalmem.c arrange to free executor state data
earlier as well.  These two changes should result in backing off the
out-of-memory condition before AbortTransaction needs any significant
amount of memory, at least in typical cases such as memory overrun due
to too many trigger events or too big an executor hash table.  And all
the same for subtransaction abort too, of course.
2006-11-23 01:14:59 +00:00
Tom Lane 3ad0728c81 On systems that have setsid(2) (which should be just about everything except
Windows), arrange for each postmaster child process to be its own process
group leader, and deliver signals SIGINT, SIGTERM, SIGQUIT to the whole
process group not only the direct child process.  This provides saner behavior
for archive and recovery scripts; in particular, it's possible to shut down a
warm-standby recovery server using "pg_ctl stop -m immediate", since delivery
of SIGQUIT to the startup subprocess will result in killing the waiting
recovery_command.  Also, this makes Query Cancel and statement_timeout apply
to scripts being run from backends via system().  (There is no support in the
core backend for that, but it's widely done using untrusted PLs.)  Per gripe
from Stephen Harris and subsequent discussion.
2006-11-21 20:59:53 +00:00
Tom Lane d68efb3f8d Repair problems with hash indexes that span multiple segments: the hash code's
preference for filling pages out-of-order tends to confuse the sanity checks
in md.c, as per report from Balazs Nagy in bug #2737.  The fix is to ensure
that the smgr-level code always has the same idea of the logical EOF as the
hash index code does, by using ReadBuffer(P_NEW) where we are adding a single
page to the end of the index, and using smgrextend() to reserve a large batch
of pages when creating a new splitpoint.  The patch is a bit ugly because it
avoids making any changes in md.c, which seems the most prudent approach for a
backpatchable beta-period fix.  After 8.3 development opens, I'll take a look
at a cleaner but more invasive patch, in particular getting rid of the now
unnecessary hack to allow reading beyond EOF in mdread().

Backpatch as far as 7.4.  The bug likely exists in 7.3 as well, but because
of the magnitude of the 7.3-to-7.4 changes in hash, the later-version patch
doesn't even begin to apply.  Given the other known bugs in the 7.3-era hash
code, it does not seem worth trying to develop a separate patch for 7.3.
2006-11-19 21:33:23 +00:00
Tom Lane 4f335a3d7f Repair two related errors in heap_lock_tuple: it was failing to recognize
cases where we already hold the desired lock "indirectly", either via
membership in a MultiXact or because the lock was originally taken by a
different subtransaction of the current transaction.  These cases must be
accounted for to avoid needless deadlocks and/or inappropriate replacement of
an exclusive lock with a shared lock.  Per report from Clarence Gardner and
subsequent investigation.
2006-11-17 18:00:15 +00:00
Peter Eisentraut e138b80996 String fix 2006-11-16 14:28:41 +00:00
Neil Conway dc10387eb1 Fix some typos in comments. 2006-11-12 06:55:54 +00:00
Tom Lane a46ca619f8 Suppress a few 'uninitialized variable' warnings that gcc emits only at
-O3 or higher (presumably because it inlines more things).  Per gripe
from Mark Mielke.
2006-11-11 01:14:19 +00:00
Tom Lane 792d6edd5b Clean up some misleading references to %p being a full path, per Simon. 2006-11-10 22:32:20 +00:00
Tom Lane dcbdf9b1d4 Change Windows rename and unlink substitutes so that they time out after
30 seconds instead of retrying forever.  Also modify xlog.c so that if
it fails to rename an old xlog segment up to a future slot, it will
unlink the segment instead.  Per discussion of bug #2712, in which it
became apparent that Windows can handle unlinking a file that's being
held open, but not renaming it.
2006-11-08 20:12:05 +00:00
Tom Lane 48188e1621 Fix recently-understood problems with handling of XID freezing, particularly
in PITR scenarios.  We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases.  Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId.  Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done.  Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database.  initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs.  Heikki Linnakangas, Simon Riggs, and Tom Lane.
2006-11-05 22:42:10 +00:00
Tom Lane 70ce5c9082 Fix "failed to re-find parent key" btree VACUUM failure by revising page
deletion code to avoid the case where an upper-level btree page remains "half
dead" for a significant period of time, and to block insertions into a key
range that is in process of being re-assigned to the right sibling of the
deleted page's parent.  This prevents the scenario reported by Ed L. wherein
index keys could become out-of-order in the grandparent index level.

Since this is a moderately invasive fix, I'm applying it only to HEAD.
The bug exists back to 7.4, but the back branches will get a different patch.
2006-11-01 19:43:17 +00:00
Tom Lane 1e758d5263 Add some code to CREATE DATABASE to check for pre-existing subdirectories
that conflict with the OID that we want to use for the new database.
This avoids the risk of trying to remove files that maybe we shouldn't
remove.  Per gripe from Jon Lapham and subsequent discussion of 27-Sep.
2006-10-18 22:44:12 +00:00
Peter Eisentraut b9b4f10b5b Message style improvements 2006-10-06 17:14:01 +00:00
Tom Lane 378c79dc78 Cleanup for pglz_compress code: remove dead code, const-ify API of
remaining functions, simplify pglz_compress's API to not require a useless
data copy when compression fails.  Also add a check in pglz_decompress that
the expected amount of data was decompressed.
2006-10-05 23:33:33 +00:00
Tom Lane e378f82e00 Make use of qsort_arg in several places that were formerly using klugy
static variables.  This avoids any risk of potential non-reentrancy,
and in particular offers a much cleaner workaround for the Intel compiler
bug that was affecting ginutil.c.
2006-10-05 17:57:40 +00:00
Bruce Momjian f99a569a2e pgindent run for 8.2. 2006-10-04 00:30:14 +00:00
Bruce Momjian 45c8ed96b9 Make some sentences consistent with similar ones.
Euler Taveira de Oliveira
2006-10-03 21:21:36 +00:00
Alvaro Herrera 4650c4fdb9 Degrade the transaction-id wraparound point message from LOG to DEBUG1, per
discussion.

Patch from Simon Riggs.
2006-09-26 17:21:39 +00:00
Tom Lane 9e936693a9 Fix free space map to correctly track the total amount of FSM space needed
even when a single relation requires more than max_fsm_pages pages.  Also,
make VACUUM emit a warning in this case, since it likely means that VACUUM
FULL or other drastic corrective measure is needed.  Per reports from Jeff
Frost and others of unexpected changes in the claimed max_fsm_pages need.
2006-09-21 20:31:22 +00:00
Teodor Sigaev deb66e013c Improve error message. Per discussion
http://archives.postgresql.org/pgsql-general/2006-09/msg00186.php
2006-09-14 11:26:49 +00:00
Bruce Momjian d18768867e Remove unnecessary brace pair. 2006-09-10 23:33:22 +00:00
Tom Lane f5b4d9a9e0 If we're going to advertise the array overlap/containment operators,
we probably should make them work reliably for all arrays.  Fix code
to handle NULLs and multidimensional arrays, move it into arrayfuncs.c.
GIN is still restricted to indexing arrays with no null elements, however.
2006-09-10 20:14:20 +00:00
Tom Lane ba920e1c91 Rename contains/contained-by operators to @> and <@, per discussion that
agreed these symbols are less easily confused.  I made new pg_operator
entries (with new OIDs) for the old names, so as to provide backward
compatibility while making it pretty easy to remove the old names in
some future release cycle.  This commit only touches the core datatypes,
contrib will be fixed separately.
2006-09-10 00:29:35 +00:00
Teodor Sigaev 889ec4b998 Fix Intel compiler bug. Per discussion
'GIN FailedAssertions on Itanium2 with Intel compiler' in pgsql-hackers,
http://archives.postgresql.org/pgsql-hackers/2006-08/msg01914.php
2006-09-05 18:25:10 +00:00
Tom Lane 8fad2e3ff4 Arrange for GetSnapshotData to copy live-subtransaction XIDs from the
PGPROC array into snapshots, and use this information to avoid visits
to pg_subtrans in HeapTupleSatisfiesSnapshot.  This appears to solve
the pg_subtrans-related context swap storm problem that's been reported
by several people for 8.1.  While at it, modify GetSnapshotData to not
take an exclusive lock on ProcArrayLock, as closer analysis shows that
shared lock is always sufficient.
Itagaki Takahiro and Tom Lane
2006-09-03 15:59:39 +00:00
Teodor Sigaev b681bfdd59 Fix BUG #2594: Gin Indexes cause server to crash when it builds on empty table 2006-08-29 14:05:44 +00:00
Tom Lane ca1fd0ea5b Move xact.c's partial support for Lists of TransactionIds into pg_list.h.
Needed because lock.c is now going to use the same type of list.
2006-08-27 19:11:46 +00:00
Tom Lane e093dcdd28 Add the ability to create indexes 'concurrently', that is, without
blocking concurrent writes to the table.  Greg Stark, with a little help
from Tom Lane.
2006-08-25 04:06:58 +00:00
Tom Lane 08ae5edc5c Optimize the case where a btree indexscan has current and mark positions
on the same index page; we can avoid data copying as well as buffer refcount
manipulations in this common case.  Makes for a small but noticeable
improvement in mergejoin speed.

Heikki Linnakangas
2006-08-24 01:18:34 +00:00
Tom Lane 35af5422f6 Make the server track an 'XID epoch', that is, maintain higher-order bits
of the transaction ID counter.  Nothing is done with the epoch except to
store it in checkpoint records, but this provides a foundation with which
add-on code can pretend that XIDs never wrap around.  This is a severely
trimmed and rewritten version of the xxid patch submitted by Marko Kreen.
Per discussion, the epoch counter seems the only part of xxid that really
needs to be in the core server.
2006-08-21 16:16:31 +00:00
Tom Lane 7aa772f03e Now that we've rearranged relation open to get a lock before touching
the rel, it's easy to get rid of the narrow race-condition window that
used to exist in VACUUM and CLUSTER.  Did some minor code-beautification
work in the same area, too.
2006-08-18 16:09:13 +00:00
Tom Lane e8ea9e9587 Implement archive_timeout feature to force xlog file switches to occur no more
than N seconds apart.  This allows a simple, if not very high performance,
means of guaranteeing that a PITR archive is no more than N seconds behind
real time.  Also make pg_current_xlog_location return the WAL Write pointer,
add pg_current_xlog_insert_location to return the Insert pointer, and fix
pg_xlogfile_name_offset to return its results as a two-element record instead
of a smashed-together string, as per recent discussion.

Simon Riggs
2006-08-17 23:04:10 +00:00
Tom Lane 7a3e30e608 Add INSERT/UPDATE/DELETE RETURNING, with basic docs and regression tests.
plpgsql support to come later.  Along the way, convert execMain's
SELECT INTO support into a DestReceiver, in order to eliminate some ugly
special cases.

Jonah Harris and Tom Lane
2006-08-12 02:52:06 +00:00
Tom Lane e002836913 Make recovery from WAL be restartable, by executing a checkpoint-like
operation every so often.  This improves the usefulness of PITR log
shipping for hot standby: formerly, if the standby server crashed, it
was necessary to restart it from the last base backup and replay all
the WAL since then.  Now it will only need to reread about the same
amount of WAL as the master server would.  The behavior might also
come in handy during a long PITR replay sequence.  Simon Riggs,
with some editorialization by Tom Lane.
2006-08-07 16:57:57 +00:00
Tom Lane 704ddaaa09 Add support for forcing a switch to a new xlog file; cause such a switch
to happen automatically during pg_stop_backup().  Add some functions for
interrogating the current xlog insertion point and for easily extracting
WAL filenames from the hex WAL locations displayed by pg_stop_backup
and friends.  Simon Riggs with some editorialization by Tom Lane.
2006-08-06 03:53:44 +00:00
Tom Lane bc8ac3ce40 Add missing pgstat_count_index_scan(), per Andreas Seltenreich. 2006-08-03 15:22:09 +00:00
Tom Lane 09d3670df3 Change the relation_open protocol so that we obtain lock on a relation
(table or index) before trying to open its relcache entry.  This fixes
race conditions in which someone else commits a change to the relation's
catalog entries while we are in process of doing relcache load.  Problems
of that ilk have been reported sporadically for years, but it was not
really practical to fix until recently --- for instance, the recent
addition of WAL-log support for in-place updates helped.

Along the way, remove pg_am.amconcurrent: all AMs are now expected to support
concurrent update.
2006-07-31 20:09:10 +00:00
Alvaro Herrera 92c2ecc130 Modify snapshot definition so that lazy vacuums are ignored by other
vacuums.  This allows a OLTP-like system with big tables to continue
regular vacuuming on small-but-frequently-updated tables while the
big tables are being vacuumed.

Original patch from Hannu Krossing, rewritten by Tom Lane and updated
by me.
2006-07-30 02:07:18 +00:00
Tom Lane e6284649b9 Modify btree to delete known-dead index entries without an actual VACUUM.
When we are about to split an index page to do an insertion, first look
to see if any entries marked LP_DELETE exist on the page, and if so remove
them to try to make enough space for the desired insert.  This should reduce
index bloat in heavily-updated tables, although of course you still need
VACUUM eventually to clean up the heap.

Junji Teramoto
2006-07-25 19:13:00 +00:00
Peter Eisentraut e9b4969062 DTrace support, with a small initial set of probes
by Robert Lor
2006-07-24 16:32:45 +00:00
Tom Lane 9dc842f083 Don't try to truncate multixact SLRU files in checkpoints done during xlog
recovery.  In the first place, it doesn't work because slru's
latest_page_number isn't set up yet (this is why we've been hearing reports
of strange "apparent wraparound" log messages during crash recovery, but
only from people who'd managed to advance their next-mxact counters some
considerable distance from 0).  In the second place, it seems a bit unwise
to be throwing away data during crash recovery anwyway.  This latter
consideration convinces me to just disable truncation during recovery,
rather than computing latest_page_number and pushing ahead.
2006-07-20 00:46:42 +00:00
Tom Lane c36418be40 Fix getDatumCopy(): don't use store_att_byval to copy into a Datum
variable (this accounts for regression failures on PPC64, and in fact
won't work on any big-endian machine).  Get rid of hardwired knowledge
about datum size rules; make it look just like datumCopy().
2006-07-16 00:54:22 +00:00
Tom Lane e040ab44e4 Improve error message wording. 2006-07-16 00:52:05 +00:00
Tom Lane 98bac16e4d Fix misguided removal of access/tuptoaster.h inclusion, per Kris Jurka.
I'm going to insist on reversion of this entire patch unless pgrminclude
is upgraded to a less broken state, but in the meantime let's get contrib
passing regression again.
2006-07-14 19:05:52 +00:00
Bruce Momjian e0522505bd Remove 576 references of include files that were not needed. 2006-07-14 14:52:27 +00:00
Bruce Momjian a22d76d96a Allow include files to compile own their own.
Strip unused include files out unused include files, and add needed
includes to C files.

The next step is to remove unused include files in C files.
2006-07-13 16:49:20 +00:00
Tom Lane d29b66882a Tweak fillfactor code as per my recent proposal. Fix nbtsort.c so that
it can handle small fillfactors for ordinary-sized index entries without
failing on large ones; fix nbtinsert.c to distinguish leaf and nonleaf
pages; change the minimum fillfactor to 10% for all index types.
2006-07-11 21:05:57 +00:00
Teodor Sigaev 001d30ee6b Add support to GIN for =(anyarray,anyarray) operation 2006-07-11 19:49:14 +00:00
Bruce Momjian ac230e7431 Alphabetically order reference to include files, "S"-"Z". 2006-07-11 18:26:11 +00:00
Bruce Momjian 0ff3461bcc Alphabetically order reference to include files, "N" - "S". 2006-07-11 17:26:59 +00:00
Bruce Momjian 3a534ade39 Alphabetically order reference to include files, "G" - "M". 2006-07-11 17:04:13 +00:00
Teodor Sigaev 234163649e GIN improvements
- Replace sorted array of entries in maintenance_work_mem to binary tree,
  this should improve create performance.
- More precisely calculate allocated memory, eliminate leaks
  with user-defined extractValue()
- Improve wordings in tsearch2
2006-07-11 16:55:34 +00:00
Alvaro Herrera d4cef0aa2a Improve vacuum code to track minimum Xids per table instead of per database.
To this end, add a couple of columns to pg_class, relminxid and relvacuumxid,
based on which we calculate the pg_database columns after each vacuum.

We now force all databases to be vacuumed, even template ones.  A backend
noticing too old a database (meaning pg_database.datminxid is in danger of
falling behind Xid wraparound) will signal the postmaster, which in turn will
start an autovacuum iteration to process the offending database.  In principle
this is only there to cope with frozen (non-connectable) databases without
forcing users to set them to connectable, but it could force regular user
database to go through a database-wide vacuum at any time.  Maybe we should
warn users about this somehow.  Of course the real solution will be to use
autovacuum all the time ;-)

There are some additional improvements we could have in this area: for example
the vacuum code could be smarter about not updating pg_database for each table
when called by autovacuum, and do it only once the whole autovacuum iteration
is done.

I updated the system catalogs documentation, but I didn't modify the
maintenance section.  Also having some regression tests for this would be nice
but it's not really a very straightforward thing to do.

Catalog version bumped due to system catalog changes.
2006-07-10 16:20:52 +00:00
Tom Lane b7b78d24f7 Code review for FILLFACTOR patch. Change WITH grammar as per earlier
discussion (including making def_arg allow reserved words), add missed
opt_definition for UNIQUE case.  Put the reloptions support code in a less
random place (I chose to make a new file access/common/reloptions.c).
Eliminate header inclusion creep.  Make the index options functions safely
user-callable (seems like client apps might like to be able to test validity
of options before trying to make an index).  Reduce overhead for normal case
with no options by allowing rd_options to be NULL.  Fix some unmaintainably
klugy code, including getting rid of Natts_pg_class_fixed at long last.
Some stylistic cleanup too, and pay attention to keeping comments in sync
with code.

Documentation still needs work, though I did fix the omissions in
catalogs.sgml and indexam.sgml.
2006-07-03 22:45:41 +00:00
Bruce Momjian 277807bd9e Add FILLFACTOR to CREATE INDEX.
ITAGAKI Takahiro
2006-07-02 02:23:23 +00:00
Teodor Sigaev 783a73168b Forget to add new file :(( 2006-06-28 12:08:35 +00:00
Teodor Sigaev 1f7ef548ec Changes
* new split algorithm (as proposed in http://archives.postgresql.org/pgsql-hackers/2006-06/msg00254.php)
  * possible call pickSplit() for second and below columns
  * add spl_(l|r)datum_exists to GIST_SPLITVEC -
    pickSplit should check its values to use already defined
    spl_(l|r)datum for splitting. pickSplit should set
    spl_(l|r)datum_exists to 'false' (if they was 'true') to
    signal to caller about using spl_(l|r)datum.
  * support for old pickSplit(): not very optimal
    but correct split
* remove 'bytes' field from GISTENTRY: in any case size of
  value is defined by it's type.
* split GIST_SPLITVEC to two structures: one for using in picksplit
  and second - for internal use.
* some code refactoring
* support of subsplit to rtree opclasses

TODO: add support of subsplit to contrib modules
2006-06-28 12:00:14 +00:00
Tom Lane 3c71244b74 Put #ifdef NOT_USED around posix_fadvise call. We may want to resurrect
this someday, but right now it seems that posix_fadvise is immature to
the point of being broken on many platforms ... and we don't have any
benchmark evidence proving it's worth spending time on.
2006-06-27 18:59:17 +00:00
Tom Lane cdd5178c69 Extend the MinimalTuple concept to tuplesort.c, thereby reducing the
per-tuple space overhead for sorts in memory.  I chose to replace the
previous patch that tried to write out the bare minimum amount of data
when sorting on disk; instead, just dump the MinimalTuples as-is.  This
wastes 3 to 10 bytes per tuple depending on architecture and null-bitmap
length, but the simplification in the writetup/readtup routines seems
worth it.
2006-06-27 16:53:02 +00:00
Tom Lane 3f50ba27cf Create infrastructure for 'MinimalTuple' representation of in-memory
tuples with less header overhead than a regular HeapTuple, per my
recent proposal.  Teach TupleTableSlot code how to deal with these.
As proof of concept, change tuplestore.c to store MinimalTuples instead
of HeapTuples.  Future patches will expand the concept to other places
where it is useful.
2006-06-27 02:51:40 +00:00
Tom Lane 3a04f53e7f pg_stop_backup was calling XLogArchiveNotify() twice for the newly created
backup history file.  Bug introduced by the 8.1 change to make pg_stop_backup
delete older history files.  Per report from Masao Fujii.
2006-06-22 20:42:57 +00:00
Tom Lane 27c3e3de09 Remove redundant gettimeofday() calls to the extent practical without
changing semantics too much.  statement_timestamp is now set immediately
upon receipt of a client command message, and the various places that used
to do their own gettimeofday() calls to mark command startup are referenced
to that instead.  I have also made stats_command_string use that same
value for pg_stat_activity.query_start for both the command itself and
its eventual replacement by <IDLE> or <idle in transaction>.  There was
some debate about that, but no argument that seemed convincing enough to
justify an extra gettimeofday() call.
2006-06-20 22:52:00 +00:00
Tom Lane 1e8ae13640 Don't try to call posix_fadvise() unless <fcntl.h> supplies a declaration
for it.  Hopefully will fix core dump evidenced by some buildfarm members
since fadvise patch went in.  The actual definition of the function is not
ABI-compatible with compiler's default assumption in the absence of any
declaration, so it's clearly unsafe to try to call it without seeing a
declaration.
2006-06-18 18:30:21 +00:00
Tom Lane 06e10abc0b Fix problems with cached tuple descriptors disappearing while still in use
by creating a reference-count mechanism, similar to what we did a long time
ago for catcache entries.  The back branches have an ugly solution involving
lots of extra copies, but this way is more efficient.  Reference counting is
only applied to tupdescs that are actually in caches --- there seems no need
to use it for tupdescs that are generated in the executor, since they'll go
away during plan shutdown by virtue of being in the per-query memory context.
Neil Conway and Tom Lane
2006-06-16 18:42:24 +00:00
Bruce Momjian 40bc06fa16 Test for POSIX_FADV_DONTNEED to use posix_fadvise(). 2006-06-16 04:11:48 +00:00
Bruce Momjian 94a5c4a01b Use posix_fadvise() to avoid kernel caching of WAL contents on WAL file
close.

ITAGAKI Takahiro
2006-06-15 19:15:00 +00:00
Teodor Sigaev b32000eda4 Som improve page split in multicolumn GiST index.
If user picksplit on n-th column generate equals
left and right unions then it calls picksplit on n+1-th
column.
2006-05-29 12:50:06 +00:00
Teodor Sigaev 0a6fde5a26 Correct cheking in findParents(). i
From Andreas Seltenreich <andreas+pg@gate450.dyndns.org>
2006-05-29 08:39:44 +00:00
Alvaro Herrera 3d58a1c168 Remove traces of otherwise unused RELKIND_SPECIAL symbol. Leave the psql bits
in place though, so that it plays nicely with older servers.

Per discussion.
2006-05-28 02:27:08 +00:00
Teodor Sigaev 5d1a066e64 Fix findParents() in case of multiple levels to find.
By Andreas Seltenreich <andreas+pg@gate450.dyndns.org>
2006-05-26 08:01:17 +00:00
Teodor Sigaev d2158b0281 * Add support NULL to GiST.
* some refactoring and simplify code int gistutil.c and gist.c
* now in some cases it can be called used-defined
  picksplit method for non-first column in index, but here
	is a place to do more.
* small fix of docs related to support NULL.
2006-05-24 11:01:39 +00:00
Teodor Sigaev 09518fbdf4 Call MarkBufferDirty() before XLogInsert() during completion of insert 2006-05-19 17:15:41 +00:00
Teodor Sigaev 420cbff881 Simplify gistSplit() and some refactoring related code. 2006-05-19 16:15:17 +00:00
Teodor Sigaev 5890790b4a Rework completion of incomplete inserts. Now it writes
WAL log during inserts.
2006-05-19 11:10:25 +00:00
Teodor Sigaev 8876e37d07 Reduce size of critial section during vacuum full, critical
sections now isn't nested. All user-defined functions now is
called outside critsections. Small improvements in WAL
protocol.

TODO: improve XLOG replay
2006-05-17 16:34:59 +00:00
Tom Lane 3fdeb189e9 Clean up code associated with updating pg_class statistics columns
(relpages/reltuples).  To do this, create formal support in heapam.c for
"overwrite" tuple updates (including xlog replay capability) and use that
instead of the ad-hoc overwrites we'd been using in VACUUM and CREATE INDEX.
Take the responsibility for updating stats during CREATE INDEX out of the
individual index AMs, and do it where it belongs, in catalog/index.c.  Aside
from being more modular, this avoids having to update the same tuple twice in
some paths through CREATE INDEX.  It's probably not measurably faster, but
for sure it's a lot cleaner than before.
2006-05-10 23:18:39 +00:00
Teodor Sigaev 10dd8df68e Reduce size of critical section and remove call of user-defined functions in
insertion and deletion, modify gistSplit() to do not use buffers.

 TODO: gistvacuumcleanup and XLOG
2006-05-10 09:19:54 +00:00
Tom Lane 5749f6ef0c Rewrite btree vacuuming to fold the former bulkdelete and cleanup operations
into a single mostly-physical-order scan of the index.  This requires some
ticklish interlocking considerations, but should create no material
performance impact on normal index operations (at least given the
already-committed changes to make scans work a page at a time).  VACUUM
itself should get significantly faster in any index that's degenerated to a
very nonlinear page order.  Also, we save one pass over the index entirely,
except in the case where there were no deletions to do and so only one pass
happened anyway.

Original patch by Heikki Linnakangas, rework by Tom Lane.
2006-05-08 00:00:17 +00:00
Tom Lane 09cb5c0e7d Rewrite btree index scans to work a page at a time in all cases (both
btgettuple and btgetmulti).  This eliminates the problem of "re-finding" the
exact stopping point, since the stopping point is effectively always a page
boundary, and index items are never moved across pre-existing page boundaries.
A small penalty is that the keys_are_unique optimization is effectively
disabled (and, therefore, is removed in this patch), causing us to apply
_bt_checkkeys() to at least one more tuple than necessary when looking up a
unique key.  However, the advantages for non-unique cases seem great enough to
accept this tradeoff.  Aside from simplifying and (sometimes) speeding up the
indexscan code, this will allow us to reimplement btbulkdelete as a largely
sequential scan instead of index-order traversal, thereby significantly
reducing the cost of VACUUM.  Those changes will come in a separate patch.

Original patch by Heikki Linnakangas, rework by Tom Lane.
2006-05-07 01:21:30 +00:00
Teodor Sigaev 2a58f3bff6 Fix typo noticed by Alvaro Herrera 2006-05-03 06:56:47 +00:00
Tom Lane e57345975c Clean up API for ambulkdelete/amvacuumcleanup as per today's discussion.
This formulation requires every AM to provide amvacuumcleanup, unlike before,
but it's surely a whole lot cleaner.  Also, add an 'amstorage' column to
pg_am so that we can get rid of hardwired knowledge in DefineOpClass().
2006-05-02 22:25:10 +00:00
Tom Lane 67030eec1e Suppress some gcc warnings. 2006-05-02 15:48:11 +00:00
Teodor Sigaev 8a3631f8d8 GIN: Generalized Inverted iNdex.
text[], int4[], Tsearch2 support for GIN.
2006-05-02 11:28:56 +00:00
Tom Lane d2896a9ed1 Arrange to cache btree metapage data in the relcache entry for the index,
thereby saving a visit to the metapage in most index searches/updates.
This wouldn't actually save any I/O (since in the old regime the metapage
generally stayed in cache anyway), but it does provide a useful decrease
in bufmgr traffic in high-contention scenarios.  Per my recent proposal.
2006-04-25 22:46:05 +00:00
Bruce Momjian e6004f0151 Add statement_timestamp(), clock_timestamp(), and
transaction_timestamp() (just like now()).

Also update statement_timeout() to mention it is statement arrival time
that is measured.

Catalog version updated.
2006-04-25 00:25:22 +00:00
Tom Lane eac825aa68 Ensure that we validate the page header of the first page of a WAL file
whenever we start to read within that file.  The first page carries
extra identification information that really ought to be checked, but
as the code stood, this was only checked when we switched sequentially
into a new WAL file, or if by chance the starting checkpoint record was
within the first page.  This patch ensures that we will detect bogus
'long header' information before we start replaying the WAL sequence.
2006-04-20 04:07:38 +00:00
Tom Lane 0a87394956 Fix the torn-page hazard for PITR base backups by forcing full page writes
to occur between pg_start_backup() and pg_stop_backup(), even if the GUC
setting full_page_writes is OFF.  Per discussion, doing this in combination
with the already-existing checkpoint during pg_start_backup() should ensure
safety against partial page updates being included in the backup.  We do
not have to force full page writes to occur during normal PITR operation,
as I had first feared.
2006-04-17 18:55:05 +00:00
Tom Lane defe93463c Make the world safe for full_page_writes. Allow XLOG records that try to
update no-longer-existing pages to fall through as no-ops, but make a note
of each page number referenced by such records.  If we don't see a later
XLOG entry dropping the table or truncating away the page, complain at
the end of XLOG replay.  Since this fixes the known failure mode for
full_page_writes = off, revert my previous band-aid patch that disabled
that GUC variable.
2006-04-14 20:27:24 +00:00
Tom Lane 49a7610c36 Fix an ancient oversight in btree xlog replay. When trying to determine if an
upper-level insertion completes a previously-seen split, we cannot simply grab
the downlink block number out of the buffer, because the buffer could contain
a later state of the page --- or perhaps the page doesn't even exist at all
any more, due to relation truncation.  These possibilities have been masked up
to now because the use of full_page_writes effectively ensured that no xlog
replay routine ever actually saw a page state newer than its own change.
Since we're deprecating full_page_writes in 8.1.*, there's no need to fix this
in existing release branches, but we need a fix in HEAD if we want to have any
hope of re-allowing full_page_writes.  Accordingly, adjust the contents of
btree WAL records so that we can always get the downlink block number from the
WAL record rather than having to depend on buffer contents.  Per report from
Kevin Grittner and Peter Brant.

Improve a few comments in related code while at it.
2006-04-13 03:53:05 +00:00
Tom Lane 7fdb4305db Fix a bunch of problems with domains by making them use special input functions
that apply the necessary domain constraint checks immediately.  This fixes
cases where domain constraints went unchecked for statement parameters,
PL function local variables and results, etc.  We can also eliminate existing
special cases for domains in places that had gotten it right, eg COPY.

Also, allow domains over domains (base of a domain is another domain type).
This almost worked before, but was disallowed because the original patch
hadn't gotten it quite right.
2006-04-05 22:11:58 +00:00
Tom Lane 09b5271ebd Add a field to the first page of each WAL file to indicate the
XLOG_BLCKSZ.  This ought to help in preventing configuration mismatch
problems if anyone tries to ship PITR files between servers compiled
with different XLOG_BLCKSZ settings.  Simon Riggs
2006-04-05 03:34:05 +00:00
Tom Lane e6140d9052 Don't use BLCKSZ for the physical length of the pg_control file, but
instead a dedicated symbol.  This probably makes no functional difference
for likely values of BLCKSZ, but it makes the intent clearer.
Simon Riggs, minor editorialization by Tom Lane.
2006-04-04 22:39:59 +00:00
Tom Lane 147d4bf3e5 Modify all callers of datatype input and receive functions so that if these
functions are not strict, they will be called (passing a NULL first parameter)
during any attempt to input a NULL value of their datatype.  Currently, all
our input functions are strict and so this commit does not change any
behavior.  However, this will make it possible to build domain input functions
that centralize checking of domain constraints, thereby closing numerous holes
in our domain support, as per previous discussion.

While at it, I took the opportunity to introduce convenience functions
InputFunctionCall, OutputFunctionCall, etc to use in code that calls I/O
functions.  This eliminates a lot of grotty-looking casts, but the main
motivation is to make it easier to grep for these places if we ever need
to touch them again.
2006-04-04 19:35:37 +00:00
Tom Lane eaef111396 Define a separately configurable XLOG_BLCKSZ symbol for the page size
used within WAL files.  Historically this was the same as the data file
BLCKSZ, but there's no necessary connection, and it's possible that
performance gains might ensue from reducing XLOG_BLCKSZ.  In any case
distinguishing two symbols should improve code clarity.  This commit
does not actually change the page size, only provide the infrastructure
to make it possible to do so.  initdb forced because of addition of a
field to pg_control.
Mark Wong, with some help from Simon Riggs and Tom Lane.
2006-04-03 23:35:05 +00:00
Tom Lane c9a2b6d4ca Fix thinko in gistRedoPageUpdateRecord: if XLR_BKP_BLOCK_1 is set, we
don't have anything to do to the page, but we still have to adjust the
incomplete_inserts list that we're maintaining in memory.
2006-04-03 16:45:50 +00:00
Teodor Sigaev 8d02b15e33 Eliminate ajust scan code. Since concurrent GiST it doesn't
do real work. That was missed during concurrence development.
2006-04-03 13:44:33 +00:00
Tom Lane 89bda95d82 Remove the 'slow' path for btree index build, which built the btree
incrementally by successive inserts rather than by sorting the data.
We were only using the slow path during bootstrap, apparently because
when first written it failed during bootstrap --- but it works fine now
AFAICT.  Removing it saves a hundred or so lines of code and produces
noticeably (~10%) smaller initial states of the system catalog indexes.
While that won't make much difference for heavily-modified catalogs,
for the more static ones there may be a useful long-term performance
improvement.
2006-04-01 03:03:37 +00:00
Tom Lane a8b8f4db23 Clean up WAL/buffer interactions as per my recent proposal. Get rid of the
misleadingly-named WriteBuffer routine, and instead require routines that
change buffer pages to call MarkBufferDirty (which does exactly what it says).
We also require that they do so before calling XLogInsert; this takes care of
the synchronization requirement documented in SyncOneBuffer.  Note that
because bufmgr takes the buffer content lock (in shared mode) while writing
out any buffer, it doesn't matter whether MarkBufferDirty is executed before
the buffer content change is complete, so long as the content change is
completed before releasing exclusive lock on the buffer.  So it's OK to set
the dirtybit before we fill in the LSN.
This eliminates the former kluge of needing to set the dirtybit in LockBuffer.
Aside from making the code more transparent, we can also add some new
debugging assertions, in particular that the caller of MarkBufferDirty must
hold the buffer content lock, not merely a pin.
2006-03-31 23:32:07 +00:00
Tom Lane 89395bfa6f Improve gist XLOG code to follow the coding rules needed to prevent
torn-page problems.  This introduces some issues of its own, mainly
that there are now some critical sections of unreasonably broad scope,
but it's a step forward anyway.  Further cleanup will require some
code refactoring that I'd prefer to get Oleg and Teodor involved in.
2006-03-30 23:03:10 +00:00
Tom Lane 6d61cdec07 Clean up and document the API for XLogOpenRelation and XLogReadBuffer.
This commit doesn't make much functional change, but it does eliminate some
duplicated code --- for instance, PageIsNew tests are now done inside
XLogReadBuffer rather than by each caller.
The GIST xlog code still needs a lot of love, but I'll worry about that
separately.
2006-03-29 21:17:39 +00:00
Tom Lane 0a971e2f20 Disable full_page_writes, because turning it off risks causing crash-recovery
failures even when the hardware and OS did nothing wrong.  Per recent analysis
of a problem report from Alex Bahdushka.

For the moment I've just diked out the test of the parameter, rather than
removing the GUC infrastructure and documentation, in case we conclude that
there's something salvageable there.  There seems no chance of it being
resurrected in the 8.1 branch though.
2006-03-28 22:01:16 +00:00
Tom Lane 288551fc60 Repair longstanding error in btree xlog replay: XLogReadBuffer should be
passed extend = true whenever we are reading a page we intend to reinitialize
completely, even if we think the page "should exist".  This is because it
might indeed not exist, if the relation got truncated sometime after the
current xlog record was made and before the crash we're trying to recover
from.  These two thinkos appear to explain both of the old bug reports
discussed here:
http://archives.postgresql.org/pgsql-hackers/2005-05/msg01369.php
2006-03-28 21:17:23 +00:00
Tom Lane 0a20207060 Arrange to emit a description of the current XLOG record as error context
when an error occurs during xlog replay.  Also, replace the former risky
'write into a fixed-size buffer with no overflow detection' API for XLOG
record description routines; use an expansible StringInfo instead.  (The
latter accounts for most of the patch bulk.)

Qingqing Zhou
2006-03-24 04:32:13 +00:00
Tom Lane 2316013961 Clean up representation of function RTEs for functions returning RECORD.
The original coding stored the raw parser output (ColumnDef and TypeName
nodes) which was ugly, bulky, and wrong because it failed to create any
dependency on the referenced datatype --- and in fact would not track type
renamings and suchlike.  Instead store a list of column type OIDs in the
RTE.

Also fix up general failure of recordDependencyOnExpr to do anything sane
about recording dependencies on datatypes.  While there are many cases where
there will be an indirect dependency (eg if an operator returns a datatype,
the dependency on the operator is enough), we do have to record the datatype
as a separate dependency in examples like CoerceToDomain.

initdb forced because of change of stored rules.
2006-03-16 00:31:55 +00:00
Tom Lane 20ab467d76 Improve parser so that we can show an error cursor position for errors
during parse analysis, not only errors detected in the flex/bison stages.
This is per my earlier proposal.  This commit includes all the basic
infrastructure, but locations are only tracked and reported for errors
involving column references, function calls, and operators.  More could
be done later but this seems like a good set to start with.  I've also
moved the ReportSyntaxErrorPosition logic out of psql and into libpq,
which should make it available to more people --- even within psql this
is an improvement because warnings weren't handled by ReportSyntaxErrorPosition.
2006-03-14 22:48:25 +00:00
Tom Lane 9f6192490e Add a CHECK_FOR_INTERRUPTS() in _bt_buildadd(). This fixes problem
with not responding to query cancel during the last stage of btree index
creation.
2006-03-10 20:18:15 +00:00
Bruce Momjian f2f5b05655 Update copyright for 2006. Update scripts. 2006-03-05 15:59:11 +00:00
Neil Conway 8e5a10d46c This patch makes the error message strings throughout the backend
more compliant with the error message style guide. In particular,
errdetail should begin with a capital letter and end with a period,
whereas errmsg should not. I also fixed a few related issues in
passing, such as fixing the repeated misspelling of "lexeme" in
contrib/tsearch2 (per Tom's suggestion).
2006-03-01 06:30:32 +00:00
Neil Conway 737651f6be Cleanup the usage of ScanDirection: use the symbolic names for the
possible ScanDirection alternatives rather than magic numbers
(-1, 0, 1).  Also, use the ScanDirection macros in a few places
rather than directly checking whether `dir == ForwardScanDirection'
and the like. Per patch from James William Pye. His patch also
changed ScanDirection to be a "char" rather than an enum, which
I haven't applied.
2006-02-21 23:01:54 +00:00
Tom Lane 2d7f694729 Move btbulkdelete's vacuum_delay_point() call to a place in the loop where
we are not holding a buffer content lock; where it was, InterruptHoldoffCount
is positive and so we'd not respond to cancel signals as intended.  Also
add missing vacuum_delay_point() call in btvacuumcleanup.  This should fix
complaint from Evgeny Gridasov about failure to respond to SIGINT/SIGTERM
in a timely fashion (bug #2257).
2006-02-14 17:20:01 +00:00
Tom Lane 49758f4703 Add some missing vacuum_delay_point calls in GIST vacuuming. 2006-02-14 16:39:32 +00:00
Tom Lane d52a57fc30 Actually there's a better way to do this, which is to count tuples
during the vacuumcleanup scan that we're going to do anyway.  Should
save a few cycles (one calculation per page, not per tuple) as well
as not having to depend on assumptions about heap and index being
in step.
I think this could probably be made to work for GIST too, but that
code looks messy enough that I'm disinclined to try right now.
2006-02-12 00:18:17 +00:00
Tom Lane fd267c1ebc Skip ambulkdelete scan if there's nothing to delete and the index is not
partial.  None of the existing AMs do anything useful except counting
tuples when there's nothing to delete, and we can get a tuple count
from the heap as long as it's not a partial index.  (hash actually can
skip anyway because it maintains a tuple count in the index metapage.)
GIST is not currently able to exploit this optimization because, due to
failure to index NULLs, GIST is always effectively partial.  Possibly
we should fix that sometime.
Simon Riggs w/ some review by Tom Lane.
2006-02-11 23:31:34 +00:00
Bruce Momjian 77bb65d3fc Revert based on Tom's recommendation:
> Allow VACUUM to complete faster by avoiding scanning the indexes when no
> rows were removed from the heap by the VACUUM.
2006-02-11 17:14:09 +00:00
Bruce Momjian bf324946b3 Allow VACUUM to complete faster by avoiding scanning the indexes when no
rows were removed from the heap by the VACUUM.

Simon Riggs
2006-02-11 16:59:09 +00:00
Tom Lane 5997386a0a Remove the no-longer-useful HashItem/HashItemData level of structure.
Same motivation as for BTItem.
2006-01-25 23:26:11 +00:00
Tom Lane c389760c32 Remove the no-longer-useful BTItem/BTItemData level of structure, and
just refer to btree index entries as plain IndexTuples, which is what
they have been for a very long time.  This is mostly just an exercise
in removing extraneous notation, but it does save a palloc/pfree cycle
per index insertion.
2006-01-25 23:04:21 +00:00
Tom Lane 3a0a16cb7e Allow row comparisons to be used as indexscan qualifications.
This completes the project to upgrade our handling of row comparisons.
2006-01-25 20:29:24 +00:00
Tom Lane 7ccaf13a06 Instead of using a numberOfRequiredKeys count to distinguish required
and non-required keys in a btree index scan, mark the required scankeys
with private flag bits SK_BT_REQFWD and/or SK_BT_REQBKWD.  This seems
at least marginally clearer to me, and it eliminates a wired-into-the-
data-structure assumption that required keys are consecutive.  Even though
that assumption will remain true for the foreseeable future, having it
in there makes the code seem more complex than necessary.
2006-01-23 22:31:41 +00:00
Tom Lane c89a0dd3bb Repair longstanding bug in slru/clog logic: it is possible for two backends
to try to create a log segment file concurrently, but the code erroneously
specified O_EXCL to open(), resulting in a needless failure.  Before 7.4,
it was even a PANIC condition :-(.  Correct code is actually simpler than
what we had, because we can just say O_CREAT to start with and not need a
second open() call.  I believe this accounts for several recent reports of
hard-to-reproduce "could not create file ...: File exists" errors in both
pg_clog and pg_subtrans.
2006-01-21 04:38:21 +00:00
Tom Lane 73e3566078 Improve comments about btree's use of ScanKey data structures: there
are two basically different kinds of scankeys, and we ought to try harder
to indicate which is used in each place in the code.  I've chosen the names
"search scankey" and "insertion scankey", though you could make about
as good an argument for "operator scankey" and "comparison function
scankey".
2006-01-17 00:09:01 +00:00
Tom Lane f7ea931287 Some minor code cleanup, falling out from the removal of rtree. SK_NEGATE
isn't being used anywhere anymore, and there seems no point in a generic
index_keytest() routine when two out of three remaining access methods
aren't using it.  Also, add a comment documenting a convention for
letting access methods define private flag bits in ScanKey sk_flags.
There are no such flags at the moment but I'm thinking about changing
btree's handling of "required keys" to use flag bits in the keys
rather than a count of required key positions.  Also, if some AM did
still want SK_NEGATE then it would be reasonable to treat it as a private
flag bit.
2006-01-14 22:03:35 +00:00
Neil Conway fb627b76cc Cosmetic code cleanup: fix a bunch of places that used "return (expr);"
rather than "return expr;" -- the latter style is used in most of the
tree. I kept the parentheses when they were necessary or useful because
the return expression was complex.
2006-01-11 08:43:13 +00:00
Tom Lane afa8f1971a Add RelationOpenSmgr() calls to ensure rd_smgr is valid when we try to
use it.  While it normally has been opened earlier during btree index
build, testing shows that it's possible for the link to be closed again
if an sinval reset occurs while the index is being built.
2006-01-07 22:45:41 +00:00
Tom Lane 2d0475e480 Convert Assert checking for empty page into a regular test and elog.
The consequences of overwriting a non-empty page are bad enough that
we should not omit this test in production builds.
2006-01-06 00:15:50 +00:00
Peter Eisentraut 86c23a6eb2 Make all command-line options of postmaster and postgres the same. See
http://archives.postgresql.org/pgsql-hackers/2006-01/msg00151.php for the
complete plan.
2006-01-05 10:07:46 +00:00
Tom Lane 195f164228 Get rid of the SpinLockAcquire/SpinLockAcquire_NoHoldoff distinction
in favor of having just one set of macros that don't do HOLD/RESUME_INTERRUPTS
(hence, these correspond to the old SpinLockAcquire_NoHoldoff case).
Given our coding rules for spinlock use, there is no reason to allow
CHECK_FOR_INTERRUPTS to be done while holding a spinlock, and also there
is no situation where ImmediateInterruptOK will be true while holding a
spinlock.  Therefore doing HOLD/RESUME_INTERRUPTS while taking/releasing a
spinlock is just a waste of cycles.  Qingqing Zhou and Tom Lane.
2005-12-29 18:08:05 +00:00
Tom Lane ab51bbaa06 Arrange to set the LC_XXX environment variables to match our locale
setup.  This protects against undesired changes in locale behavior
if someone carelessly does setlocale(LC_ALL, "") (and we know who
you are, perl guys).
2005-12-28 23:22:51 +00:00
Bruce Momjian 261114a23f I have added these macros to c.h:
#define HIGHBIT                 (0x80)
        #define IS_HIGHBIT_SET(ch)      ((unsigned char)(ch) & HIGHBIT)

and removed CSIGNBIT and mapped it uses to HIGHBIT.  I have also added
uses for IS_HIGHBIT_SET where appropriate.  This change is
purely for code clarity.
2005-12-25 02:14:19 +00:00
Tom Lane 656beff590 Adjust string comparison so that only bitwise-equal strings are considered
equal: if strcoll claims two strings are equal, check it with strcmp, and
sort according to strcmp if not identical.  This fixes inconsistent
behavior under glibc's hu_HU locale, and probably under some other locales
as well.  Also, take advantage of the now-well-defined behavior to speed up
texteq, textne, bpchareq, bpcharne: they may as well just do a bitwise
comparison and not bother with strcoll at all.

NOTE: affected databases may need to REINDEX indexes on text columns to be
sure they are self-consistent.
2005-12-22 22:50:00 +00:00
Tom Lane ec0baf949e Divide the lock manager's shared state into 'partitions', so as to
reduce contention for the former single LockMgrLock.  Per my recent
proposal.  I set it up for 16 partitions, but on a pgbench test this
gives only a marginal further improvement over 4 partitions --- we need
to test more scenarios to choose the number of partitions.
2005-12-11 21:02:18 +00:00
Tom Lane cefcbbf1fd Push the responsibility for handling ignore_killed_tuples down into
_bt_checkkeys(), instead of checking it in the top-level nbtree.c routines
as formerly.  This saves a little bit of loop overhead, but more importantly
it lets us skip performing the index key comparisons for dead tuples.
2005-12-07 19:37:53 +00:00
Tom Lane f1b059af12 A couple of tiny performance hacks in _bt_step(). Remove PageIsEmpty
checks, which were once needed because PageGetMaxOffsetNumber would
fail on empty pages, but are now just redundant.  Also, don't set up
local variables that aren't needed in the fast path --- most of the
time, we only need to advance offnum and not step across a page boundary.
Motivated by noticing _bt_step at the top of OProfile profile for a
pgbench run.
2005-12-07 18:03:48 +00:00
Tom Lane 887a7c61f6 Get rid of slru.c's hardwired insistence on a fixed number of slots per
SLRU area.  The number of slots is still a compile-time constant (someday
we might want to change that), but at least it's a different constant for
each SLRU area.  Increase number of subtrans buffers to 32 based on
experimentation with a heavily subtrans-bashing test case, and increase
number of multixact member buffers to 16, since it's obviously silly for
it not to be at least twice the number of multixact offset buffers.
2005-12-06 23:08:34 +00:00
Tom Lane a615acf555 Arrange for read-only accesses to SLRU page buffers to take only a shared
lock, not exclusive, if the desired page is already in memory.  This can
be demonstrated to be a significant win on the pg_subtrans cache when there
is a large window of open transactions.  It should be useful for pg_clog
as well.  I didn't try to make GetMultiXactIdMembers() use the code, as
that would have taken some restructuring, and what with the local cache
for multixact contents it probably wouldn't really make a difference.
Per my recent proposal.
2005-12-06 18:10:06 +00:00
Tom Lane a98871b7ac Tweak indexscan machinery to avoid taking an AccessShareLock on an index
if we already have a stronger lock due to the index's table being the
update target table of the query.  Same optimization I applied earlier
at the table level.  There doesn't seem to be much interest in the more
radical idea of not locking indexes at all, so do what we can ...
2005-12-03 05:51:03 +00:00
Tom Lane 4c4eb57154 Some marginal additional hacking to shave a few more cycles off
heapgettup.
2005-11-26 05:03:06 +00:00
Tom Lane 70f1482de3 Change seqscan logic so that we check visibility of all tuples on a page
when we first read the page, rather than checking them one at a time.
This allows us to take and release the buffer content lock just once
per page, instead of once per tuple.  Since it's a shared lock the
contention penalty for holding the lock longer shouldn't be too bad.
We can safely do this only when using an MVCC snapshot; else the
assumption that visibility won't change over time is uncool.  Therefore
there are now two code paths depending on the snapshot type.  I also
made the same change in nodeBitmapHeapscan.c, where it can be done always
because we only support MVCC snapshots for bitmap scans anyway.
Also make some incidental cleanups in the APIs of these functions.
Per a suggestion from Qingqing Zhou.
2005-11-26 03:03:07 +00:00
Bruce Momjian 436a2956d8 Re-run pgindent, fixing a problem where comment lines after a blank
comment line where output as too long, and update typedefs for /lib
directory.  Also fix case where identifiers were used as variable names
in the backend, but as typedefs in ecpg (favor the backend for
indenting).

Backpatch to 8.1.X.
2005-11-22 18:17:34 +00:00
Tom Lane dd218ae7b0 Remove the t_datamcxt field of HeapTupleData. This was introduced for
the convenience of tuptoaster.c and is no longer needed, so may as well
get rid of some small amount of overhead.
2005-11-20 19:49:08 +00:00
Tom Lane 40314f2dac Modify tuptoaster's API so that it does not try to modify the passed
tuple in-place, but instead passes back an all-new tuple structure if
any changes are needed.  This is a much cleaner and more robust solution
for the bug discovered by Alexey Beschiokov; accordingly, revert the
quick hack I installed yesterday.
With this change, HeapTupleData.t_datamcxt is no longer needed; will
remove it in a separate commit in HEAD only.
2005-11-20 18:38:20 +00:00
Tom Lane 2a8d3d83ef R-tree is dead ... long live GiST. 2005-11-07 17:36:47 +00:00
Tom Lane 6236991143 Add simple sanity checks on newly-read pages to GiST, too. 2005-11-06 22:39:21 +00:00
Tom Lane 766dc45d9f Add defenses to btree and hash index AMs to do simple sanity checks
on every index page they read; in particular to catch the case of an
all-zero page, which PageHeaderIsValid allows to pass.  It turns out
hash already had this idea, but it was just Assert()ing things rather
than doing a straight error check, and the Asserts were partially
redundant with PageHeaderIsValid anyway.  Per recent failure example
from Jim Nasby.  (gist still needs the same treatment.)
2005-11-06 19:29:01 +00:00
Tom Lane 18691d8ee3 Clean up representation of SLRU page state. This is the cleaner fix
for the SLRU race condition that I posted a few days ago, but we decided
not to use in 8.1 and older branches.
2005-11-05 21:19:47 +00:00
Alvaro Herrera 902377c465 Rename the members of CommandDest enum so they don't collide with other uses of
those names.  (Debug and None were pretty bad names anyway.)  I hope I catched
all uses of the names in comments too.
2005-11-03 17:11:40 +00:00
Tom Lane 99d48695d4 Fix longstanding race condition in transaction log management: there was a
very narrow window in which SimpleLruReadPage or SimpleLruWritePage could
think that I/O was needed when it wasn't (and indeed the buffer had already
been assigned to another page).  This would result in an Assert failure if
Asserts were enabled, and probably in silent data corruption if not.
Reported independently by Jim Nasby and Robert Creager.

I intend a more extensive fix when 8.2 development starts, but this is a
reasonably low-impact patch for the existing branches.
2005-11-03 00:23:36 +00:00
Peter Eisentraut 07bb9f086b Message corrections 2005-10-29 00:31:52 +00:00
Tom Lane a037926295 Reorder code so that we don't have to hold a critical section while
reserving SLRU space for a new MultiXact.  The original coding would have
treated out-of-disk-space as a PANIC condition, which is unnecessary.
2005-10-28 19:00:19 +00:00
Tom Lane 1986ca5ce5 Fix race condition in multixact code: it's possible to try to read a
multixact's starting offset before the offset has been stored into the
SLRU file.  A simple fix would be to hold the MultiXactGenLock until the
offset has been stored, but that looks like a big concurrency hit.  Instead
rely on knowledge that unset offsets will be zero, and loop when we see
a zero.  This requires a little extra hacking to ensure that zero is never
a valid value for the offset.  Problem reported by Matteo Beccati, fix
ideas from Martijn van Oosterhout, Alvaro Herrera, and Tom Lane.
2005-10-28 17:27:29 +00:00
Tom Lane 6d6c3722fb Make code for selecting default WAL sync method less confusing. 2005-10-22 20:27:17 +00:00
Tom Lane d9cb48786e Better solution to the problem of labeling whole-row Datums that are
generated from subquery outputs: use the type info stored in the Var
itself.  To avoid making ExecEvalVar and slot_getattr more complex
and slower, I split out the whole-row case into a separate ExecEval routine.
2005-10-19 22:30:30 +00:00
Tom Lane 07908c9c37 Ensure that the Datum generated from a whole-row Var contains valid
type ID information even when it's a record type.  This is needed to
handle whole-row Vars referencing subquery outputs.  Per example from
Richard Huxton.
2005-10-19 18:18:33 +00:00
Tom Lane 23836fb1fb A few trivial code cleanups motivated by reading warnings generated
by a recent HP C compiler.  Mostly, get rid of useless local variables
that are assigned to but never used.
2005-10-18 01:06:24 +00:00
Bruce Momjian 1dc3498251 Standard pgindent run for 8.1. 2005-10-15 02:49:52 +00:00
Bruce Momjian 1d028537a2 This makes the error messages for PREPARE TRANSACTION, COMMIT PREPARED
etc. match the docs, which talk about "transaction identifier" not
"gid" or "global transaction identifier".

Steve Woodcock
2005-10-13 22:55:55 +00:00
Bruce Momjian 84cc9a4bb3 Back out this because of fear of changing error strings:
This makes the error messages for PREPARE TRANSACTION, COMMIT PREPARED
etc. match the docs, which talk about "transaction identifier" not
"gid" or "global transaction identifier".

Steve Woodcock
2005-10-13 17:57:57 +00:00
Bruce Momjian 90c22c9206 This makes the error messages for PREPARE TRANSACTION, COMMIT PREPARED
etc. match the docs, which talk about "transaction identifier" not
"gid" or "global transaction identifier".

Steve Woodcock
2005-10-13 17:57:17 +00:00
Tom Lane e952ae1268 Fix longstanding bug found by Atsushi Ogawa: _bt_check_unique would mark
the wrong buffer dirty when trying to kill a dead index entry that's on
a page after the one it started on.  No risk of data corruption, just
inefficiency, but still a bug.
2005-10-12 17:18:03 +00:00
Tom Lane cb8b6618ce Revise pgstats stuff to fix the problems with not counting accesses
generated by bitmap index scans.  Along the way, simplify and speed up
the code for counting sequential and index scans; it was both confusing
and inefficient to be taking care of that in the per-tuple loops, IMHO.
initdb forced because of internal changes in pg_stat view definitions.
2005-10-06 02:29:23 +00:00
Tom Lane 64eea6c21d Expand pg_control information so that we can verify that the database
was created on a machine with alignment rules and floating-point format
similar to the current machine.  Per recent discussion, this seems like
a good idea with the increasing prevalence of 32/64 bit environments.
2005-10-03 00:28:43 +00:00
Tom Lane 303e089df5 Clean up possibly-uninitialized-variable warnings reported by gcc 4.x. 2005-09-24 22:54:44 +00:00
Bruce Momjian b3364fc81b pgindent new GIST index code, per request from Tom. 2005-09-22 20:44:36 +00:00
Tom Lane 08817bdb76 Adjust GiST error messages to conform to message style guidelines. 2005-09-22 18:49:45 +00:00
Teodor Sigaev f4516f8732 Small fixes 2005-09-16 14:40:54 +00:00
Neil Conway 1dd9b09332 Copy-editing for GiST README. 2005-09-15 17:44:27 +00:00
Teodor Sigaev 79fae4a764 Readme about GiST's algorithms 2005-09-15 16:39:15 +00:00
Tom Lane 35e9b1cc1e Clean up a couple of ad-hoc computations of the maximum number of tuples
on a page, as suggested by ITAGAKI Takahiro.  Also, change a few places
that were using some other estimates of max-items-per-page to consistently
use MaxOffsetNumber.  This is conservatively large --- we could have used
the new MaxHeapTuplesPerPage macro, or a similar one for index tuples ---
but those places are simply declaring a fixed-size buffer and assuming it
will work, rather than actively testing for overrun.  It seems safer to
size these buffers in a way that can't overflow even if the page is
corrupt.
2005-09-02 19:02:20 +00:00
Tom Lane 037709e0b3 Reduce default value of max_prepared_transactions from 50 to 5. This
saves nearly 700kB in the default shared memory segment size, which seems
worthwhile, and it is a feature that many users won't use anyway.  Per
Heikki's argument, there is no point in a compromise value --- those who
are using 2PC at all will probably want it at least equal to max_connections.
But we can't set it to zero by default without breaking the prepared_xacts
regression test.
2005-08-29 21:38:18 +00:00
Tom Lane 9052537325 Rewrite gather-write patch into something less obviously bolted on
after the fact.  Fix bug with incorrect test for whether we are at end
of logfile segment.  Arrange for writes triggered by XLogInsert's
is-cache-more-than-half-full test to synchronize with the cache boundaries,
so that in long transactions we tend to write alternating halves of the
cache rather than randomly chosen portions of it; this saves one more
write syscall per cache load.
2005-08-22 23:59:04 +00:00
Bruce Momjian 8ad3965a11 Improve xid wraparound message (the server isn't really shut down, just
not accepting queries).

         errmsg("database is not accepting queries to avoid
	 wraparound data loss in database \"%s\"",
         errhint("Stop the postmaster and use a standalone
	 backend to VACUUM database \"%s\".",
2005-08-22 16:59:47 +00:00
Tom Lane d0096a41fa Fix some inconsistent choices of datatypes in xlog.c. Make buffer
indexes all be int, rather than variously int, uint16 and uint32;
add some casts where necessary to support large buffer arrays.
2005-08-22 00:41:28 +00:00
Tom Lane f39f6b500f Seems that the childXids list would be better based on Oid lists than
integer lists.
2005-08-20 23:45:08 +00:00
Tom Lane 0007490e09 Convert the arithmetic for shared memory size calculation from 'int'
to 'Size' (that is, size_t), and install overflow detection checks in it.
This allows us to remove the former arbitrary restrictions on NBuffers
etc.  It won't make any difference in a 32-bit machine, but in a 64-bit
machine you could theoretically have terabytes of shared buffers.
(How efficiently we could manage 'em remains to be seen.)  Similarly,
num_temp_buffers, work_mem, and maintenance_work_mem can be set above
2Gb on a 64-bit machine.  Original patch from Koichi Suzuki, additional
work by moi.
2005-08-20 23:26:37 +00:00
Tatsuo Ishii ba2fc7eb4b Make GetMultiXactIdMembers() a public function. 2005-08-20 01:29:27 +00:00
Tom Lane f57e3f4cf3 Repair problems with VACUUM destroying t_ctid chains too soon, and with
insufficient paranoia in code that follows t_ctid links.  (We must do both
because even with VACUUM doing it properly, the intermediate state with
a dangling t_ctid link is visible concurrently during lazy VACUUM, and
could be seen afterwards if either type of VACUUM crashes partway through.)
Also try to improve documentation about what's going on.  Patch is a bit
bulky because passing the XMAX information around required changing the
APIs of some low-level heapam.c routines, but it's not conceptually very
complicated.  Per trouble report from Teodor and subsequent analysis.
This needs to be back-patched, but I'll do that after 8.1 beta is out.
2005-08-20 00:40:32 +00:00
Tom Lane f8d0a82bf9 Avoid an Assert failure if OuterUserId hasn't been set yet during
AbortTransaction.  This can happen if a backend's InitPostgres transaction
fails (eg, because the given username is invalid).  Per Alvaro.
2005-08-17 22:14:34 +00:00
Tom Lane 6ea05c16a4 Change a couple of "can't happen" error messages to be a shade more
verbose when they do happen.  The "left link changed unexpectedly"
one in particular has been seen more than once in the field.
2005-08-12 14:34:14 +00:00
Tom Lane 721e53785d Solve the problem of OID collisions by probing for duplicate OIDs
whenever we generate a new OID.  This prevents occasional duplicate-OID
errors that can otherwise occur once the OID counter has wrapped around.
Duplicate relfilenode values are also checked for when creating new
physical files.  Per my recent proposal.
2005-08-12 01:36:05 +00:00
Tom Lane d90c531188 Autovacuum loose end mop-up. Provide autovacuum-specific vacuum cost
delay and limit, both as global GUCs and as table-specific entries in
pg_autovacuum.  stats_reset_on_server_start is now OFF by default,
but a reset is forced if we did WAL replay.  XID-wrap vacuums do not
ANALYZE, but do FREEZE if it's a template database.  Alvaro Herrera
2005-08-11 21:11:50 +00:00
Bruce Momjian 949ebbd55e Mention MD5 function index for indexing long values. 2005-08-11 13:22:33 +00:00
Tom Lane 24ff62d76f Make new hints follow style guide. 2005-08-10 22:39:00 +00:00
Bruce Momjian 237be3cc29 Add hints to cases where indexes fail because of values that are too long. 2005-08-10 21:36:46 +00:00
Tom Lane 4568e0f791 Modify AtEOXact_CatCache and AtEOXact_RelationCache to assume that the
ResourceOwner mechanism already released all reference counts for the
cache entries; therefore, we do not need to scan the catcache or relcache
at transaction end, unless we want to do it as a debugging crosscheck.
Do the crosscheck only in Assert mode.  This is the same logic we had
previously installed in AtEOXact_Buffers to avoid overhead with large
numbers of shared buffers.  I thought it'd be a good idea to do it here
too, in view of Kari Lavikka's recent report showing a real-world case
where AtEOXact_CatCache is taking a significant fraction of runtime.
2005-08-08 19:17:23 +00:00
Tom Lane 0001e98d54 Code and docs review for pg_column_size() patch. 2005-08-02 16:11:57 +00:00
Tom Lane 2a4fad1a0e Add NOWAIT option to SELECT FOR UPDATE/SHARE.
Original patch by Hans-Juergen Schoenig, revisions by Karel Zak
and Tom Lane.
2005-08-01 20:31:16 +00:00
Tom Lane d42cf5a42a Add per-user and per-database connection limit options.
This patch also includes preliminary update of pg_dumpall for roles.
Petr Jelinek, with review by Bruce Momjian and Tom Lane.
2005-07-31 17:19:22 +00:00
Bruce Momjian 5b0bfec414 Fix compile for no O_SYNC, but introduced with O_DIRECT. 2005-07-30 14:15:44 +00:00
Tom Lane 5d5f1a79e6 Clean up a number of autovacuum loose ends. Make the stats collector
track shared relations in a separate hashtable, so that operations done
from different databases are counted correctly.  Add proper support for
anti-XID-wraparound vacuuming, even in databases that are never connected
to and so have no stats entries.  Miscellaneous other bug fixes.
Alvaro Herrera, some additional fixes by Tom Lane.
2005-07-29 19:30:09 +00:00
Bruce Momjian c6b1724c67 Update O_DIRECT comment. 2005-07-29 03:25:53 +00:00
Bruce Momjian c34bb00581 Use O_DIRECT if available when using O_SYNC for wal_sync_method.
Also, write multiple WAL buffers out in one write() operation.

ITAGAKI Takahiro

---------------------------------------------------------------------------

> If we disable writeback-cache and use open_sync, the per-page writing
> behavior in WAL module will show up as bad result. O_DIRECT is similar
> to O_DSYNC (at least on linux), so that the benefit of it will disappear
> behind the slow disk revolution.
>
> In the current source, WAL is written as:
>     for (i = 0; i < N; i++) { write(&buffers[i], BLCKSZ); }
> Is this intentional? Can we rewrite it as follows?
>    write(&buffers[0], N * BLCKSZ);
>
> In order to achieve it, I wrote a 'gather-write' patch (xlog.gw.diff).
> Aside from this, I'll also send the fixed direct io patch (xlog.dio.diff).
> These two patches are independent, so they can be applied either or both.
>
>
> I tested them on my machine and the results as follows. It shows that
> direct-io and gather-write is the best choice when writeback-cache is off.
> Are these two patches worth trying if they are used together?
>
>
>             | writeback | fsync= | fdata | open_ | fsync_ | open_
> patch       | cache     |  false |  sync |  sync | direct | direct
> ------------+-----------+--------+-------+-------+--------+---------
> direct io   | off       |  124.2 | 105.7 |  48.3 |   48.3 |  48.2
> direct io   | on        |  129.1 | 112.3 | 114.1 |  142.9 | 144.5
> gather-write| off       |  124.3 | 108.7 | 105.4 |  (N/A) | (N/A)
> both        | off       |  131.5 | 115.5 | 114.4 |  145.4 | 145.2
>
> - 20runs * pgbench -s 100 -c 50 -t 200
>    - with tuning (wal_buffers=64, commit_delay=500, checkpoint_segments=8)
> - using 2 ATA disks:
>    - hda(reiserfs) includes system and wal.
>    - hdc(jfs) includes database files. writeback-cache is always on.
>
> ---
> ITAGAKI Takahiro
2005-07-29 03:22:33 +00:00
Tom Lane e5d6b91220 Add SET ROLE. This is a partial commit of Stephen Frost's recent patch;
I'm still working on the has_role function and information_schema changes.
2005-07-25 22:12:34 +00:00
Bruce Momjian 9af9d674c6 Remove unintended code addition. 2005-07-23 15:31:16 +00:00
Bruce Momjian 4098c8867d Macro alignment cleanup. 2005-07-23 15:29:47 +00:00
Tom Lane f2bf2d2dc5 Fix a couple of bogus comments, per Alvaro. 2005-07-13 22:46:09 +00:00
Tom Lane d7207cfc6b Even though I'd like to see full_page_writes go away before 8.1,
a minimum requirement is that it not completely break the system
meanwhile.  Put the test in the right place.
2005-07-08 04:07:26 +00:00
Bruce Momjian a923602855 Add pg_column_size() to return storage size of a column, including
possible compression.

Mark Kirkwood
2005-07-06 19:02:54 +00:00
Bruce Momjian 326a7a0788 Add GUC full_page_writes to control writing full pages to WAL. 2005-07-05 23:18:10 +00:00
Tom Lane eb5949d190 Arrange for the postmaster (and standalone backends, initdb, etc) to
chdir into PGDATA and subsequently use relative paths instead of absolute
paths to access all files under PGDATA.  This seems to give a small
performance improvement, and it should make the system more robust
against naive DBAs doing things like moving a database directory that
has a live postmaster in it.  Per recent discussion.
2005-07-04 04:51:52 +00:00
Tom Lane e7e1694295 Migrate rtree_gist functionality into the core system, and add some
basic regression tests for GiST to the standard regression tests.
I took the opportunity to add an rtree-equivalent gist opclass for
circles; the contrib version only covered boxes and polygons, but
indexing circles is very handy for distance searches.
2005-07-01 19:19:05 +00:00
Teodor Sigaev 5350216156 Improve error messages and add comment 2005-07-01 13:18:17 +00:00
Teodor Sigaev 898a7bd13b Bug fixes for GiST crash recovery.
- add forgotten check of lsn for insert completion
- remove level of pages: hard to check in recovery
- some cleanups
2005-06-30 17:52:14 +00:00
Tom Lane 401de9c8be Improve the checkpoint signaling mechanism so that the bgwriter can tell
the difference between checkpoints forced due to WAL segment consumption
and checkpoints forced for other reasons (such as CREATE DATABASE).  Avoid
generating 'checkpoints are occurring too frequently' messages when the
checkpoint wasn't caused by WAL segment consumption.  Per gripe from
Chris K-L.
2005-06-30 00:00:52 +00:00
Tom Lane b5f7cff84f Clean up the rather historically encumbered interface to now() and
current time: provide a GetCurrentTimestamp() function that returns
current time in the form of a TimestampTz, instead of separate time_t
and microseconds fields.  This is what all the callers really want
anyway, and it eliminates low-level dependencies on AbsoluteTime,
which is a deprecated datatype that will have to disappear eventually.
2005-06-29 22:51:57 +00:00
Teodor Sigaev 4523e0b63a Cleanup, remove unneeded pallocs 2005-06-29 14:06:14 +00:00
Teodor Sigaev 88b49cdc95 Code cleanup. gistfillbuffer accepts InvalidOffsetNumber. 2005-06-28 15:51:00 +00:00
Tom Lane 7762619e95 Replace pg_shadow and pg_group by new role-capable catalogs pg_authid
and pg_auth_members.  There are still many loose ends to finish in this
patch (no documentation, no regression tests, no pg_dump support for
instance).  But I'm going to commit it now anyway so that Alvaro can
make some progress on shared dependencies.  The catalog changes should
be pretty much done.
2005-06-28 05:09:14 +00:00
Teodor Sigaev e8cab5fe49 Concurrency for GiST
- full concurrency for insert/update/select/vacuum:
        - select and vacuum never locks more than one page simultaneously
        - select (gettuple) hasn't any lock across it's calls
        - insert never locks more than two page simultaneously:
                - during search of leaf to insert it locks only one page
                  simultaneously
                - while walk upward to the root it locked only parent (may be
                  non-direct parent) and child. One of them X-lock, another may
                  be S- or X-lock
- 'vacuum full' locks index
- improve gistgetmulti
- simplify XLOG records

Fix bug in index_beginscan_internal: LockRelation may clean
  rd_aminfo structure, so move GET_REL_PROCEDURE after LockRelation
2005-06-27 12:45:23 +00:00
Tom Lane b90f8f20f0 Extend r-tree operator classes to handle Y-direction tests equivalent
to the existing X-direction tests.  An rtree class now includes 4 actual
2-D tests, 4 1-D X-direction tests, and 4 1-D Y-direction tests.
This involved adding four new Y-direction test operators for each of
box and polygon; I followed the PostGIS project's lead as to the names
of these operators.
NON BACKWARDS COMPATIBLE CHANGE: the poly_overleft (&<) and poly_overright
(&>) operators now have semantics comparable to box_overleft and box_overright.
This is necessary to make r-tree indexes work correctly on polygons.
Also, I changed circle_left and circle_right to agree with box_left and
box_right --- formerly they allowed the boundaries to touch.  This isn't
actually essential given the lack of any r-tree opclass for circles, but
it seems best to sync all the definitions while we are at it.
2005-06-24 20:53:34 +00:00
Tom Lane 9a09248edd Fix rtree and contrib/rtree_gist search behavior for the 1-D box and
polygon operators (<<, &<, >>, &>).  Per ideas originally put forward
by andrew@supernews and later rediscovered by moi.  This patch just
fixes the existing opclasses, and does not add any new behavior as I
proposed earlier; that can be sorted out later.  In principle this
could be back-patched, since it changes only search behavior and not
system catalog entries nor rtree index contents.  I'm not currently
planning to do that, though, since I think it could use more testing.
2005-06-24 00:18:52 +00:00
Tom Lane e98edb5555 Fix the mechanism for reporting the original table OID and column number
of columns of a query result so that it can "see through" cursors and
prepared statements.  Per gripe a couple months back from John DeSoi.
2005-06-22 17:45:46 +00:00
Tom Lane b95ae32b41 Avoid WAL-logging individual tuple insertions during CREATE TABLE AS
(a/k/a SELECT INTO).  Instead, flush and fsync the whole relation before
committing.  We do still need the WAL log when PITR is active, however.
Simon Riggs and Tom Lane.
2005-06-20 18:37:02 +00:00
Teodor Sigaev 1bfdd1a893 fix founded hole in recovery after crash, add vacuum_delay_point() 2005-06-20 15:22:38 +00:00
Teodor Sigaev d544ec8bbd 1. full functional WAL for GiST
2. improve vacuum for gist
   - use FSM
   - full vacuum:
      - reforms parent tuple if it's needed
        ( tuples was deleted on child page or parent tuple remains invalid
          after crash recovery )
      - truncate index file if possible
3. fixes bugs and mistakes
2005-06-20 10:29:37 +00:00
Tom Lane d961a56899 Avoid unnecessary palloc overhead in _bt_first(). The temporary
scankeys arrays that it needs can never have more than INDEX_MAX_KEYS
entries, so it's reasonable to just allocate them as fixed-size local
arrays, and save the cost of palloc/pfree.  Not a huge savings, but
a cycle saved is a cycle earned ...
2005-06-19 22:41:00 +00:00
Tom Lane fc654583ab Need #include <time.h> on some platforms. 2005-06-19 22:34:56 +00:00
Tom Lane 3f749924f8 Simplify uses of readdir() by creating a function ReadDir() that
includes error checking and an appropriate ereport(ERROR) message.
This gets rid of rather tedious and error-prone manipulation of errno,
as well as a Windows-specific bug workaround, at more than a dozen
call sites.  After an idea in a recent patch by Heikki Linnakangas.
2005-06-19 21:34:03 +00:00
Tom Lane e26b0abda3 Arrange to fsync two-phase-commit state files only during checkpoints;
given reasonably short lifespans for prepared transactions, this should
mean that only a small minority of state files ever need to be fsynced
at all.  Per discussion with Heikki Linnakangas.
2005-06-19 20:00:39 +00:00
Tom Lane a8d1075f27 Add a time-of-preparation column to the pg_prepared_xacts view, per an
old suggestion by Oliver Jowett.  Also, add a transaction column to the
pg_locks view to show the xid of each transaction holding or awaiting
locks; this allows prepared transactions to be properly associated with
the locks they own.  There was already a column named 'transaction',
and I chose to rename it to 'transactionid' --- since this column is
new in the current devel cycle there should be no backwards compatibility
issue to worry about.
2005-06-18 19:33:42 +00:00
Tom Lane 66b098492e Dept. of second thoughts: regular COMMIT deletes deletable files before
releasing locks, so COMMIT PREPARED should too.
2005-06-18 05:21:09 +00:00
Tom Lane d0a89683a3 Two-phase commit. Original patch by Heikki Linnakangas, with additional
hacking by Alvaro Herrera and Tom Lane.
2005-06-17 22:32:51 +00:00
Bruce Momjian f4d907ca85 Remove old *.backup files when we do pg_stop_backup(). This
prevents a large number of *.backup files from existing in pg_xlog/
2005-06-15 01:36:08 +00:00
Teodor Sigaev 37c839365c WAL for GiST. It work for online backup and so on, but on
recovery after crash (power loss etc) it may say that it can't restore
index and index should be reindexed.

Some refactoring code.
2005-06-14 11:45:14 +00:00
Tom Lane c186c93148 Change the planner to allow indexscan qualification clauses to use
nonconsecutive columns of a multicolumn index, as per discussion around
mid-May (pghackers thread "Best way to scan on-disk bitmaps").  This
turns out to require only minimal changes in btree, and so far as I can
see none at all in GiST.  btcostestimate did need some work, but its
original assumption that index selectivity == heap selectivity was
quite bogus even before this.
2005-06-13 23:14:49 +00:00
Bruce Momjian 51746c4549 Free buffer allocated via malloc (process is short-lived, but fix it anyway). 2005-06-09 22:36:27 +00:00
Tom Lane dce83e76e0 Add missing #include -- mea culpa. 2005-06-09 21:01:25 +00:00
Tom Lane 82358ca936 Put a critical section around update of hash index metapage. Per
discussion with Qingqing Zhou.
2005-06-09 18:23:50 +00:00
Tom Lane f5b2f60bd1 Change WAL-logging scheme for multixacts to be more like regular
transaction IDs, rather than like subtrans; in particular, the information
now survives a database restart.  Per previous discussion, this is
essential for PITR log shipping and for 2PC.
2005-06-08 15:50:28 +00:00
Tom Lane ee7ac7b11e Modify XLogInsert API to make callers specify whether pages to be backed
up have the standard layout with unused space between pd_lower and pd_upper.
When this is set, XLogInsert will omit the unused space without bothering
to scan it to see if it's zero.  That saves time in XLogInsert, and also
allows reversion of my earlier patch to make PageRepairFragmentation et al
explicitly re-zero freed space.  Per suggestion by Heikki Linnakangas.
2005-06-06 20:22:58 +00:00
Tom Lane 4c8495a1f2 Remove the mostly-stubbed-out-anyway support routines for WAL UNDO.
That code is never going to be used in the foreseeable future, and
where it's more than a stub it's making the redo routines harder to
read.
2005-06-06 17:01:25 +00:00
Tom Lane 21fda22ec4 Change CRCs in WAL records from 64bit to 32bit for performance reasons.
Instead of a separate CRC on each backup block, include backup blocks
in their parent WAL record's CRC; this is important to ensure that the
backup block really goes with the WAL record, ie there was not a page
tear right at the start of the backup block.  Implement a simple form
of compression of backup blocks: drop any run of zeroes starting at
pd_lower, so as not to store the unused 'hole' that commonly exists in
PG heap and index pages.  Tweak PageRepairFragmentation and related
routines to ensure they keep the unused space zeroed, so that the above
compression method remains effective.  All per recent discussions.
2005-06-02 05:55:29 +00:00
Tom Lane a91fa39028 Add test to WAL replay to verify that xl_prev points back to the previous
WAL record; this is necessary to be sure we recognize stale WAL records
when a WAL page was only partially written during a system crash.
2005-05-31 19:10:28 +00:00
Tom Lane e92a88272e Modify hash_search() API to prevent future occurrences of the error
spotted by Qingqing Zhou.  The HASH_ENTER action now automatically
fails with elog(ERROR) on out-of-memory --- which incidentally lets
us eliminate duplicate error checks in quite a bunch of places.  If
you really need the old return-NULL-on-out-of-memory behavior, you
can ask for HASH_ENTER_NULL.  But there is now an Assert in that path
checking that you aren't hoping to get that behavior in a palloc-based
hash table.
Along the way, remove the old HASH_FIND_SAVE/HASH_REMOVE_SAVED actions,
which were not being used anywhere anymore, and were surely too ugly
and unsafe to want to see revived again.
2005-05-29 04:23:07 +00:00
Tom Lane 32e8fc4a28 Arrange to cache fmgr lookup information for an index's access method
routines in the index's relcache entry, instead of doing a fresh fmgr_info
on every index access.  We were already doing this for the index's opclass
support functions; not sure why we didn't think to do it for the AM
functions too.  This supersedes the former method of caching (only)
amgettuple in indexscan scan descriptors; it's an improvement because the
function lookup can be amortized across multiple statements instead of
being repeated for each statement.  Even though lookup for builtin
functions is pretty cheap, this seems to drop a percent or two off some
simple benchmarks.
2005-05-27 23:31:21 +00:00
Bruce Momjian b492c3accc Add parentheses to macros when args are used in computations. Without
them, the executation behavior could be unexpected.
2005-05-25 21:40:43 +00:00
Bruce Momjian 6dc7760ac3 Add support for wal_fsync_writethrough for Darwin, and restructure the
code to better handle writethrough.

Chris Campbell
2005-05-20 14:53:26 +00:00
Tom Lane ff0c143a3b Make a comment pgindent-proof, per suggestion from Alvaro. 2005-05-19 23:58:51 +00:00
Tom Lane ee3b71f6bc Split the shared-memory array of PGPROC pointers out of the sinval
communication structure, and make it its own module with its own lock.
This should reduce contention at least a little, and it definitely makes
the code seem cleaner.  Per my recent proposal.
2005-05-19 21:35:48 +00:00
Neil Conway c891e05f26 Cleanup GiST header files. Since GiST extensions are often written as
external projects, we should be careful about what parts of the GiST
API are considered implementation details, and which are part of the
public API. Therefore, I've moved internal-only declarations into
gist_private.h -- future backward-incompatible changes to gist.h should
be made with care, to avoid needlessly breaking external GiST extensions.

Also did some related header cleanup: remove some unnecessary #includes
from gist.h, and remove some unused definitions: isAttByVal(), _gistdump(),
and GISTNStrategies.
2005-05-17 03:34:18 +00:00
Neil Conway eda6dd32d1 GiST improvements:
- make sure we always invoke user-supplied GiST methods in a short-lived
  memory context. This means the backend isn't exposed to any memory leaks
  that be in those methods (in fact, it is probably a net loss for most
  GiST methods to bother manually freeing memory now). This also means
  we can do away with a lot of ugly manual memory management in the
  GiST code itself.

- keep the current page of a GiST index scan pinned, rather than doing a
  ReadBuffer() for each tuple produced by the scan. Since ReadBuffer() is
  expensive, this is a perf. win

- implement dead tuple killing for GiST indexes (which is easy to do, now
  that we keep a pin on the current scan page). Now all the builtin indexes
  implement dead tuple killing.

- cleanup a lot of ugly code in GiST
2005-05-17 00:59:30 +00:00
Tom Lane 2ef172a2a4 Fix latent bug in ExecSeqRestrPos: it leaves the plan node's result slot
in an inconsistent state.  (This is only latent because in reality
ExecSeqRestrPos is dead code at the moment ... but someday maybe it won't
be.)  Add some comments about what the API for plan node mark/restore
actually is, because it's not immediately obvious.
2005-05-15 21:19:55 +00:00
Neil Conway eb0d00a9be Various style cleanups for GiST; no changes to functionality. 2005-05-15 04:08:29 +00:00
Neil Conway 3140437495 This patch refactors away some duplicated code in the index AM build
methods: they all invoke UpdateStats() since they have computed the
number of heap tuples, so I created a function in catalog/index.c that
each AM now calls.
2005-05-11 06:24:55 +00:00
Neil Conway f38e413b20 Code cleanup: in C89, there is no point casting the first argument to
memset() or MemSet() to a char *. For one, memset()'s first argument is
a void *, and further void * can be implicitly coerced to/from any other
pointer type.
2005-05-11 01:26:02 +00:00
Bruce Momjian 35e1651508 Back out check for unreferenced files.
Heikki Linnakangas
2005-05-10 22:27:30 +00:00
Neil Conway dc5ebcfcce Fix typo in comment. 2005-05-10 05:15:07 +00:00
Tom Lane 30f540be43 Repair very-low-probability race condition between relation extension
and VACUUM: in the interval between adding a new page to the relation
and formatting it, it was possible for VACUUM to come along and decide
it should format the page too.  Though not harmful in itself, this would
cause data loss if a third transaction were able to insert tuples into
the vacuumed page before the original extender got control back.
2005-05-07 21:32:24 +00:00
Tom Lane 4cd4ed0cc2 Fix case in which a debug printout would print already-pfreed data. 2005-05-07 18:14:25 +00:00
Tom Lane 278bd0cc22 For some reason access/tupmacs.h has been #including utils/memutils.h,
which is neither needed by nor related to that header.  Remove the bogus
inclusion and instead include the header in those C files that actually
need it.  Also fix unnecessary inclusions and bad inclusion order in
tsearch2 files.
2005-05-06 17:24:55 +00:00
Tom Lane 126eaef651 Clean up MultiXactIdExpand's API by separating out the case where we
are creating a new MultiXactId from two regular XIDs.  The original
coding was unnecessarily complicated and didn't save any code anyway.
2005-05-03 19:42:41 +00:00
Bruce Momjian 76668e6eb4 Check the file system on postmaster startup and report any unreferenced
files in the server log.

Heikki Linnakangas
2005-05-02 18:26:54 +00:00
Tom Lane 6c412f0605 Change CREATE TYPE to require datatype output and send functions to have
only one argument.  (Per recent discussion, the option to accept multiple
arguments is pretty useless for user-defined types, and would be a likely
source of security holes if it was used.)  Simplify call sites of
output/send functions to not bother passing more than one argument.
2005-05-01 18:56:19 +00:00
Tom Lane 93b2477278 Use the standard lock manager to establish priority order when there
is contention for a tuple-level lock.  This solves the problem of a
would-be exclusive locker being starved out by an indefinite succession
of share-lockers.  Per recent discussion with Alvaro.
2005-04-30 19:03:33 +00:00
Tom Lane 3a694bb0a1 Restructure LOCKTAG as per discussions of a couple months ago.
Essentially, we shoehorn in a lockable-object-type field by taking
a byte away from the lockmethodid, which can surely fit in one byte
instead of two.  This allows less artificial definitions of all the
other fields of LOCKTAG; we can get rid of the special pg_xactlock
pseudo-relation, and also support locks on individual tuples and
general database objects (including shared objects).  None of those
possibilities are actually exploited just yet, however.

I removed pg_xactlock from pg_class, but did not force initdb for
that change.  At this point, relkind 's' (SPECIAL) is unused and
could be removed entirely.
2005-04-29 22:28:24 +00:00
Tom Lane bedb78d386 Implement sharable row-level locks, and use them for foreign key references
to eliminate unnecessary deadlocks.  This commit adds SELECT ... FOR SHARE
paralleling SELECT ... FOR UPDATE.  The implementation uses a new SLRU
data structure (managed much like pg_subtrans) to represent multiple-
transaction-ID sets.  When more than one transaction is holding a shared
lock on a particular row, we create a MultiXactId representing that set
of transactions and store its ID in the row's XMAX.  This scheme allows
an effectively unlimited number of row locks, just as we did before,
while not costing any extra overhead except when a shared lock actually
has to be shared.   Still TODO: use the regular lock manager to control
the grant order when multiple backends are waiting for a row lock.

Alvaro Herrera and Tom Lane.
2005-04-28 21:47:18 +00:00
Tom Lane 19d127548c Add comment about checkpoint panic behavior during shutdown, per
suggestion from Qingqing Zhou.
2005-04-23 18:49:54 +00:00
Tom Lane 3842892492 Recent changes got the sense of the notnull bit backwards in the 2.0
protocol output routines.  Mea culpa :-(.  Per report from Kris Jurka.
2005-04-23 17:45:35 +00:00
Bruce Momjian 1a6ad669fb Fix comment typo. 2005-04-17 03:04:29 +00:00
Tom Lane 5f0a974ea9 Reduce PANIC to ERROR in several xlog routines that are used in both
critical and noncritical contexts (an example of noncritical being
post-checkpoint removal of dead xlog segments).  In the critical cases
the CRIT_SECTION mechanism will cause ERROR to be promoted to PANIC
anyway, and in the noncritical cases we shouldn't let an error take
down the entire database.  Arguably there should be *no* explicit
PANIC errors in this module, only more START/END_CRIT_SECTION calls,
but I didn't go that far.  (Yet.)
2005-04-15 22:19:48 +00:00
Tom Lane 61b861421b Modify MoveOfflineLogs/InstallXLogFileSegment to avoid O(N^2) behavior
when recycling a large number of xlog segments during checkpoint.
The former behavior searched from the same start point each time,
requiring O(checkpoint_segments^2) stat() calls to relocate all the
segments.  Instead keep track of where we stopped last time through.
2005-04-15 18:48:10 +00:00
Tom Lane 8e14408028 Make equalTupleDescs() compare attlen/attbyval/attalign rather than
assuming comparison of atttypid is sufficient.  In a dropped column
atttypid will be 0, and we'd better check the physical-storage data
to make sure the tupdescs are physically compatible.
I do not believe there is a real risk before 8.0, since before that
we only used this routine to compare successive states of the tupdesc
for a particular relation.  But 8.0's typcache.c might be comparing
arbitrary tupdescs so we'd better play it safer.
2005-04-14 22:34:48 +00:00
Tom Lane 162bd08b3f Completion of project to use fixed OIDs for all system catalogs and
indexes.  Replace all heap_openr and index_openr calls by heap_open
and index_open.  Remove runtime lookups of catalog OID numbers in
various places.  Remove relcache's support for looking up system
catalogs by name.  Bulky but mostly very boring patch ...
2005-04-14 20:03:27 +00:00
Tom Lane 2193a856a2 Simplify initdb-time assignment of OIDs as I proposed yesterday, and
avoid encroaching on the 'user' range of OIDs by allowing automatic
OID assignment to use values below 16k until we reach normal operation.

initdb not forced since this doesn't make any incompatible change;
however a lot of stuff will have different OIDs after your next initdb.
2005-04-13 18:54:57 +00:00
Tom Lane c3294f1cbf Fix interaction between materializing holdable cursors and firing
deferred triggers: either one can create more work for the other,
so we have to loop till it's all gone.  Per example from andrew@supernews.
Add a regression test to help spot trouble in this area in future.
2005-04-11 19:51:16 +00:00
Tom Lane ad161bcc8a Merge Resdom nodes into TargetEntry nodes to simplify code and save a
few palloc's.  I also chose to eliminate the restype and restypmod fields
entirely, since they are redundant with information stored in the node's
contained expression; re-examining the expression at need seems simpler
and more reliable than trying to keep restype/restypmod up to date.

initdb forced due to change in contents of stored rules.
2005-04-06 16:34:07 +00:00
Tom Lane 47888fe842 First phase of OUT-parameters project. We can now define and use SQL
functions with OUT parameters.  The various PLs still need work, as does
pg_dump.  Rudimentary docs and regression tests included.
2005-03-31 22:46:33 +00:00
Tom Lane 8c85a34a3b Officially decouple FUNC_MAX_ARGS from INDEX_MAX_KEYS, and set the
former to 100 by default.  Clean up some of the less necessary
dependencies on FUNC_MAX_ARGS; however, the biggie (FunctionCallInfoData)
remains.
2005-03-29 03:01:32 +00:00
Tom Lane 70c9763d48 Convert oidvector and int2vector into variable-length arrays. This
change saves a great deal of space in pg_proc and its primary index,
and it eliminates the former requirement that INDEX_MAX_KEYS and
FUNC_MAX_ARGS have the same value.  INDEX_MAX_KEYS is still embedded
in the on-disk representation (because it affects index tuple header
size), but FUNC_MAX_ARGS is not.  I believe it would now be possible
to increase FUNC_MAX_ARGS at little cost, but haven't experimented yet.
There are still a lot of vestigial references to FUNC_MAX_ARGS, which
I will clean up in a separate pass.  However, getting rid of it
altogether would require changing the FunctionCallInfoData struct,
and I'm not sure I want to buy into that.
2005-03-29 00:17:27 +00:00
Tom Lane 119191609c Remove dead push/pop rollback code. Vadim once planned to implement
transaction rollback via UNDO but I think that's highly unlikely to
happen, so we may as well remove the stubs.  (Someday we ought to
rip out the stub xxx_undo routines, too.)  Per Alvaro.
2005-03-28 01:50:34 +00:00
Tom Lane bf3dbb5881 First steps towards index scans with heap access decoupled from index
access: define new index access method functions 'amgetmulti' that can
fetch multiple TIDs per call.  (The functions exist but are totally
untested as yet.)  Since I was modifying pg_am anyway, remove the
no-longer-needed 'rel' parameter from amcostestimate functions, and
also remove the vestigial amowner column that was creating useless
work for Alvaro's shared-object-dependencies project.
Initdb forced due to changes in pg_am.
2005-03-27 23:53:05 +00:00
Tom Lane 617dd33b6e Eliminate duplicate hasnulls bit testing in index tuple access, and
clean up itup.h a little bit.
2005-03-27 18:38:27 +00:00
Bruce Momjian b1f57d88f5 Change Win32 O_SYNC method to O_DSYNC because that is what the method
currently does.  This is now the default Win32 wal sync method because
we perfer o_datasync to fsync.

Also, change Win32 fsync to a new wal sync method called
fsync_writethrough because that is the behavior of _commit, which is
what is used for fsync on Win32.

Backpatch to 8.0.X.
2005-03-24 04:36:20 +00:00
Tom Lane 94e03330cb Create a routine PageIndexMultiDelete() that replaces a loop around
PageIndexTupleDelete() with a single pass of compactification ---
logic mostly lifted from PageRepairFragmentation.  I noticed while
profiling that a VACUUM that's cleaning up a whole lot of deleted
tuples would spend as much as a third of its CPU time in
PageIndexTupleDelete; not too surprising considering the loop method
was roughly O(N^2) in the number of tuples involved.
2005-03-22 06:17:03 +00:00
Tom Lane ee4ddac137 Convert index-related tuple handling routines from char 'n'/' ' to bool
convention for isnull flags.  Also, remove the useless InsertIndexResult
return struct from index AM aminsert calls --- there is no reason for
the caller to know where in the index the tuple was inserted, and we
were wasting a palloc cycle per insert to deliver this uninteresting
value (plus nontrivial complexity in some AMs).
I forced initdb because of the change in the signature of the aminsert
routines, even though nothing really looks at those pg_proc entries...
2005-03-21 01:24:04 +00:00
Neil Conway fe7015f5e8 Change the return value of HeapTupleSatisfiesUpdate() to be an enum,
rather than an integer, and fix the associated fallout. From Alvaro
Herrera.
2005-03-20 23:40:34 +00:00
Tom Lane 354049c709 Remove unnecessary calls of FlushRelationBuffers: there is no need
to write out data that we are about to tell the filesystem to drop.
smgr_internal_unlink already had a DropRelFileNodeBuffers call to
get rid of dead buffers without a write after it's no longer possible
to roll back the deleting transaction.  Adding a similar call in
smgrtruncate simplifies callers and makes the overall division of
labor clearer.  This patch removes the former behavior that VACUUM
would write all dirty buffers of a relation unconditionally.
2005-03-20 22:00:54 +00:00
Tom Lane f97aebd162 Revise TupleTableSlot code to avoid unnecessary construction and disassembly
of tuples when passing data up through multiple plan nodes.  A slot can now
hold either a normal "physical" HeapTuple, or a "virtual" tuple consisting
of Datum/isnull arrays.  Upper plan levels can usually just copy the Datum
arrays, avoiding heap_formtuple() and possible subsequent nocachegetattr()
calls to extract the data again.  This work extends Atsushi Ogawa's earlier
patch, which provided the key idea of adding Datum arrays to TupleTableSlots.
(I believe however that something like this was foreseen way back in Berkeley
days --- see the old comment on ExecProject.)  A test case involving many
levels of join of fairly wide tables (about 80 columns altogether) showed
about 3x overall speedup, though simple queries will probably not be
helped very much.

I have also duplicated some code in heaptuple.c in order to provide versions
of heap_formtuple and friends that use "bool" arrays to indicate null
attributes, instead of the old convention of "char" arrays containing either
'n' or ' '.  This provides a better match to the convention used by
ExecEvalExpr.  While I have not made a concerted effort to get rid of uses
of the old routines, I think they should be deprecated and eventually removed.
2005-03-16 21:38:10 +00:00
Tom Lane a9b05bdc83 Avoid O(N^2) overhead in repeated nocachegetattr calls when columns of
a tuple are being accessed via ExecEvalVar and the attcacheoff shortcut
isn't usable (due to nulls and/or varlena columns).  To do this, cache
Datums extracted from a tuple in the associated TupleTableSlot.
Also some code cleanup in and around the TupleTable handling.
Atsushi Ogawa with some kibitzing by Tom Lane.
2005-03-14 04:41:13 +00:00
Tom Lane a52b4fb131 Adjust creation/destruction of TupleDesc data structure to reduce the
number of palloc calls.  This has a salutory impact on plpgsql operations
with record variables (which create and destroy tupdescs constantly)
and probably helps a bit in some other cases too.
2005-03-07 04:42:17 +00:00
Tom Lane 4aefe75553 Remove some no-longer-needed kluges for bootstrapping, in particular
the AMI_OVERRIDE flag.  The fact that TransactionLogFetch treats
BootstrapTransactionId as always committed is sufficient to make
bootstrap work, and getting rid of extra tests in heavily used code
paths seems like a win.  The files produced by initdb are demonstrably
the same after this change.
2005-02-20 21:46:50 +00:00
Tom Lane 60b2444cc3 Add code to prevent transaction ID wraparound by enforcing a safe limit
in GetNewTransactionId().  Since the limit value has to be computed
before we run any real transactions, this requires adding code to database
startup to scan pg_database and determine the oldest datfrozenxid.
This can conveniently be combined with the first stage of an attack on
the problem that the 'flat file' copies of pg_shadow and pg_group are
not properly updated during WAL recovery.  The code I've added to
startup resides in a new file src/backend/utils/init/flatfiles.c, and
it is responsible for rewriting the flat files as well as initializing
the XID wraparound limit value.  This will eventually allow us to get
rid of GetRawDatabaseInfo too, but we'll need an initdb so we can add
a trigger to pg_database.
2005-02-20 02:22:07 +00:00
Bruce Momjian 7c44e57331 Move plpgsql DEBUG from DEBUG2 to DEBUG1 because it is a user-requested
DEBUG.

Fix a few places where DEBUG1 crept in that should have been DEBUG2.
2005-02-12 23:53:42 +00:00
Tom Lane 12179c99b1 Marginal hack to merge adjacent ReleaseBuffer/ReadBuffer calls into
ReleaseAndReadBuffer during GIST index searches.  We already did this
in btree and rtree, might as well do it here too.
2005-02-05 19:38:58 +00:00
Neil Conway a885ecd6ef Change heap_modifytuple() to require a TupleDesc rather than a
Relation. Patch from Alvaro Herrera, minor editorializing by
Neil Conway.
2005-01-27 23:24:11 +00:00
Tom Lane 0ffe9f7946 Fix memory leak in rtdosplit, per report from Clive Page. 2005-01-24 02:47:26 +00:00
Neil Conway b4297c177c This patch makes some improvements to the rtree index implementation:
(1) Keep a pin on the scan's current buffer and mark buffer. This
avoids the need to do a ReadBuffer() for each tuple produced by the
scan. Since ReadBuffer() is expensive, this is a significant win.

(2) Convert a ReleaseBuffer(); ReadBuffer() pair into
ReleaseAndReadBuffer(). Surely not a huge win, but it saves a lock
acquire/release...

(3) Remove a bunch of duplicated code in rtget.c; make rtnext() handle
both the "initial result" and "subsequent result" cases.

(4) Add support for index tuple killing

(5) Remove rtscancache(): it is dead code, for the same reason that
gistscancache() is dead code (an index scan ought not be invoked with
NoMovementScanDirection).

The end result is about a 10% improvement in rtree index scan perf,
according to contrib/rtree_gist/bench.
2005-01-18 23:25:55 +00:00
Tom Lane 0ce4d56924 Phase 1 of fix for 'SMgrRelation hashtable corrupted' problem. This
is the minimum required fix.  I want to look next at taking advantage of
it by simplifying the message semantics in the shared inval message queue,
but that part can be held over for 8.1 if it turns out too ugly.
2005-01-10 20:02:24 +00:00
Bruce Momjian 2daed8c5b3 Update copyrights that were missed. 2005-01-01 05:43:09 +00:00
PostgreSQL Daemon 2ff501590b Tag appropriate files for rc3
Also performed an initial run through of upgrading our Copyright date to
extend to 2005 ... first run here was very simple ... change everything
where: grep 1996-2004 && the word 'Copyright' ... scanned through the
generated list with 'less' first, and after, to make sure that I only
picked up the right entries ...
2004-12-31 22:04:05 +00:00
Tom Lane bfa5f30481 Awhile back I added some code to StartupCLOG() to forcibly zero out
the remainder of the current clog page during system startup.  While
this was a good idea, it turns out the code fails if nextXid is
exactly at a page boundary, because we won't have created the "current"
clog page yet in that case.  Since the page will be correctly zeroed
when we execute the first transaction on it, the solution is just to
do nothing when exactly at a page boundary.  Per trouble report from
Dave Hartwig.
2004-12-22 18:45:49 +00:00
Tom Lane ff5a354ece Fix is-it-time-for-a-checkpoint logic so that checkpoint_segments can
usefully be larger than 255.  Per gripe from Simon Riggs.
2004-12-17 00:10:36 +00:00
Tom Lane c3d6c7d8f9 Calculation of keys_are_unique flag was wrong for cases involving
redundant cross-datatype comparisons.  Per example from Merlin Moncure.
2004-12-15 19:16:39 +00:00
Tom Lane 5374d097de Change planner to use the current true disk file size as its estimate of
a relation's number of blocks, rather than the possibly-obsolete value
in pg_class.relpages.  Scale the value in pg_class.reltuples correspondingly
to arrive at a hopefully more accurate number of rows.  When pg_class
contains 0/0, estimate a tuple width from the column datatypes and divide
that into current file size to estimate number of rows.  This improved
methodology allows us to jettison the ancient hacks that put bogus default
values into pg_class when a table is first created.  Also, per a suggestion
from Simon, make VACUUM (but not VACUUM FULL or ANALYZE) adjust the value
it puts into pg_class.reltuples to try to represent the mean tuple density
instead of the minimal density that actually prevails just after VACUUM.
These changes alter the plans selected for certain regression tests, so
update the expected files accordingly.  (I removed join_1.out because
it's not clear if it still applies; we can add back any variant versions
as they are shown to be needed.)
2004-12-01 19:00:56 +00:00
Tom Lane 37d693033d Minor adjustment of message style. 2004-11-17 16:26:59 +00:00
Neil Conway 5d1dd2bc55 Micro-optimization of markpos() and restrpos() in btree and hash indexes.
Rather than using ReadBuffer() to increment the reference count on an
already-pinned buffer, we should use IncrBufferRefCount() as it is
faster and does not require acquiring the BufMgrLock.
2004-11-17 03:13:38 +00:00
Neil Conway b25d23e1e6 Don't allow pg_start_backup() to be invoked if archive_command has not
been defined. Patch from Gavin Sherry, editorializing by Neil Conway.
2004-11-17 02:22:54 +00:00
Neil Conway a236dd9536 There is no need for ReadBuffer() call sites to check that the returned
buffer is valid, as ReadBuffer() will elog on error. Most of the call
sites of ReadBuffer() got this right, but this patch fixes those call
sites that did not.
2004-11-14 02:04:14 +00:00
Neil Conway 4d0f669f3c Remove obsolete comment from btbuild() and hashbuild(): we no longer use
a global variable to control building indexes.
2004-11-11 00:32:50 +00:00
Peter Eisentraut 0ed3c7665e Small message clarifications 2004-11-05 17:11:34 +00:00
Tom Lane 88868d4fbc Change COMMIT back to the old behavior of emitting command tag COMMIT,
not ROLLBACK, for the case of COMMIT outside a transaction block.
Alvaro Herrera
2004-10-30 20:44:43 +00:00
Tom Lane 23f264d125 Rearrange order of pre-commit operations: must close cursors before doing
ON COMMIT actions.  Per bug report from Michael Guerin.
2004-10-29 22:19:53 +00:00
Tom Lane ee69be44d5 Add DEBUG1-level logging of checkpoint start and end. Also, reduce the
'recycled log files' and 'removed log files' messages from DEBUG1 to
DEBUG2, replacing them with a count of files added/removed/recycled in
the checkpoint end message, as per suggestion from Simon Riggs.
2004-10-29 00:16:08 +00:00
Tom Lane 83cd2d8b0f Make heap_fetch API more consistent by having the buffer remain pinned
in all cases when keep_buf = true.  This allows ANALYZE's inner loop to
use heap_release_fetch, which saves multiple buffer lookups for the same
page and avoids overestimation of cost by the vacuum cost mechanism.
2004-10-26 16:05:03 +00:00
Tom Lane fb22b32095 Allow functions returning void or cstring to appear in FROM clause,
to make life cushy for the JDBC driver.  Centralize the decision-making
that affects this by inventing a get_type_func_class() function, rather
than adding special cases in half a dozen places.
2004-10-20 16:04:50 +00:00
Tom Lane fdd13f1568 Give the ResourceOwner mechanism full responsibility for releasing buffer
pins at end of transaction, and reduce AtEOXact_Buffers to an Assert
cross-check that this was done correctly.  When not USE_ASSERT_CHECKING,
AtEOXact_Buffers is a complete no-op.  This gets rid of an O(NBuffers)
bottleneck during transaction commit/abort, which recent testing has shown
becomes significant above a few tens of thousands of shared buffers.
2004-10-16 18:57:26 +00:00
Tom Lane 9ffc8ed58b Repair possible failure to update hint bits back to disk, per
http://archives.postgresql.org/pgsql-hackers/2004-10/msg00464.php.
This fix is intended to be permanent: it moves the responsibility for
calling SetBufferCommitInfoNeedsSave() into the tqual.c routines,
eliminating the requirement for callers to test whether t_infomask changed.
Also, tighten validity checking on buffer IDs in bufmgr.c --- several
routines were paranoid about out-of-range shared buffer numbers but not
about out-of-range local ones, which seems a tad pointless.
2004-10-15 22:40:29 +00:00
Bruce Momjian 5c267325ec Add 'int' cast for getpid() because some Solaris releases return long
for getpid().
2004-10-14 20:23:46 +00:00
Peter Eisentraut 0fd37839d9 Message style revisions 2004-10-12 21:54:45 +00:00
Bruce Momjian 67608a393b Make getpid() use %d consistently for printing. 2004-10-09 02:46:42 +00:00
Bruce Momjian a5d7ba773d Adjust comments previously moved to column 1 by pgident. 2004-10-07 15:21:58 +00:00
Tom Lane 4c77cbb272 PortalRun must guard against the possibility that the portal it's
running contains VACUUM or a similar command that will internally start
and commit transactions.  In such a case, the original caller values of
CurrentMemoryContext and CurrentResourceOwner will point to objects that
will be destroyed by the internal commit.  We must restore these pointers
to point to the newly-manufactured transaction context and resource owner,
rather than possibly pointing to deleted memory.
Also tweak xact.c so that AbortTransaction and AbortSubTransaction
forcibly restore a sane value for CurrentResourceOwner, much as they
have always done for CurrentMemoryContext.  I'm not certain this is
necessary but I'm feeling paranoid today.
Responds to Sean Chittenden's bug report of 4-Oct.
2004-10-04 21:52:15 +00:00
Tom Lane 4c5e810fcd Code review for NOWAIT patch: downgrade NOWAIT from fully reserved keyword
to unreserved keyword, use ereport not elog, assign a separate error code
for 'could not obtain lock' so that applications will be able to detect
that case cleanly.
2004-10-01 16:40:05 +00:00
Tom Lane d2af5f8a3e Adjust index locking rules as per my proposal of earlier today. You
now are supposed to take some kind of lock on an index whenever you
are going to access the index contents, rather than relying only on a
lock on the parent table.
2004-09-30 23:21:26 +00:00
Neil Conway 0ed07d49d5 Code cleanup: don't bother casting the argument to pfree() to void *
from another pointer type. Per C89, this is unnecessary, and it is common
practice throughout the rest of the tree anyway.
2004-09-27 04:01:23 +00:00
Tom Lane 054b78ba38 Now that xmax and cmin are distinct fields again, we should zero xmax when
creating a new tuple.  This is just for debugging sanity, though, since
nothing should be paying any attention to xmax when the HEAP_XMAX_INVALID
bit is set.
2004-09-17 18:09:55 +00:00
Tom Lane 257cccbe5e Add some marginal tweaks to eliminate memory leakages associated with
subtransactions.  Trivial subxacts (such as a plpgsql exception block
containing no database access) now demonstrably leak zero bytes.
2004-09-16 20:17:49 +00:00
Tom Lane 86fff990b2 RecentXmin is too recent to use as the cutoff point for accessing
pg_subtrans --- what we need is the oldest xmin of any snapshot in use
in the current top transaction.  Introduce a new variable TransactionXmin
to play this role.  Fixes intermittent regression failure reported by
Neil Conway.
2004-09-16 18:35:23 +00:00
Tom Lane 8f9f198603 Restructure subtransaction handling to reduce resource consumption,
as per recent discussions.  Invent SubTransactionIds that are managed like
CommandIds (ie, counter is reset at start of each top transaction), and
use these instead of TransactionIds to keep track of subtransaction status
in those modules that need it.  This means that a subtransaction does not
need an XID unless it actually inserts/modifies rows in the database.
Accordingly, don't assign it an XID nor take a lock on the XID until it
tries to do that.  This saves a lot of overhead for subtransactions that
are only used for error recovery (eg plpgsql exceptions).  Also, arrange
to release a subtransaction's XID lock as soon as the subtransaction
exits, in both the commit and abort cases.  This avoids holding many
unique locks after a long series of subtransactions.  The price is some
additional overhead in XactLockTableWait, but that seems acceptable.
Finally, restructure the state machine in xact.c to have a more orthogonal
set of states for subtransactions.
2004-09-16 16:58:44 +00:00
Tom Lane b2c4071299 Redesign query-snapshot timing so that volatile functions in READ COMMITTED
mode see a fresh snapshot for each command in the function, rather than
using the latest interactive command's snapshot.  Also, suppress fresh
snapshots as well as CommandCounterIncrement inside STABLE and IMMUTABLE
functions, instead using the snapshot taken for the most closely nested
regular query.  (This behavior is only sane for read-only functions, so
the patch also enforces that such functions contain only SELECT commands.)
As per my proposal of 6-Sep-2004; I note that I floated essentially the
same proposal on 19-Jun-2002, but that discussion tailed off without any
action.  Since 8.0 seems like the right place to be taking possibly
nontrivial backwards compatibility hits, let's get it done now.
2004-09-13 20:10:13 +00:00
Tom Lane 493f72606b Renumber SnapshotNow and the other special snapshot codes so that
((Snapshot) NULL) can no longer be confused with a valid snapshot,
as per my recent suggestion.  Define a macro InvalidSnapshot for 0.
Use InvalidSnapshot instead of SnapshotAny as the do-nothing special
case for heap_update and heap_delete crosschecks; this seems a little
cleaner even though the behavior is really the same.
2004-09-11 18:28:34 +00:00