Commit Graph

2106 Commits

Author SHA1 Message Date
Tom Lane ee678fe30c Make NOTIFY_PAYLOAD_MAX_LENGTH depend explicitly on BLCKSZ and
NAMEDATALEN, so this code doesn't go nuts with smaller than default
BLCKSZ or larger than default NAMEDATALEN.  The standard value is
still exactly 8000.
2010-02-17 00:52:09 +00:00
Tom Lane d1e027221d Replace the pg_listener-based LISTEN/NOTIFY mechanism with an in-memory queue.
In addition, add support for a "payload" string to be passed along with
each notify event.

This implementation should be significantly more efficient than the old one,
and is also more compatible with Hot Standby usage.  There is not yet any
facility for HS slaves to receive notifications generated on the master,
although such a thing is possible in future.

Joachim Wieland, reviewed by Jeff Davis; also hacked on by me.
2010-02-16 22:34:57 +00:00
Andrew Dunstan fc5173ad51 Add query text to auto_explain output.
Still to be done: fix docs and fix regression failures under auto_explain.
2010-02-16 22:19:59 +00:00
Greg Stark 27cb626f7a revert to showing buffer counts in explain (buffers) 2010-02-16 20:07:13 +00:00
Alvaro Herrera dc11595193 Fix typo in comment 2010-02-15 16:10:34 +00:00
Greg Stark 34ebccddcd Display explain buffers measurements in memory units rather than blocks. Also show "Total Buffer Usage" to hint that these are totals not averages per loop 2010-02-15 02:36:26 +00:00
Robert Haas e26c539e9f Wrap calls to SearchSysCache and related functions using macros.
The purpose of this change is to eliminate the need for every caller
of SearchSysCache, SearchSysCacheCopy, SearchSysCacheExists,
GetSysCacheOid, and SearchSysCacheList to know the maximum number
of allowable keys for a syscache entry (currently 4).  This will
make it far easier to increase the maximum number of keys in a
future release should we choose to do so, and it makes the code
shorter, too.

Design and review by Tom Lane.
2010-02-14 18:42:19 +00:00
Tom Lane cbe9d6beb4 Fix up rickety handling of relation-truncation interlocks.
Move rd_targblock, rd_fsm_nblocks, and rd_vm_nblocks from relcache to the smgr
relation entries, so that they will get reset to InvalidBlockNumber whenever
an smgr-level flush happens.  Because we now send smgr invalidation messages
immediately (not at end of transaction) when a relation truncation occurs,
this ensures that other backends will reset their values before they next
access the relation.  We no longer need the unreliable assumption that a
VACUUM that's doing a truncation will hold its AccessExclusive lock until
commit --- in fact, we can intentionally release that lock as soon as we've
completed the truncation.  This patch therefore reverts (most of) Alvaro's
patch of 2009-11-10, as well as my marginal hacking on it yesterday.  We can
also get rid of assorted no-longer-needed relcache flushes, which are far more
expensive than an smgr flush because they kill a lot more state.

In passing this patch fixes smgr_redo's failure to perform visibility-map
truncation, and cleans up some rather dubious assumptions in freespace.c and
visibilitymap.c about when rd_fsm_nblocks and rd_vm_nblocks can be out of
date.
2010-02-09 21:43:30 +00:00
Tom Lane 95289e4a58 Rearrange lazy-vacuum code a little bit to reduce the window between
truncating the table and transaction commit.  This isn't really making
it safe, but at least there is no good reason to do free space map
cleanup within the risk window.  Don't lock out cancel interrupts
until we have to, either.
2010-02-09 00:28:30 +00:00
Tom Lane 9184cc7dab Fix serious performance bug in new implementation of VACUUM FULL:
cluster_rel necessarily builds an all-new toast table, so it's useless to
then go and VACUUM FULL the toast table.
2010-02-08 16:50:21 +00:00
Tom Lane 0a469c8769 Remove old-style VACUUM FULL (which was known for a little while as
VACUUM FULL INPLACE), along with a boatload of subsidiary code and complexity.
Per discussion, the use case for this method of vacuuming is no longer large
enough to justify maintaining it; not to mention that we don't wish to invest
the work that would be needed to make it play nicely with Hot Standby.

Aside from the code directly related to old-style VACUUM FULL, this commit
removes support for certain WAL record types that could only be generated
within VACUUM FULL, redirect-pointer removal in heap_page_prune, and
nontransactional generation of cache invalidation sinval messages (the last
being the sticking point for Hot Standby).

We still have to retain all code that copes with finding HEAP_MOVED_OFF and
HEAP_MOVED_IN flag bits on existing tuples.  This can't be removed as long
as we want to support in-place update from pre-9.0 databases.
2010-02-08 04:33:55 +00:00
Tom Lane 1ddc2703a9 Work around deadlock problems with VACUUM FULL/CLUSTER on system catalogs,
as per my recent proposal.

First, teach IndexBuildHeapScan to not wait for INSERT_IN_PROGRESS or
DELETE_IN_PROGRESS tuples to commit unless the index build is checking
uniqueness/exclusion constraints.  If it isn't, there's no harm in just
indexing the in-doubt tuple.

Second, modify VACUUM FULL/CLUSTER to suppress reverifying
uniqueness/exclusion constraint properties while rebuilding indexes of
the target relation.  This is reasonable because these commands aren't
meant to deal with corrupted-data situations.  Constraint properties
will still be rechecked when an index is rebuilt by a REINDEX command.

This gets us out of the problem that new-style VACUUM FULL would often
wait for other transactions while holding exclusive lock on a system
catalog, leading to probable deadlock because those other transactions
need to look at the catalogs too.  Although the real ultimate cause of
the problem is a debatable choice to release locks early after modifying
system catalogs, changing that choice would require pretty serious
analysis and is not something to be undertaken lightly or on a tight
schedule.  The present patch fixes the problem in a fairly reasonable
way and should also improve the speed of VACUUM FULL/CLUSTER a little bit.
2010-02-07 22:40:33 +00:00
Tom Lane b9b8831ad6 Create a "relation mapping" infrastructure to support changing the relfilenodes
of shared or nailed system catalogs.  This has two key benefits:

* The new CLUSTER-based VACUUM FULL can be applied safely to all catalogs.

* We no longer have to use an unsafe reindex-in-place approach for reindexing
  shared catalogs.

CLUSTER on nailed catalogs now works too, although I left it disabled on
shared catalogs because the resulting pg_index.indisclustered update would
only be visible in one database.

Since reindexing shared system catalogs is now fully transactional and
crash-safe, the former special cases in REINDEX behavior have been removed;
shared catalogs are treated the same as non-shared.

This commit does not do anything about the recently-discussed problem of
deadlocks between VACUUM FULL/CLUSTER on a system catalog and other
concurrent queries; will address that in a separate patch.  As a stopgap,
parallel_schedule has been tweaked to run vacuum.sql by itself, to avoid
such failures during the regression tests.
2010-02-07 20:48:13 +00:00
Tom Lane 9727c583fe Restructure CLUSTER/newstyle VACUUM FULL/ALTER TABLE support so that swapping
of old and new toast tables can be done either at the logical level (by
swapping the heaps' reltoastrelid links) or at the physical level (by swapping
the relfilenodes of the toast tables and their indexes).  This is necessary
infrastructure for upcoming changes to support CLUSTER/VAC FULL on shared
system catalogs, where we cannot change reltoastrelid.  The physical swap
saves a few catalog updates too.

We unfortunately have to keep the logical-level swap logic because in some
cases we will be adding or deleting a toast table, so there's no possibility
of a physical swap.  However, that only happens as a consequence of schema
changes in the table, which we do not need to support for system catalogs,
so such cases aren't an obstacle for that.

In passing, refactor the cluster support functions a little bit to eliminate
unnecessarily-duplicated code; and fix the problem that while CLUSTER had
been taught to rename the final toast table at need, ALTER TABLE had not.
2010-02-04 00:09:14 +00:00
Heikki Linnakangas 9de778b24b Move the responsibility of writing a "unlogged WAL operation" record from
heap_sync() to the callers, because heap_sync() is sometimes called even
if the operation itself is WAL-logged. This eliminates the bogus unlogged
records from CLUSTER that Simon Riggs reported, patch by Fujii Masao.
2010-02-03 10:01:30 +00:00
Tom Lane 70a2b05a59 Assorted cleanups in preparation for using a map file to support altering
the relfilenode of currently-not-relocatable system catalogs.

1. Get rid of inval.c's dependency on relfilenode, by not having it emit
smgr invalidations as a result of relcache flushes.  Instead, smgr sinval
messages are sent directly from smgr.c when an actual relation delete or
truncate is done.  This makes considerably more structural sense and allows
elimination of a large number of useless smgr inval messages that were
formerly sent even in cases where nothing was changing at the
physical-relation level.  Note that this reintroduces the concept of
nontransactional inval messages, but that's okay --- because the messages
are sent by smgr.c, they will be sent in Hot Standby slaves, just from a
lower logical level than before.

2. Move setNewRelfilenode out of catalog/index.c, where it never logically
belonged, into relcache.c; which is a somewhat debatable choice as well but
better than before.  (I considered catalog/storage.c, but that seemed too
low level.)  Rename to RelationSetNewRelfilenode.

3. Cosmetic cleanups of some other relfilenode manipulations.
2010-02-03 01:14:17 +00:00
Tom Lane c98157d693 CLUSTER specified the wrong namespace when renaming toast tables of temporary
relations (they don't live in pg_toast).  This caused an Assert failure in
assert-enabled builds.  So far as I can see, in a non-assert build it would
only have messed up the checks for conflicting names, so a failure would be
quite improbable but perhaps not impossible.
2010-02-02 19:12:29 +00:00
Robert Haas 63f9282f6e Tighten integrity checks on ALTER TABLE ... ALTER COLUMN ... RENAME.
When a column is renamed, we recursively rename the same column in
all descendent tables.  But if one of those tables also inherits that
column from a table outside the inheritance hierarchy rooted at the
named table, we must throw an error.  The previous coding correctly
prohibited the rename when the parent had inherited the column from
elsewhere, but overlooked the case where the parent was OK but a child
table also inherited the same column from a second, unrelated parent.

For now, not backpatched due to lack of complaints from the field.

KaiGai Kohei, with further changes by me.
Reviewed by Bernd Helme and Tom Lane.
2010-02-01 19:28:56 +00:00
Robert Haas 42a8ab0a14 Augment EXPLAIN output with more details on Hash nodes.
We show the number of buckets, the number of batches (and also the original
number if it has changed), and the peak space used by the hash table.  Minor
executor changes to track peak space used.
2010-02-01 15:43:36 +00:00
Tom Lane 034fffbf31 Fix memory leak created by deferrable-index-constraints patches.
We need to free the OID list returned by ExecInsertIndexTuples to avoid
a query-lifespan memory leak.  When many rows require rechecking, this
can be a significant leak --- it's even more than the space used for the
queued trigger events.

Dean Rasheed
2010-01-31 18:15:39 +00:00
Peter Eisentraut e7b3349a8a Type table feature
This adds the CREATE TABLE name OF type command, per SQL standard.
2010-01-28 23:21:13 +00:00
Heikki Linnakangas e0e8b96345 Change a few remaining calls of XLogArchivingActive() to use
XLogIsNeeded() instead, to determine if an otherwise non-logged operation
needs to be logged in WAL for standby servers.

Fujii Masao
2010-01-28 07:31:42 +00:00
Tom Lane d879697cd2 Remove the default_do_language parameter, instead making DO use a hardwired
default of "plpgsql".  This is more reasonable than it was when the DO patch
was written, because we have since decided that plpgsql should be installed
by default.  Per discussion, having a parameter for this doesn't seem useful
enough to justify the risk of application breakage if the value is changed
unexpectedly.
2010-01-26 16:33:40 +00:00
Tom Lane 875353b99f Fix assorted core dumps and Assert failures that could occur during
AbortTransaction or AbortSubTransaction, when trying to clean up after an
error that prevented (sub)transaction start from completing:
* access to TopTransactionResourceOwner that might not exist
* assert failure in AtEOXact_GUC, if AtStart_GUC not called yet
* assert failure or core dump in AfterTriggerEndSubXact, if
  AfterTriggerBeginSubXact not called yet

Per testing by injecting elog(ERROR) at successive steps in StartTransaction
and StartSubTransaction.  It's not clear whether all of these cases could
really occur in the field, but at least one of them is easily exposed by
simple stress testing, as per my accidental discovery yesterday.
2010-01-24 21:49:17 +00:00
Robert Haas 76a47c0e74 Replace ALTER TABLE ... SET STATISTICS DISTINCT with a more general mechanism.
Attributes can now have options, just as relations and tablespaces do, and
the reloptions code is used to parse, validate, and store them.  For
simplicity and because these options are not performance critical, we store
them in a separate cache rather than the main relcache.

Thanks to Alex Hunsaker for the review.
2010-01-22 16:40:19 +00:00
Heikki Linnakangas 09b115f706 Write a WAL record whenever we perform an operation without WAL-logging
that would've been WAL-logged if archiving was enabled. If we encounter
such records in archive recovery anyway, we know that some data is
missing from the log. A WARNING is emitted in that case.

Original patch by Fujii Masao, with changes by me.
2010-01-20 19:43:40 +00:00
Peter Eisentraut eb210ce85a Before attempting to create a composite type, check whether a type of that
name already exists, so we'd get an error message about a "type" instead
of about a "relation", because the composite type code shares code with
relation creation.
2010-01-20 05:47:09 +00:00
Tom Lane 9a915e596f Improve the handling of SET CONSTRAINTS commands by having them search
pg_constraint before searching pg_trigger.  This allows saner handling of
corner cases; in particular we now say "constraint is not deferrable"
rather than "constraint does not exist" when the command is applied to
a constraint that's inherently non-deferrable.  Per a gripe several months
ago from hubert depesz lubaczewski.

To make this work without breaking user-defined constraint triggers,
we have to add entries for them to pg_constraint.  However, in return
we can remove the pgconstrname column from pg_constraint, which represents
a fairly sizable space savings.  I also replaced the tgisconstraint column
with tgisinternal; the old meaning of tgisconstraint can now be had by
testing for nonzero tgconstraint, while there is no other way to get
the old meaning of nonzero tgconstraint, namely that the trigger was
internally generated rather than being user-created.

In passing, fix an old misstatement in the docs and comments, namely that
pg_trigger.tgdeferrable is exactly redundant with pg_constraint.condeferrable.
Actually, we mark RI action triggers as nondeferrable even when they belong to
a nominally deferrable FK constraint.  The SET CONSTRAINTS code now relies on
that instead of hard-coding a list of exception OIDs.
2010-01-17 22:56:23 +00:00
Simon Riggs 8bfd1a8848 Lock database while running drop database in Hot Standby to protect
against concurrent reconnection. Failure during testing showed issue
was possible, even though earlier analysis seemed to indicate it
would not be required. Use LockSharedObjectForSession() before
ResolveRecoveryConflictWithDatabase() and hold lock until end of
processing for that WAL record. Simple approach to avoid introducing
further bugs at this stage of development on an improbable issue.
2010-01-16 14:16:31 +00:00
Tom Lane 08f8d478eb Do parse analysis of an EXPLAIN's contained statement during the normal
parse analysis phase, rather than at execution time.  This makes parameter
handling work the same as it does in ordinary plannable queries, and in
particular fixes the incompatibility that Pavel pointed out with plpgsql's
new handling of variable references.  plancache.c gets a little bit
grottier, but the alternatives seem worse.
2010-01-15 22:36:35 +00:00
Heikki Linnakangas 40f908bdcd Introduce Streaming Replication.
This includes two new kinds of postmaster processes, walsenders and
walreceiver. Walreceiver is responsible for connecting to the primary server
and streaming WAL to disk, while walsender runs in the primary server and
streams WAL from disk to the client.

Documentation still needs work, but the basics are there. We will probably
pull the replication section to a new chapter later on, as well as the
sections describing file-based replication. But let's do that as a separate
patch, so that it's easier to see what has been added/changed. This patch
also adds a new section to the chapter about FE/BE protocol, documenting the
protocol used by walsender/walreceivxer.

Bump catalog version because of two new functions,
pg_last_xlog_receive_location() and pg_last_xlog_replay_location(), for
monitoring the progress of replication.

Fujii Masao, with additional hacking by me
2010-01-15 09:19:10 +00:00
Simon Riggs e99767bc28 First part of refactoring of code for ResolveRecoveryConflict. Purposes
of this are to centralise the conflict code to allow further change,
as well as to allow passing through the full reason for the conflict
through to the conflicting backends. Backend state alters how we
can handle different types of conflict so this is now required.
As originally suggested by Heikki, no longer optional.
2010-01-14 11:08:02 +00:00
Bruce Momjian 228170410d Please tablespace directories in their own subdirectory so pg_migrator
can upgrade clusters without renaming the tablespace directories.  New
directory structure format is, e.g.:

	$PGDATA/pg_tblspc/20981/PG_8.5_201001061/719849/83292814
2010-01-12 02:42:52 +00:00
Simon Riggs 3bfcccc295 During Hot Standby, fix drop database when sessions idle.
Previously we only cancelled sessions that were in-transaction.

Simple fix is to just cancel all sessions without waiting. Doing
it this way avoids complicating common code paths, which would
not be worth the trouble to cover this rare case.

Problem report and fix by Andres Freund, edited somewhat by me
2010-01-10 15:44:28 +00:00
Bruce Momjian c282b36dd2 More tablespace.c comment improvements. 2010-01-07 04:10:39 +00:00
Bruce Momjian 85fcbd8655 Clarify tablespace.c::TablespaceCreateDbspace() comments. 2010-01-07 04:05:39 +00:00
Bruce Momjian a6f56efc35 PG_MAJORVERSION:
For simplicity, use PG_MAJORVERSION rather than PG_VERSION for creation
of the PG_VERSION file.
2010-01-06 23:23:51 +00:00
Itagaki Takahiro ee0b602425 Silence compiler warning about uninitialized variables. This initialization
is not necessary needed, but some compilers complain about it.
2010-01-06 11:25:39 +00:00
Itagaki Takahiro 946cf229e8 Support rewritten-based full vacuum as VACUUM FULL. Traditional
VACUUM FULL was renamed to VACUUM FULL INPLACE. Also added a new
option -i, --inplace for vacuumdb to perform FULL INPLACE vacuuming.

Since the new VACUUM FULL uses CLUSTER infrastructure, we cannot
use it for system tables. VACUUM FULL for system tables always
fall back into VACUUM FULL INPLACE silently.

Itagaki Takahiro, reviewed by Jeff Davis and Simon Riggs.
2010-01-06 05:31:14 +00:00
Bruce Momjian f98fbc78c3 Preserve relfilenodes:
Add support to pg_dump --binary-upgrade to preserve all relfilenodes,
for use by pg_migrator.
2010-01-06 03:04:03 +00:00
Bruce Momjian 5c82ccb1dd Use OIDCHARS:
Use OIDCHARS for oid character length, rather than '10', in tablespace
code.
2010-01-06 01:48:09 +00:00
Robert Haas d86d51a958 Support ALTER TABLESPACE name SET/RESET ( tablespace_options ).
This patch only supports seq_page_cost and random_page_cost as parameters,
but it provides the infrastructure to scalably support many more.
In particular, we may want to add support for effective_io_concurrency,
but I'm leaving that as future work for now.

Thanks to Tom Lane for design help and Alvaro Herrera for the review.
2010-01-05 21:54:00 +00:00
Bruce Momjian 0239800893 Update copyright for the year 2010. 2010-01-02 16:58:17 +00:00
Tom Lane e6df063cf2 Dept of second thoughts: recursive case in ANALYZE shouldn't emit a
pgstats message.  This might need to be done differently later, but
with the current logic that's what should happen.
2009-12-30 21:21:33 +00:00
Tom Lane 48c192c15e Revise pgstat's tracking of tuple changes to improve the reliability of
decisions about when to auto-analyze.

The previous code depended on n_live_tuples + n_dead_tuples - last_anl_tuples,
where all three of these numbers could be bad estimates from ANALYZE itself.
Even worse, in the presence of a steady flow of HOT updates and matching
HOT-tuple reclamations, auto-analyze might never trigger at all, even if all
three numbers are exactly right, because n_dead_tuples could hold steady.

To fix, replace last_anl_tuples with an accurately tracked count of the total
number of committed tuple inserts + updates + deletes since the last ANALYZE
on the table.  This can still be compared to the same threshold as before, but
it's much more trustworthy than the old computation.  Tracking this requires
one more intra-transaction counter per modified table within backends, but no
additional memory space in the stats collector.  There probably isn't any
measurable speed difference; if anything it might be a bit faster than before,
since I was able to eliminate some per-tuple arithmetic operations in favor of
adding sums once per (sub)transaction.

Also, simplify the logic around pgstat vacuum and analyze reporting messages
by not trying to fold VACUUM ANALYZE into a single pgstat message.

The original thought behind this patch was to allow scheduling of analyzes
on parent tables by artificially inflating their changes_since_analyze count.
I've left that for a separate patch since this change seems to stand on its
own merit.
2009-12-30 20:32:14 +00:00
Tom Lane 540e69a061 Add an index on pg_inherits.inhparent, and use it to avoid seqscans in
find_inheritance_children().  This is a complete no-op in databases without
any inheritance.  In databases where there are just a few entries in
pg_inherits, it could conceivably be a small loss.  However, in databases with
many inheritance parents, it can be a big win.
2009-12-29 22:00:14 +00:00
Tom Lane 649b5ec7c8 Add the ability to store inheritance-tree statistics in pg_statistic,
and teach ANALYZE to compute such stats for tables that have subclasses.
Per my proposal of yesterday.

autovacuum still needs to be taught about running ANALYZE on parent tables
when their subclasses change, but the feature is useful even without that.
2009-12-29 20:11:45 +00:00
Heikki Linnakangas 84d723b6ce Previous fix for temporary file management broke returning a set from
PL/pgSQL function within an exception handler. Make sure we use the right
resource owner when we create the tuplestore to hold returned tuples.

Simplify tuplestore API so that the caller doesn't need to be in the right
memory context when calling tuplestore_put* functions. tuplestore.c
automatically switches to the memory context used when the tuplestore was
created. Tuplesort was already modified like this earlier. This patch also
removes the now useless MemoryContextSwitch calls from callers.

Report by Aleksei on pgsql-bugs on Dec 22 2009. Backpatch to 8.1, like
the previous patch that broke this.
2009-12-29 17:40:59 +00:00
Bruce Momjian 0d399d57d5 Remove PGDLLIMPORT used for binary upgrade; must be on the externs, per Tom. 2009-12-28 18:49:05 +00:00
Bruce Momjian 3687b2e0eb Add PGDLLIMPORT for binary_upgrade global variables so shared object
libraries can access them.
2009-12-28 18:39:03 +00:00