table, by allocating just enough for a hardcoded number of dead tuples per
page. The current estimate is 200 dead tuples per page.
Per reports from Jeff Amiel, Erik Jones and Marko Kreen, and subsequent
discussion.
CVS: ----------------------------------------------------------------------
CVS: Enter Log. Lines beginning with `CVS:' are removed automatically
CVS:
CVS: Committing in .
CVS:
CVS: Modified Files:
CVS: commands/vacuumlazy.c
CVS: ----------------------------------------------------------------------
* stats_start_collector goes away; we always start the collector process,
unless prevented by a problem with setting up the stats UDP socket.
* stats_reset_on_server_start goes away; it seems useless in view of the
availability of pg_stat_reset().
* stats_block_level and stats_row_level are merged into a single variable
"track_counts", which controls all reports sent to the collector process.
* stats_command_string is renamed to track_activities.
* log_autovacuum is renamed to log_autovacuum_min_duration to better reflect
its meaning.
The log_autovacuum change is not a compatibility issue since it didn't exist
before 8.3 anyway. The other changes need to be release-noted.
columns, and the new version can be stored on the same heap page, we no longer
generate extra index entries for the new version. Instead, index searches
follow the HOT-chain links to ensure they find the correct tuple version.
In addition, this patch introduces the ability to "prune" dead tuples on a
per-page basis, without having to do a complete VACUUM pass to recover space.
VACUUM is still needed to clean up dead index entries, however.
Pavan Deolasee, with help from a bunch of other people.
no-longer-needed pages at the end of a table. We thought we could throw away
pages containing HEAPTUPLE_DEAD tuples; but this is not so, because such
tuples very likely have index entries pointing at them, and we wouldn't have
removed the index entries. The problem only emerges in a somewhat unlikely
race condition: the dead tuples have to have been inserted by a transaction
that later aborted, and this has to have happened between VACUUM's initial
scan of the page and then rechecking it for empty in count_nondeletable_pages.
But that timespan will include an index-cleaning pass, so it's not all that
hard to hit. This seems to explain a couple of previously unsolved bug
reports.
than two independent bits (one of which was never used in heap pages anyway,
or at least hadn't been in a very long time). This gives us flexibility to
add the HOT notions of redirected and dead item pointers without requiring
anything so klugy as magic values of lp_off and lp_len. The state values
are chosen so that for the states currently in use (pre-HOT) there is no
change in the physical representation.
an exclusive lock on the table at this point, which we want to release as soon
as possible. This is called in the phase of lazy vacuum where we truncate the
empty pages at the end of the table.
An alternative solution would be to lower the vacuum delay settings before
starting the truncating phase, but this doesn't work very well in autovacuum
due to the autobalancing code (which can cause other processes to change our
cost delay settings). This case could be considered in the balancing code, but
it is simpler this way.
ReadNewTransactionId from GetSnapshotData --- with a "latestCompletedXid"
variable that is updated during transaction commit or abort. Since
latestCompletedXid is written only in places that had to lock ProcArrayLock
exclusively anyway, and is read only in places that had to lock ProcArrayLock
shared anyway, it adds no new locking requirements to the system despite being
cluster-wide. Moreover, removing ReadNewTransactionId from snapshot
acquisition eliminates the need to take both XidGenLock and ProcArrayLock at
the same time. Since XidGenLock is sometimes held across I/O this can be a
significant win. Some preliminary benchmarking suggested that this patch has
no effect on average throughput but can significantly improve the worst-case
transaction times seen in pgbench. Concept by Florian Pflug, implementation
by Tom Lane.
no need for serialization against snapshot-taking because the xact doesn't
affect anyone else's snapshot anyway. Per discussion. Also, move various
info about the interlocking of transactions and snapshots out of code comments
and into a hopefully-more-cohesive discussion in access/transam/README.
Also, remove a couple of now-obsolete comments about having to force some WAL
to be written to persuade RecordTransactionCommit to do its thing.
null::char(3) to a simple Const node. (It already worked for non-null values,
but not when we skipped evaluation of a strict coercion function.) This
prevents loss of typmod knowledge in situations such as exhibited in bug
#3598. Unfortunately there seems no good way to fix that bug in 8.1 and 8.2,
because they simply don't carry a typmod for a plain Const node.
In passing I made all the other callers of makeNullConst supply "real" typmod
values too, though I think it probably doesn't matter anywhere else.
rows will normally never obtain an XID at all. We already did things this way
for subtransactions, but this patch extends the concept to top-level
transactions. In applications where there are lots of short read-only
transactions, this should improve performance noticeably; not so much from
removal of the actual XID-assignments, as from reduction of overhead that's
driven by the rate of XID consumption. We add a concept of a "virtual
transaction ID" so that active transactions can be uniquely identified even
if they don't have a regular XID. This is a much lighter-weight concept:
uniqueness of VXIDs is only guaranteed over the short term, and no on-disk
record is made about them.
Florian Pflug, with some editorialization by Tom.
(Actually, it works as a plain statement too, but I didn't document that
because it seems a bit useless.) Unify VariableResetStmt with
VariableSetStmt, and clean up some ancient cruft in the representation of
same.
There are still some loose ends: I didn't do anything about the SET FROM
CURRENT idea yet, and it's not real clear whether we are happy with the
interaction of SET LOCAL with function-local settings. The documentation
is a bit spartan, too.
namespace isn't necessarily first in the search path (there could be implicit
schemas ahead of it). Examples are
test=# set search_path TO s1;
test=# create view pg_timezone_names as select * from pg_timezone_names();
ERROR: "pg_timezone_names" is already a view
test=# create table pg_class (f1 int primary key);
ERROR: permission denied: "pg_class" is a system catalog
You'd expect these commands to create the requested objects in s1, since
names beginning with pg_ aren't supposed to be reserved anymore. What is
happening is that we create the requested base table and then execute
additional commands (here, CREATE RULE or CREATE INDEX), and that code is
passed the same RangeVar that was in the original command. Since that
RangeVar has schemaname = NULL, the secondary commands think they should do a
path search, and that means they find system catalogs that are implicitly in
front of s1 in the search path.
This is perilously close to being a security hole: if the secondary command
failed to apply a permission check then it'd be possible for unprivileged
users to make schema modifications to system catalogs. But as far as I can
find, there is no code path in which a check doesn't occur. Which makes it
just a weird corner-case bug for people who are silly enough to want to
name their tables the same as a system catalog.
The relevant code has changed quite a bit since 8.2, which means this patch
wouldn't work as-is in the back branches. Since it's a corner case no one
has reported from the field, I'm not going to bother trying to back-patch.
relcache entry after having heap_close'd it. This could lead to misbehavior
if a relcache flush wiped out the cache entry meanwhile. In 8.2 there is a
very real risk of CREATE INDEX CONCURRENTLY using the wrong relid for locking
and waiting purposes. I think the bug is only cosmetic in 8.0 and 8.1,
because their transgression is limited to using RelationGetRelationName(rel)
in an ereport message immediately after heap_close, and there's no way (except
with special debugging options) for a cache flush to occur in that interval.
Not quite sure that it's cosmetic in 7.4, but seems best to patch anyway.
Found by trying to run the regression tests with CLOBBER_CACHE_ALWAYS enabled.
Maybe we should try to do that on a regular basis --- it's awfully slow,
but perhaps some fast buildfarm machine could do it once in awhile.
This prevents needing to do complex and poorly-defined updates of the
mapping table if the new parser has different token types than the old.
Per discussion.
init options of the template as top-level options in the syntax. This also
makes ALTER a bit easier to use, since options can be replaced individually.
I also made these statements verify that the tmplinit method will accept
the new settings before they get stored; in the original coding you didn't
find out about mistakes until the dictionary got invoked.
Under the hood, init methods now get options as a List of DefElem instead
of a raw text string --- that lets tsearch use existing options-pushing code
instead of duplicating functionality.
'with map' parameter; as things now stand there's really not much point
in specifying a config-to-copy if you don't copy its map. Also, use
COPY instead of TEMPLATE as the key word for a config-to-copy, so as
to avoid confusion with text search templates. Per discussion; the
just-committed reference page for the command already describes it
this way.
Oleg Bartunov and Teodor Sigaev, but I did a lot of editorializing,
so anything that's broken is probably my fault.
Documentation is nonexistent as yet, but let's land the patch so we can
get some portability testing done.
are not one of the query's defined result relations, but nonetheless have
triggers fired against them while the query is active. This was formerly
impossible but can now occur because of my recent patch to fix the firing
order for RI triggers. Caching a ResultRelInfo avoids duplicating work by
repeatedly opening and closing the same relation, and also allows EXPLAIN
ANALYZE to "see" and report on these extra triggers. Use the same mechanism
to cache open relations when firing deferred triggers at transaction shutdown;
this replaces the former one-element-cache strategy used in that case, and
should improve performance a bit when there are deferred triggers on a number
of relations.
row within one query: we were firing check triggers before all the updates
were done, leading to bogus failures. Fix by making the triggers queued by
an RI update go at the end of the outer query's trigger event list, thereby
effectively making the processing "breadth-first". This was indeed how it
worked pre-8.0, so the bug does not occur in the 7.x branches.
Per report from Pavel Stehule.
First, we cannot assume that XLogAsyncCommitFlush guarantees hint bits will be
settable, because clog.c's inexact LSN bookkeeping results in windows where a
previously flushed transaction is considered unhintable because it shares an
LSN slot with a later unflushed transaction. But repair_frag requires
XMIN_COMMITTED to be correct so that it can distinguish tuples moved by the
current vacuum. Since not being able to set the bit is an uncommon corner
case, the most practical way of dealing with it seems to be to abandon
shrinking (ie, don't invoke repair_frag) when we find a non-dead tuple whose
XMIN_COMMITTED bit couldn't be set.
Second, it is possible for the same reason that a RECENTLY_DEAD tuple does not
get its XMAX_COMMITTED bit set during scan_heap. But by the time repair_frag
examines the tuple it might be possible to set the bit. We therefore must
take buffer content lock when calling HeapTupleSatisfiesVacuum a second time,
else we can get an Assert failure in SetBufferCommitInfoNeedsSave. This
latter bug is latent in existing releases, but I think it cannot actually
occur without async commit, since the first HeapTupleSatisfiesVacuum call
should always have set the bit. So I'm not going to back-patch it.
In passing, reduce the existing "cannot shrink relation" messages from NOTICE
to LOG level. The new message must be no higher than LOG if we don't want
unpredictable regression test failures, and consistency seems like a good
idea. Also arrange that only one such message is reported per VACUUM FULL;
in typical scenarios you could get spammed with many such messages, which
seems a bit useless.
displayed in the postmaster log. This avoids Windows-specific problems with
localized time zone names that are in the wrong encoding, and generally seems
like a good idea to forestall other potential platform-dependent issues.
To preserve the existing behavior that all backends will log in the same time
zone, create a new GUC variable log_timezone that can only be changed on a
system-wide basis, and reference log-related calculations to that zone instead
of the TimeZone variable.
This fixes the issue reported by Hiroshi Saito that timestamps printed by
xlog.c startup could be improperly localized on Windows. We still need a
simpler patch for that problem in the back branches, however.
before reporting a transaction committed. Data consistency is still
guaranteed (unlike setting fsync = off), but a crash may lose the effects
of the last few transactions. Patch by Simon, some editorialization by Tom.
referencing table does not change the tuple's FK column(s), we don't bother
to check the PK table since the constraint was presumably already valid.
However, the check is still necessary if the tuple was inserted by our own
transaction, since in that case the INSERT trigger will conclude it need not
make the check (since its version of the tuple has been deleted). We got this
right for simple cases, but not when the insert and update are in different
subtransactions of the current top-level transaction; in such cases the FK
check would never be made at all. (Hence, problem dates back to 8.0 when
subtransactions were added --- it's actually the subtransaction version of a
bug fixed in 7.3.5.) Fix, and add regression test cases. Report and fix by
Affan Salman.
Sequences and views could previously be renamed using ALTER TABLE, but
this was a repeated source of confusion for users. Update the docs,
and psql tab completion. Patch from David Fetter; various minor fixes
by myself.
that are fired at end-of-statement (as is the normal case for foreign keys,
for example). In this situation the per-subxact deferred trigger context
is always empty when subtransaction exit is reached; so we could free it,
but were not doing so, leading to an intratransaction leak of 8K or more
per subtransaction. Per off-list example from Viatcheslav Kalinin
subsequent to bug #3418 (his original bug report omitted a foreign key
constraint needed to cause this leak).
Back-patch to 8.2; prior versions were not using per-subxact contexts
for deferred triggers, so did not have this leak.
over a fairly long period of time, rather than being spat out in a burst.
This happens only for background checkpoints carried out by the bgwriter;
other cases, such as a shutdown checkpoint, are still done at full speed.
Remove the "all buffers" scan in the bgwriter, and associated stats
infrastructure, since this seems no longer very useful when the checkpoint
itself is properly throttled.
Original patch by Itagaki Takahiro, reworked by Heikki Linnakangas,
and some minor API editorialization by me.
(which now deals only in optimizable statements), and put that code
into a new file parser/parse_utilcmd.c. This helps clarify and enforce
the design rule that utility statements shouldn't be processed during
the regular parse analysis phase; all interpretation of their meaning
should happen after they are given to ProcessUtility to execute.
(We need this because we don't retain any locks for a utility statement
that's in a plan cache, nor have any way to detect that it's stale.)
We are also able to simplify the API for parse_analyze() and related
routines, because they will now always return exactly one Query structure.
In passing, fix bug #3403 concerning trying to add a serial column to
an existing temp table (this is largely Heikki's work, but we needed
all that restructuring to make it safe).
profiling that CopyAttributeOutText was taking an unreasonable fraction of
the backend run time (like 66%!) on the following trivial test case:
$ time psql -c "copy (select repeat('xyzzy',50) from generate_series(1,10000000)) to stdout" regression >/dev/null
The time is all being spent on scanning the string for characters to be
escaped, which most of the time there aren't any of. Some tweaking to take
as many tests as possible out of the inner loop reduced the runtime of this
example by more than 10%. In a real-world case it wouldn't be as useful
a speedup, but it still seems worth adding a few lines here.
an array of strings rather than an array of integers, and allow any simple
constant or identifier to be used in typmods; for example
create table foo (f1 widget(42,'23skidoo',point));
Of course the typmodin function has still got to pack this info into a
non-negative int32 for storage, but it's still a useful improvement in
flexibility, especially considering that you can do nearly anything if you
are willing to keep the info in a side table. We can get away with this
change since we have not yet released a version providing user-definable
typmods. Per discussion.
for each temp file, rather than once per sort or hashjoin; this allows
spreading the data of a large sort or join across multiple tablespaces.
(I remain dubious that this will make any difference in practice, but certain
people insisted.) Arrange to cache the results of parsing the GUC variable
instead of recomputing from scratch on every demand, and push usage of the
cache down to the bottommost fd.c level.
tablespace(s) in which to store temp tables and temporary files. This is a
list to allow spreading the load across multiple tablespaces (a random list
element is chosen each time a temp object is to be created). Temp files are
not stored in per-database pgsql_tmp/ directories anymore, but per-tablespace
directories.
Jaime Casanova and Albert Cervera, with review by Bernd Helmle and Tom Lane.
will exit before failing because of conflicting DB usage. Per discussion,
this seems a good idea to help mask the fact that backend exit takes nonzero
time. Remove a couple of thereby-obsoleted sleeps in contrib and PL
regression test sequences.
buffers, rather than blowing out the whole shared-buffer arena. Aside from
avoiding cache spoliation, this fixes the problem that VACUUM formerly tended
to cause a WAL flush for every page it modified, because we had it hacked to
use only a single buffer. Those flushes will now occur only once per
ring-ful. The exact ring size, and the threshold for seqscans to switch into
the ring usage pattern, remain under debate; but the infrastructure seems
done. The key bit of infrastructure is a new optional BufferAccessStrategy
object that can be passed to ReadBuffer operations; this replaces the former
StrategyHintVacuum API.
This patch also changes the buffer usage-count methodology a bit: we now
advance usage_count when first pinning a buffer, rather than when last
unpinning it. To preserve the behavior that a buffer's lifetime starts to
decrease when it's released, the clock sweep code is modified to not decrement
usage_count of pinned buffers.
Work not done in this commit: teach GiST and GIN indexes to use the vacuum
BufferAccessStrategy for vacuum-driven fetches.
Original patch by Simon, reworked by Heikki and again by Tom.
and/or create plans for hypothetical situations; in particular, investigate
plans that would be generated using hypothetical indexes. This is a
heavily-rewritten version of the hooks proposed by Gurjeet Singh for his
Index Advisor project. In this formulation, the index advisor can be
entirely a loadable module instead of requiring a significant part to be
in the core backend, and plans can be generated for hypothetical indexes
without requiring the creation and rolling-back of system catalog entries.
The index advisor patch as-submitted is not compatible with these hooks,
but it needs significant work anyway due to other 8.2-to-8.3 planner
changes. With these hooks in the core backend, development of the advisor
can proceed as a pgfoundry project.
FreezeXid introduced in a recent commit, so there isn't any data loss in this
approach.
Doing it causes ALTER TABLE (or rather, the forms of it that cause a full table
rewrite) to be affected as well. In this case, the frozen point is RecentXmin,
because after the rewrite all the tuples are relabeled with the rewriting
transaction's Xid.
TOAST tables are fixed automatically as well, as fallout of the way they were
already being handled in the respective code paths.
With this patch, there is no longer need to VACUUM tables for Xid wraparound
purposes that have been cleaned up via TRUNCATE or CLUSTER.
to avoid losing useful Xid information in not-so-old tuples. This makes
CLUSTER behave the same as VACUUM as far a tuple-freezing behavior goes
(though CLUSTER does not yet advance the table's relfrozenxid).
While at it, move the actual freezing operation in rewriteheap.c to a more
appropriate place, and document it thoroughly. This part of the patch from
Tom Lane.
avoid a later needless VACUUM for Xid-wraparound purposes. We can do this
since the table is known to be left empty, so no Xid remains on it.
Per discussion.
there's an indirect dependency on the owner via the parent table. We were
already handling indexes that way, but not toast tables for some reason.
Saves a little catalog space and cuts down the verbosity of checkSharedDependencies
reports.
named foo, would work but the other ordering would not. If a user-specified
type or table name collides with an existing auto-generated array name, just
rename the array type out of the way by prepending more underscores. This
should not create any backward-compatibility issues, since the cases in which
this will happen would have failed outright in prior releases.
Also fix an oversight in the arrays-of-composites patch: ALTER TABLE RENAME
renamed the table's rowtype but not its array type.
needs to check the new constraint against columns of derived domains too.
Also, make it error out if the domain to be modified is used within any
composite-type columns. Eventually we should support that case, but it seems
a bit painful, and not suitable for a back-patch. For the moment just let the
user know we can't do it.
Backpatch to 8.2, which is the only released version that allows nested
domains. Possibly the other part should be back-patched further.
and views (but not system catalogs, nor sequences or toast tables). Get rid
of the hardwired convention that a type's array type is named exactly "_type",
instead using a new column pg_type.typarray to provide the linkage. (It still
will be named "_type", though, except in odd corner cases such as
maximum-length type names.)
Along the way, make tracking of owner and schema dependencies for types more
uniform: a type directly created by the user has these dependencies, while a
table rowtype or auto-generated array type does not have them, but depends on
its parent object instead.
David Fetter, Andrew Dunstan, Tom Lane
true at the very end of its processing, the update is broadcast via a
shared-cache-inval message for the index; without this, existing backends that
already have relcache entries for the index might never see it become valid.
Also, force a relcache inval on the index's parent table at the same time,
so that any cached plans for that table are re-planned; this ensures that
the newly valid index will be used if appropriate. Aside from making
C.I.C. behave more reasonably, this is necessary infrastructure for some
aspects of the HOT patch. Pavan Deolasee, with a little further stuff from
me.
messages to the stats collector. This avoids the problem that enabling
stats_row_level for autovacuum has a significant overhead for short
read-only transactions, as noted by Arjen van der Meijden. We can avoid
an extra gettimeofday call by piggybacking on the one done for WAL-logging
xact commit or abort (although that doesn't help read-only transactions,
since they don't WAL-log anything).
In my proposal for this, I noted that we could change the WAL log entries
for commit/abort to record full TimestampTz precision, instead of only
time_t as at present. That's not done in this patch, but will be committed
separately.
types of unspecified parameters when submitted via extended query protocol.
This worked in 8.2 but I had broken it during plancache changes. DECLARE
CURSOR is now treated almost exactly like a plain SELECT through parse
analysis, rewrite, and planning; only just before sending to the executor
do we divert it away to ProcessUtility. This requires a special-case check
in a number of places, but practically all of them were already special-casing
SELECT INTO, so it's not too ugly. (Maybe it would be a good idea to merge
the two by treating IntoClause as a form of utility statement? Not going to
worry about that now, though.) That approach doesn't work for EXPLAIN,
however, so for that I punted and used a klugy solution of running parse
analysis an extra time if under extended query protocol.
is in progress on the same hashtable. This seems the least invasive way to
fix the recently-recognized problem that a split could cause the scan to
visit entries twice or (with much lower probability) miss them entirely.
The only field-reported problem caused by this is the "failed to re-find
shared lock object" PANIC in COMMIT PREPARED reported by Michel Dorochevsky,
which was caused by multiply visited entries. However, it seems certain
that mdsync() is vulnerable to missing required fsync's due to missed
entries, and I am fearful that RelationCacheInitializePhase2() might be at
risk as well. Because of that and the generalized hazard presented by this
bug, back-patch all the supported branches.
Along the way, fix pg_prepared_statement() and pg_cursor() to not assume
that the hashtables they are examining will stay static between calls.
This is risky regardless of the newly noted dynahash problem, because
hash_seq_search() has never promised to cope with deletion of table entries
other than the just-returned one. There may be no bug here because the only
supported way to call these functions is via ExecMakeTableFunctionResult()
which will cycle them to completion before doing anything very interesting,
but it seems best to get rid of the assumption. This affects 8.2 and HEAD
only, since those functions weren't there earlier.
RESET SESSION, RESET PLANS, and RESET TEMP are now DISCARD ALL,
DISCARD PLANS, and DISCARD TEMP, respectively. This is to avoid
confusion with the pre-existing RESET variants: the DISCARD
commands are not actually similar to RESET. Patch from Marko
Kreen, with some minor editorialization.
processes to be running simultaneously. Also, now autovacuum processes do not
count towards the max_connections limit; they are counted separately from
regular processes, and are limited by the new GUC variable
autovacuum_max_workers.
The launcher now has intelligence to launch workers on each database every
autovacuum_naptime seconds, limited only on the max amount of worker slots
available.
Also, the global worker I/O utilization is limited by the vacuum cost-based
delay feature. Workers are "balanced" so that the total I/O consumption does
not exceed the established limit. This part of the patch was contributed by
ITAGAKI Takahiro.
Per discussion.
a replan. I had originally thought this was not necessary, but the new
SPI facilities create a path whereby queries planned with non-default
options can get into the cache, so it is necessary.
access to the planner's cursor-related planning options, and provide new
FETCH/MOVE routines that allow access to the full power of those commands.
Small refactoring of planner(), pg_plan_query(), and pg_plan_queries()
APIs to make it convenient to pass the planning options down from SPI.
This is the core-code portion of Pavel Stehule's patch for scrollable
cursor support in plpgsql; I'll review and apply the plpgsql changes
separately.
reviewed by Neil Conway. This patch adds the following DDL command
variants: RESET SESSION, RESET TEMP, RESET PLANS, CLOSE ALL, and
DEALLOCATE ALL. RESET SESSION is intended for use by connection
pool software and the like, in order to reset a client session
to something close to its initial state.
Note that while most of these command variants can be executed
inside a transaction block (but are not transaction-aware!),
RESET SESSION cannot. While this is inconsistent, it is intended
to catch programmer mistakes: RESET SESSION in an open transaction
block is probably unintended.
This commit breaks any code that assumes that the mere act of forming a tuple
(without writing it to disk) does not "toast" any fields. While all available
regression tests pass, I'm not totally sure that we've fixed every nook and
cranny, especially in contrib.
Greg Stark with some help from Tom Lane
A DBA is allowed to create a language in his database if it's marked
"tmpldbacreate" in pg_pltemplate. The factory default is that this is set
for all standard trusted languages, but of course a superuser may adjust
the settings. In service of this, add the long-foreseen owner column to
pg_language; renaming, dropping, and altering owner of a PL now follow
normal ownership rules instead of being superuser-only.
Jeremy Drake, with some editorialization by Tom Lane.
search_path that was active when the plan was first made. To do this,
improve namespace.c to support a stack of "override" search path settings
(we must have a stack since nested replan events are entirely possible).
This facility replaces the "special namespace" hack formerly used by
CREATE SCHEMA, and should be able to support per-function search path
settings as well.
doesn't exist. This allows DROP to be used to clean out the pg_tablespace
catalog entry in a situation where a previous DROP attempt failed before
committing but after having removed the directories and symlink.
Per report from William Garrison. Even though his test case depends on an
unrelated bug in PreventTransactionChain, it's certainly possible for this
situation to arise due to other problems, eg a system crash at just the
right time.
rules to be defined with different, per session controllable, behaviors
for replication purposes.
This will allow replication systems like Slony-I and, as has been stated
on pgsql-hackers, other products to control the firing mechanism of
triggers and rewrite rules without modifying the system catalog directly.
The firing mechanisms are controlled by a new superuser-only GUC
variable, session_replication_role, together with a change to
pg_trigger.tgenabled and a new column pg_rewrite.ev_enabled. Both
columns are a single char data type now (tgenabled was a bool before).
The possible values in these attributes are:
'O' - Trigger/Rule fires when session_replication_role is "origin"
(default) or "local". This is the default behavior.
'D' - Trigger/Rule is disabled and fires never
'A' - Trigger/Rule fires always regardless of the setting of
session_replication_role
'R' - Trigger/Rule fires when session_replication_role is "replica"
The GUC variable can only be changed as long as the system does not have
any cached query plans. This will prevent changing the session role and
accidentally executing stored procedures or functions that have plans
cached that expand to the wrong query set due to differences in the rule
firing semantics.
The SQL syntax for changing a triggers/rules firing semantics is
ALTER TABLE <tabname> <when> TRIGGER|RULE <name>;
<when> ::= ENABLE | ENABLE ALWAYS | ENABLE REPLICA | DISABLE
psql's \d command as well as pg_dump are extended in a backward
compatible fashion.
Jan
did not expect that a DEAD tuple could follow a RECENTLY_DEAD tuple in an
update chain, but because the OldestXmin rule for determining deadness is a
simplification of reality, it is possible for this situation to occur
(implying that the RECENTLY_DEAD tuple is in fact dead to all observers,
but this patch does not attempt to exploit that). The code would follow a
chain forward all the way, but then stop before a DEAD tuple when backing
up, meaning that not all of the chain got moved. This could lead to copying
the chain multiple times (resulting in duplicate copies of the live tuple at
its end), or leaving dangling index entries behind (which, aside from
generating warnings from later vacuums, creates a risk of wrong query
results or bogus duplicate-key errors once the heap slot the index entry
points to is repopulated).
The fix is to recheck HeapTupleSatisfiesVacuum while following a chain
forward, and to stop if a DEAD tuple is reached. Each contiguous group
of RECENTLY_DEAD tuples will therefore be copied as a separate chain.
The patch also adds a couple of extra sanity checks to verify correct
behavior.
Per report and test case from Pavan Deolasee.
module and teach PREPARE and protocol-level prepared statements to use it.
In service of this, rearrange utility-statement processing so that parse
analysis does not assume table schemas can't change before execution for
utility statements (necessary because we don't attempt to re-acquire locks
for utility statements when reusing a stored plan). This requires some
refactoring of the ProcessUtility API, but it ends up cleaner anyway,
for instance we can get rid of the QueryContext global.
Still to do: fix up SPI and related code to use the plan cache; I'm tempted to
try to make SQL functions use it too. Also, there are at least some aspects
of system state that we want to ensure remain the same during a replan as in
the original processing; search_path certainly ought to behave that way for
instance, and perhaps there are others.
even if none of the fields in the pg_class row change. This behavior is
necessary to ensure other backends flush rd_targblock values that might
point to truncated-away pages. We got this right pre-8.2 but it was broken
by overoptimistic change to not write out the pg_class row if unchanged.
Per report from Pavan Deolasee.
fixup various places in the tree that were clearing a StringInfo by hand.
Making this function a part of the API simplifies client code slightly,
and avoids needlessly peeking inside the StringInfo interface.
drill down into subplan targetlists to print the referent expression for an
OUTER or INNER var in an upper plan node. Hence, make it do that always, and
banish the old hack of showing "?columnN?" when things got too complicated.
Along the way, fix an EXPLAIN bug I introduced by suppressing subqueries from
execution-time range tables: get_name_for_var_field() assumed it could look at
rte->subquery to find out the real type of a RECORD var. That doesn't work
anymore, but instead we can look at the input plan of the SubqueryScan plan
node.
and quals have varno OUTER, rather than zero, to indicate a reference to
an output of their lefttree subplan. This is consistent with the way
that every other upper-level node type does it, and allows some simplifications
in setrefs.c and EXPLAIN.
useless substructure for its RangeTblEntry nodes. (I chose to keep using the
same struct node type and just zero out the link fields for unneeded info,
rather than making a separate ExecRangeTblEntry type --- it seemed too
fragile to have two different rangetable representations.)
Along the way, put subplans into a list in the toplevel PlannedStmt node,
and have SubPlan nodes refer to them by list index instead of direct pointers.
Vadim wanted to do that years ago, but I never understood what he was on about
until now. It makes things a *whole* lot more robust, because we can stop
worrying about duplicate processing of subplans during expression tree
traversals. That's been a constant source of bugs, and it's finally gone.
There are some consequent simplifications yet to be made, like not using
a separate EState for subplans in the executor, but I'll tackle that later.
storing mostly-redundant Query trees in prepared statements, portals, etc.
To replace Query, a new node type called PlannedStmt is inserted by the
planner at the top of a completed plan tree; this carries just the fields of
Query that are still needed at runtime. The statement lists kept in portals
etc. now consist of intermixed PlannedStmt and bare utility-statement nodes
--- no Query. This incidentally allows us to remove some fields from Query
and Plan nodes that shouldn't have been there in the first place.
Still to do: simplify the execution-time range table; at the moment the
range table passed to the executor still contains Query trees for subqueries.
initdb forced due to change of stored rules.
plan nodes, so that the executor does not need to get these items from
the range table at runtime. This will avoid needing to include these
fields in the compact range table I'm expecting to make the executor use.
an opclass for a generic type such as ANYARRAY. The original coding failed
to check that PK and FK columns were of the same array type. Per discussion
with Tom Dunstan. Also, make the code a shade more readable by not trying
to economize on variables.
continuously, and requests vacuum runs of "autovacuum workers" to postmaster.
The workers do the actual vacuum work. This allows for future improvements,
like allowing multiple autovacuum jobs running in parallel.
For now, the code keeps the original behavior of having a single autovac
process at any time by sleeping until the previous worker has finished.
equality checks it applies, instead of a random dependence on whatever
operators might be named "=". The equality operators will now be selected
from the opfamily of the unique index that the FK constraint depends on to
enforce uniqueness of the referenced columns; therefore they are certain to be
consistent with that index's notion of equality. Among other things this
should fix the problem noted awhile back that pg_dump may fail for foreign-key
constraints on user-defined types when the required operators aren't in the
search path. This also means that the former warning condition about "foreign
key constraint will require costly sequential scans" is gone: if the
comparison condition isn't indexable then we'll reject the constraint
entirely. All per past discussions.
Along the way, make the RI triggers look into pg_constraint for their
information, instead of using pg_trigger.tgargs; and get rid of the always
error-prone fixed-size string buffers in ri_triggers.c in favor of building up
the RI queries in StringInfo buffers.
initdb forced due to columns added to pg_constraint and pg_trigger.
entries for the victim database go away sooner rather than later. We already
did the equivalent thing at the per-relation level, not sure why it's not
been done for whole databases. With this change, pgstat_vacuum_tabstat
should usually not find anything to do; though we still need it as a backstop
in case DROPDB or TABPURGE messages get lost under load.
thought that it didn't have to reposition the underlying tuplestore if the
portal is atEnd. But this is not so, because tuplestores have separate read
and write cursors ... and the read cursor hasn't moved from the start.
This mistake explains bug #2970 from William Zhang.
Note: the coding here is pretty inefficient, but given that no one has noticed
this bug until now, I'd say hardly anyone uses the case where the cursor has
been advanced before being persisted. So maybe it's not worth worrying about.
describe the maximum size of index tuples (which is typically AM-dependent
anyway); and consequently remove the bogus deduction for "special space"
that was built into it.
Adjust TOAST_TUPLE_THRESHOLD and TOAST_MAX_CHUNK_SIZE to avoid wasting two
bytes per toast chunk, and to ensure that the calculation correctly tracks any
future changes in page header size. The computation had been inaccurate in a
way that didn't cause any harm except space wastage, but future changes could
have broken it more drastically.
Fix the calculation of BTMaxItemSize, which was formerly computed as 1 byte
more than it could safely be. This didn't cause any harm in practice because
it's only compared against maxalign'd lengths, but future changes in the size
of page headers or btree special space could have exposed the problem.
initdb forced because of change in TOAST_MAX_CHUNK_SIZE, which alters the
storage of toast tables.
made query plan. Use of ALTER COLUMN TYPE creates a hazard for cached
query plans: they could contain Vars that claim a column has a different
type than it now has. Fix this by checking during plan startup that Vars
at relation scan level match the current relation tuple descriptor. Since
at that point we already have at least AccessShareLock, we can be sure the
column type will not change underneath us later in the query. However,
since a backend's locks do not conflict against itself, there is still a
hole for an attacker to exploit: he could try to execute ALTER COLUMN TYPE
while a query is in progress in the current backend. Seal that hole by
rejecting ALTER TABLE whenever the target relation is already open in
the current backend.
This is a significant security hole: not only can one trivially crash the
backend, but with appropriate misuse of pass-by-reference datatypes it is
possible to read out arbitrary locations in the server process's memory,
which could allow retrieving database content the user should not be able
to see. Our thanks to Jeff Trout for the initial report.
Security: CVE-2007-0556
Standard English uses "may", "can", and "might" in different ways:
may - permission, "You may borrow my rake."
can - ability, "I can lift that log."
might - possibility, "It might rain today."
Unfortunately, in conversational English, their use is often mixed, as
in, "You may use this variable to do X", when in fact, "can" is a better
choice. Similarly, "It may crash" is better stated, "It might crash".
FAMILY; and add FAMILY option to CREATE OPERATOR CLASS to allow adding a
class to a pre-existing family. Per previous discussion. Man, what a
tedious lot of cutting and pasting ...
columns procost and prorows, to allow simple user adjustment of the estimated
cost of a function call, as well as control of the estimated number of rows
returned by a set-returning function. We might eventually wish to extend this
to allow function-specific estimation routines, but there seems to be
consensus that we should try a simple constant estimate first. In particular
this provides a relatively simple way to control the order in which different
WHERE clauses are applied in a plan node, which is a Good Thing in view of the
fact that the recent EquivalenceClass planner rewrite made that much less
predictable than before.
provide just a boolean 'amcanorder', instead of fields that specify the
sort operator strategy numbers. We have decided to require ordering-capable
AMs to use btree-compatible strategy numbers, so the old fields are
overkill (and indeed misleading about what's allowed).
accessing it, like DROP DATABASE. This allows the regression tests to pass
with autovacuum enabled, which open the gates for finally enabling autovacuum
by default.
per-column options for btree indexes. The planner's support for this is still
pretty rudimentary; it does not yet know how to plan mergejoins with
nondefault ordering options. The documentation is pretty rudimentary, too.
I'll work on improving that stuff later.
Note incompatible change from prior behavior: ORDER BY ... USING will now be
rejected if the operator is not a less-than or greater-than member of some
btree opclass. This prevents less-than-sane behavior if an operator that
doesn't actually define a proper sort ordering is selected.
having md.c return a success/failure boolean to smgr.c, which was just going
to elog anyway, let md.c issue the elog messages itself. This allows better
error reporting, particularly in cases such as "short read" or "short write"
which Peter was complaining of. Also, remove the kluge of allowing mdread()
to return zeroes from a read-beyond-EOF: this is now an error condition
except when InRecovery or zero_damaged_pages = true. (Hash indexes used to
require that behavior, but no more.) Also, enforce that mdwrite() is to be
used for rewriting existing blocks while mdextend() is to be used for
extending the relation EOF. This restriction lets us get rid of the old
ad-hoc defense against creating huge files by an accidental reference to
a bogus block number: we'll only create new segments in mdextend() not
mdwrite() or mdread(). (Again, when InRecovery we allow it anyway, since
we need to allow updates of blocks that were later truncated away.)
Also, clean up the original makeshift patch for bug #2737: move the
responsibility for padding relation segments to full length into md.c.
cases. Operator classes now exist within "operator families". While most
families are equivalent to a single class, related classes can be grouped
into one family to represent the fact that they are semantically compatible.
Cross-type operators are now naturally adjunct parts of a family, without
having to wedge them into a particular opclass as we had done originally.
This commit restructures the catalogs and cleans up enough of the fallout so
that everything still works at least as well as before, but most of the work
needed to actually improve the planner's behavior will come later. Also,
there are not yet CREATE/DROP/ALTER OPERATOR FAMILY commands; the only way
to create a new family right now is to allow CREATE OPERATOR CLASS to make
one by default. I owe some more documentation work, too. But that can all
be done in smaller pieces once this infrastructure is in place.
operator strategy numbers, ie, GiST and GIN. This is almost cosmetic
enough to not need a catversion bump, but since the opr_sanity regression
test has to change in sync with the catalog entry, I figured I'd better
do one.
AbortTransaction, which would lead to recursion and eventual PANIC exit
as illustrated in recent report from Jeff Davis. First, in xact.c create
a special dedicated memory context for AbortTransaction to run in. This
solves the problem as long as AbortTransaction doesn't need more than 32K
(or whatever other size we create the context with). But in corner cases
it might. Second, in trigger.c arrange to keep pending after-trigger event
records in separate contexts that can be freed near the beginning of
AbortTransaction, rather than having them persist until CleanupTransaction
as before. Third, in portalmem.c arrange to free executor state data
earlier as well. These two changes should result in backing off the
out-of-memory condition before AbortTransaction needs any significant
amount of memory, at least in typical cases such as memory overrun due
to too many trigger events or too big an executor hash table. And all
the same for subtransaction abort too, of course.
in PITR scenarios. We now WAL-log the replacement of old XIDs with
FrozenTransactionId, so that such replacement is guaranteed to propagate to
PITR slave databases. Also, rather than relying on hint-bit updates to be
preserved, pg_clog is not truncated until all instances of an XID are known to
have been replaced by FrozenTransactionId. Add new GUC variables and
pg_autovacuum columns to allow management of the freezing policy, so that
users can trade off the size of pg_clog against the amount of freezing work
done. Revise the already-existing code that forces autovacuum of tables
approaching the wraparound point to make it more bulletproof; also, revise the
autovacuum logic so that anti-wraparound vacuuming is done per-table rather
than per-database. initdb forced because of changes in pg_class, pg_database,
and pg_autovacuum catalogs. Heikki Linnakangas, Simon Riggs, and Tom Lane.
that conflict with the OID that we want to use for the new database.
This avoids the risk of trying to remove files that maybe we shouldn't
remove. Per gripe from Jon Lapham and subsequent discussion of 27-Sep.
static variables. This avoids any risk of potential non-reentrancy,
and in particular offers a much cleaner workaround for the Intel compiler
bug that was affecting ginutil.c.
to performance. (A wholesale effort to get rid of strncpy should be
undertaken sometime, but not during beta.) This commit also fixes dynahash.c
to correctly truncate overlength string keys for hashtables, so that its
callers don't have to anymore.
even when a single relation requires more than max_fsm_pages pages. Also,
make VACUUM emit a warning in this case, since it likely means that VACUUM
FULL or other drastic corrective measure is needed. Per reports from Jeff
Frost and others of unexpected changes in the claimed max_fsm_pages need.
the table being analyzed. This prevents two ANALYZEs from running
concurrently on the same table and possibly suffering concurrent-update
failures while trying to store their results into pg_statistic. The
downside is that a database-wide ANALYZE executed within a transaction
block will hold ShareUpdateExclusiveLock on many tables simultaneously,
which could lead to concurrency issues or even deadlock against another
such ANALYZE. However, this seems a corner case of less importance
than getting unexpected errors from a foreground ANALYZE when autovacuum
elects to analyze the same table concurrently. Per discussion.
after an error during VACUUM. We have a PG_TRY block anyway around the only
call sites, so just reset it in the CATCH clause instead of having
AtEOXact_Buffers blindly do it during xact end. I think the old code was
actively wrong for the case of a failure during ANALYZE inside a
subtransaction --- the flag wouldn't get cleared until main transaction end.
Probably not worth back-patching though.
proposal. Parameter logging works even for binary-format parameters, and
logging overhead is avoided when disabled.
log_statement = all output for the src/test/examples/testlibpq3.c example
now looks like
LOG: statement: execute <unnamed>: SELECT * FROM test1 WHERE t = $1
DETAIL: parameters: $1 = 'joe''s place'
LOG: statement: execute <unnamed>: SELECT * FROM test1 WHERE i = $1::int4
DETAIL: parameters: $1 = '2'
and log_min_duration_statement = 0 results in
LOG: duration: 2.431 ms parse <unnamed>: SELECT * FROM test1 WHERE t = $1
LOG: duration: 2.335 ms bind <unnamed> to <unnamed>: SELECT * FROM test1 WHERE t = $1
DETAIL: parameters: $1 = 'joe''s place'
LOG: duration: 0.394 ms execute <unnamed>: SELECT * FROM test1 WHERE t = $1
DETAIL: parameters: $1 = 'joe''s place'
LOG: duration: 1.251 ms parse <unnamed>: SELECT * FROM test1 WHERE i = $1::int4
LOG: duration: 0.566 ms bind <unnamed> to <unnamed>: SELECT * FROM test1 WHERE i = $1::int4
DETAIL: parameters: $1 = '2'
LOG: duration: 0.173 ms execute <unnamed>: SELECT * FROM test1 WHERE i = $1::int4
DETAIL: parameters: $1 = '2'
(This example demonstrates the folly of ignoring parse/bind steps for duration
logging purposes, BTW.)
Along the way, create a less ad-hoc mechanism for determining which commands
are logged by log_statement = mod and log_statement = ddl. The former coding
was actually missing quite a few things that look like ddl to me, and it
did not handle EXECUTE or extended query protocol correctly at all.
This commit does not do anything about the question of whether log_duration
should be removed or made less redundant with log_min_duration_statement.
that has parameters is always planned afresh for each Bind command,
treating the parameter values as constants in the planner. This removes
the performance penalty formerly often paid for using out-of-line
parameters --- with this definition, the planner can do constant folding,
LIKE optimization, etc. After a suggestion by Andrew@supernews.
can create or modify rules for the table. Do setRuleCheckAsUser() while
loading rules into the relcache, rather than when defining a rule. This
ensures that permission checks for tables referenced in a rule are done with
respect to the current owner of the rule's table, whereas formerly ALTER TABLE
OWNER would fail to update the permission checking for associated rules.
Removal of separate RULE privilege is needed to prevent various scenarios
in which a grantee of RULE privilege could effectively have any privilege
of the table owner. For backwards compatibility, GRANT/REVOKE RULE is still
accepted, but it doesn't do anything. Per discussion here:
http://archives.postgresql.org/pgsql-hackers/2006-04/msg01138.php
the target relation(s). There might be some cases where we could discard
the pending event instead, but for the moment a conservative approach
seems sufficient. Per report from Markus Schiltknecht and subsequent
discussion.
optionally bind. I re-added the "statement:" label so people will
understand why the line is being printed (it is log_*statement
behavior).
Use single quotes for bind values, instead of double quotes, and double
literal single quotes in bind values (and document that). I also made
use of the DETAIL line to have much cleaner output.
locks that would conflict with a specified lock request, without
actually trying to get that lock. Use this instead of the former ad hoc
method of doing the first wait step in CREATE INDEX CONCURRENTLY.
Fixes problem with undetected deadlock and in many cases will allow the
index creation to proceed sooner than it otherwise could've. Per
discussion with Greg Stark.
by abandoning the idea that it should say SERIAL in the dump. Instead,
dump serial sequences and column defaults just like regular ones.
Add a new backend command ALTER SEQUENCE OWNED BY to let pg_dump recreate
the sequence-to-column dependency that was formerly created "behind the
scenes" by SERIAL. This restores SERIAL to being truly "just a macro"
consisting of component operations that can be stated explicitly in SQL.
Furthermore, the new command allows sequence ownership to be reassigned,
so that old mistakes can be cleaned up.
Also, downgrade the OWNED-BY dependency from INTERNAL to AUTO, since there
is no longer any very compelling argument why the sequence couldn't be
dropped while keeping the column. (This forces initdb, to be sure the
right kinds of dependencies are in there.)
Along the way, add checks to prevent ALTER OWNER or SET SCHEMA on an
owned sequence; you can now only do this indirectly by changing the
owning table's owner or schema. This is an oversight in previous
releases, but probably not worth back-patching.
the rel, it's easy to get rid of the narrow race-condition window that
used to exist in VACUUM and CLUSTER. Did some minor code-beautification
work in the same area, too.
cannot assume that there's exactly one Query in the Portal, as we can for
ONE_SELECT mode, because non-SELECT queries might have extra queries added
during rule rewrites. Fix things up so that we'll use ONE_RETURNING mode
when a Portal contains one primary (canSetTag) query and that query has
a RETURNING list. This appears to be a second showstopper reason for running
the Portal to completion before we start to hand anything back --- we want
to be sure that the rule-added queries get run too.
merely a matter of fixing the error check, since the underlying Portal
infrastructure already handles it. This in turn allows these statements
to be used in some existing plpgsql and plperl contexts, such as a
plpgsql FOR loop. Also, do some marginal code cleanup in places that
were being sloppy about distinguishing SELECT from SELECT INTO.
plpgsql support to come later. Along the way, convert execMain's
SELECT INTO support into a DestReceiver, in order to eliminate some ugly
special cases.
Jonah Harris and Tom Lane
o print user name for all
o print portal name if defined for all
o print query for all
o reduce log_statement header to single keyword
o print bind parameters as DETAIL if text mode
the DROP pass rather than the ADD_CONSTR pass. On examining the code I
think this was just an oversight rather than intentional, and it seems
to satisfy the principle of least surprise better than the alternative
solution that was discussed. Add an example to the ref page showing how
to do ALTER TYPE and update the default in one command. Per gripe from
Markus Bertheau that that wasn't possible.
(e.g. "INSERT ... VALUES (...), (...), ...") and elsewhere as allowed
by the spec. (e.g. similar to a FROM clause subselect). initdb required.
Joe Conway and Tom Lane.
(table or index) before trying to open its relcache entry. This fixes
race conditions in which someone else commits a change to the relation's
catalog entries while we are in process of doing relcache load. Problems
of that ilk have been reported sporadically for years, but it was not
really practical to fix until recently --- for instance, the recent
addition of WAL-log support for in-place updates helped.
Along the way, remove pg_am.amconcurrent: all AMs are now expected to support
concurrent update.
created in the bootstrap phase proper, rather than added after-the-fact
by initdb. This is cleaner than before because it allows us to retire the
undocumented ALTER TABLE ... CREATE TOAST TABLE command, but the real reason
I'm doing it is so that toast tables of shared catalogs will now have
predetermined OIDs. This will allow a reasonably clean solution to the
problem of locking tables before we load their relcache entries, to appear
in a forthcoming patch.
vacuums. This allows a OLTP-like system with big tables to continue
regular vacuuming on small-but-frequently-updated tables while the
big tables are being vacuumed.
Original patch from Hannu Krossing, rewritten by Tom Lane and updated
by me.
the opportunity to treat COUNT(*) as a zero-argument aggregate instead
of the old hack that equated it to COUNT(1); this is materially cleaner
(no more weird ANYOID cases) and ought to be at least a tiny bit faster.
Original patch by Sergey Koposov; review, documentation, simple regression
tests, pg_dump and psql support by moi.
a table. Otherwise a USING clause that yields NULL can leave the table
violating its constraint (possibly there are other cases too). Per report
from Alexander Pravking.
To this end, add a couple of columns to pg_class, relminxid and relvacuumxid,
based on which we calculate the pg_database columns after each vacuum.
We now force all databases to be vacuumed, even template ones. A backend
noticing too old a database (meaning pg_database.datminxid is in danger of
falling behind Xid wraparound) will signal the postmaster, which in turn will
start an autovacuum iteration to process the offending database. In principle
this is only there to cope with frozen (non-connectable) databases without
forcing users to set them to connectable, but it could force regular user
database to go through a database-wide vacuum at any time. Maybe we should
warn users about this somehow. Of course the real solution will be to use
autovacuum all the time ;-)
There are some additional improvements we could have in this area: for example
the vacuum code could be smarter about not updating pg_database for each table
when called by autovacuum, and do it only once the whole autovacuum iteration
is done.
I updated the system catalogs documentation, but I didn't modify the
maintenance section. Also having some regression tests for this would be nice
but it's not really a very straightforward thing to do.
Catalog version bumped due to system catalog changes.
discussion (including making def_arg allow reserved words), add missed
opt_definition for UNIQUE case. Put the reloptions support code in a less
random place (I chose to make a new file access/common/reloptions.c).
Eliminate header inclusion creep. Make the index options functions safely
user-callable (seems like client apps might like to be able to test validity
of options before trying to make an index). Reduce overhead for normal case
with no options by allowing rd_options to be NULL. Fix some unmaintainably
klugy code, including getting rid of Natts_pg_class_fixed at long last.
Some stylistic cleanup too, and pay attention to keeping comments in sync
with code.
Documentation still needs work, though I did fix the omissions in
catalogs.sgml and indexam.sgml.
the read lock we hold on the table's parent relation until commit.
Update equalfuncs.c for the new field in AlterTableCmd. Various
improvements to comments, variable names, and error reporting.
There is room for further improvement here, but this is at least
a step in the right direction.
Open items:
There were a few tangentially related issues that have come up that I think
are TODOs. I'm likely to tackle one or two of these next so I'm interested in
hearing feedback on them as well.
. Constraints currently do not know anything about inheritance. Tom suggested
adding a coninhcount and conislocal like attributes have to track their
inheritance status.
. Foreign key constraints currently do not get copied to new children (and
therefore my code doesn't verify them). I don't think it would be hard to
add them and treat them like CHECK constraints.
. No constraints at all are copied to tables defined with LIKE. That makes it
hard to use LIKE to define new partitions. The standard defines LIKE and
specifically says it does not copy constraints. But the standard already has
an option called INCLUDING DEFAULTS; we could always define a non-standard
extension LIKE table INCLUDING CONSTRAINTS that gives the user the option to
request a copy including constraints.
. Personally, I think the whole attislocal thing is bunk. The decision about
whether to drop a column from children tables or not is something that
should be up to the user and trying to DWIM based on whether there was ever
a local definition or the column was acquired purely through inheritance is
hardly ever going to match up with user expectations.
. And of course there's the whole unique and primary key constraint issue. I
think to get any traction at all on this you have a prerequisite of a real
partitioned table implementation where the system knows what the partition
key is so it can recognize when it's a leading part of an index key.
Greg Stark
tuples with less header overhead than a regular HeapTuple, per my
recent proposal. Teach TupleTableSlot code how to deal with these.
As proof of concept, change tuplestore.c to store MinimalTuples instead
of HeapTuples. Future patches will expand the concept to other places
where it is useful.
changing semantics too much. statement_timestamp is now set immediately
upon receipt of a client command message, and the various places that used
to do their own gettimeofday() calls to mark command startup are referenced
to that instead. I have also made stats_command_string use that same
value for pg_stat_activity.query_start for both the command itself and
its eventual replacement by <IDLE> or <idle in transaction>. There was
some debate about that, but no argument that seemed convincing enough to
justify an extra gettimeofday() call.
libpq/md5.h, so that there's a clear separation between backend-only
definitions and shared frontend/backend definitions. (Turns out this
is reversing a bad decision from some years ago...) Fix up references
to crypt.h as needed. I looked into moving the code into src/port, but
the headers in src/include/libpq are sufficiently intertwined that it
seems more work than it's worth to do that.
by creating a reference-count mechanism, similar to what we did a long time
ago for catcache entries. The back branches have an ugly solution involving
lots of extra copies, but this way is more efficient. Reference counting is
only applied to tupdescs that are actually in caches --- there seems no need
to use it for tupdescs that are generated in the executor, since they'll go
away during plan shutdown by virtue of being in the per-query memory context.
Neil Conway and Tom Lane
per-call overhead is quite significant, at least on Linux: whatever
it's doing is more than just shoving the bytes into a buffer. Buffering
the data so we can call fwrite() just once per row seems to be a win.
characters in all cases. Formerly we mostly just threw warnings for invalid
input, and failed to detect it at all if no encoding conversion was required.
The tighter check is needed to defend against SQL-injection attacks as per
CVE-2006-2313 (further details will be published after release). Embedded
zero (null) bytes will be rejected as well. The checks are applied during
input to the backend (receipt from client or COPY IN), so it no longer seems
necessary to check in textin() and related routines; any string arriving at
those functions will already have been validated. Conversion failure
reporting (for characters with no equivalent in the destination encoding)
has been cleaned up and made consistent while at it.
Also, fix a few longstanding errors in little-used encoding conversion
routines: win1251_to_iso, win866_to_iso, euc_tw_to_big5, euc_tw_to_mic,
mic_to_euc_tw were all broken to varying extents.
Patches by Tatsuo Ishii and Tom Lane. Thanks to Akio Ishida and Yasuo Ohgaki
for identifying the security issues.
(relpages/reltuples). To do this, create formal support in heapam.c for
"overwrite" tuple updates (including xlog replay capability) and use that
instead of the ad-hoc overwrites we'd been using in VACUUM and CREATE INDEX.
Take the responsibility for updating stats during CREATE INDEX out of the
individual index AMs, and do it where it belongs, in catalog/index.c. Aside
from being more modular, this avoids having to update the same tuple twice in
some paths through CREATE INDEX. It's probably not measurably faster, but
for sure it's a lot cleaner than before.
The former approach used ExclusiveLock on pg_database, which being a
cluster-wide lock meant only one of these operations could proceed at
a time; worse, it also blocked all incoming connections in ReverifyMyDatabase.
Now that we have LockSharedObject(), we can use locks of different types
applied to databases considered as objects. This allows much more
flexible management of the interlocking: two CREATE DATABASEs need not
block each other, and need not block connections except to the template
database being used. Similarly DROP DATABASE doesn't block unrelated
operations. The locking used in flatfiles.c is also much narrower in
scope than before. Per recent proposal.
in various places that were previously doing ad hoc pg_database searches.
This may speed up database-related privilege checks a little bit, but
the main motivation is to eliminate the performance reason for having
ReverifyMyDatabase do such a lot of stuff (viz, avoiding repeat scans
of pg_database during backend startup). The locking reason for having
that routine is about to go away, and it'd be good to have the option
to break it up.
This formulation requires every AM to provide amvacuumcleanup, unlike before,
but it's surely a whole lot cleaner. Also, add an 'amstorage' column to
pg_am so that we can get rid of hardwired knowledge in DefineOpClass().
not named ones, and replace linear searches of the list with array indexing.
The named-parameter support has been dead code for many years anyway,
and recent profiling suggests that the searching was costing a noticeable
amount of performance for complex queries.
CREATE AGGREGATE aggname (input_type) (parameter_list)
along with the old syntax where the input type was named in the parameter
list. This fits more naturally with the way that the aggregate is identified
in DROP AGGREGATE and other utility commands; furthermore it has a natural
extension to handle multiple-input aggregates, where the basetype-parameter
method would get ugly. In fact, this commit fixes the grammar and all the
utility commands to support multiple-input aggregates; but DefineAggregate
rejects it because the executor isn't fixed yet.
I didn't do anything about treating agg(*) as a zero-input aggregate instead
of artificially making it a one-input aggregate, but that should be considered
in combination with supporting multi-input aggregates.
when trying to locate the referent of a RECORD variable. This fixes the
'record type has not been registered' failure reported by Stefan
Kaltenbrunner about a month ago. A side effect of the way I chose to
fix it is that most variable references in join conditions will now be
properly labeled with the variable's source table name, instead of the
not-too-helpful 'outer' or 'inner' we used to use.
that apply the necessary domain constraint checks immediately. This fixes
cases where domain constraints went unchecked for statement parameters,
PL function local variables and results, etc. We can also eliminate existing
special cases for domains in places that had gotten it right, eg COPY.
Also, allow domains over domains (base of a domain is another domain type).
This almost worked before, but was disallowed because the original patch
hadn't gotten it quite right.
functions are not strict, they will be called (passing a NULL first parameter)
during any attempt to input a NULL value of their datatype. Currently, all
our input functions are strict and so this commit does not change any
behavior. However, this will make it possible to build domain input functions
that centralize checking of domain constraints, thereby closing numerous holes
in our domain support, as per previous discussion.
While at it, I took the opportunity to introduce convenience functions
InputFunctionCall, OutputFunctionCall, etc to use in code that calls I/O
functions. This eliminates a lot of grotty-looking casts, but the main
motivation is to make it easier to grep for these places if we ever need
to touch them again.
misleadingly-named WriteBuffer routine, and instead require routines that
change buffer pages to call MarkBufferDirty (which does exactly what it says).
We also require that they do so before calling XLogInsert; this takes care of
the synchronization requirement documented in SyncOneBuffer. Note that
because bufmgr takes the buffer content lock (in shared mode) while writing
out any buffer, it doesn't matter whether MarkBufferDirty is executed before
the buffer content change is complete, so long as the content change is
completed before releasing exclusive lock on the buffer. So it's OK to set
the dirtybit before we fill in the LSN.
This eliminates the former kluge of needing to set the dirtybit in LockBuffer.
Aside from making the code more transparent, we can also add some new
debugging assertions, in particular that the caller of MarkBufferDirty must
hold the buffer content lock, not merely a pin.
This commit doesn't make much functional change, but it does eliminate some
duplicated code --- for instance, PageIsNew tests are now done inside
XLogReadBuffer rather than by each caller.
The GIST xlog code still needs a lot of love, but I'll worry about that
separately.
have symlinks (ie, Windows). Although it'll never be called on to do anything
useful during normal operation on such a platform, it's still needed to
re-create dropped directories during WAL replay.
when an error occurs during xlog replay. Also, replace the former risky
'write into a fixed-size buffer with no overflow detection' API for XLOG
record description routines; use an expansible StringInfo instead. (The
latter accounts for most of the patch bulk.)
Qingqing Zhou
during parse analysis, not only errors detected in the flex/bison stages.
This is per my earlier proposal. This commit includes all the basic
infrastructure, but locations are only tracked and reported for errors
involving column references, function calls, and operators. More could
be done later but this seems like a good set to start with. I've also
moved the ReportSyntaxErrorPosition logic out of psql and into libpq,
which should make it available to more people --- even within psql this
is an improvement because warnings weren't handled by ReportSyntaxErrorPosition.
relations are still checked for permissions etc as soon as they are
opened. The original form of the patch could hold exclusive lock for a
long time on relations that the user doesn't even have permissions to
access, let alone truncate.
are unnecessarily allocated on the heap rather than the stack. If the
StringInfo doesn't outlive the stack frame in which it is created,
there is no need to allocate it on the heap via makeStringInfo() --
stack allocation is faster. While it's not a big deal unless the
code is in a critical path, I don't see a reason not to save a few
cycles -- using stack allocation is not less readable.
I also cleaned up a bit of code along the way: moved variable
declarations into a more tightly-enclosing scope where possible,
fixed some pointless copying of strings in dblink, etc.
creation of a shell type. This allows a less hacky way of dealing with
the mutual dependency between a datatype and its I/O functions: make a
shell type, then make the functions, then define the datatype fully.
We should fix pg_dump to handle things this way, but this commit just deals
with the backend.
Martijn van Oosterhout, with some corrections by Tom Lane.
bits indicating which optional capabilities can actually be exercised
at runtime. This will allow Sort and Material nodes, and perhaps later
other nodes, to avoid unnecessary overhead in common cases.
This commit just adds the infrastructure and arranges to pass the correct
flag values down to plan nodes; none of the actual optimizations are here
yet. I'm committing this separately in case anyone wants to measure the
added overhead. (It should be negligible.)
Simon Riggs and Tom Lane
id (CVE-2006-0553). Also fix related bug in SET SESSION AUTHORIZATION that
allows unprivileged users to crash the server, if it has been compiled with
Asserts enabled. The escalation-of-privilege risk exists only in 8.1.0-8.1.2.
However, the Assert-crash risk exists in all releases back to 7.3.
Thanks to Akio Ishida for reporting this problem.
comments on cluster global objects like databases, tablespaces, and
roles.
It touches a lot of places, but not much in the way of big changes. The
only design decision I made was to duplicate the query and manipulation
functions rather than to try and have them handle both shared and local
comments. I believe this is simpler for the code and not an issue for
callers because they know what type of object they are dealing with.
This has resulted in a shobj_description function analagous to
obj_description and backend functions [Create/Delete]SharedComments
mirroring the existing [Create/Delete]Comments functions.
pg_shdescription.h goes into src/include/catalog/
Kris Jurka
partial. None of the existing AMs do anything useful except counting
tuples when there's nothing to delete, and we can get a tuple count
from the heap as long as it's not a partial index. (hash actually can
skip anyway because it maintains a tuple count in the index metapage.)
GIST is not currently able to exploit this optimization because, due to
failure to index NULLs, GIST is always effectively partial. Possibly
we should fix that sometime.
Simon Riggs w/ some review by Tom Lane.
regardless of the current schema search path. Since CREATE OPERATOR CLASS
only allows one default opclass per datatype regardless of schemas, this
should have minimal impact, and it fixes problems with failure to find a
desired opclass while restoring dump files. Per discussion at
http://archives.postgresql.org/pgsql-hackers/2006-02/msg00284.php.
Remove now-redundant-or-unused code in typcache.c and namespace.c,
and backpatch as far as 8.0.
relations: fix the executor so that we can have an Append plan on the
inside of a nestloop and still pass down outer index keys to index scans
within the Append, then generate such plans as if they were regular
inner indexscans. This avoids the need to evaluate the outer relation
multiple times.
Continue to support GRANT ON [TABLE] for sequences for backward
compatibility; issue warning for invalid sequence permissions.
[Backward compatibility warning message.]
Add USAGE permission for sequences that allows only currval() and
nextval(), not setval().
Mention object name in grant/revoke warnings because of possible
multi-object operations.
occurs when it tries to heap_open pg_tablespace. When control returns to
smgrcreate, that routine will be holding a dangling pointer to a closed
SMgrRelation, resulting in mayhem. This is of course a consequence of
the violation of proper module layering inherent in having smgr.c call
a tablespace command routine, but the simplest fix seems to be to change
the locking mechanism. There's no real need for TablespaceCreateDbspace
to touch pg_tablespace at all --- it's only opening it as a way of locking
against a parallel DROP TABLESPACE command. A much better answer is to
create a special-purpose LWLock to interlock these two operations.
This drops TablespaceCreateDbspace quite a few layers down the food chain
and makes it something reasonably safe for smgr to call.
files: avoid creating stats hashtable entries for tables that aren't being
touched except by vacuum/analyze, ensure that entries for dropped tables are
removed promptly, and tweak the data layout to avoid storing useless struct
padding. Also improve the performance of pgstat_vacuum_tabstat(), and make
sure that autovacuum invokes it exactly once per autovac cycle rather than
multiple times or not at all. This should cure recent complaints about 8.1
showing much higher stats I/O volume than was seen in 8.0. It'd still be a
good idea to revisit the design with an eye to not re-writing the entire
stats dataset every half second ... but that would be too much to backpatch,
I fear.
cursors. Patch from Joachim Wieland, review and ediorialization by Neil
Conway. The view lists cursors defined by DECLARE CURSOR, using SPI, or
via the Bind message of the frontend/backend protocol. This means the
view does not list the unnamed portal or the portal created to implement
EXECUTE. Because we do list SPI portals, there might be more rows in
this view than you might expect if you are using SPI implicitly (e.g.
via a procedural language).
Per recent discussion on -hackers, the query string included in the
view for cursors defined by DECLARE CURSOR is based on
debug_query_string. That means it is not accurate if multiple queries
separated by semicolons are submitted as one query string. However,
there doesn't seem a trivial fix for that: debug_query_string
is better than nothing. I also changed SPI_cursor_open() to include
the source text for the portal it creates: AFAICS there is no reason
not to do this.
Update the documentation and regression tests, bump the catversion.
an array of regtype, rather than an array of OIDs. This is likely to
be more useful to user, and the type OID can easily be obtained by
casting a regtype value to OID. Per suggestion from Tom.
Update the documentation and regression tests, and bump the catversion.
permissions on the functions and operators contained in the opclass.
Since we already require superuser privilege to create an operator class,
there's no expansion-of-privilege hazard here, but if someone were to get
the idea of building an opclass containing functions that need security
restrictions, we'd better warn them off. Also, change the permission
checks from have-execute-privilege to have-ownership, and then comment
them all out since they're dead code anyway under the superuser restriction.
type definition. Because use of a type's I/O conversion functions isn't
access-checked, CREATE TYPE amounts to granting public execute permissions
on the functions, and so allowing it to anybody means that someone could
theoretically gain access to a function he's not supposed to be able to
execute. The parameter-type restrictions already enforced by CREATE TYPE
make it fairly unlikely that this oversight is meaningful in practice,
but still it seems like a good idea to plug the hole going forward.
Also, document the implicit grant just in case anybody gets the idea of
building I/O functions that might need security restrictions.
our own command (or more generally, xmin = our xact and cmin >= current
command ID) should not be seen as good. Else we may try to update rows
we already updated. This error was inserted last August while fixing the
even bigger problem that the old coding wouldn't see *any* tuples inserted
by our own transaction as good. Per report from Euler Taveira de Oliveira.
access information about the prepared statements that are available
in the current session. Original patch from Joachim Wieland, various
improvements by Neil Conway.
The "statement" column of the view contains the literal query string
sent by the client, without any rewriting or pretty printing. This
means that prepared statements created via SQL will be prefixed with
"PREPARE ... AS ", whereas those prepared via the FE/BE protocol will
not. That is unfortunate, but discussion on -patches did not yield an
efficient way to improve this, and there is some merit in returning
exactly what the client sent to the backend.
Catalog version bumped, regression tests updated.
if (c == '\\' && cstate->line_buf.len == 0)
The problem with that is the because of the input and _output_
buffering, cstate->line_buf.len could be zero even if we are not on the
first character of a line. In fact, for a typical line, it is zero for
all characters on the line. The proper solution is to introduce a
boolean, first_char_in_line, that we set as we enter the loop and clear
once we process a character.
I have restructured the line-reading code in copy.c by:
o merging the CSV/non-CSV functions into a single function
o used macros to centralize and clarify the buffering code
o updated comments
o renamed client_encoding_only to encoding_embeds_ascii
o added a high-bit test to the encoding_embeds_ascii test for
performance
o in CSV mode, allow a backslash followed by a non-period to
continue being processed as a data value
There should be no performance impact from this patch because it is
functionally equivalent. If you apply the patch you will see copy.c is
much clearer in this area now and might suggest additional
optimizations.
I have also attached a 8.1-only patch to fix the CSV \. handling bug
with no code restructuring.
messages, when client attempts to execute these outside a transaction (start
one) or in a failed transaction (reject message, except for COMMIT/ROLLBACK
statements which we can handle). Per report from Francisco Figueiredo Jr.
if we already have a stronger lock due to the index's table being the
update target table of the query. Same optimization I applied earlier
at the table level. There doesn't seem to be much interest in the more
radical idea of not locking indexes at all, so do what we can ...
"ctid IN (list)" will still work after we convert IN to ScalarArrayOpExpr.
Make some minor efficiency improvements while at it, such as ensuring that
multiple TIDs are fetched in physical heap order. And fix EXPLAIN so that
it shows what's really going on for a TID scan.
comment line where output as too long, and update typedefs for /lib
directory. Also fix case where identifiers were used as variable names
in the backend, but as typedefs in ecpg (favor the backend for
indenting).
Backpatch to 8.1.X.
process of dropping roles by dropping objects owned by them and privileges
granted to them, or giving the owned objects to someone else, through the
use of the data stored in the new pg_shdepend catalog.
Some refactoring of the GRANT/REVOKE code was needed, as well as ALTER OWNER
code. Further cleanup of code duplication in the GRANT code seems necessary.
Implemented by me after an idea from Tom Lane, who also provided various kind
of implementation advice.
Regression tests pass. Some tests for the new functionality are also added,
as well as rudimentary documentation.
create circularity of role memberships. This is a minimum-impact fix
for the problem reported by Florian Pflug. I thought about removing
the superuser_arg test from is_member_of_role() altogether, as it seems
redundant for many of the callers --- but not all, and it's way too late
in the 8.1 cycle to be making large changes. Perhaps reconsider this
later.
properly advancing the CommandCounter between multiple sub-queries
generated by rules, we forgot to update the snapshot being used, so
that the successive sub-queries didn't actually see each others'
results. This is still not *exactly* like the semantics of normal
execution of the same queries, in that we don't take new transaction
snapshots and hence don't see changes from concurrently committed
commands, but I think that's OK and probably even preferable for
EXPLAIN ANALYZE.
ie removing shared-dependency entries, should happen before non-rollbackable
ones. That way a failure during the rollbackable part doesn't leave us
with inconsistent state.
argument as a 'regclass' value instead of a text string. The frontend
conversion of text string to pg_class OID is now encapsulated as an
implicitly-invocable coercion from text to regclass. This provides
backwards compatibility to the old behavior when the sequence argument
is explicitly typed as 'text'. When the argument is just an unadorned
literal string, it will be taken as 'regclass', which means that the
stored representation will be an OID. This solves longstanding problems
with renaming sequences that are referenced in default expressions, as
well as new-in-8.1 problems with renaming such sequences' schemas or
moving them to another schema. All per recent discussion.
Along the way, fix some rather serious problems in dbmirror's support
for mirroring sequence operations (int4 vs int8 confusion for instance).
in which invalid page data could be transiently written to disk by
concurrent bgwriter activity. There doesn't seem any risk of loss of
actual user data, but an empty page could possibly be left corrupt if a
crash occurs before the correct data gets written out. Pointed out by
Alvaro Herrera.
for procedural languages. This replaces the hard-wired table I had
originally proposed as a stopgap solution. For the moment, the initial
contents only include languages shipped with the core distribution.
as per my recent proposal. For now the template data is hard-wired in
proclang.c --- this should be replaced later by a new shared system
catalog, but we don't want to force initdb during 8.1 beta. This change
lets us cleanly load existing dump files even if they contain outright
wrong information about a PL's support functions, such as a wrong path
to the shared library or a missing validator function. Also, we can
revert the recent kluges to make pg_dump dump PL support functions that
are stored in pg_catalog.
While at it, I removed the code in pg_regress that replaced $libdir
with a hardcoded path for temporary installations. This is no longer
needed given our support for relocatable installations.
on a page, as suggested by ITAGAKI Takahiro. Also, change a few places
that were using some other estimates of max-items-per-page to consistently
use MaxOffsetNumber. This is conservatively large --- we could have used
the new MaxHeapTuplesPerPage macro, or a similar one for index tuples ---
but those places are simply declaring a fixed-size buffer and assuming it
will work, rather than actively testing for overrun. It seems safer to
size these buffers in a way that can't overflow even if the page is
corrupt.
the parent table, even if the command that creates them is executed by
someone else (such as a superuser or a member of the owning role).
Per gripe from Michael Fuhr.
use these instead of its previous hack of changing pg_class.reltriggers.
Documentation is lacking, will add that later.
Patch by Satoshi Nagayasu, review and some extra work by Tom Lane.
erroring out as it has done for the last couple weeks. Document that this
form is now ignored because indexes can't usefully have different owners
from their parent tables. Fix pg_dump to not generate ALTER OWNER commands
for indexes.
discussion of getting around this by relaxing the checks made for regular
users, but I'm disinclined to toy with the security model right now,
so just special-case it for superusers where needed.
to 'Size' (that is, size_t), and install overflow detection checks in it.
This allows us to remove the former arbitrary restrictions on NBuffers
etc. It won't make any difference in a 32-bit machine, but in a 64-bit
machine you could theoretically have terabytes of shared buffers.
(How efficiently we could manage 'em remains to be seen.) Similarly,
num_temp_buffers, work_mem, and maintenance_work_mem can be set above
2Gb on a 64-bit machine. Original patch from Koichi Suzuki, additional
work by moi.
insufficient paranoia in code that follows t_ctid links. (We must do both
because even with VACUUM doing it properly, the intermediate state with
a dangling t_ctid link is visible concurrently during lazy VACUUM, and
could be seen afterwards if either type of VACUUM crashes partway through.)
Also try to improve documentation about what's going on. Patch is a bit
bulky because passing the XMAX information around required changing the
APIs of some low-level heapam.c routines, but it's not conceptually very
complicated. Per trouble report from Teodor and subsequent analysis.
This needs to be back-patched, but I'll do that after 8.1 beta is out.
whenever we generate a new OID. This prevents occasional duplicate-OID
errors that can otherwise occur once the OID counter has wrapped around.
Duplicate relfilenode values are also checked for when creating new
physical files. Per my recent proposal.
character, tighten the inner loops of CopyReadLine and CopyReadAttribute,
arrange to parse out all the attributes of a line in just one call instead
of one CopyReadAttribute call per attribute, be smarter about which client
encodings require slow pg_encoding_mblen() loops. Also, clean up the
mishmash of static variables and overly-long parameter lists in favor of
passing around a single CopyState struct containing all the state data.
Original patch by Alon Goldshuv, reworked by Tom Lane.
This was not especially critical before, but it is now that we track
ownership dependencies --- the dependency for the rowtype *must* shift
to the new owner. Spotted by Bernd Helmle.
Also fix a problem introduced by recent change to allow non-superusers
to do ALTER OWNER in some cases: if the table had a toast table, ALTER
OWNER failed *even for superusers*, because the test being applied would
conclude that the new would-be owner had no create rights on pg_toast.
A side-effect of the fix is to disallow changing the ownership of indexes
or toast tables separately from their parent table, which seems a good
idea on the whole.
of special case for Windows port. Put a PG_TRY around most of createdb()
to ensure that we remove copied subdirectories on failure, even if the
failure happens while creating the pg_database row. (I think this explains
Oliver Siegmar's recent report.) Having done that, there's no need for
the fragile assumption that copydir() mustn't ereport(ERROR), so simplify
its API. Eliminate the old code that used system("cp ...") to copy
subdirectories, in favor of using copydir() on all platforms. This not
only should allow much better error reporting, but allows us to fsync
the created files before trusting that the copy has succeeded.
track shared relations in a separate hashtable, so that operations done
from different databases are counted correctly. Add proper support for
anti-XID-wraparound vacuuming, even in databases that are never connected
to and so have no stats entries. Miscellaneous other bug fixes.
Alvaro Herrera, some additional fixes by Tom Lane.