copies?) to ensure they really don't run proc_exit/shmem_exit callbacks,
as was intended. I broke this behavior recently by installing atexit
callbacks without thinking about the one case where we truly don't want
to run those callback functions. Noted in an example from Dave Page.
is run at the end of archive recovery, providing a chance to do external
cleanup. Modify pg_standby so that it no longer removes the trigger file,
that is to be done using the recovery_end_command now.
Provide a "smart" failover mode in pg_standby, where we don't fail over
immediately, but only after recovering all unapplied WAL from the archive.
That gives you zero data loss assuming all WAL was archived before
failover, which is what most users of pg_standby actually want.
recovery_end_command by Simon Riggs, pg_standby changes by Fujii Masao and
myself.
that aren't RELKIND_RELATION or RELKIND_VIEW, and to disallow attaching rules
to system relations unless allowSystemTableMods is on. This is to make the
behavior of CREATE RULE more like CREATE TRIGGER, which disallows the
comparable cases. Per discussion of bug #4808.
redirecting libxml's allocations into a Postgres context. Instead, just let
it use malloc directly, and add PG_TRY blocks as needed to be sure we release
libxml data structures in error recovery code paths. This is ugly but seems
much more likely to play nicely with third-party uses of libxml, as seen in
recent trouble reports about using Perl XML facilities in pl/perl and bug
#4774 about contrib/xml2.
I left the code for allocation redirection in place, but it's only
built/used if you #define USE_LIBXMLCONTEXT. This is because I found it
useful to corral libxml's allocations in a palloc context when hunting
for libxml memory leaks, and we're surely going to have more of those
in the future with this type of approach. But we don't want it turned on
in a normal build because it breaks exactly what we need to fix.
I have not re-indented most of the code sections that are now wrapped
by PG_TRY(); that's for ease of review. pg_indent will fix it.
This is a pre-existing bug in 8.3, but I don't dare back-patch this change
until it's gotten a reasonable amount of field testing.
xml_parse, all arising from the same sloppy usage of parse_xml_decl.
The original coding had that function returning its output string
parameters in the libxml context, which is long-lived, and all but one
of its callers neglected to free the strings afterwards. The easiest
and most bulletproof fix is to return the strings in the local palloc
context instead, since that's short-lived. This was only costing a
dozen or two bytes per function call, but that adds up fast if the
function is called repeatedly ...
Noted while poking at the more general problem of what to do with our
libxml memory allocation hooks. Back-patch to 8.3, which has the
identical coding.
errors when tables are concurrently dropped. To do this we must take lock
on each relation before we check its privileges. The old code was trying
to do that the other way around, which is a bit pointless when there are lots
of other commands that lock relations before checking privileges. I did keep
it checking each relation's privilege before locking the next relation, which
is a detail that ALTER TABLE isn't too picky about.
ability to lock relations as they scan pg_inherits, and to ignore any
relations that have disappeared by the time we get lock on them. This
makes uses of these functions safe against concurrent DROP operations
on child tables: we will effectively ignore any just-dropped child,
rather than possibly throwing an error as in recent bug report from
Thomas Johansson (and similar past complaints). The behavior should
not change otherwise, since the code was acquiring those same locks
anyway, just a little bit later.
An exception is LockTableCommand(), which is still behaving unsafely;
but that seems to require some more discussion before we change it.
find_inheritance_children() and find_all_inheritors(). I got annoyed that
these are buried inside the planner but mostly used elsewhere. So, create
a new file catalog/pg_inherits.c and put them there, along with a couple
of other functions that search pg_inherits.
The code that modifies pg_inherits is (still) in tablecmds.c --- it's
kind of entangled with unrelated code that modifies pg_depend and other
stuff, so pulling it out seemed like a bigger change than I wanted to make
right now. But this file provides a natural home for it if anyone ever
gets around to that.
This commit just moves code around; it doesn't change anything, except
I succumbed to the temptation to make a couple of trivial optimizations
in typeInheritsFrom().
of AND/OR clause branches that predtest.c would attempt to deal with. As
noted in bug #4721, that change disabled proof attempts for sizes of problems
that people are actually expecting it to work for. The original complaint
it was trying to solve was O(N^2) behavior for long IN-lists, so let's try
applying the limit to just ScalarArrayOpExprs rather than everything.
Another case of "foolish consistency" I fear.
Back-patch to 8.2, same as the previous patch was.
predicate_refuted_by: if either top-level input is a single-element list,
reduce it to its lone member before proceeding. This avoids
a useless level of AND-recursion within the recursive proof routines.
It's worth doing because, for example, if the clause is a 100-element
list and the predicate is a 1-element list then we'd otherwise strip
the predicate's list structure 100 times as we iterate through the clause.
It's only needed at top level because there won't be any trivial ANDs below
that --- this situation is an artifact of the decision to represent even
single-item conditions as Lists in the "implicit AND" format, and that format
is only used at the top level of any predicate or restriction condition.
joins a bit better, ie, understand the differing cost functions for matched
and unmatched outer tuples. There is more that could be done in cost_hashjoin
but this already helps a great deal. Per discussions with Robert Haas.
a toast table to be built, even if the sum-of-column-widths calculation
indicates one isn't needed. This is needed by pg_migrator because if the
old table has a toast table, we have to migrate over the toast table since
it might contain some live data, even though subsequent column drops could
mean that no recently-added rows could require toasting.
restrictions specified for semijoins in optimizer/README, to wit that
you can't reassociate outer joins into or out of the RHS of a semijoin.
Per report from Heikki.
you can end up with an unrecoverable backup if you start a new base backup
right after finishing archive recovery. In that scenario, the redo pointer of
the checkpoint that pg_start_backup() writes points to the XLOG segment where
the timeline-changing end-of-archive-recovery checkpoint is. The beginning
of that segment contains pages with the old timeline ID, and we don't accept
that in recovery unless we find a history file covering the old timeline ID.
If you omit pg_xlog from the base backup and clear the archive directory
before starting the backup, there will be no such history file available.
The bug is present in all versions since PITR was introduced in 8.0, but I'm
back-patching only back to 8.2. Earlier versions didn't have XLOG switch
records, making this fix unfeasible. Given the lack of reports until now,
it doesn't seem worthwhile to spend more effort to fix 8.0 and 8.1.
Per report and suggestion by Mikael Krantz
can be pushed to the top of the join tree, we update both the relids and
qualscope variables to keep them in sync. This prevents a possible later
failure of an Assert clause, and affects nothing else since qualscope isn't
used later except for that Assert. At the moment the Assert shouldn't be
reachable when we've pushed the qual up; but this is cheap insurance, and
it's more sensible anyway in terms of the overall logic of the routine.
Per analysis of a bug report from Stefan Huehner.
I'm not back-patching this since it's just future-proofing; but if anyone
gets tempted to change check_outerjoin_delay again in the back branches,
this might be needed.
must be used for the new database, except when copying from template0.
This is the same rule that we now enforce for locale settings, and it has
the same motivation: databases other than template0 might contain data that
would be invalid according to a different setting. This represents another
step in a continuing process of locking down ways in which encoding violations
could occur inside the backend. Per discussion of a few days ago.
In passing, fix pre-existing breakage of mbregress.sh, and fix up a couple
of ereport() calls in dbcommands.c that failed to specify sqlstate codes.
will still be performed if something in a backend process calls exit()
directly, instead of going through proc_exit() as we prefer. This is a second
response to the issue that we might load third-party code that doesn't know it
should not call exit(). Such a call will now cause a reasonably graceful
backend shutdown, if possible. (Of course, if the reason for the exit() call
is out-of-memory or some such, we might not be able to recover, but at least
we will try.)
a backend has done exit(0) or exit(1) without having disengaged itself
from shared memory. We are at risk for this whenever third-party code is
loaded into a backend, since such code might not know it's supposed to go
through proc_exit() instead. Also, it is reported that under Windows
there are ways to externally kill a process that cause the status code
returned to the postmaster to be indistinguishable from a voluntary exit
(thank you, Microsoft). If this does happen then the system is probably
hosed --- for instance, the dead session might still be holding locks.
So the best recovery method is to treat this like a backend crash.
The dead man switch is armed for a particular child process when it
acquires a regular PGPROC, and disarmed when the PGPROC is released;
these should be the first and last touches of shared memory resources
in a backend, or close enough anyway. This choice means there is no
coverage for auxiliary processes, but I doubt we need that, since they
shouldn't be executing any user-provided code anyway.
This patch also improves the management of the EXEC_BACKEND
ShmemBackendArray array a bit, by reducing search costs.
Although this problem is of long standing, the lack of field complaints
seems to mean it's not critical enough to risk back-patching; at least
not till we get some more testing of this mechanism.
points where we step right or left to the next page. This should ensure
reasonable response time to a query cancel request during an unsuccessful
index scan, as seen in recent gripe from Marc Cousin. It's a bit trickier
than it might seem at first glance, because CHECK_FOR_INTERRUPTS() is a no-op
if executed while holding a buffer lock. So we have to do it just at the
point where we've dropped one page lock and not yet acquired the next.
Remove CHECK_FOR_INTERRUPTS calls at the top level of btgetbitmap and
hashgetbitmap, since they're pointless given the added checks.
I think that GIST is okay already --- at least, there's a CHECK_FOR_INTERRUPTS
at a plausible-looking place in gistnext(). I don't claim to know GIN well
enough to try to poke it for this, if indeed it has a problem at all.
This is a pre-existing issue, but in view of the lack of prior complaints
I'm not going to risk back-patching.
ANALYZE's total sample. The original coding is at risk of overflow for
statistics targets exceeding about 2675; this was not a problem before
8.4 but it is now. Per bug #4793 from Dennis Noordsij.
it fails because the shared memory segment already exists. This
means it can take up to 10 seconds before it reports the error
if it *does* exist, but hopefully it will make the system capable
of restarting even when the server is under high load.
to make sure that the error code is reset, as a precaution in
case the API doesn't properly reset it on success. This could
be necessary, since we check the error value even if the function
doesn't fail for specific success cases.
error message if the installation directory layout is messed up (or at least,
something more useful than the behavior exhibited in bug #4787). During
postmaster startup, check that get_pkglib_path resolves as a readable
directory; and if ParseTzFile() fails to open the expected timezone
abbreviation file, check the possibility that the directory is missing rather
than just the specified file. In case of either failure, issue a hint
suggesting that the installation is broken. These two checks cover the lib/
and share/ trees of a full installation, which should take care of most
scenarios where a sysadmin decides to get cute.
part that rounds up to exactly 1.0 second. The previous coding rejected input
like "00:12:57.9999999999999999999999999999", with the exact number of nines
needed to cause failure varying depending on float-timestamp option and
possibly on platform. Obviously this should round up to the next integral
second, if we don't have enough precision to distinguish the value from that.
Per bug #4789 from Robert Kruus.
In passing, fix a missed check for fractional seconds in one copy of the
"is it greater than 24:00:00" code.
Broken all the way back, so patch all the way back.
PlaceHolderVar nodes in join quals appearing in or below the lowest
outer join that could null the subquery being pulled up. This improves
the planner's ability to recognize constant join quals, and probably
helps with detection of common sort keys (equivalence classes) as well.
aggregate function. By definition, such a sub-SELECT cannot reference any
variables of query levels between itself and the aggregate's semantic level
(else the aggregate would've been assigned to that lower level instead).
So the correct, most efficient implementation is to treat the sub-SELECT as
being a sub-select of that outer query level, not the level the aggregate
syntactically appears in. Not doing so also confuses the heck out of our
parameter-passing logic, as illustrated in bug report from Daniel Grace.
Fortunately, we were already copying the whole Aggref expression up to the
outer query level, so all that's needed is to delay SS_process_sublinks
processing of the sub-SELECT until control returns to the outer level.
This has been broken since we introduced spec-compliant treatment of
outer aggregates in 7.4; so patch all the way back.
any negative or positive number, not just -1 or 1. Fix comment on
varstr_cmp and citext test case accordingly.
As pointed out by Zdenek Kotala, and buildfarm member gothic moth.
documentation warnings against setting it nonzero unless active use of
prepared transactions is intended and a suitable transaction manager has been
installed. This should help to prevent the type of scenario we've seen
several times now where a prepared transaction is forgotten and eventually
causes severe maintenance problems (or even anti-wraparound shutdown).
The only real reason we had the default be nonzero in the first place was to
support regression testing of the feature. To still be able to do that,
tweak pg_regress to force a nonzero value during "make check". Since we
cannot force a nonzero value in "make installcheck", add a variant regression
test "expected" file that shows the results that will be obtained when
max_prepared_transactions is zero.
Also, extend the HINT messages for transaction wraparound warnings to mention
the possibility that old prepared transactions are causing the problem.
All per today's discussion.
using the system functions all the time. (These files are now just copies
of the osf.* files.) The homebrew functions were not getting used anyway
on AIX versions that have dlopen(), that is 4.3 and up, so they are not
needed on any AIX that is even remotely supported by the vendor anymore.
We'd have probably left them here anyway, except some questions were
raised about the copyright.
fact that this is breaking the MSVC build, it's probably not really a good
idea to expand the dependencies of gram.h any further than the core parser;
for instance the value of SCONST might depend on which bison version you'd
built with. Better to expose an additional call point in parser.c, so
move what I had put into pl_funcs.c into parser.c. Also PGDLLIMPORT'ify
the reference to standard_conforming_strings, per buildfarm results.
Stefan Kaltenbrunner. The most reasonable behavior (at least for the near
term) seems to be to ignore the PlaceHolderVar and examine its argument
instead. In support of this, change the API of pull_var_clause() to allow
callers to request recursion into PlaceHolderVars. Currently
estimate_num_groups() is the only customer for that behavior, but where
there's one there may be others.
constants through full joins, as in
select * from tenk1 a full join tenk1 b using (unique1)
where unique1 = 42;
which should generate a fairly cheap plan where we apply the constraint
unique1 = 42 in each relation scan. This had been broken by my patch of
2008-06-27, which is now reverted in favor of a more invasive but hopefully
less incorrect approach. That patch was meant to prevent incorrect extraction
of OR'd indexclauses from OR conditions above an outer join. To do that
correctly we need more information than the outerjoin_delay flag can provide,
so add a nullable_relids field to RestrictInfo that records exactly which
relations are nulled by outer joins that are underneath a particular qual
clause. A side benefit is that we can make the test in create_or_index_quals
more specific: it is now smart enough to extract an OR'd indexclause into the
outer side of an outer join, even though it must not do so in the inner side.
The old coding couldn't distinguish these cases so it could not do either.
select u&42 from table-with-a-u-column;
Also fix missing SET_YYLLOC() in the {dolqfailed} production that I suppose
this was based on. The latter is a pre-existing bug, but the only effect
is to misplace the error cursor by one token, so probably not worth
backpatching.
how this ought to behave for multi-dimensional arrays. Per discussion,
not having it at all seems better than having it with what might prove
to be the wrong behavior. We can always add it later when we have consensus
on the correct behavior.
already did that on Windows, but it's needed on other platforms too when
LC_CTYPE=C. With other locales, we enforce (or trust) that the codeset of
the locale matches the server encoding so we don't need to bind it
explicitly. It should do no harm in that case either, but I don't have
full faith in the PG encoding -> OS codeset mapping table yet. Per recent
discussion on pgsql-hackers.
the checkpoint in immediate or lazy mode. This is to address complaints
that pg_start_backup() takes a long time even when there's no need to minimize
its I/O consumption.
alias for array_length(v,1). The efficiency gain here is doubtless
negligible --- what I'm interested in is making sure that if we have
second thoughts about the definition, we will not have to force a
post-beta initdb to change the implementation.
of discovery, rather than reverse order. This doesn't matter functionally
(I suppose the previous coding dates from the time when lcons was markedly
cheaper than lappend). However now that EXPLAIN is labeling subplans with
IDs that are based on order of creation, this may help produce a slightly
less surprising printout.
are individually labeled, rather than just grouped under an "InitPlan"
or "SubPlan" heading. This in turn makes it possible for decompilation of
a subplan reference to usefully identify which subplan it's referencing.
I also made InitPlans identify which parameter symbol(s) they compute,
so that references to those parameters elsewhere in the plan tree can
be connected to the initplan that will be executed. Per a gripe from
Robert Haas about EXPLAIN output of a WITH query being inadequate,
plus some longstanding pet peeves of my own.
are using our own ports of getopt or getopt_long, those will define
the variable for themselves; and if not, we don't need these, because
we never touch the variable anyway.
of adding optional namespace and action fields to DefElem. Having three
node types that do essentially the same thing bloats the code and leads
to errors of confusion, such as in yesterday's bug report from Khee Chin.
when we are waiting for old snapshots to go away during a concurrent index
build. In particular, this rule lets us avoid waiting for
idle-in-transaction sessions.
This logic could be improved further if we had some way to wake up when
the session we are currently waiting for goes idle-in-transaction. However
that would be a significantly more complex/invasive patch, so it'll have to
wait for some other day.
Simon Riggs, with some improvements by Tom.
interval_eq() considers equal. I'm not sure how that fundamental requirement
escaped us through multiple revisions of this hash function, but there it is;
it's been wrong since interval_hash was first written for PG 7.1.
Per bug #4748 from Roman Kononov.
Backpatch to all supported releases.
This patch changes the contents of hash indexes for interval columns. That's
no particular problem for PG 8.4, since we've broken on-disk compatibility
of hash indexes already; but it will require a migration warning note in
the next minor releases of all existing branches: "if you have any hash
indexes on columns of type interval, REINDEX them after updating".
To implement this without almost duplicating the reloption table, treat
relopt_kind as a bitmask instead of an integer value. This decreases the
range of allowed values, but it's not clear that there's need for that much
values anyway.
This patch also makes heap_reloptions explicitly a no-op for relation kinds
other than heap and TOAST tables.
Patch by ITAGAKI Takahiro with minor edits from me. (In particular I removed
the bit about adding relation kind to an error message, which I intend to
commit separately.)
try to protect an already-existing buffer from being evicted. This was
left as an open issue when the posix_fadvise patch was committed. I'm
not sure there's any evidence to justify more work in this area, but we
should have some record about it in the source code.
for simple Var targetlist entries all the time, even when there are other
entries that are not simple Vars. Also, ensure that we prefetch attributes
(with slot_getsomeattrs) for all Vars in the targetlist, even those buried
within expressions. In combination these changes seem to significantly
reduce the runtime for cases where tlists are mostly but not exclusively
Vars. Per my proposal of yesterday.
conversion functions. This allows transaction rollback to revert to a
previous client_encoding setting without doing fresh catalog lookups.
I believe that this explains and fixes the recent report of "failed to commit
client_encoding" failures.
This bug is present in 8.3.x, but it doesn't seem prudent to back-patch
the fix, at least not till it's had some time for field testing in HEAD.
In passing, remove SetDefaultClientEncoding(), which was used nowhere.
we failed to assign, even in "can't happen" cases. Motivated by wondering
what's going on in a recent trouble report where "failed to commit" did
happen.
temp relations; this is no more expensive than before, now that we have
pg_class.relistemp. Insert tests into bufmgr.c to prevent attempting
to fetch pages from nonlocal temp relations. This provides a low-level
defense against bugs-of-omission allowing temp pages to be loaded into shared
buffers, as in the contrib/pgstattuple problem reported by Stuart Bishop.
While at it, tweak a bunch of places to use new relcache tests (instead of
expensive probes into pg_namespace) to detect local or nonlocal temp tables.
relations (including a temp table's indexes and toast table/index), and
false for normal relations. For ease of checking, this commit just adds
the column and fills it correctly --- revising the relation access machinery
to use it will come separately.
at the same instant as a new backend is spawned. Since CountActiveBackends()
doesn't hold ProcArrayLock, it needs to be prepared for the case that a
pointer at the end of the proc array is still NULL even though numProcs says
it should be valid, since it doesn't hold ProcArrayLock. Backpatch to 8.1.
8.0 and earlier had this right, but it was broken in the split of PGPROC and
sinval shared memory arrays.
Per report and proposal by Marko Kreen.
TupleTableSlots. We have functions for retrieving a minimal tuple from a slot
after storing a regular tuple in it, or vice versa; but these were implemented
by converting the internal storage from one format to the other. The problem
with that is it invalidates any pass-by-reference Datums that were already
fetched from the slot, since they'll be pointing into the just-freed version
of the tuple. The known problem cases involve fetching both a whole-row
variable and a pass-by-reference value from a slot that is fed from a
tuplestore or tuplesort object. The added regression tests illustrate some
simple cases, but there may be other failure scenarios traceable to the same
bug. Note that the added tests probably only fail on unpatched code if it's
built with --enable-cassert; otherwise the bug leads to fetching from freed
memory, which will not have been overwritten without additional conditions.
Fix by allowing a slot to contain both formats simultaneously; which turns out
not to complicate the logic much at all, if anything it seems less contorted
than before.
Back-patch to 8.2, where minimal tuples were introduced.