Users of asynchronous-commit mode expect there to be a guaranteed maximum
delay before an async commit's WAL records get flushed to disk. The
original version of the walwriter hibernation patch broke that. Add an
extra shared-memory flag to allow async commits to kick the walwriter out
of hibernation mode, without adding any noticeable overhead in cases where
no action is needed.
Latch-ify the stats collector, so that it does not need an arbitrary wakeup
cycle to check for postmaster death. The incremental savings in idle power
is pretty marginal, since we only had it waking every two seconds; but I
believe that this patch may also improve the collector's performance under
load, by reducing the number of kernel calls made per message when messages
are arriving constantly (we now avoid a select/poll call except when we
need to sleep). The change also reduces the time needed for a normal
database shutdown on platforms where signals don't interrupt select().
This patch modifies the walwriter process so that, when it has not found
anything useful to do for many consecutive wakeup cycles, it extends its
sleep time to reduce the server's idle power consumption. It reverts to
normal as soon as it's done any successful flushes. It's still true that
during any async commit, backends check for completed, unflushed pages of
WAL and signal the walwriter if there are any; so that in practice the
walwriter can get awakened and returned to normal operation sooner than the
sleep time might suggest.
Also, improve the checkpointer so that it uses a latch and a computed delay
time to not wake up at all except when it has something to do, replacing a
previous hardcoded 0.5 sec wakeup cycle. This also is primarily useful for
reducing the server's power consumption when idle.
In passing, get rid of the dedicated latch for signaling the walwriter in
favor of using its procLatch, since that comports better with possible
generic signal handlers using that latch. Also, fix a pre-existing bug
with failure to save/restore errno in walwriter's signal handlers.
Peter Geoghegan, somewhat simplified by Tom
"Unexpected EOF on client connection" without an open transaction
is mostly noise, so turn it into DEBUG1. With an open transaction it's
still indicating a problem, so keep those as ERROR, and change the message
to indicate that it happened in a transaction.
Commit 62c7bd31c8 had assorted problems, most
visibly that it broke PREPARE TRANSACTION in the presence of session-level
advisory locks (which should be ignored by PREPARE), as per a recent
complaint from Stephen Rees. More abstractly, the patch made the
LockMethodData.transactional flag not merely useless but outright
dangerous, because in point of fact that flag no longer tells you anything
at all about whether a lock is held transactionally. This fix therefore
removes that flag altogether. We now rely entirely on the convention
already in use in lock.c that transactional lock holds must be owned by
some ResourceOwner, while session holds are never so owned. Setting the
locallock struct's owner link to NULL thus denotes a session hold, and
there is no redundant marker for that.
PREPARE TRANSACTION now works again when there are session-level advisory
locks, and it is also able to transfer transactional advisory locks to the
prepared transaction, but for implementation reasons it throws an error if
we hold both types of lock on a single lockable object. Perhaps it will be
worth improving that someday.
Assorted other minor cleanup and documentation editing, as well.
Back-patch to 9.1, except that in the 9.1 branch I did not remove the
LockMethodData.transactional flag for fear of causing an ABI break for
any external code that might be examining those structs.
At the time we check whether the tuple is dead to all running
transactions, we've already verified that it isn't visible to our
scan, setting hint bits if appropriate. So there's no need to
recheck CLOG for the all-dead test we do just a moment later.
So, add HeapTupleIsSurelyDead() to test the appropriate condition
under the assumption that all relevant hit bits are already set.
Review by Tom Lane.
Remove the following ports:
- dgux
- nextstep
- sunos4
- svr4
- ultrix4
- univel
These are obsolete and not worth rescuing. In most cases, there is
circumstantial evidence that they wouldn't work anymore anyway.
This patch adjusts the core statistics views to match the decision already
taken for pg_stat_statements, that values representing elapsed time should
be represented as float8 and measured in milliseconds. By using float8,
we are no longer tied to a specific maximum precision of timing data.
(Internally, it's still microseconds, but we could now change that without
needing changes at the SQL level.)
The columns affected are
pg_stat_bgwriter.checkpoint_write_time
pg_stat_bgwriter.checkpoint_sync_time
pg_stat_database.blk_read_time
pg_stat_database.blk_write_time
pg_stat_user_functions.total_time
pg_stat_user_functions.self_time
pg_stat_xact_user_functions.total_time
pg_stat_xact_user_functions.self_time
The first four of these are new in 9.2, so there is no compatibility issue
from changing them. The others require a release note comment that they
are now double precision (and can show a fractional part) rather than
bigint as before; also their underlying statistics functions now match
the column definitions, instead of returning bigint microseconds.
In ancient times, it was thought that this wouldn't work because of
TrapMacro/AssertMacro, but changing those to use a comma operator
appears to work without compiler warnings.
Normally whole-row Vars are printed as "tabname.*". However, that does not
work at top level of a targetlist, because per SQL standard the parser will
think that the "*" should result in column-by-column expansion; which is
not at all what a whole-row Var implies. We used to just print the table
name in such cases, which works most of the time; but it fails if the table
name matches a column name available anywhere in the FROM clause. This
could lead for instance to a view being interpreted differently after dump
and reload. Adding parentheses doesn't fix it, but there is a reasonably
simple kluge we can use instead: attach a no-op cast, so that the "*" isn't
syntactically at top level anymore. This makes the printing of such
whole-row Vars a lot more consistent with other Vars, and may indeed fix
more cases than just the reported one; I'm suspicious that cases involving
schema qualification probably didn't work properly before, either.
Per bug report and fix proposal from Abbas Butt, though this patch is quite
different in detail from his.
Back-patch to all supported versions.
If it fails to open a new log file, the syslogger assumes there's something
wrong with its parameters (such as log_directory), and stops attempting
automatic time-based or size-based log file rotations. Sending it SIGHUP
is supposed to start that up again. However, the original coding for that
was really bogus, involving clobbering a couple of GUC variables and hoping
that SIGHUP processing would restore them. Get rid of that technique in
favor of maintaining a separate flag showing we've turned rotation off.
Per report from Mark Kirkwood.
Also, the syslogger will automatically attempt to create the log_directory
directory if it doesn't exist, but that was only happening at startup.
For consistency and ease of use, it should do the same whenever the value
of log_directory is changed by SIGHUP.
Back-patch to all supported branches.
The alternative of disallowing index-only scans in HS operation was
discussed, but the consensus was that it was better to treat marking
a page all-visible as a recovery conflict for snapshots that could still
fail to see XIDs on that page. We may in the future try to soften this,
so that we simply force index scans to do heap fetches in cases where
this may be an issue, rather than throwing a hard conflict.
bitmap_scan_cost_est() has to be able to cope with a BitmapOrPath, but
I'd taken a shortcut that didn't work for that case. Noted by Heikki.
Add some regression tests since this area is evidently under-covered.
setrefs.c failed to do "rtoffset" adjustment of Vars in RETURNING lists,
which meant they were left with the wrong varnos when the RETURNING list
was in a subquery. That was never possible before writable CTEs, of
course, but now it's broken. The executor fails to notice any problem
because ExecEvalVar just references the ecxt_scantuple for any normal
varno; but EXPLAIN breaks when the varno is wrong, as illustrated in a
recent complaint from Bartosz Dmytrak.
Since the eventual rtoffset of the subquery is not known at the time
we are preparing its plan node, the previous scheme of executing
set_returning_clause_references() at that time cannot handle this
adjustment. Fortunately, it turns out that we don't really need to do it
that way, because all the needed information is available during normal
setrefs.c execution; we just have to dig it out of the ModifyTable node.
So, do that, and get rid of the kluge of early setrefs processing of
RETURNING lists. (This is a little bit of a cheat in the case of inherited
UPDATE/DELETE, because we are not passing a "root" struct that corresponds
exactly to what the subplan was built with. But that doesn't matter, and
anyway this is less ugly than early setrefs processing was.)
Back-patch to 9.1, where the problem became possible to hit.
Prohibiting this outright would break dumps taken from older versions
that contain such casts, which would create far more pain than is
justified here.
Per report by Jaime Casanova and subsequent discussion.
We must set the visibility map bit before releasing our exclusive lock
on the heap page; otherwise, someone might clear the heap page bit
before we set the visibility map bit, leading to a situation where the
visibility map thinks the page is all-visible but it's really not.
This problem has existed since 8.4, but it wasn't critical before we
had index-only scans, since the worst case scenario was that the page
wouldn't get vacuumed until the next scan_all vacuum.
Along the way, a couple of minor, related improvements: (1) if we
pause the heap scan to do an index vac cycle, release any visibility
map page we're holding, since really long-running pins are not good
for a variety of reasons; and (2) warn if we see a page that's marked
all-visible in the visibility map but not on the page level, since
that should never happen any more (it was allowed in previous
releases, but not in 9.2).
Instead of an exact cost comparison, use a fuzzy comparison with 1e-10
delta after all other path metrics have proved equal. This is to avoid
having platform-specific roundoff behaviors determine the choice when
two paths are really the same to our cost estimators. Adjust the
recently-added test case that made it obvious we had a problem here.
The original syntax wasn't universally loved, and it didn't allow its
usage in CREATE TABLE, only ALTER TABLE. It now works everywhere, and
it also allows using ALTER TABLE ONLY to add an uninherited CHECK
constraint, per discussion.
The pg_constraint column has accordingly been renamed connoinherit.
This commit partly reverts some of the changes in
61d81bd28d, particularly some pg_dump and
psql bits, because now pg_get_constraintdef includes the necessary NO
INHERIT within the constraint definition.
Author: Nikhil Sontakke
Some tweaks by me
For an initial relation that lacks any join clauses (that is, it has to be
cartesian-product-joined to the rest of the query), we considered only
cartesian joins with initial rels appearing later in the initial-relations
list. This creates an undesirable dependency on FROM-list order. We would
never fail to find a plan, but perhaps we might not find the best available
plan. Noted while discussing the logic with Amit Kapila.
Improve the comments a bit in this area, too.
Arguably this is a bug fix, but given the lack of complaints from the
field I'll refrain from back-patching.
This patch adjusts the treatment of parameterized paths so that all paths
with the same parameterization (same set of required outer rels) for the
same relation will have the same rowcount estimate. We cache the rowcount
estimates to ensure that property, and hopefully save a few cycles too.
Doing this makes it practical for add_path_precheck to operate without
a rowcount estimate: it need only assume that paths with different
parameterizations never dominate each other, which is close enough to
true anyway for coarse filtering, because normally a more-parameterized
path should yield fewer rows thanks to having more join clauses to apply.
In add_path, we do the full nine yards of comparing rowcount estimates
along with everything else, so that we can discard parameterized paths that
don't actually have an advantage. This fixes some issues I'd found with
add_path rejecting parameterized paths on the grounds that they were more
expensive than not-parameterized ones, even though they yielded many fewer
rows and hence would be cheaper once subsequent joining was considered.
To make the same-rowcounts assumption valid, we have to require that any
parameterized path enforce *all* join clauses that could be obtained from
the particular set of outer rels, even if not all of them are useful for
indexing. This is required at both base scans and joins. It's a good
thing anyway since the net impact is that join quals are checked at the
lowest practical level in the join tree. Hence, discard the original
rather ad-hoc mechanism for choosing parameterization joinquals, and build
a better one that has a more principled rule for when clauses can be moved.
The original rule was actually buggy anyway for lack of knowledge about
which relations are part of an outer join's outer side; getting this right
requires adding an outer_relids field to RestrictInfo.
This has been wrong for a really long time. We don't use two-phase
locking to protect against serialization anomalies.
Per discussion on pgsql-hackers about 2011-03-07; original report
by Dan Ports.
The previous code could cause a backend crash after BEGIN; SAVEPOINT a;
LOCK TABLE foo (interrupted by ^C or statement timeout); ROLLBACK TO
SAVEPOINT a; LOCK TABLE foo, and might have leaked strong-lock counts
in other situations.
Report by Zoltán Böszörményi; patch review by Jeff Davis.
Previously, we used SetBufferCommitInfoNeedsSave, but that's really
intended for dirty-marks we can theoretically afford to lose, such as
hint bits. As for 9.2, the PD_ALL_VISIBLE mustn't be lost in this
way, since we could then end up with a heap page that isn't
all-visible and a visibility map page that is all visible, causing
index-only scans to return wrong answers.
Mostly, this consists of adding support for fields which exist in the
structure but aren't handled by copy/equal/outfuncs; but the create
foreign table case can actually produce garbage output.
Noah Misch
When using synchronous replication, we waited for the commit record to be
replicated, but if we our transaction didn't write any other WAL records,
that's not required because we don't even flush the WAL locally to disk in
that case. This lead to long waits when committing a transaction that only
modified a temporary table. Bug spotted by Thom Brown.
The output of the new pg_xlog_location_diff function is of type numeric,
since it could theoretically overflow an int8 due to signedness; this
provides a convenient way to format such values.
Fujii Masao, with some beautification by me.
So far as I can tell, it is no longer possible for this heuristic to do
anything useful, because the new weaker definition of
have_relevant_joinclause means that any relation with a joinclause must be
considered joinable to at least one other relation. It would still be
possible for the code block to be entered, for example if there are join
order restrictions that prevent any join of the current level from being
formed; but in that case it's just a waste of cycles to attempt to form
cartesian joins, since the restrictions will still apply.
Furthermore, IMO the existence of this code path can mask bugs elsewhere;
we would have noticed the problem with cartesian joins a lot sooner if
this code hadn't compensated for it in the simplest case.
Accordingly, let's remove it and see what happens. I'm committing this
separately from the prerequisite changes in have_relevant_joinclause,
just to make the question easier to revisit if there is some fault in
my logic.
We should be willing to cross-join two small relations if that allows us
to use an inner indexscan on a large relation (that is, the potential
indexqual for the large table requires both smaller relations). This
worked in simple cases but fell apart as soon as there was a join clause
to a fourth relation, because the existence of any two-relation join clause
caused the planner to not consider clauseless joins between other base
relations. The added regression test shows an example case adapted from
a recent complaint from Benoit Delbosc.
Adjust have_relevant_joinclause, have_relevant_eclass_joinclause, and
has_relevant_eclass_joinclause to consider that a join clause mentioning
three or more relations is sufficient grounds for joining any subset of
those relations, even if we have to do so via a cartesian join. Since such
clauses are relatively uncommon, this shouldn't affect planning speed on
typical queries; in fact it should help a bit, because the latter two
functions in particular get significantly simpler.
Although this is arguably a bug fix, I'm not going to risk back-patching
it, since it might have currently-unforeseen consequences.
Per mailing list discussion, we would like to keep the bytea functions
parallel to the text functions, so rename bytea_agg to string_agg,
which already exists for text.
Also, to satisfy the rule that we don't want aggregate functions of
the same name with a different number of arguments, add a delimiter
argument, just like string_agg for text already has.
cost_index's method for estimating per-tuple costs of evaluating filter
conditions (a/k/a qpquals) was completely wrong in the presence of derived
indexable conditions, such as range conditions derived from a LIKE clause.
This was largely masked in common cases as a result of all simple operator
clauses having about the same costs, but it could show up in a big way when
dealing with functional indexes containing expensive functions, as seen for
example in bug #6579 from Istvan Endredy. Rejigger the calculation to give
sane answers when the indexquals aren't a subset of the baserestrictinfo
list. As a side benefit, we now do the calculation properly for cases
involving join clauses (ie, parameterized indexscans), which we always
overestimated before.
There are still cases where this is an oversimplification, such as clauses
that can be dropped because they are implied by a partial index's
predicate. But we've never accounted for that in cost estimates before,
and I'm not convinced it's worth the cycles to try to do so.
Previously we attempted to throw an error or at least warning for missing
schemas, but this was done inconsistently because of implementation
restrictions (in many cases, GUC settings are applied outside transactions
so that we can't do system catalog lookups). Furthermore, there were
exceptions to the rule even in the beginning, and we'd been poking more
and more holes in it as time went on, because it turns out that there are
lots of use-cases for having some irrelevant items in a common search_path
value. It seems better to just adopt a philosophy similar to what's always
been done with Unix PATH settings, wherein nonexistent or unreadable
directories are silently ignored.
This commit also fixes the documentation to point out that schemas for
which the user lacks USAGE privilege are silently ignored. That's always
been true but was previously not documented.
This is mostly in response to Robert Haas' complaint that 9.1 started to
throw errors or warnings for missing schemas in cases where prior releases
had not. We won't adopt such a significant behavioral change in a back
branch, so something different will be needed in 9.1.
This definition is convenient when applying the function to the
reltablespace column of pg_class, since that's what zero means there;
and it doesn't interfere with any other plausible use of the function.
Per gripe from Bruce Momjian.
This patch reverts commit 191ef2b407
and thereby restores the pre-7.3 behavior of EXTRACT(EPOCH FROM
timestamp-without-tz). Per discussion, the more recent behavior was
misguided on a couple of grounds: it makes it hard to get a
non-timezone-aware epoch value for a timestamp, and it makes this one
case dependent on the value of the timezone GUC, which is incompatible
with having timestamp_part() labeled as immutable.
The other behavior is still available (in all releases) by explicitly
casting the timestamp to timestamp with time zone before applying EXTRACT.
This will need to be called out as an incompatible change in the 9.2
release notes. Although having mutable behavior in a function marked
immutable is clearly a bug, we're not going to back-patch such a change.
estimate_num_groups() gets unhappy with
create table empty();
select * from empty except select * from empty e2;
I can't see any actual use-case for such a query (and the table is illegal
per SQL spec), but it seems like a good idea that it not cause an assert
failure.
The case could not arise when this code was originally written, but it can
now (since we made zero-column MergeJoins work for the benefit of FULL JOIN
ON TRUE). I don't think there is any actual bug here, but we might as well
treat it consistently with other uses of COPY_POINTER_FIELD(). Per comment
from Ashutosh Bapat.
This was a thinko in previous commit. Now that stack base pointer is now set
in PostmasterMain and SubPostmasterMain, it doesn't need to be set in
PostgresMain anymore.
We used to only initialize the stack base pointer when starting up a regular
backend, not in other processes. In particular, autovacuum workers can run
arbitrary user code, and without stack-depth checking, infinite recursion
in e.g an index expression will bring down the whole cluster.
The comment about PL/Java using set_stack_base() is not yet true. As the
code stands, PL/java still modifies the stack_base_ptr variable directly.
However, it's been discussed in the PL/Java mailing list that it should be
changed to use the function, because PL/Java is currently oblivious to the
register stack used on Itanium. There's another issues with PL/Java, namely
that the stack base pointer it sets is not really the base of the stack, it
could be something close to the bottom of the stack. That's a separate issue
that might need some further changes to this code, but that's a different
story.
Backpatch to all supported releases.
XLOG_GIN_UPDATE_META_PAGE and XLOG_GIN_DELETE_LISTPAGE records were printed
with a list link field labeled as "blkno", which was confusing, especially
when the link was empty (InvalidBlockNumber). Print the metapage block
number instead, since that's what's actually being updated. We could
include the link values too as a separate field, but not clear it's worth
the trouble.
Back-patch to 8.4 where the dubious code was added.
Commit 337b6f5ecf contained the entirely
fanciful assumption that it had made comparetup_datum unreachable.
Reported and patched by Takashi Yamamoto.
Fix up some not terribly accurate/useful comments from that commit, too.
If we make the initially-called function return the table physical-size
estimate, acquire_inherited_sample_rows will be able to use that to
allocate numbers of samples among child tables, when the day comes that
we want to support foreign tables in inheritance trees.
ANALYZE now accepts foreign tables and allows the table's FDW to control
how the sample rows are collected. (But only manual ANALYZEs will touch
foreign tables, for the moment, since among other things it's not very
clear how to handle remote permissions checks in an auto-analyze.)
contrib/file_fdw is extended to support this.
Etsuro Fujita, reviewed by Shigeru Hanada, some further tweaking by me.
Ants Aasma's original patch to add timing information for buffer I/O
requests exposed this data at the relation level, which was judged too
costly. I've here exposed it at the database level instead.
The original coding of the syslogger had an arbitrary limit of 20 large
messages concurrently in progress, after which it would just punt and dump
message fragments to the output file separately. Our ambitions are a bit
higher than that now, so allow the data structure to expand as necessary.
Reported and patched by Andrew Dunstan; some editing by Tom
Initialise ckptXidEpoch from starting checkpoint and maintain the correct
value as we roll forwards. This allows GetNextXidAndEpoch() to return the
correct epoch when executed during recovery. Backpatch to 9.0 when the
problem is first observable by a user.
Bug report from Daniel Farina
Postmaster sets max_safe_fds by testing how many open file descriptors it
can open, and that is normally inherited by all child processes at fork().
Not so on EXEC_BACKEND, ie. Windows, however. Because of that, we
effectively ignored max_files_per_process on Windows, and always assumed
a conservative default of 32 simultaneous open files. That could have an
impact on performance, if you need to access a lot of different files
in a query. After this patch, the value is passed to child processes by
save/restore_backend_variables() among many other global variables.
It has been like this forever, but given the lack of complaints about it,
I'm not backpatching this.
Generally, the parse location assigned to a multiple-token construct is
the location of its leftmost token. This commit breaks that rule for
the syntaxes TYPENAME 'LITERAL' and CAST(CONSTANT AS TYPENAME) --- the
resulting Const will have the location of the literal string, not the
typename or CAST keyword. The cases where this matters are pretty thin on
the ground (no error messages in the regression tests change, for example),
and it's unlikely that any user would be confused anyway by an error cursor
pointing at the literal. But still it's less than consistent. The reason
for changing it is that contrib/pg_stat_statements wants to know the parse
location of the original literal, and it was agreed that this is the least
unpleasant way to preserve that information through parse analysis.
Peter Geoghegan
Add a queryId field to Query and PlannedStmt. This is not used by the
core backend, except for being copied around at appropriate times.
It's meant to allow plug-ins to track a particular query forward from
parse analysis to execution.
The queryId is intentionally not dumped into stored rules (and hence this
commit doesn't bump catversion). You could argue that choice either way,
but it seems better that stored rule strings not have any dependency
on plug-ins that might or might not be present.
Also, add a post_parse_analyze_hook that gets invoked at the end of
parse analysis (but only for top-level analysis of complete queries,
not cases such as analyzing a domain's default-value expression).
This is mainly meant to be used to compute and assign a queryId,
but it could have other applications.
Peter Geoghegan
Currently, the only way to see the numbers this gathers is via
EXPLAIN (ANALYZE, BUFFERS), but the plan is to add visibility through
the stats collector and pg_stat_statements in subsequent patches.
Ants Aasma, reviewed by Greg Smith, with some further changes by me.
It used to be case that lazy vacuum could call this function with only
a shared lock on the buffer, but neither lazy vacuum nor any other
code path does that any more. Simplify the code accordingly and clean
up some related, obsolete comments.
The COPY documentation says "COPY FROM matches the input against the null
string before removing backslashes". It is therefore reasonable to presume
that null markers like E'\\0' will work ... and they did, until someone put
the tests in the wrong order during microoptimization-driven rewrites.
Since then, we've been failing if the null marker is something that would
de-escape to an invalidly-encoded string. Since null markers generally
need to be something that can't appear in the data, this represents a
nontrivial loss of functionality; surprising nobody noticed it earlier.
Per report from Jeff Davis. Backpatch to 8.4 where this got broken.
setlocale() accepts locale name "" as meaning "the locale specified by the
process's environment variables". Historically we've accepted that for
Postgres' locale settings, too. However, it's fairly unsafe to store an
empty string in a new database's pg_database.datcollate or datctype fields,
because then the interpretation could vary across postmaster restarts,
possibly resulting in index corruption and other unpleasantness.
Instead, we should expand "" to whatever it means at the moment of calling
CREATE DATABASE, which we can do by saving the value returned by
setlocale().
For consistency, make initdb set up the initial lc_xxx parameter values the
same way. initdb was already doing the right thing for empty locale names,
but it did not replace non-empty names with setlocale results. On a
platform where setlocale chooses to canonicalize the spellings of locale
names, this would result in annoying inconsistency. (It seems that popular
implementations of setlocale don't do such canonicalization, which is a
pity, but the POSIX spec certainly allows it to be done.) The same risk
of inconsistency leads me to not venture back-patching this, although it
could certainly be seen as a longstanding bug.
Per report from Jeff Davis, though this is not his proposed patch.
For some reason, in the original coding of the PlaceHolderVar mechanism
I had supposed that PlaceHolderVars couldn't propagate into subqueries.
That is of course entirely possible. When it happens, we need to treat
an outer-level PlaceHolderVar much like an outer Var or Aggref, that is
SS_replace_correlation_vars() needs to replace the PlaceHolderVar with
a Param, and then when building the finished SubPlan we have to provide
the PlaceHolderVar expression as an actual parameter for the SubPlan.
The handling of the contained expression is a bit delicate but it can be
treated exactly like an Aggref's expression.
In addition to the missing logic in subselect.c, prepjointree.c was failing
to search subqueries for PlaceHolderVars that need their relids adjusted
during subquery pullup. It looks like everyplace else that touches
PlaceHolderVars got it right, though.
Per report from Mark Murawski. In 9.1 and HEAD, queries affected by this
oversight would fail with "ERROR: Upper-level PlaceHolderVar found where
not expected". But in 9.0 and 8.4, you'd silently get possibly-wrong
answers, since the value transmitted into the subquery wouldn't go to null
when it should.
We were doing the recursive simplification of function/operator arguments
in half a dozen different places, with rather baroque logic to ensure it
didn't get done multiple times on some arguments. This patch improves that
by postponing argument simplification until after we've dealt with named
parameters and added any needed default expressions.
Marti Raudsepp, somewhat hacked on by me
Fix loss of previous expression-simplification work when a transform
function fires: we must not simply revert to untransformed input tree.
Instead build a dummy FuncExpr node to pass to the transform function.
This has the additional advantage of providing a simpler, more uniform
API for transform functions.
Move documentation to a somewhat less buried spot, relocate some
poorly-placed code, be more wary of null constants and invalid typmod
values, add an opr_sanity check on protransform function signatures,
and some other minor cosmetic adjustments.
Note: although this patch touches pg_proc.h, no need for catversion
bump, because the changes are cosmetic and don't actually change the
intended catalog contents.
For those variables only used when asserts are enabled, use a new
macro PG_USED_FOR_ASSERTS_ONLY, which expands to
__attribute__((unused)) when asserts are not enabled.
This restores the pre-9.0 situation that it's possible to add new indexes
on pg_class and other mapped-but-not-shared catalogs, so long as you broke
the glass and flipped the big red Dont-Touch-Me switch. As before, there
are a lot of gotchas, and you'd have to be pretty desperate to try this
on a production database; but there doesn't seem to be a reason for
relmapper.c to be preventing such things all by itself. Per
experimentation with a case suggested by Cody Cutrer.
I broke this in commit 337b6f5ecf, which
among other things arranged for quicksorts to CHECK_FOR_INTERRUPTS()
slightly less frequently. Sadly, it also arranged for heapsorts to
CHECK_FOR_INTERRUPTS() much less frequently. Repair.
Making this operation look like a utility statement seems generally a good
idea, and particularly so in light of the desire to provide command
triggers for utility statements. The original choice of representing it as
SELECT with an IntoClause appendage had metastasized into rather a lot of
places, unfortunately, so that this patch is a great deal more complicated
than one might at first expect.
In particular, keeping EXPLAIN working for SELECT INTO and CREATE TABLE AS
subcommands required restructuring some EXPLAIN-related APIs. Add-on code
that calls ExplainOnePlan or ExplainOneUtility, or uses
ExplainOneQuery_hook, will need adjustment.
Also, the cases PREPARE ... SELECT INTO and CREATE RULE ... SELECT INTO,
which formerly were accepted though undocumented, are no longer accepted.
The PREPARE case can be replaced with use of CREATE TABLE AS EXECUTE.
The CREATE RULE case doesn't seem to have much real-world use (since the
rule would work only once before failing with "table already exists"),
so we'll not bother with that one.
Both SELECT INTO and CREATE TABLE AS still return a command tag of
"SELECT nnnn". There was some discussion of returning "CREATE TABLE nnnn",
but for the moment backwards compatibility wins the day.
Andres Freund and Tom Lane
For a little while there I thought match_pathkeys_to_index() was broken
because it wasn't trying to match index columns to pathkeys in order.
Actually that's correct, because GiST can support ordering operators
on any random collection of index columns, but it sure needs a comment.
In commit 57664ed25e I tried to fix a bug
reported by Teodor Sigaev by making non-simple-Var output columns distinct
(by wrapping their expressions with dummy PlaceHolderVar nodes). This did
not work too well. Commit b28ffd0fcc fixed
some ensuing problems with matching to child indexes, but per a recent
report from Claus Stadler, constraint exclusion of UNION ALL subqueries was
still broken, because constant-simplification didn't handle the injected
PlaceHolderVars well either. On reflection, the original patch was quite
misguided: there is no reason to expect that EquivalenceClass child members
will be distinct. So instead of trying to make them so, we should ensure
that we can cope with the situation when they're not.
Accordingly, this patch reverts the code changes in the above-mentioned
commits (though the regression test cases they added stay). Instead, I've
added assorted defenses to make sure that duplicate EC child members don't
cause any problems. Teodor's original problem ("MergeAppend child's
targetlist doesn't match MergeAppend") is addressed more directly by
revising prepare_sort_from_pathkeys to let the parent MergeAppend's sort
list guide creation of each child's sort list.
In passing, get rid of add_sort_column; as far as I can tell, testing for
duplicate sort keys at this stage is dead code. Certainly it doesn't
trigger often enough to be worth expending cycles on in ordinary queries.
And keeping the test would've greatly complicated the new logic in
prepare_sort_from_pathkeys, because comparing pathkey list entries against
a previous output array requires that we not skip any entries in the list.
Back-patch to 9.1, like the previous patches. The only known issue in
this area that wasn't caused by the ill-advised previous patches was the
MergeAppend planning failure, which of course is not relevant before 9.1.
It's possible that we need some of the new defenses against duplicate child
EC entries in older branches, but until there's some clear evidence of that
I'm going to refrain from back-patching further.
This is for tools such as Coverity that don't know that the grammar
enforces that the case of not having a relation (but instead a query)
cannot happen in the FROM case.
A leaf tuple that we need to delete could get moved as a consequence of an
insertion happening concurrently with the VACUUM scan. If it moves from a
page past the current scan point to a page before, we'll miss it, which is
not acceptable. Hence, when we see a leaf-page REDIRECT that could have
been made since our scan started, chase down the redirection pointer much
as if we were doing a normal index search, and be sure to vacuum every page
it leads to. This fixes the issue because, if the tuple was on page N at
the instant we start our scan, we will surely find it as a consequence of
chasing the redirect from page N, no matter how much it moves around in
between. Problem noted by Takashi Yamamoto.
We have always created a whole-table dependency for the target relation,
but that's not really good enough, as it doesn't prevent scenarios such
as dropping an individual target column or altering its type. So we
have to create an individual dependency for each target column, as well.
Per report from Bill MacArthur of a rule containing UPDATE breaking
after such an alteration. Note that this patch doesn't try to make
such cases work, only to ensure that the attempted ALTER TABLE throws
an error telling you it can't cope with adjusting the rule.
This is a long-standing bug, but given the lack of prior reports
I'm not going to risk back-patching it. A back-patch wouldn't do
anything to fix existing rules' dependency lists, anyway.
This patch fixes the other major compatibility-breaking limitation of
SPGiST, that it didn't store anything for null values of the indexed
column, and so could not support whole-index scans or "x IS NULL"
tests. The approach is to create a wholly separate search tree for
the null entries, and use fixed "allTheSame" insertion and search
rules when processing this tree, instead of calling the index opclass
methods. This way the opclass methods do not need to worry about
dealing with nulls.
Catversion bump is for pg_am updates as well as the change in on-disk
format of SPGiST indexes; there are some tweaks in SPGiST WAL records
as well.
Heavily rewritten version of a patch by Oleg Bartunov and Teodor Sigaev.
(The original also stored nulls separately, but it reused GIN code to do
so; which required undesirable compromises in the on-disk format, and
would likely lead to bugs due to the GIN code being required to work in
two very different contexts.)
It now prints the argument that was at fault.
Also fix a small misbehavior where the error message issued by
getopt() would complain about a program named "--single", because
that's what argv[0] is in the server process.
The original API definition was incapable of supporting whole-index scans
because there was no way to invoke leaf-value reconstruction without
checking any qual conditions. Also, it was inefficient for
multiple-qual-condition scans because value reconstruction got done over
again for each qual condition, and because other internal work in the
consistent functions likewise had to be done for each qual. To fix these
issues, pass the whole scankey array to the opclass consistent functions,
instead of only letting them see one item at a time. (Essentially, the
loop over scankey entries is now inside the consistent functions not
outside them. This makes the consistent functions a bit more complicated,
but not unreasonably so.)
In itself this commit does nothing except save a few cycles in
multiple-qual-condition index scans, since we can't support whole-index
scans on SPGiST indexes until nulls are included in the index. However,
I consider this a must-fix for 9.2 because once we release it will get
very much harder to change the opclass API definition.
This allows loadable modules to get control at drop time, perhaps for the
purpose of performing additional security checks or to log the event.
The initial purpose of this code is to support sepgsql, but other
applications should be possible as well.
KaiGai Kohei, reviewed by me.
Further reflection shows that a single callback isn't very workable if we
desire to let FDWs generate multiple Paths, because that forces the FDW to
do all work necessary to generate a valid Plan node for each Path. Instead
split the former PlanForeignScan API into three steps: GetForeignRelSize,
GetForeignPaths, GetForeignPlan. We had already bit the bullet of breaking
the 9.1 FDW API for 9.2, so this shouldn't cause very much additional pain,
and it's substantially more flexible for complex FDWs.
Add an fdw_private field to RelOptInfo so that the new functions can save
state there rather than possibly having to recalculate information two or
three times.
In addition, we'd not thought through what would be needed to allow an FDW
to set up subexpressions of its choice for runtime execution. We could
treat ForeignScan.fdw_private as an executable expression but that seems
likely to break existing FDWs unnecessarily (in particular, it would
restrict the set of node types allowable in fdw_private to those supported
by expression_tree_walker). Instead, invent a separate field fdw_exprs
which will receive the postprocessing appropriate for expression trees.
(One field is enough since it can be a list of expressions; also, we assume
the corresponding expression state tree(s) will be held within fdw_state,
so we don't need to add anything to ForeignScanState.)
Per review of Hanada Shigeru's pgsql_fdw patch. We may need to tweak this
further as we continue to work on that patch, but to me it feels a lot
closer to being right now.
Phil Sorber reported that a rewriting ALTER TABLE within an extension
update script failed, because it creates and then drops a placeholder
table; the drop was being disallowed because the table was marked as an
extension member. We could hack that specific case but it seems likely
that there might be related cases now or in the future, so the most
practical solution seems to be to create an exception to the general rule
that extension member objects can only be dropped by dropping the owning
extension. To wit: if the DROP is issued within the extension's own
creation or update scripts, we'll allow it, implicitly performing an
"ALTER EXTENSION DROP object" first. This will simplify cases such as
extension downgrade scripts anyway.
No docs change since we don't seem to have documented the idea that you
would need ALTER EXTENSION DROP for such an action to begin with.
Also, arrange for explicitly temporary tables to not get linked as
extension members in the first place, and the same for the magic
pg_temp_nnn schemas that are created to hold them. This prevents assorted
unpleasant results if an extension script creates a temp table: the forced
drop at session end would either fail or remove the entire extension, and
neither of those outcomes is desirable. Note that this doesn't fix the
ALTER TABLE scenario, since the placeholder table is not temp (unless the
table being rewritten is).
Back-patch to 9.1.
In constructs such as "x IN (1,2,3,4)" and "x <> ALL(ARRAY[1,2,3,4])",
we formerly always used a general-purpose assumption that the probability
of success is independent for each comparison of "x" to an array element.
But in real-world usage of these constructs, that's a pretty poor
assumption; it's much saner to assume that the array elements are distinct
and so the match probabilities are disjoint. Apply that assumption if the
operator appears to behave as equality (for ANY) or inequality (for ALL).
But fall back to the normal independent-probabilities calculation if this
yields an impossible result, ie probability > 1 or < 0. We could protect
ourselves against bad estimates even more by explicitly checking for equal
array elements, but that is expensive and doesn't seem worthwhile: doing
it would amount to optimizing for poorly-written queries at the expense
of well-written ones.
Daniele Varrazzo and Tom Lane, after a suggestion by Ants Aasma
GetForeignColumnOptions provides some abstraction for accessing
column-specific FDW options, on a par with the access functions that were
already provided here for other FDW-related information.
Adjust file_fdw.c to use GetForeignColumnOptions instead of equivalent
hand-rolled code.
In addition, add some SGML documentation for the functions exported by
foreign.c that are meant for use by FDW authors.
(This is the fdw_helper portion of the proposed pgsql_fdw patch.)
Hanada Shigeru, reviewed by KaiGai Kohei
Now that cache invalidation callbacks get only a hash value, and not a
tuple TID (per commits 632ae6829f and
b5282aa893), the only way they can restrict
what they invalidate is to know what the hash values mean. setrefs.c was
doing this via a hard-wired assumption but that seems pretty grotty, and
it'll only get worse as more cases come up. So let's expose a calculation
function that takes the same parameters as SearchSysCache. Per complaint
from Marko Kreen.
Use-cases for this include custom log filtering rules and custom log
message transmission mechanisms (for instance, lossy log message
collection, which has been discussed several times recently).
As is our common practice for hooks, there's no regression test nor
user-facing documentation for this, though the author did exhibit a
sample module using the hook.
Martin Pihlak, reviewed by Marti Raudsepp
This simplifies the code a little bit. The new rule is that to update
XLogCtl->LogwrtResult, you must hold both WALWriteLock and info_lck, whereas
before we had two copies, one that was protected by WALWriteLock and another
protected by info_lck. The code that updates them was already holding both
locks, so merging the two is trivial.
The third copy, XLogCtl->Insert.LogwrtResult, was not totally redundant, it
was used in AdvanceXLInsertBuffer to update the backend-local copy, before
acquiring the info_lck to read the up-to-date value. But the value of that
seems dubious; at best it's saving one spinlock acquisition per completed
WAL page, which is not significant compared to all the other work involved.
And in practice, it's probably not saving even that much.
It's harmless to do full page writes even when not strictly necessary, so
when turning full_page_writes on, we can set the global flag first, and then
call XLogInsert. Likewise, when turning it off, we can write the WAL record
first, and then clear the flag. This way XLogInsert doesn't need any special
handling of the XLOG_FPW_CHANGE record type. XLogInsert is complicated
enough already, so anything we can keep away from there is a good thing.
Actually I don't think the atomicity of the shared memory flag matters,
anyway, because we only write the XLOG_FPW_CHANGE at the end of recovery,
when there are no concurrent WAL insertions going on. But might as well make
it safe, in case we allow changing full_page_writes on the fly in the
future.
The original API specification only allowed an FDW to create a single
access path, which doesn't seem like a terribly good idea in hindsight.
Instead, move the responsibility for building the Path node and calling
add_path() into the FDW's PlanForeignScan function. Now, it can do that
more than once if appropriate. There is no longer any need for the
transient FdwPlan struct, so get rid of that.
Etsuro Fujita, Shigeru Hanada, Tom Lane
This patch installs significantly smarter penalty and picksplit functions
for ranges, making GiST indexes for them smaller and faster to search.
There is no on-disk format change, so no catversion bump, but you'd need
to REINDEX to get the benefits for any existing index.
Alexander Korotkov, reviewed by Jeff Davis
The code in this function that tried to cope with a missing count histogram
was quite ineffective for anything except a perfectly flat distribution.
Furthermore, since we were already punting for missing MCELEM slot, it's
rather useless to sweat over missing DECHIST: there are no cases where
ANALYZE will create the first but not the second. So just simplify the
code by punting rather than pretending we can do something useful.
Do "frac" arithmetic in int64 to prevent overflow with large statistics
targets, and improve the comments so people have some chance of
understanding how it works.
Alexander Korotkov and Tom Lane
Now that we have validate_xlog_location, call it from the previously
existing functions taking xlog locatoins as a string input.
Suggested by Fujii Masao
Comparing two xlog locations are useful for example when calculating
replication lag.
Euler Taveira de Oliveira, reviewed by Fujii Masao, and some cleanups
from me
This patch improves selectivity estimation for the array <@, &&, and @>
(containment and overlaps) operators. It enables collection of statistics
about individual array element values by ANALYZE, and introduces
operator-specific estimators that use these stats. In addition,
ScalarArrayOpExpr constructs of the forms "const = ANY/ALL (array_column)"
and "const <> ANY/ALL (array_column)" are estimated by treating them as
variants of the containment operators.
Since we still collect scalar-style stats about the array values as a
whole, the pg_stats view is expanded to show both these stats and the
array-style stats in separate columns. This creates an incompatible change
in how stats for tsvector columns are displayed in pg_stats: the stats
about lexemes are now displayed in the array-related columns instead of the
original scalar-related columns.
There are a few loose ends here, notably that it'd be nice to be able to
suppress either the scalar-style stats or the array-element stats for
columns for which they're not useful. But the patch is in good enough
shape to commit for wider testing.
Alexander Korotkov, reviewed by Noah Misch and Nathan Boley
The only reason this didn't work before was that parserOpenTable()
rejects composite types. So use relation_openrv() directly and
manually do the errposition() setup that parserOpenTable() does.
This fixes an oversight in commit 11cad29c91,
which introduced MergeAppend plans. Before that happened, we never
particularly cared about the sort ordering of scans of inheritance child
relations, since appending their outputs together would destroy any
ordering anyway. But now it's important to be able to match child relation
sort orderings to those of the surrounding query. The original coding of
add_child_rel_equivalences skipped ec_has_const EquivalenceClasses, on the
originally-correct grounds that adding child expressions to them was
useless. The effect of this is that when a parent variable is equated to
a constant, we can't recognize that index columns on the equivalent child
variables are not sort-significant; that is, we can't recognize that a
child index on, say, (x, y) is able to generate output in "ORDER BY y"
order when there is a clause "WHERE x = constant". Adding child
expressions to the (x, constant) EquivalenceClass fixes this, without any
downside that I can see other than a few more planner cycles expended on
such queries.
Per recent gripe from Robert McGehee. Back-patch to 9.1 where MergeAppend
was introduced.
Previously it was thought that it's impossible as the code stands, because
insertions create buffers as tuples are cascaded downwards, and index
split also creaters buffers eagerly for all halves. But the example from
Jay Levitt demonstrates that it can happen, when the root page is split.
It's in fact OK if the buffer doesn't exist, so we just need to remove the
sanity check. In fact, we've been discussing the possibility of destroying
empty buffers to conserve memory, which would render the sanity check
completely useless anyway.
Fix by Alexander Korotkov
The only toastable column now is datacl, but we don't really support
long ACLs anyway. The TOAST table should have been removed when the
pg_db_role_setting catalog was introduced in commit
2eda8dfb52, but I forgot to do that.
Per -hackers discussion on March 2011.
A prepared transaction can get new conflicts in and out after preparing, so
we cannot rely on the in- and out-flags stored in the statefile at prepare-
time. As a quick fix, make the conservative assumption that after a restart,
all prepared transactions are considered to have both in- and out-conflicts.
That can lead to unnecessary rollbacks after a crash, but that shouldn't be
a big problem in practice; you don't want prepared transactions to hang
around for a long time anyway.
Dan Ports
This makes it much more convenient to build tools for Postgres that are
separately compiled and require a matching CRC implementation.
To prevent multiple copies of the CRC polynomial tables being introduced
into the postgres binaries, they are now included in the static library
libpgport that is mainly meant for replacement system functions. That
seems like a bit of a kludge, but there's no better place.
This cleans up building of the tools pg_controldata and pg_resetxlog,
which previously had to build their own copies of pg_crc.o.
In the future, external programs that need access to the CRC tables can
include the tables directly from the new header file pg_crc_tables.h.
Daniel Farina, reviewed by Abhijit Menon-Sen and Tom Lane
We don't need to constrain the other side of an indexable join clause to
not be below an outer join; an example here is
SELECT FROM t1 LEFT JOIN t2 ON t1.a = t2.b LEFT JOIN t3 ON t2.c = t3.d;
We can consider an inner indexscan on t3.d using c = d as indexqual, even
though t2.c is potentially nulled by a previous outer join. The comparable
logic in orindxpath.c has always worked that way, but I was being overly
cautious here.
We already skip rewriting the table in these cases, but we still force a
whole table scan to validate the data. This can be skipped, and thus
we can make the whole ALTER TABLE operation just do some catalog touches
instead of scanning the table, when these two conditions hold:
(a) Old and new pg_constraint.conpfeqop match exactly. This is actually
stronger than needed; we could loosen things by way of operator
families, but it'd require a lot more effort.
(b) The functions, if any, implementing a cast from the foreign type to
the primary opcintype are the same. For this purpose, we can consider a
binary coercion equivalent to an exact type match. When the opcintype
is polymorphic, require that the old and new foreign types match
exactly. (Since ri_triggers.c does use the executor, the stronger check
for polymorphic types is no mere future-proofing. However, no core type
exercises its necessity.)
Author: Noah Misch
Committer's note: catalog version bumped due to change of the Constraint
node. I can't actually find any way to have such a node in a stored
rule, but given that we have "out" support for them, better be safe.
In commit 4016bdef8a I fixed a bunch of
ginxlog.c bugs having to do with not handling XLogReadBuffer failures
correctly. However, in ginRedoUpdateMetapage and ginRedoDeleteListPages,
I unaccountably thought that failure to read the metapage would be
impossible and just put in an elog(PANIC) call. This is of course wrong:
failure is exactly what will happen if the index got dropped (or rebuilt)
between creation of the WAL record and the crash we're trying to recover
from. I believe this explains Nicholas Wilson's recent report of these
errors getting reached.
Also, fix memory leak in forgetIncompleteSplit. This wasn't of much
concern when the code was written, but in a long-running standby server
page split records could be expected to accumulate indefinitely.
Back-patch to 8.4 --- before that, GIN didn't have a metapage.
Claiming that the typevar argument to DefineCompositeType() is const
was a plain lie. A similar case in DefineVirtualRelation() was
already changed in passing in commit 1575fbcb. Also clean up the now
unnecessary casts that used to cast away the const.
The "uncomplicated" case isn't materially less complicated than the full
case, certainly not enough so to justify duplicating nearly 500 lines
of code. The only extra work being done in the full path is zaptreesubs,
which is very cheap compared to everything else being done here, and
besides that I'm less than convinced that it's not needed in some cases
even without backrefs.
In nested sub-regex trees, lower-level nodes created DFAs and then
destroyed them again before exiting, which is a bit dumb considering that
the recursive search is likely to call those nodes again later. Instead
cache each created DFA until the end of pg_regexec(). This is basically a
space for time tradeoff, in that it might increase the maximum memory
usage. However, in most regex patterns there are not all that many subre
nodes, so not that many DFAs --- and in any case, the peak usage occurs
when reaching the bottom recursion level, and except for alternation cases
that's going to be the same anyway.
Apparently some primordial version of Spencer's engine needed cdissect()
and child functions to be able to continue matching from a previous
position when re-called. That is dead code, though, since trivial
inspection shows that cdissect can never be entered without having
previously done zapmem which resets the relevant retry counter. I have
also verified experimentally that no case in the Tcl regression tests
reaches cdissect with a nonzero retry value. Accordingly, remove that
logic. This doesn't really save any noticeable number of cycles in itself,
but it is one step towards making dissect() and cdissect() equivalent,
which will allow removing hundreds of lines of near-duplicated code.
Since struct subre's "retry" field is no longer particularly related to
any kind of retry, rename it to "id". As of this commit it's only used
for identifying a subre node in debug printouts, so you might think we
should get rid of the field entirely; but I have a plan for another use.
Cases where a back-reference is part of a larger subexpression that
is quantified have never worked in Spencer's regex engine, because
he used a compile-time transformation that neglected the need to
check the back-reference match in iterations before the last one.
(That was okay for capturing parens, and we still do it if the
regex has *only* capturing parens ... but it's not okay for backrefs.)
To make this work properly, we have to add an "iteration" node type
to the regex engine's vocabulary of sub-regex nodes. Since this is a
moderately large change with a fair risk of introducing new bugs of its
own, apply to HEAD only, even though it's a fix for a longstanding bug.
Both libpq and the backend would truncate a common name extracted from a
certificate at 32 bytes. Replace that fixed-size buffer with dynamically
allocated string so that there is no hard limit. While at it, remove the
code for extracting peer_dn, which we weren't using for anything; and
don't bother to store peer_cn longer than we need it in libpq.
This limit was not so terribly unreasonable when the code was written,
because we weren't using the result for anything critical, just logging it.
But now that there are options for checking the common name against the
server host name (in libpq) or using it as the user's name (in the server),
this could result in undesirable failures. In the worst case it even seems
possible to spoof a server name or user name, if the correct name is
exactly 32 bytes and the attacker can persuade a trusted CA to issue a
certificate in which that string is a prefix of the certificate's common
name. (To exploit this for a server name, he'd also have to send the
connection astray via phony DNS data or some such.) The case that this is
a realistic security threat is a bit thin, but nonetheless we'll treat it
as one.
Back-patch to 8.4. Older releases contain the faulty code, but it's not
a security problem because the common name wasn't used for anything
interesting.
Reported and patched by Heikki Linnakangas
Security: CVE-2012-0867
This check was overlooked when we added function execute permissions to the
system years ago. For an ordinary trigger function it's not a big deal,
since trigger functions execute with the permissions of the table owner,
so they couldn't do anything the user issuing the CREATE TRIGGER couldn't
have done anyway. However, if a trigger function is SECURITY DEFINER,
that is not the case. The lack of checking would allow another user to
install it on his own table and then invoke it with, essentially, forged
input data; which the trigger function is unlikely to realize, so it might
do something undesirable, for instance insert false entries in an audit log
table.
Reported by Dinesh Kumar, patch by Robert Haas
Security: CVE-2012-0866
This allows changing the location of the files that were previously
hard-coded to server.crt, server.key, root.crt, root.crl.
server.crt and server.key continue to be the default settings and are
thus required to be present by default if SSL is enabled. But the
settings for the server-side CA and CRL are now empty by default, and
if they are set, the files are required to be present. This replaces
the previous behavior of ignoring the functionality if the files were
not found.
When "vacuuming" a single btree page by removing LP_DEAD tuples, we are not
actually within a vacuum operation, but rather in an ordinary insertion
process that could well be running concurrently with a vacuum. So clearing
the cycleid is incorrect, and could cause the concurrent vacuum to miss
removing tuples that it needs to remove. This is a longstanding bug
introduced by commit e6284649b9 of
2006-07-25. I believe it explains Maxim Boguk's recent report of index
corruption, and probably some other previously unexplained reports.
In 9.0 and up this is a one-line fix; before that we need to introduce a
flag to tell _bt_delitems what to do.
First, as noted by Itagaki Takahiro, a datum of type JSON doesn't
need to be escaped. Second, ensure that numeric output not in
the form of a legal JSON number is quoted and escaped.
The syntax "\n*", that is a backref with a * quantifier directly applied
to it, has never worked correctly in Spencer's library. This has been an
open bug in the Tcl bug tracker since 2005:
https://sourceforge.net/tracker/index.php?func=detail&aid=1115587&group_id=10894&atid=110894
The core of the problem is in parseqatom(), which first changes "\n*" to
"\n+|" and then applies repeat() to the NFA representing the backref atom.
repeat() thinks that any arc leading into its "rp" argument is part of the
sub-NFA to be repeated. Unfortunately, since parseqatom() already created
the arc that was intended to represent the empty bypass around "\n+", this
arc gets moved too, so that it now leads into the state loop created by
repeat(). Thus, what was supposed to be an "empty" bypass gets turned into
something that represents zero or more repetitions of the NFA representing
the backref atom. In the original example, in place of
^([bc])\1*$
we now have something that acts like
^([bc])(\1+|[bc]*)$
At runtime, the branch involving the actual backref fails, as it's supposed
to, but then the other branch succeeds anyway.
We could no doubt fix this by some rearrangement of the operations in
parseqatom(), but that code is plenty ugly already, and what's more the
whole business of converting "x*" to "x+|" probably needs to go away to fix
another problem I'll mention in a moment. Instead, this patch suppresses
the *-conversion when the target is a simple backref atom, leaving the case
of m == 0 to be handled at runtime. This makes the patch in regcomp.c a
one-liner, at the cost of having to tweak cbrdissect() a little. In the
event I went a bit further than that and rewrote cbrdissect() to check all
the string-length-related conditions before it starts comparing characters.
It seems a bit stupid to possibly iterate through many copies of an
n-character backreference, only to fail at the end because the target
string's length isn't a multiple of n --- we could have found that out
before starting. The existing coding could only be a win if integer
division is hugely expensive compared to character comparison, but I don't
know of any modern machine where that might be true.
This does not fix all the problems with quantified back-references. In
particular, the code is still broken for back-references that appear within
a larger expression that is quantified (so that direct insertion of the
quantification limits into the BACKREF node doesn't apply). I think fixing
that will take some major surgery on the NFA code, specifically introducing
an explicit iteration node type instead of trying to transform iteration
into concatenation of modified regexps.
Back-patch to all supported branches. In HEAD, also add a regression test
case for this. (It may seem a bit silly to create a regression test file
for just one test case; but I'm expecting that we will soon import a whole
bunch of regex regression tests from Tcl, so might as well create the
infrastructure now.)
While this doesn't save a huge amount of runtime, it still seems worth
doing, especially since I realized that the data copying I did in my first
draft was quite unnecessary. In this version, once we have the results
cached, getting them back for re-use is really very cheap.
Also, remove the hard-wired limitation to not consider wctype.h results for
character codes above 255. It turns out that we can't push the limit as
far up as I'd originally hoped, because the regex colormap code is not
efficient enough to cope very well with character classes containing many
thousand letters, which a Unicode locale is entirely capable of producing.
Still, we can push it up to U+7FF (which I chose as the limit of 2-byte
UTF8 characters), which will at least make Eastern Europeans happy pending
a better solution. Thus, this commit resolves the specific complaint in
bug #6457, but not the more general issue that letters of non-western
alphabets are mostly not recognized as matching [[:alpha:]].
Create src/backend/regex/README to hold an implementation overview of
the regex package, and fill it in with some preliminary notes about
the code's DFA/NFA processing and colormap management. Much more to
do there of course.
Also, improve some code comments around the colormap and cvec code.
No functional changes except to add one missing assert.
Some line feeds are added to target lists and from lists to make
them more readable. By default they wrap at 80 columns if possible,
but the wrap column is also selectable - if 0 it wraps after every
item.
Andrew Dunstan, reviewed by Hitoshi Harada.
Sync our regex code with upstream changes since last time we did this,
which was Tcl 8.5.0 (see commit df1e965e12).
There are no functional changes here; the main point is just to lay down
a commit-log marker that somebody has looked at this recently, and to do
what we can to keep the two codebases comparable.
Formerly, we just punted when trying to estimate stats for variables coming
out of sub-queries using DISTINCT, on the grounds that whatever stats we
might have for underlying table columns would be inapplicable. But if the
sub-query has only one DISTINCT column, we can consider its output variable
as being unique, which is useful information all by itself. The scope of
this improvement is pretty narrow, but it costs nearly nothing, so we might
as well do it. Per discussion with Andres Freund.
This patch differs from the draft I submitted yesterday in updating various
comments about vardata.isunique (to reflect its extended meaning) and in
tweaking the interaction with security_barrier views. There does not seem
to be a reason why we can't use this sort of knowledge even when the
sub-query is such a view.
This extends the changes of commit 6252c4f9e2
so that we run the cleanup hook earlier for failure cases as well as
success cases. As before, the point is to avoid an assertion failure from
an Assert I added in commit a874fe7b4c, which
was meant to check that no user-written code can be called during portal
cleanup. This fixes a case reported by Pavan Deolasee in which the Assert
could be triggered during backend exit (see the new regression test case),
and also prevents the possibility that the cleanup hook is run after
portions of the portal's state have already been recycled. That doesn't
really matter in current usage, but it foreseeably could matter in the
future.
Back-patch to 9.1 where the Assert in question was added.
Per recent work by Peter Geoghegan, it's significantly faster to
tuplesort on a single sortkey if ApplySortComparator is inlined into
quicksort rather reached via a function pointer. It's also faster
in general to have a version of quicksort which is specialized for
sorting SortTuple objects rather than objects of arbitrary size and
type. This requires a couple of additional copies of the quicksort
logic, which in this patch are generate using a Perl script. There
might be some benefit in adding further specializations here too,
but thus far it's not clear that those gains are worth their weight
in code footprint.
The hstore and json datatypes both have record-conversion functions that
pay attention to column names in the composite values they're handed.
We used to not worry about inserting correct field names into tuple
descriptors generated at runtime, but given these examples it seems
useful to do so. Observe the nicer-looking results in the regression
tests whose results changed.
catversion bump because there is a subtle change in requirements for stored
rule parsetrees: RowExprs from ROW() constructs now have to include field
names.
Andrew Dunstan and Tom Lane
We don't normally allow quals to be pushed down into a view created
with the security_barrier option, but functions without side effects
are an exception: they're OK. This allows much better performance in
common cases, such as when using an equality operator (that might
even be indexable).
There is an outstanding issue here with the CREATE FUNCTION / ALTER
FUNCTION syntax: there's no way to use ALTER FUNCTION to unset the
leakproof flag. But I'm committing this as-is so that it doesn't
have to be rebased again; we can fix up the grammar in a future
commit.
KaiGai Kohei, with some wordsmithing by me.
If tuples were toasted, heap_multi_insert didn't update the ctid on the
original tuples. This caused a failure if there was an after trigger
(including a foreign key), on the table, and a tuple got toasted.
Per off-list report and test case from Ted Phelps
It's not entirely evident how the logic here relates to the
interval_transform function, so let's clue people in that they need to
check that if the rules change.
These were added to kwlist.h as unreserved keywords in separate patches,
but authors forgot to add them to the corresponding list in gram.y.
Because of that, even though they were supposed to be unreserved keywords,
they could not be used as identifiers. src/tools/check_keywords.pl is your
friend.
If somebody puts a window function in WHERE, we should complain about that
in so many words. The previous coding tended to complain about the window
function's arguments instead, which is likely to be misleading to users who
are unclear on the semantics of window functions; as seen for example in
bug #6440 from Matyas Novak.
Just another example of how "add new code at the end" is frequently a bad
heuristic.
This enables ALTER TABLE to skip table and index rebuilds when the
new type is unconstraint varbit, or when the allowable number of bits
is not decreasing.
Noah Misch, with review and a fix for an OID collision by me.
This enables ALTER TABLE to skip table and index rebuilds when a column
is changed to an unconstrained numeric, or when the scale is unchanged
and the precision does not decrease.
Noah Misch, with a few stylistic changes and a fix for an OID
collision by me.
Sometimes it may be useful to get actual row counts out of EXPLAIN
(ANALYZE) without paying the cost of timing every node entry/exit.
With this patch, you can say EXPLAIN (ANALYZE, TIMING OFF) to get that.
Tomas Vondra, reviewed by Eric Theise, with minor doc changes by me.
If LWLockWaitUntilFree was called before the first LWLockAcquire call, you
would either crash because of access to uninitialized array or account the
acquisition incorrectly. LWLockConditionalAcquire doesn't have this problem
because it doesn't update the lwlock stats.
In practice, this never happens because there is no codepath where you would
call LWLockWaitUntilfree before LWLockAcquire after a new process is
launched. But that's just accidental, there's no guarantee that that's
always going to be true in the future.
Spotted by Jeff Janes.
The postmaster was coded to treat any unexpected exit of the startup
process (i.e., the WAL replay process) as a catastrophic crash, and not try
to restart it. This was OK so long as the startup process could not have
any sibling postmaster children. However, if a hot-standby backend
crashes, we SIGQUIT the startup process along with everything else, and the
resulting exit is hardly "unexpected". Treating it as such meant we failed
to restart a standby server after any child crash at all, not only a crash
of the WAL replay process as intended. Adjust that. Back-patch to 9.0
where hot standby was introduced.
Although we will not even issue an XLOG_TBLSPC_DROP WAL record unless
removal of the tablespace's directories succeeds, that does not guarantee
that the same operation will succeed during WAL replay. Foreseeable
reasons for it to fail include temp files created in the tablespace by Hot
Standby backends, wrong directory permissions on a standby server, etc etc.
The original coding threw ERROR if replay failed to remove the directories,
but that is a serious overreaction. Throwing an error aborts recovery,
and worse means that manual intervention will be needed to get the database
to start again, since otherwise the same error will recur on subsequent
attempts to replay the same WAL record. And the consequence of failing to
remove the directories is only that some probably-small amount of disk
space is wasted, so it hardly seems justified to throw an error.
Accordingly, arrange to report such failures as LOG messages and keep going
when a failure occurs during replay.
Back-patch to 9.0 where Hot Standby was introduced. In principle such
problems can occur in earlier releases, but Hot Standby increases the odds
of trouble significantly. Given the lack of field reports of such issues,
I'm satisfied with patching back as far as the patch applies easily.
Originally, most of this code assumed that no Postgres backends could be
running concurrently with it, and so no locking could be needed. That
assumption fails in Hot Standby. While it's still true that Hot Standby
backends should never change values like nextXid, they can examine them,
and consistency is important in some cases such as when computing a
snapshot. Therefore, prudence requires that WAL replay code obtain the
relevant locks when modifying such variables, even though it can examine
them without taking a lock. We were following that coding rule in some
places but not all. This commit applies the coding rule uniformly to all
updates of ShmemVariableCache and MultiXactState fields; a search of the
replay routines did not find any other cases that seemed to be at risk.
In addition, this commit fixes a longstanding thinko in replay of NEXTOID
and checkpoint records: we tried to advance nextOid only if it was behind
the value in the WAL record, but the comparison would draw the wrong
conclusion if OID wraparound had occurred since the previous value.
Better to just unconditionally assign the new value, since OID assignment
shouldn't be happening during replay anyway.
The additional locking seems to be more in the nature of future-proofing
than fixing any live bug, so I am not going to back-patch it. The NEXTOID
fix will be back-patched separately.
RestoreBkpBlocks was in the habit of zeroing and refilling the target
buffer; which was perfectly safe when the code was written, but is unsafe
during Hot Standby operation. The reason is that we have coding rules
that allow backends to continue accessing a tuple in a heap relation while
holding only a pin on its buffer. Such a backend could see transiently
zeroed data, if WAL replay had occasion to change other data on the page.
This has been shown to be the cause of bug #6425 from Duncan Rance (who
deserves kudos for developing a sufficiently-reproducible test case) as
well as Bridget Frey's re-report of bug #6200. It most likely explains the
original report as well, though we don't yet have confirmation of that.
To fix, change the code so that only bytes that are supposed to change will
change, even transiently. This actually saves cycles in RestoreBkpBlocks,
since it's not writing the same bytes twice.
Also fix seq_redo, which has the same disease, though it has to work a bit
harder to meet the requirement.
So far as I can tell, no other WAL replay routines have this type of bug.
In particular, the index-related replay routines, which would certainly be
broken if they had to meet the same standard, are not at risk because we
do not have coding rules that allow access to an index page when not
holding a buffer lock on it.
Back-patch to 9.0 where Hot Standby was added.
All other WAL redo routines either call RestoreBkpBlocks() or Assert that
they haven't been passed any backup blocks. Make this one do likewise.
Also, fix incorrect routine name in its failure message.
When testing bits (but not when setting or clearing them), we now
won't check whether the map has been extended. This significantly
improves performance in the case where the visibility map doesn't
exist yet, by avoiding an extra system call per tuple. To make
sure backends notice eventually, send an smgr inval on VM extension.
Dean Rasheed, with minor modifications by me.
On some platforms, strtod() reports ERANGE for a denormalized value (ie,
one that can be represented as distinct from zero, but is too small to have
full precision). On others, it doesn't. It seems better to try to accept
these values consistently, so add a test to see if the result value
indicates a true out-of-range condition. This should be okay per Single
Unix Spec. On machines where the underlying math isn't IEEE standard, the
behavior for such small numbers may not be very consistent, but then it
wouldn't be anyway.
Marti Raudsepp, after a proposal by Jeroen Vermeulen
Like the XML data type, we simply store JSON data as text, after checking
that it is valid. More complex operations such as canonicalization and
comparison may come later, but this is enough for not.
There are a few open issues here, such as whether we should attempt to
detect UTF-8 surrogate pairs represented as \uXXXX\uYYYY, but this gets
the basic framework in place.
If there was a wait-until-free process in the head of the wait queue,
followed by an exclusive locker, the exclusive locker was not be woken up
as it should.
The sequence USAGE privilege is sufficiently similar to the SQL
standard that it seems reasonable to show in the information schema.
Also add some compatibility notes about it on the GRANT reference
page.
When a backend needs to flush the WAL, and someone else is already flushing
the WAL, wait until it releases the WALInsertLock and check if we still need
to do the flush or if the other backend already did the work for us, before
acquiring WALInsertLock. This helps group commit, because when the WAL flush
finishes, all the backends that were waiting for it can be woken up in one
go, and the can all concurrently observe that they're done, rather than
waking them up one by one in a cascading fashion.
This is based on a new LWLock function, LWLockWaitUntilFree(), which has
peculiar semantics. If the lock is immediately free, it grabs the lock and
returns true. If it's not free, it waits until it is released, but then
returns false without grabbing the lock. This is used in XLogFlush(), so
that when the lock is acquired, the backend flushes the WAL, but if it's
not, the backend first checks the current flush location before retrying.
Original patch and benchmarking by Peter Geoghegan and Simon Riggs, although
this patch as committed ended up being very different from that.
When default_text_search_config, default_tablespace, or temp_tablespaces
setting is set per-user or per-database, with an "ALTER USER/DATABASE SET
..." statement, don't throw an error if the text search configuration or
tablespace does not exist. In case of text search configuration, even if
it doesn't exist in the current database, it might exist in another
database, where the setting is intended to have its effect. This behavior
is now the same as search_path's.
Tablespaces are cluster-wide, so the same argument doesn't hold for
tablespaces, but there's a problem with pg_dumpall: it dumps "ALTER USER
SET ..." statements before the "CREATE TABLESPACE" statements. Arguably
that's pg_dumpall's fault - it should dump the statements in such an order
that the tablespace is created first and then the "ALTER USER SET
default_tablespace ..." statements after that - but it seems better to be
consistent with search_path and default_text_search_config anyway. Besides,
you could still create a dump that throws an error, by creating the
tablespace, running "ALTER USER SET default_tablespace", then dropping the
tablespace and running pg_dumpall on that.
Backpatch to all supported versions.
btcostestimate() makes an estimate of the number of index tuples that will
be visited based on knowledge of which index clauses can actually bound the
scan within nbtree. However, it forgot to account for partial indexes in
this calculation, with the result that the cost of the index scan could be
significantly overestimated for a partial index. Fix that by merging the
predicate with the abbreviated indexclause list, in the same way as we do
with the full list to estimate how many heap tuples will be visited.
Also, slightly increase the "fudge factor" that's meant to give preference
to smaller indexes over larger ones. While this is applied to all indexes,
it's most important for partial indexes since it can be the only factor
that makes a partial index look cheaper than a similar full index.
Experimentation shows that the existing value is so small as to easily get
swamped by noise such as page-boundary-roundoff behavior. I'm tempted to
kick it up more than this, but will refrain for now.
Per report from Ruben Blanco. These are long-standing issues, but given
the lack of prior complaints I'm not going to risk changing planner
behavior in back branches by back-patching.
In commit 57664ed25e, I made the planner
wrap non-simple-variable outputs of appendrel children (IOW, child SELECTs
of UNION ALL subqueries) inside PlaceHolderVars, in order to solve some
issues with EquivalenceClass processing. However, this means that any
upper-level WHERE clauses mentioning such outputs will now contain
PlaceHolderVars after they're pushed down into the appendrel child,
and that prevents indxpath.c from recognizing that they could be matched
to index expressions. To fix, add explicit stripping of PlaceHolderVars
from index operands, same as we have long done for RelabelType nodes.
Add a regression test covering both this and the plain-UNION case (which
is a totally different code path, but should also be able to do it).
Per bug #6416 from Matteo Beccati. Back-patch to 9.1, same as the
previous change.
Formerly we passed an empty list to each per-child-table invocation of
grouping_planner, and then merged the results into the global list.
However, that fails if there's a CTE attached to the statement, because
create_ctescan_plan uses the list to find the plan referenced by a CTE
reference; so it was unable to find any CTEs attached to the outer UPDATE
or DELETE. But there's no real reason not to use the same list throughout
the process, and doing so is simpler and faster anyway.
Per report from Josh Berkus of "could not find plan for CTE" failures.
Back-patch to 9.1 where we added support for WITH attached to UPDATE or
DELETE. Add some regression test cases, too.
We can't just skip initializing such subplans, because the referencing CTE
node will expect to find the subplan available when it initializes. That
in turn means that ExecInitModifyTable must allow the case (which actually
it needed to do anyway, since there's no guarantee that ModifyTable is
exactly at the top of the CTE plan tree). So move the complaint about not
being allowed in EvalPlanQual mode to execution instead of initialization.
Testing turned up yet another problem, which is that we'd try to
re-initialize the result relation's index list, leading to leaks and
dangling pointers.
Per report from Phil Sorber. Back-patch to 9.1 where data-modifying CTEs
were introduced.
After the planner was fixed to convert some IN/EXISTS subqueries into
semijoins or antijoins, we had to prevent it from doing that in some
cases where the plans risked getting much worse. The reason the plans
got worse was that in the unoptimized implementation, subqueries could
reference parameters from the outer query at any join level, and so
full table scans could be avoided even if they were one or more levels
of join below where the semi/anti join would be. Now that we have
sufficient mechanism in the planner to handle such cases properly,
it should no longer be necessary to play dumb here.
This reverts commits 07b9936a0f and
cd1f0d04bf. The latter was a stopgap
fix that wasn't really sufficiently analyzed at the time. Rather
than just restricting ourselves to cases where the new join can be
stacked on the right-hand input, we should also consider whether it
can be stacked on the left-hand input.
This patch fixes the planner so that it can generate nestloop-with-
inner-indexscan plans even with one or more levels of joining between
the indexscan and the nestloop join that is supplying the parameter.
The executor was fixed to handle such cases some time ago, but the
planner was not ready. This should improve our plans in many situations
where join ordering restrictions formerly forced complete table scans.
There is probably a fair amount of tuning work yet to be done, because
of various heuristics that have been added to limit the number of
parameterized paths considered. However, we are not going to find out
what needs to be adjusted until the code gets some real-world use, so
it's time to get it in there where it can be tested easily.
Note API change for index AM amcostestimate functions. I'm not aware of
any non-core index AMs, but if there are any, they will need minor
adjustments.
Hitherto, the information schema only showed explicitly granted
privileges that were visible in the *acl catalog columns. If no
privileges had been granted, the implicit privileges were not shown.
To fix that, add an SQL-accessible version of the acldefault()
function, and use that inside the aclexplode() calls to substitute the
catalog-specific default privilege set for null values.
reviewed by Abhijit Menon-Sen
In e5e2fc842c, blank lines were removed
after a comment block, which now looks as though the comment refers to
the immediately following code, but it actually refers to the
preceding code. So put the blank lines back.
This has been the behavior already in most cases, but through
omission, ALTER DOMAIN / OWNER TO and ALTER DOMAIN / SET SCHEMA would
silently work on non-domain types as well.
Those fields only appear in the structs so that genbki.pl can create
the BKI bootstrap files for the catalogs. But they are not actually
usable from C. So hiding them can prevent coding mistakes, saves
stack space, and can help the compiler.
In certain catalogs, the first variable-length field has been kept
visible after manual inspection. These exceptions are noted in C
comments.
reviewed by Tom Lane
Normally, accessing variable-length members of catalog structures past
the first one doesn't work at all. Here, it happened to work because
indnatts was checked to be 1, and so the defined FormData_pg_index
layout, using int2vector[1] and oidvector[1] for variable-length
arrays, happened to match the actual memory layout. But it's a very
fragile assumption, and it's not in a performance-critical path, so
code it properly using heap_getattr() instead.
bug analysis by Tom Lane
Our own qsort_arg() implementation doesn't have the defect previously
observed to affect only QNX 4, so it seems sufficiently to assert that
it isn't broken rather than retesting. Also, update a few comments to
clarify why it's valuable to retain a tie-break rule based on CTID
during index builds.
Peter Geoghegan, with slight tweaks by me.
We now use the same error message for ALTER TABLE .. ADD COLUMN or
ALTER TABLE .. RENAME COLUMN that we do for CREATE TABLE. The old
message was accurate, but might be confusing to users not aware of our
system columns.
Vik Reykja, with some changes by me, and further proofreading by Tom Lane
To make it wake up promptly when activity starts again, backends nudge it
by setting a latch in MarkBufferDirty(). The latch is kept set while
bgwriter is active, so there is very little overhead from that when the
system is busy. It is only armed before going into longer sleep.
Peter Geoghegan, with some changes by me.
This doesn't do anything useful just yet, but is intended as supporting
infrastructure for allowing sepgsql to sensibly check DROP permissions.
KaiGai Kohei and Robert Haas
Add counters for number and size of temporary files used
for spill-to-disk queries for each database to the
pg_stat_database view.
Tomas Vondra, review by Magnus Hagander
Rip out a regression test that doesn't play well with settings put in
place by the build farm, and rewrite the code in CheckIndexCompatible
in a hopefully more transparent style.
This gives up the "don't rewrite the index" behavior in a couple of
relatively unimportant cases, such as changing between an array type
and an unconstrained domain over that array type, in return for
making this code more future-proof.
Noah Misch
Base backup follows recommended procedure, plus goes to great
lengths to ensure that partial page writes are avoided.
Jun Ishizuka and Fujii Masao, with minor modifications
This reports the depth level of triggers currently in execution, or zero
if not called from inside a trigger.
No catversion bump in this patch, but you have to initdb if you want
access to the new function.
Author: Kevin Grittner
Replication occurs only to memory on standby, not to disk,
so provides additional performance if user wishes to
reduce durability level slightly. Adds concept of multiple
independent sync rep queues.
Fujii Masao and Simon Riggs
We log AccessExclusiveLocks for replay onto standby nodes,
but because of timing issues on ProcArray it is possible to
log a lock that is still held by a just committed transaction
that is very soon to be removed. To avoid any timing issue we
avoid applying locks made by transactions with InvalidXid.
Simon Riggs, bug report Tom Lane, diagnosis Pavan Deolasee
This separates the state (running/idle/idleintransaction etc) into
it's own field ("state"), and leaves the query field containing just
query text.
The query text will now mean "current query" when a query is running
and "last query" in other states. Accordingly,the field has been
renamed from current_query to query.
Since backwards compatibility was broken anyway to make that, the procpid
field has also been renamed to pid - along with the same field in
pg_stat_replication for consistency.
Scott Mead and Magnus Hagander, review work from Greg Smith
That avoids errors when the functions are used in queries like "SELECT
pg_relation_size(oid) FROM pg_class", and a table is dropped concurrently.
Phil Sorber
When the only remaining active transactions are READ ONLY, we do a "partial
cleanup" of committed transactions because certain types of conflicts
aren't possible anymore. For committed r/w transactions, we release the
SIREAD locks but keep the SERIALIZABLEXACT. However, for committed r/o
transactions, we can go further and release the SERIALIZABLEXACT too. The
problem was with the latter case: we were returning the SERIALIZABLEXACT to
the free list without removing it from the finished list.
The only real change in the patch is the SHMQueueDelete line, but I also
reworked some of the surrounding code to make it obvious that r/o and r/w
transactions are handled differently -- the existing code felt a bit too
clever.
Dan Ports
This prevents the postmaster from unexpectedly croaking if postgresql.conf
contains something like:
include 'invalid_directory_name'
Noah Misch. Reviewed by Tom Lane and myself.