As with initdb these programs need to run with a restricted token, and
if they don't pg_upgrade will fail when run as a user with Adminstrator
privileges.
Backpatch to all live branches. On the development branch the code is
reorganized so that the restricted token code is now in a single
location. On the stable bramches a less invasive change is made by
simply copying the relevant code to pg_upgrade.c and pg_resetxlog.c.
Patches and bug report from Muhammad Asif Naeem, reviewed by Michael
Paquier, slightly edited by me.
_hash_splitbucket() obtained the base page of the new bucket by calling
_hash_getnewbuf(), but it held no exclusive lock that would prevent some
other process from calling _hash_getnewbuf() at the same time. This is
contrary to _hash_getnewbuf()'s API spec and could in fact cause failures.
In practice, we must only call that function while holding write lock on
the hash index's metapage.
An additional problem was that we'd already modified the metapage's bucket
mapping data, meaning that failure to extend the index would leave us with
a corrupt index.
Fix both issues by moving the _hash_getnewbuf() call to just before we
modify the metapage in _hash_expandtable().
Unfortunately there's still a large problem here, which is that we could
also incur ENOSPC while trying to get an overflow page for the new bucket.
That would leave the index corrupt in a more subtle way, namely that some
index tuples that should be in the new bucket might still be in the old
one. Fixing that seems substantially more difficult; even preallocating as
many pages as we could possibly need wouldn't entirely guarantee that the
bucket split would complete successfully. So for today let's just deal
with the base case.
Per report from Antonin Houska. Back-patch to all active branches.
... and rename it and its sibling array_offsets to array_position and
array_positions, to account for the changed behavior.
Having the functions return subscripts better matches existing practice,
and is better suited to using the result value as a subscript into the
array directly. For one-based arrays, the new definition is identical
to what was originally committed.
(We use the term "subscript" in the documentation, which is what we use
whenever we talk about arrays; but the functions themselves are named
using the word "position" to match the standard-defined POSITION()
functions.)
Author: Pavel Stěhule
Behavioral problem noted by Dean Rasheed.
ReindexIndex() trusts a parser-built RangeVar with the persistence to
use for the new copy of the index; but the parser naturally does not
know what's the persistence of the original index. To find out the
correct persistence, grab it from relcache.
This bug was introduced by commit 85b506bbfc, and therefore no
backpatch is necessary.
Bug reported by Thom Brown, analysis and patch by Michael Paquier; test
case provided by Fabrízio de Royes Mello.
The previous coding in get_const_expr() tried to avoid quoting integer,
float, and numeric literals if at all possible. While that looks nice,
it means that dumped expressions might re-parse to something that's
semantically equivalent but not the exact same parsetree; for example
a FLOAT8 constant would re-parse as a NUMERIC constant with a cast to
FLOAT8. Though the result would be the same after constant-folding,
this is problematic in certain contexts. In particular, Jeff Davis
pointed out that this could cause unexpected failures in ALTER INHERIT
operations because of child tables having not-exactly-equivalent CHECK
expressions. Therefore, favor correctness over legibility and dump
such constants in quotes except in the limited cases where they'll
be interpreted as the same type even without any casting.
This results in assorted small changes in the regression test outputs,
and will affect display of user-defined views and rules similarly.
The odds of that causing problems in the field seem non-negligible;
given the lack of previous complaints, it seems best not to change
this in the back branches.
BackendIdGetTransactionIds() neglected the possibility that the PROC
pointer in a ProcState array entry is null. In current usage, this could
only crash if the other backend had exited since pgstat_read_current_status
saw it as active, which is a pretty narrow window. But it's reachable in
the field, per bug #12918 from Vladimir Borodin.
Back-patch to 9.4 where the faulty code was introduced.
Bugs all spotted by Coverity, including wrong realloc() size request
and memory leaks. Cosmetic improvements by me.
The usage of the global variable "filemap" here is still pretty awful,
but at least I got rid of the gratuitous aliasing in several routines
(which was helping to annoy Coverity, as well as being a bug risk).
Slow functions in index expressions might cause this loop to take long
enough to make it worth being cancellable. Probably it would be enough
to call CHECK_FOR_INTERRUPTS here, but for consistency with other
per-sample-row loops in this file, let's use vacuum_delay_point.
Report and patch by Jeff Janes. Back-patch to all supported branches.
Previously the funcCtx was a child of the tmpCtx, but that was broken
by commit eaa5808e8e, which made
MemoryContextReset() delete, not reset, child contexts. The behavior of
having a tmpCtx reset also clear the other context seems rather dubious
anyway, so let's just disentangle them. Per report from Erik Rijkers.
In passing, fix badly-inaccurate comments about these contexts.
If set, the pager will not be used unless this many lines are to be
displayed, even if that is more than the screen depth. Default is zero,
meaning it's disabled.
There is probably more work to be done in giving the user control over
when the pager is used, particularly when wide output forces use of the
pager regardless of how many lines there are, but this is a start.
We cannot use the index's tuple descriptor directly to describe the index
tuples returned in an index-only scan. That's because the index might use
a different datatype for the values stored on disk than the type originally
indexed. As long as they were both pass-by-ref, it worked, but will not work
for pass-by-value types of different sizes. I noticed this as a crash when I
started hacking a patch to add fetch methods to btree_gist.
This improves on commit bbfd7edae5 by
making two simple changes:
* pg_attribute_noreturn now takes parentheses, ie pg_attribute_noreturn().
Likewise pg_attribute_unused(), pg_attribute_packed(). This reduces
pgindent's tendency to misformat declarations involving them.
* attributes are now always attached to function declarations, not
definitions. Previously some places were taking creative shortcuts,
which were not merely candidates for bad misformatting by pgindent
but often were outright wrong anyway. (It does little good to put a
noreturn annotation where callers can't see it.) In any case, if
we would like to believe that these macros can be used with non-gcc
compilers, we should avoid gratuitous variance in usage patterns.
I also went through and manually improved the formatting of a lot of
declarations, and got rid of excessively repetitive (and now obsolete
anyway) comments informing the reader what pg_attribute_printf is for.
This adds a new GiST opclass method, 'fetch', which is used to reconstruct
the original Datum from the value stored in the index. Also, the 'canreturn'
index AM interface function gains a new 'attno' argument. That makes it
possible to use index-only scans on a multi-column index where some of the
opclasses support index-only scans but some do not.
This patch adds support in the box and point opclasses. Other opclasses
can added later as follow-on patches (btree_gist would be particularly
interesting).
Anastasia Lubennikova, with additional fixes and modifications by me.
Several submitted and even committed patches have run into the problem
that C89, our baseline, does not provide minimum/maximum values for
various integer datatypes. C99's stdint.h does, but we can't rely on
it.
Several parts of the code defined limits locally, so instead centralize
the definitions to c.h.
This patch also changes the more obvious usages of literal limit values;
there's more places that could be changed, but it's less clear whether
it's beneficial to change those.
Author: Andrew Gierth
Discussion: 87619tc5wc.fsf@news-spur.riddles.org.uk
Since commit a2e35b53c3, most CREATE and ALTER commands return the
ObjectAddress of the affected object. This is useful for event triggers
to try to figure out exactly what happened. This patch extends this
idea a bit further to cover ALTER TABLE as well: an auxiliary
ObjectAddress is returned for each of several subcommands of ALTER
TABLE. This makes it possible to decode with precision what happened
during execution of any ALTER TABLE command; for instance, which
constraint was added by ALTER TABLE ADD CONSTRAINT, or which parent got
dropped from the parents list by ALTER TABLE NO INHERIT.
As with the previous patch, there is no immediate user-visible change
here.
This is all really just continuing what c504513f83 started.
Reviewed by Stephen Frost.
The POSIX spec says that rint() rounds halfway cases to nearest even.
Our substitute implementation failed to do that, rather rounding halfway
cases away from zero; and it also got some other cases (such as minus
zero) wrong. This led to observable cross-platform differences, as
reported in bug #12885 from Rich Schaaf; in particular, casting from
float to int didn't honor round-to-nearest-even on builds using rint.c.
Implement something that attempts to cover all cases per spec, and add
some simple regression tests so that we'll notice if any platforms still
get this wrong.
Although this is a bug fix, no back-patch, as a behavioral change in
the back branches was agreed not to be a good idea.
Pedro Gimeno Fortea, reviewed by Michael Paquier and myself
Even though the main benefit of the Lehman and Yao algorithm for
btrees is that no locks need be held between page reads in an
index search, we were holding a buffer pin on each leaf page after
it was read until we were ready to read the next one. The reason
was so that we could treat this as a weak lock to create an
"interlock" with vacuum's deletion of heap line pointers, even
though our README file pointed out that this was not necessary for
a scan using an MVCC snapshot.
The main goal of this patch is to reduce the blocking of vacuum
processes by in-progress btree index scans (including a cursor
which is idle), but the code rearrangement also allows for one
less buffer content lock to be taken when a forward scan steps from
one page to the next, which results in a small but consistent
performance improvement in many workloads.
This patch leaves behavior unchanged for some cases, which can be
addressed separately so that each case can be evaluated on its own
merits. These unchanged cases are when a scan uses a non-MVCC
snapshot, an index-only scan, and a scan of a btree index for which
modifications are not WAL-logged. If later patches allow all of
these cases to drop the buffer pin after reading a leaf page, then
the btree vacuum process can be simplified; it will no longer need
the "super-exclusive" lock to delete tuples from a page.
Reviewed by Heikki Linnakangas and Kyotaro Horiguchi
I failed to realize that server names reported in the object args array
would get quoted, which is wrong; remove that, making sure that it's
only quoted in the string-formatted identity.
This bug was introduced by my commit cf34e373, which was backpatched,
but since object name/args arrays are new in commit a676201490, there
is no need to backpatch this any further.
ExecOpenScanRelation assumed that any relation listed in the ExecRowMark
list has been locked by InitPlan; but this is not true if the rel's
markType is ROW_MARK_COPY, which is possible if it's a foreign table.
In most (possibly all) cases, failure to acquire a lock here isn't really
problematic because the parser, planner, or plancache would have taken the
appropriate lock already. In principle though it might leave us vulnerable
to working with a relation that we hold no lock on, and in any case if the
executor isn't depending on previously-taken locks otherwise then it should
not do so for ROW_MARK_COPY relations.
Noted by Etsuro Fujita. Back-patch to all active versions, since the
inconsistency has been there a long time. (It's almost certainly
irrelevant in 9.0, since that predates foreign tables, but the code's
still wrong on its own terms.)
Previously, CHECK constraints of the same scope were checked in whatever
order they happened to be read from pg_constraint. (Usually, but not
reliably, this would be creation order for domain constraints and reverse
creation order for table constraints, because of differing implementation
details.) Nondeterministic results of this sort are problematic at least
for testing purposes, and in discussion it was agreed to be a violation of
the principle of least astonishment. Therefore, borrow the principle
already established for triggers, and apply such checks in name order
(using strcmp() sort rules). This lets users control the check order
if they have a mind to.
Domain CHECK constraints still follow the rule of checking lower nested
domains' constraints first; the name sort only applies to multiple
constraints attached to the same domain.
In passing, I failed to resist the temptation to wordsmith a bit in
create_domain.sgml.
Apply to HEAD only, since this could result in a behavioral change in
existing applications, and the potential regression test failures have
not actually been observed in our buildfarm.
It worked in my Windows VM with VS2013, but buildfarm animal mastodon,
running MSVC 2005, was not happy. Amit Kapila also reported a similar error
earlier in his environment. Let's see if this helps.
Earlier versions of this tool were available (and still are) on github.
Thanks to Michael Paquier, Alvaro Herrera, Peter Eisentraut, Amit Kapila,
and Satoshi Nagayasu for review.
Recovery delays are implemented by waiting on a latch, and latches take
milliseconds as a parameter. The required amount of waiting was computed
using microsecond resolution though and the wait loop's abort condition
was checking the delay in microseconds as well. This could lead to
short spurts of busy looping when the overall wait time was below a
millisecond, but above 0 microseconds.
Instead just formulate the wait loop's abort condition in millisecond
granularity as well. Given that that's recovery_min_apply_delay
resolution, it seems harmless to not wait for less than a millisecond.
Backpatch to 9.4 where recovery_min_apply_delay was introduced.
Discussion: 20150323141819.GH26995@alap3.anarazel.de
dsm_control->nitems never decreases, so this is testing whether the
server has *ever* run out of DSM segments, not whether it is
*currently* out of DSM segments.
Reported off-list by Amit Kapila.
Revert "to_char(float4/8): zero pad to specified length". There are
too many platform-specific problems, and the proper rounding is missing.
Also revert companion patch 9d61b9953c.
Foreign tables can now be inheritance children, or parents. Much of the
system was already ready for this, but we had to fix a few things of
course, mostly in the area of planner and executor handling of row locks.
As side effects of this, allow foreign tables to have NOT VALID CHECK
constraints (and hence to accept ALTER ... VALIDATE CONSTRAINT), and to
accept ALTER SET STORAGE and ALTER SET WITH/WITHOUT OIDS. Continuing to
disallow these things would've required bizarre and inconsistent special
cases in inheritance behavior. Since foreign tables don't enforce CHECK
constraints anyway, a NOT VALID one is a complete no-op, but that doesn't
mean we shouldn't allow it. And it's possible that some FDWs might have
use for SET STORAGE or SET WITH OIDS, though doubtless they will be no-ops
for most.
An additional change in support of this is that when a ModifyTable node
has multiple target tables, they will all now be explicitly identified
in EXPLAIN output, for example:
Update on pt1 (cost=0.00..321.05 rows=3541 width=46)
Update on pt1
Foreign Update on ft1
Foreign Update on ft2
Update on child3
-> Seq Scan on pt1 (cost=0.00..0.00 rows=1 width=46)
-> Foreign Scan on ft1 (cost=100.00..148.03 rows=1170 width=46)
-> Foreign Scan on ft2 (cost=100.00..148.03 rows=1170 width=46)
-> Seq Scan on child3 (cost=0.00..25.00 rows=1200 width=46)
This was done mainly to provide an unambiguous place to attach "Remote SQL"
fields, but it is useful for inherited updates even when no foreign tables
are involved.
Shigeru Hanada and Etsuro Fujita, reviewed by Ashutosh Bapat and Kyotaro
Horiguchi, some additional hacking by me
Previously, zero padding was limited to the internal length, rather than
the specified length. This allows it to match to_char(int/numeric), which
always padded to the specified length.
Regression tests added.
BACKWARD INCOMPATIBILITY
Instead of copying xlogreader.c and *desc.c files into the source directory,
build them where they are. That's what we do for other binaries that need to
compile and link in files from elsewhere in the source tree.
The commit history suggests that it was done this way because of issues with
older versions of MSVC. I think this should work, but we'll see if the
buildfarm complains.
On platforms where we support 128bit integers, use them to implement
faster transition functions for sum(int8), avg(int8),
var_*(int2/int4),stdev_*(int2/int4). Where not supported continue to use
numeric as a transition type.
In some synthetic benchmarks this has been shown to provide significant
speedups.
Bumps catversion.
Discussion: 544BB5F1.50709@proxel.se
Author: Andreas Karlsson
Reviewed-By: Peter Geoghegan, Petr Jelinek, Andres Freund,
Oskari Saarenmaa, David Rowley
We will, for the foreseeable future, not expose 128 bit datatypes to
SQL. But being able to use 128bit math will allow us, in a later patch,
to use 128bit accumulators for some aggregates; leading to noticeable
speedups over using numeric.
So far we only detect a gcc/clang extension that supports 128bit math,
but no 128bit literals, and no *printf support. We might want to expand
this in the future to further compilers; if there are any that that
provide similar support.
Discussion: 544BB5F1.50709@proxel.se
Author: Andreas Karlsson, with significant editorializing by me
Reviewed-By: Peter Geoghegan, Oskari Saarenmaa
The pg_stat and pg_signal-related functions have been using GetUserId()
instead of has_privs_of_role() for checking if the current user should
be able to see details in pg_stat_activity or signal other processes,
requiring a user to do 'SET ROLE' for inheirited roles for a permissions
check, unlike other permissions checks.
This patch changes that behavior to, instead, act like most other
permission checks and use has_privs_of_role(), removing the 'SET ROLE'
need. Documentation and error messages updated accordingly.
Per discussion with Alvaro, Peter, Adam (though not using Adam's patch),
and Robert.
Reviewed by Jeevan Chalke.
Right now, there's only one flag, DSM_CREATE_NULL_IF_MAXSEGMENTS,
which suppresses the error that would normally be thrown when the
maximum number of segments already exists, instead returning NULL.
It might be useful to add more flags in the future, such as one to
ignore allocation errors, but I haven't done that here.
Previously, GetBackgroundWorkerPid() would return BGWH_NOT_YET_STARTED
if the slot used for the worker registration had not been reused by
unrelated activity, and BGWH_STOPPED if it had. Either way, a process
that had requested notification when the state of one of its
background workers changed did not receive such notifications. Fix
things so that GetBackgroundWorkerPid() always returns BGWH_STOPPED in
this situation, so that we do not erroneously give waiters the
impression that the worker will eventually be started; and send
notifications just as we would if the process terminated after having
been started, so that it's possible to wait for the postmaster to
process a worker termination request without polling.
Discovered by Amit Kapila during testing of parallel sequential scan.
Analysis and fix by me. Back-patch to 9.4; there may not be anyone
relying on this interface yet, but if anyone is, the new behavior is a
clear improvement.
Since commit cb4a3b04 we were already doing this for the Cygwin/mingw
toolchains, but MSVC had not been updated to do it. At Install.pm time,
the Makefile (or GNUmakefile) is inspected, and if a line matching
SO_MAJOR_VERSION is found (indicating a shared library is being built),
then files with the .dll extension are set to be installed in bin/
rather than lib/, while files with .lib extension are installed in lib/.
This makes the MSVC toolchain up to date with cygwin/mingw.
This removes ad-hoc hacks that were copying files into bin/ or lib/
manually (libpq.dll in particular was already being copied into bin).
So while this is a rather ugly kludge, it's still cleaner than what was
there before.
Author: Michael Paquier
Reviewed by: Asif Naeem
We were involving the parser too much in setting up initial vacuuming
parameters. This patch moves that responsibility elsewhere to simplify
code, and also to make future additions easier. To do this, create a
new struct VacuumParams which is filled just prior to vacuum execution,
instead of at parse time; for user-invoked vacuuming this is set up in a
new function ExecVacuum, while autovacuum sets it up by itself.
While at it, add a new member VACOPT_SKIPTOAST to enum VacuumOption,
only set by autovacuum, which is used to disable vacuuming of the toast
table instead of the old do_toast parameter; this relieves the argument
list of vacuum() and some callees a bit. This partially makes up for
having added more arguments in an effort to avoid having autovacuum from
constructing a VacuumStmt parse node.
Author: Michael Paquier. Some tweaks by Álvaro
Reviewed by: Robert Haas, Stephen Frost, Álvaro Herrera
Since the array length check is using a post-increment operator, the
compiler complains that there's a potential write to one element beyond
the end of the array. This is not possible currently: the only path to
this function is through pg_get_object_address(), which already verifies
that the input array is no more than two elements in length. Still, a
bug is a bug.
No idea why my compiler doesn't complain about this ...
Pointed out by Dead Rasheed and Peter Eisentraut
In the spirit of 890192e99a and 4464303405f: have get_object_address
understand individual pg_amop and pg_amproc objects. There is no way to
refer to such objects directly in the grammar -- rather, they are almost
always considered an integral part of the opfamily that contains them.
(The only case that deals with them individually is ALTER OPERATOR
FAMILY ADD/DROP, which carries the opfamily address separately and thus
does not need it to be part of each added/dropped element's address.)
In event triggers it becomes possible to become involved with individual
amop/amproc elements, and this commit enables pg_get_object_address to
do so as well.
To make the overall coding simpler, this commit also slightly changes
the get_object_address representation for opclasses and opfamilies:
instead of having the AM name in the objargs array, I moved it as the
first element of the objnames array. This enables the new code to use
objargs for the type names used by pg_amop and pg_amproc.
Reviewed by: Stephen Frost
This patch fixes two inadequacies of the PlanRowMark representation.
First, that the original LockingClauseStrength isn't stored (and cannot be
inferred for foreign tables, which always get ROW_MARK_COPY). Since some
PlanRowMarks are created out of whole cloth and don't actually have an
ancestral RowMarkClause, this requires adding a dummy LCS_NONE value to
enum LockingClauseStrength, which is fairly annoying but the alternatives
seem worse. This fix allows getting rid of the use of get_parse_rowmark()
in FDWs (as per the discussion around commits 462bd95705 and
8ec8760fc8), and it simplifies some things elsewhere.
Second, that the representation assumed that all child tables in an
inheritance hierarchy would use the same RowMarkType. That's true today
but will soon not be true. We add an "allMarkTypes" field that identifies
the union of mark types used in all a parent table's children, and use
that where appropriate (currently, only in preprocess_targetlist()).
In passing fix a couple of minor infelicities left over from the SKIP
LOCKED patch, notably that _outPlanRowMark still thought waitPolicy
is a bool.
Catversion bump is required because the numeric values of enum
LockingClauseStrength can appear in on-disk rules.
Extracted from a much larger patch to support foreign table inheritance;
it seemed worth breaking this out, since it's a separable concern.
Shigeru Hanada and Etsuro Fujita, somewhat modified by me
Commit df630b0dd5 moved enum LockWaitPolicy
into its very own header file utils/lockwaitpolicy.h, which does not seem
like a great idea from here. First, it's still a node-related declaration,
and second, a file named like that can never sensibly be used for anything
else. I do not think we want to encourage a one-typedef-per-header-file
approach. The upcoming foreign table inheritance patch was doubling down
on this bad idea by moving enum LockClauseStrength into its *own*
can-never-be-used-for-anything-else file. Instead, let's put them both in
a file named nodes/lockoptions.h. (They do seem to need a separate header
file because we need them in both parsenodes.h and plannodes.h, and we
don't want either of those including the other. Past practice might
suggest adding them to nodes/nodes.h, but they don't seem sufficiently
globally useful to justify that.)
Committed separately since there's no functional change here, just some
header-file refactoring.
Since 465883b0a two versions of commit records have existed. A compact
version that was used when no cache invalidations, smgr unlinks and
similar were needed, and a full version that could deal with all
that. Additionally the full version was embedded into twophase commit
records.
That resulted in a measurable reduction in the size of the logged WAL in
some workloads. But more recently additions like logical decoding, which
e.g. needs information about the database something was executed on,
made it applicable in fewer situations. The static split generally made
it hard to expand the commit record, because concerns over the size made
it hard to add anything to the compact version.
Additionally it's not particularly pretty to have twophase.c insert
RM_XACT records.
Rejigger things so that the commit and abort records only have one form
each, including the twophase equivalents. The presence of the various
optional (in the sense of not being in every record) pieces is indicated
by a bits in the 'xinfo' flag. That flag previously was not included in
compact commit records. To prevent an increase in size due to its
presence, it's only included if necessary; signalled by a bit in the
xl_info bits available for xact.c, similar to heapam.c's
XLOG_HEAP_OPMASK/XLOG_HEAP_INIT_PAGE.
Twophase commit/aborts are now the same as their normal
counterparts. The original transaction's xid is included in an optional
data field.
This means that commit records generally are smaller, except in the case
of a transaction with subtransactions, but no other special cases; the
increase there is four bytes, which seems acceptable given that the more
common case of not having subtransactions shrank. The savings are
especially measurable for twophase commits, which previously always used
the full version; but will in practice only infrequently have required
that.
The motivation for this work are not the space savings and and
deduplication though; it's that it makes it easier to extend commit
records with additional information. That's just a few lines of code
now; without impacting the common case where that information is not
needed.
Discussion: 20150220152150.GD4149@awork2.anarazel.de,
235610.92468.qm%40web29004.mail.ird.yahoo.com
Reviewed-By: Heikki Linnakangas, Simon Riggs
The introduction of min_wal_size & max_wal_size in 88e9823026 makes it
feasible to increase the default upper bound in checkpoint
size. Previously raising the default would lead to a increased disk
footprint, even if more segments weren't beneficial. The low default of
checkpoint size is one of common performance problem users have thus
increasing the default makes sense. Setups where the increase in
maximum disk usage is a problem will very likely have to run with a
modified configuration anyway.
Discussion: 54F4EFB8.40202@agliodbs.com,
CA+TgmoZEAgX5oMGJOHVj8L7XOkAe05Gnf45rP40m-K3FhZRVKg@mail.gmail.com
Author: Josh Berkus, after a discussion involving lots of people.
The new recovery_target_action (introduced in aedccb1f6/b8e33a85d4)
replaces it's functionality. Having both seems likely to cause more
confusion than it saves worry due to the incompatibility.
Discussion: 5484FC53.2060903@2ndquadrant.com
Author: Petr Jelinek
Obsoleted by commit 21dcda2713, but I missed
seeing the cross-reference in the comments for exec_eval_integer().
Also improve the cross-reference in the comments for exec_eval_cleanup().
Since commit ba7c5975ad, port/dirmod.c
has contained only Windows-specific functions. Most platforms don't
seem to mind uselessly building an empty file, but OS X for one issues
warnings. Hence, treat dirmod.c as a Windows-specific file selected
by configure rather than one that's always built. We can revert this
change if dirmod.c ever gains any non-Windows functionality again.
Back-patch to 9.4 where the mentioned commit appeared.
GNU readline defines the return value of write_history() as "zero if OK,
else an errno code". libedit's version of that function used to have a
different definition (to wit, "-1 if error, else the number of lines
written to the file"). We tried to work around that by checking whether
errno had become nonzero, but this method has never been kosher according
to the published API of either library. It's reportedly completely broken
in recent Ubuntu releases: psql bleats about "No such file or directory"
when saving ~/.psql_history, even though the write worked fine.
However, libedit has been following the readline definition since somewhere
around 2006, so it seems all right to finally break compatibility with
ancient libedit releases and trust that the return value is what readline
specifies. (I'm not sure when the various Linux distributions incorporated
this fix, but I did find that OS X has been shipping fixed versions since
10.5/Leopard.)
If anyone is still using such an ancient libedit, they will find that psql
complains it can't write ~/.psql_history at exit, even when the file was
written correctly. This is no worse than the behavior we're fixing for
current releases.
Back-patch to all supported branches.
The message tries to tell the replication apply delay which fails if
the first WAL record is not applied yet. Fix is, instead of telling
overflowed minus numeric, showing "N/A" which indicates that the delay
data is not yet available. Problem reported by me and patch by
Fabrízio de Royes Mello.
Back patched to 9.4, 9.3 and 9.2 stable branches (9.1 and 9.0 do not
have the debug message).
The ROW_MARK_COPY path in EvalPlanQualFetchRowMarks() was just setting
tableoid to InvalidOid, I think on the assumption that the referenced
RTE must be a subquery or other case without a meaningful OID. However,
foreign tables also use this code path, and they do have meaningful
table OIDs; so failure to set the tuple field can lead to user-visible
misbehavior. Fix that by fetching the appropriate OID from the range
table.
There's still an issue about whether CTID can ever have a meaningful
value in this case; at least with postgres_fdw foreign tables, it does.
But that is a different problem that seems to require a significantly
different patch --- it's debatable whether postgres_fdw really wants to
use this code path at all.
Simplified version of a patch by Etsuro Fujita, who also noted the
problem to begin with. The issue can be demonstrated in all versions
having FDWs, so back-patch to 9.1.
We can't handle this in the general case due to limitations of the
planner's data representations; but we can allow it in many useful cases,
by being careful to flatten only when we are pulling a single-row subquery
up into a FROM (or, equivalently, inner JOIN) node that will still have at
least one remaining relation child. Per discussion of an example from
Kyotaro Horiguchi.
While poking at David Kubečka's issue I noticed an ancient logic error
in get_loop_count(): it used 1.0 as a "no data yet" indicator, but since
that is actually a valid rowcount estimate, this doesn't work. If we
have one input relation with 1.0 as rowcount and then another one with
a larger rowcount, we should use 1.0 as the result, but we picked the
larger rowcount instead. (I think when I coded this, I recognized the
conflict, but mistakenly thought that the logic would pick the desired
count anyway.)
Fixing this changed the plan for one existing regression test case.
Since the point of that test is to exercise creation of a particular
shape of nestloop plan, I tweaked the query a little bit so it still
results in the same plan choice.
This is definitely a bug, but I'm hesitant to back-patch since it might
change plan choices unexpectedly, and anyway failure to implement a
heuristic precisely as intended is a pretty low-grade bug.
If we have a semijoin, say
SELECT * FROM x WHERE x1 IN (SELECT y1 FROM y)
and we're estimating the cost of a parameterized indexscan on x, the number
of repetitions of the indexscan should not be taken as the size of y; it'll
really only be the number of distinct values of y1, because the only valid
plan with y on the outside of a nestloop would require y to be unique-ified
before joining it to x. Most of the time this doesn't make that much
difference, but sometimes it can lead to drastically underestimating the
cost of the indexscan and hence choosing a bad plan, as pointed out by
David Kubečka.
Fixing this is a bit difficult because parameterized indexscans are costed
out quite early in the planning process, before we have the information
that would be needed to call estimate_num_groups() and thereby estimate the
number of distinct values of the join column(s). However we can move the
code that extracts a semijoin RHS's unique-ification columns, so that it's
done in initsplan.c rather than on-the-fly in create_unique_path(). That
shouldn't make any difference speed-wise and it's really a bit cleaner too.
The other bit of information we need is the size of the semijoin RHS,
which is easy if it's a single relation (we make those estimates before
considering indexscan costs) but problematic if it's a join relation.
The solution adopted here is just to use the product of the sizes of the
join component rels. That will generally be an overestimate, but since
estimate_num_groups() only uses this input as a clamp, an overestimate
shouldn't hurt us too badly. In any case we don't allow this new logic
to produce a value larger than we would have chosen before, so that at
worst an overestimate leaves us no wiser than we were before.
In the spirit of 890192e99a, this time add support for the things
living in the pg_default_acl catalog. These are not really "objects",
but they show up as such in event triggers.
There is no "DROP DEFAULT PRIVILEGES" or similar command, so it doesn't
look like the new representation given would be useful anywhere else, so
I didn't try to use it outside objectaddress.c. (That might be a bug in
itself, but that would be material for another commit.)
Reviewed by Stephen Frost.
Since commit 72dd233d3e we were trying to obtain object addressing
information in sql_drop event triggers, but that caused failures when
the drops involved user mappings. This addition enables that to work
again. Naturally, pg_get_object_address can work with these objects
now, too.
I toyed with the idea of removing DropUserMappingStmt as a node and
using DropStmt instead in the DropUserMappingStmt grammar production,
but that didn't go very well: for one thing the messages thrown by the
specific code are specialized (you get "server not found" if you specify
the wrong server, instead of a generic "user mapping for ... not found"
which you'd get it we were to merge this with RemoveObjects --- unless
we added even more special cases). For another thing, it would require
to pass RoleSpec nodes through the objname/objargs representation used
by RemoveObjects, which works in isolation, but gets messy when
pg_get_object_address is involved. So I dropped this part for now.
Reviewed by Stephen Frost.
PL/Python uses str() to convert Python values back to PostgreSQL, but
str() is lossy for float values, so use repr() instead in that case.
Author: Marko Kreen <markokr@gmail.com>
While the SQL standard is pretty vague on the overall topic of operator
precedence (because it never presents a unified BNF for all expressions),
it does seem reasonable to conclude from the spec for <boolean value
expression> that OR has the lowest precedence, then AND, then NOT, then IS
tests, then the six standard comparison operators, then everything else
(since any non-boolean operator in a WHERE clause would need to be an
argument of one of these).
We were only sort of on board with that: most notably, while "<" ">" and
"=" had properly low precedence, "<=" ">=" and "<>" were treated as generic
operators and so had significantly higher precedence. And "IS" tests were
even higher precedence than those, which is very clearly wrong per spec.
Another problem was that "foo NOT SOMETHING bar" constructs, such as
"x NOT LIKE y", were treated inconsistently because of a bison
implementation artifact: they had the documented precedence with respect
to operators to their right, but behaved like NOT (i.e., very low priority)
with respect to operators to their left.
Fixing the precedence issues is just a small matter of rearranging the
precedence declarations in gram.y, except for the NOT problem, which
requires adding an additional lookahead case in base_yylex() so that we
can attach a different token precedence to NOT LIKE and allied two-word
operators.
The bulk of this patch is not the bug fix per se, but adding logic to
parse_expr.c to allow giving warnings if an expression has changed meaning
because of these precedence changes. These warnings are off by default
and are enabled by the new GUC operator_precedence_warning. It's believed
that very few applications will be affected by these changes, but it was
agreed that a warning mechanism is essential to help debug any that are.
setup_param_list() was allocating a fresh ParamListInfo for each query or
expression evaluation requested by a plpgsql function. There was probably
once good reason to do it like that, but for a long time we've had a
convention that there's a one-to-one mapping between the function's
PLpgSQL_datum array and the ParamListInfo slots, which means that a single
ParamListInfo can serve all the function's evaluation requests: the data
that would need to be passed is the same anyway.
In this patch, we retain the pattern of zeroing out the ParamListInfo
contents during each setup_param_list() call, because some of the slots may
be stale and we don't know exactly which ones. So this patch only saves a
palloc/pfree per evaluation cycle and nothing more; still, that seems to be
good for a couple percent overall speedup on simple-arithmetic type
statements. In future, though, we might be able to improve matters still
more by managing the param array contents more carefully.
Also, unify the former use of estate->cur_expr with that of
paramLI->parserSetupArg; they both were used to point to the active
expression, so we can combine the variables into just one.
Error messages informing the user that no such column exists can
sometimes provoke a perplexed response. This often happens due to
a subtle typo in the column name or, perhaps less likely, in the
alias name. To speed discovery of what the real issue is in such
cases, we'll now search the range table for approximate matches.
If there are one or two such matches that are good enough to think
that they might be what the user intended to type, and better than
all other approximate matches, we'll issue a hint suggesting that
the user might have intended to reference those columns.
Peter Geoghegan and Robert Haas
Until now __attribute__() was defined to be empty for all compilers but
gcc. That's problematic because it prevents using it in other compilers;
which is necessary e.g. for atomics portability. It's also just
generally dubious to do so in a header as widely included as c.h.
Instead add pg_attribute_format_arg, pg_attribute_printf,
pg_attribute_noreturn macros which are implemented in the compilers that
understand them. Also add pg_attribute_noreturn and pg_attribute_packed,
but don't provide fallbacks, since they can affect functionality.
This means that external code that, possibly unwittingly, relied on
__attribute__ defined to be empty on !gcc compilers may now run into
warnings or errors on those compilers. But there shouldn't be many
occurances of that and it's hard to work around...
Discussion: 54B58BA3.8040302@ohmu.fi
Author: Oskari Saarenmaa, with some minor changes by me.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.
This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.
This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.
Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
Commit 865f14a2d3 was quite a few bricks
shy of a load: psql, ecpg, and plpgsql were all left out-of-step with
the core lexer. Of these only the last was likely to be a fatal
problem; but still, a minimal amount of grepping, or even just reading
the comments adjacent to the places that were changed, would have found
the other places that needed to be changed.
... which is the usual convention among AMs, so that pg_filedump and
similar utilities can tell apart pages of different AMs. It was also
the intent of the original code, but I failed to realize that alignment
considerations would move the whole thing to the previous-to-last word
in the page.
The new definition of the associated macro makes surrounding code a bit
leaner, too.
Per note from Heikki at
http://www.postgresql.org/message-id/546A16EF.9070005@vmware.com
SQL has standardized on => as the use of to specify named parameters,
and we've wanted for many years to support the same syntax ourselves,
but this has been complicated by the possible use of => as an operator
name. In PostgreSQL 9.0, we began emitting a warning when an operator
named => was defined, and in PostgreSQL 9.2, we stopped shipping a
=>(text, text) operator as part of hstore. By the time the next major
version of PostgreSQL is released, => will have been deprecated for a
full five years, so hopefully there won't be too many people still
relying on it. We continue to support := for compatibility with
previous PostgreSQL releases.
Pavel Stehule, reviewed by Petr Jelinek, with a few documentation
tweaks by me.
We allow this module to be turned off on restarts, so a restart time
check is enough to activate or deactivate the module; however, if there
is a standby replaying WAL emitted from a master which is restarted, but
the standby isn't, the state in the standby becomes inconsistent and can
easily be crashed.
Fix by activating and deactivating the module during WAL replay on
parameter change as well as on system start.
Problem reported by Fujii Masao in
http://www.postgresql.org/message-id/CAHGQGwFhJ3CnHo1CELEfay18yg_RA-XZT-7D8NuWUoYSZ90r4Q@mail.gmail.com
Author: Petr Jelínek
ALTER DEFAULT PRIVILEGES was trying to decode the list of roles in the
FOR clause as a list of names rather than of RoleSpecs; and the IN
clause in CREATE ROLE was doing the same thing. This was evidenced by
crashes on some buildfarm machines, though on my platform this doesn't
cause a failure by mere chance; I can reproduce the failures only by
adding some padding in struct RoleSpecs.
Fix by dereferencing those lists as being of RoleSpecs, not string
Values.
Commands such as ALTER USER, ALTER GROUP, ALTER ROLE, GRANT, and the
various ALTER OBJECT / OWNER TO, as well as ad-hoc clauses related to
roles such as the AUTHORIZATION clause of CREATE SCHEMA, the FOR clause
of CREATE USER MAPPING, and the FOR ROLE clause of ALTER DEFAULT
PRIVILEGES can now take the keywords CURRENT_USER and SESSION_USER as
user specifiers in place of an explicit user name.
This commit also fixes some quite ugly handling of special standards-
mandated syntax in CREATE USER MAPPING, which in particular would fail
to work in presence of a role named "current_user".
The special role specifiers PUBLIC and NONE also have more consistent
handling now.
Also take the opportunity to add location tracking to user specifiers.
Authors: Kyotaro Horiguchi. Heavily reworked by Álvaro Herrera.
Reviewed by: Rushabh Lathia, Adam Brightwell, Marti Raudsepp.
Commit 5cefbf5a6c introduced an
assumption that this field would always be non-NULL when doing a merge
pass, but that's not true. Without this fix, you can crash the server
by building a hash index that is sufficiently large relative to
maintenance_work_mem, or by triggering a large datum sort.
Commit 5ea86e6e65 changed the comments
for that field to say that it would be set in all cases except for the
hash index case, but that wasn't (and still isn't) true.
The datum-sort failure was spotted by Tomas Vondra; initial analysis
of that failure was by Peter Geoghegan. The remaining issues were
spotted by me during review of the surrounding code, and the patch is
all my fault.
This makes it easier to write frontend programs that needs to understand
the WAL record format of CREATE/DROP DATABASE. dbcommands.h cannot easily
be #included in a frontend program, because it pulls in other header files
that need backend stuff, but the new dbcommands_xlog.h header file has
fewer dependencies.
This is a possibly-vain effort to silence a Coverity warning about
bogus endianness dependency. The code's fine, because it takes care
of endianness issues for itself, but Coverity sees an int64 being
passed to an int* argument and not unreasonably suspects something's
wrong. I'm not sure if putting the void* cast in the way will shut it
up; but it can't hurt and seems better from a documentation standpoint
anyway, since the pointer is not used as an int* in this code path.
Just for a bit of additional safety, verify that the result length
is 8 bytes as expected.
Back-patch to 9.3 where the code in question was added.
This struct is purely a client-side artifact. Perhaps there was once
reason for the server to know it, but any such reason is lost in the
mists of time. We certainly don't need two independent declarations
of it.
The SGML docs claimed that 1-byte integers could be sent or received with
the "isint" options, but no such behavior has ever been implemented in
pqGetInt() or pqPutInt(). The in-code documentation header for PQfn() was
even less in tune with reality, and the code itself used parameter names
matching neither the SGML docs nor its libpq-fe.h declaration. Do a bit
of additional wordsmithing on the SGML docs while at it.
Since the business about 1-byte integers is a clear documentation bug,
back-patch to all supported branches.
By building it unconditionally, libpgport inadvertently replaced any
libc version of the function. This is essentially a code cleanup; any
effect on performance is almost surely too small to notice.
This role attribute is an ancient PostgreSQL feature, but could only be
set by directly updating the system catalogs, and it doesn't have any
clearly defined use.
Author: Adam Brightwell <adam.brightwell@crunchydatasolutions.com>
Commit 7b583b20b1 created an unnecessary
dump failure hazard by applying pg_get_function_identity_arguments()
to every function in the database, even those that won't get dumped.
This could result in snapshot-related problems if concurrent sessions are,
for example, creating and dropping temporary functions, as noted by Marko
Tiikkaja in bug #12832. While this is by no means pg_dump's only such
issue with concurrent DDL, it's unfortunate that we added a new failure
mode for cases that used to work, and even more so that the failure was
created for basically cosmetic reasons (ie, to sort overloaded functions
more deterministically).
To fix, revert that patch and instead sort function arguments using
information that pg_dump has available anyway, namely the names of the
argument types. This will produce a slightly different sort ordering for
overloaded functions than the previous coding; but applying strcmp
directly to the output of pg_get_function_identity_arguments really was
a bit odd anyway. The sorting will still be name-based and hence
independent of possibly-installation-specific OID assignments. A small
additional benefit is that sorting now works regardless of server version.
Back-patch to 9.3, where the previous commit appeared.
We were using "user mapping for user XYZ" as description for user mappings, but
that's ambiguous because users can have mappings on multiple foreign
servers; therefore change it to "for user XYZ on server UVW" instead.
Object identities for user mappings are also updated in the same way, in
branches 9.3 and above.
The incomplete description string was introduced together with the whole
SQL/MED infrastructure by commit cae565e503 of 8.4 era, so backpatch all
the way back.
An OID return value was being used only for a (rather pointless) assert.
Silence by removing the variable and the assert.
Per note from Peter Geoghegan
I had thought that there was no need to maintain separate cache entries
for different source typmods, but further experimentation shows that there
is an advantage to doing so in some cases. In particular, if a domain has
a typmod (say, "CREATE DOMAIN d AS numeric(20,0)"), failing to notice the
source typmod leads to applying a length-coercion step even when the
source has the correct typmod.
This is because can_coerce_type thinks that RECORD can be cast to any
composite type, but coerce_record_to_complex only works for inputs that are
RowExprs or whole-row Vars, so we get a hard failure on a CaseTestExpr.
Perhaps these corner cases ought to be fixed so that coerce_to_target_type
actually returns NULL as per its specification, rather than failing ...
but for the moment an extra check here is the path of least resistance.
plpgsql's historical method for converting datatypes during assignments was
to apply the source type's output function and then the destination type's
input function. Aside from being miserably inefficient in most cases, this
method failed outright in many cases where a user might expect it to work;
an example is that "declare x int; ... x := 3.9;" would fail, not round the
value to 4.
Instead, let's convert by applying the appropriate assignment cast whenever
there is one. To avoid breaking compatibility unnecessarily, fall back to
the I/O conversion method if there is no assignment cast.
So far as I can tell, there is just one case where this method produces a
different result than the old code in a case where the old code would not
have thrown an error. That is assignment of a boolean value to a string
variable (type text, varchar, or bpchar); the old way gave boolean's output
representation, ie 't'/'f', while the new way follows the behavior of the
bool-to-text cast and so gives 'true' or 'false'. This will need to be
called out as an incompatibility in the 9.5 release notes.
Aside from handling many conversion cases more sanely, this method is
often significantly faster than the old way. In part that's because
of more effective caching of the conversion info.
genericcostestimate() and friends used the cost of the entire indexqual
expressions as the charge for initial evaluation of indexscan arguments.
But of course the index column is not evaluated, only the other side
of the qual expression, so this was a bad overestimate if the index
column was an expensive expression.
To fix, refactor the logic in this area so that there's a single routine
charged with deconstructing index quals and figuring out what is the index
column and what is the comparison expression. This is more or less free in
the case of btree indexes, since btcostestimate() was doing equivalent
deconstruction already. It probably adds a bit of new overhead in the cases
of other index types, but not a lot. (In the case of GIN I think I saved
something by getting rid of code that wasn't aware that the index column
associations were already available "for free".)
Per recent gripe from Jeff Janes.
Arguably this is a bug fix, but I'm hesitant to back-patch because of the
possibility of destabilizing plan choices that people may be happy with.
This code relied on pointer equality to identify which restriction clauses
also appear in the indexquals (and, therefore, don't need to be applied as
simple filter conditions). That was okay once upon a time, years ago,
before we introduced the equivalence-class machinery. Now there's about a
50-50 chance that an equality clause appearing in the indexquals will be
the mirror image (commutator) of its mate in the restriction list. When
that happens, we'd erroneously think that the clause would be re-evaluated
at each visited row, and therefore inflate the cost estimate for the
indexscan by the clause's cost.
Add some logic to catch this case. It seems to me that it continues not to
be worthwhile to expend the extra predicate-proof work that createplan.c
will do on the finally-selected plan, but this case is common enough and
cheap enough to handle that we should do so.
This will make a small difference (about one cpu_operator_cost per row)
in simple cases; but in situations where there's an expensive function in
the indexquals, it can make a very large difference, as seen in recent
example from Jeff Janes.
This is a long-standing bug, but I'm hesitant to back-patch because of the
possibility of destabilizing plan choices that people may be happy with.
Passing a NULL pstate wouldn't actually work, because isLockedRefname()
isn't prepared to cope with it; and there hasn't been any in-core code
that tries in over a decade. So just remove the residual NULL handling.
Spotted by Coverity; analysis and patch by Michael Paquier.
The changed routines are mostly those that can be directly called by
ProcessUtilitySlow; the intention is to make the affected object
information more precise, in support for future event trigger changes.
Originally it was envisioned that the OID of the affected object would
be enough, and in most cases that is correct, but upon actually
implementing the event trigger changes it turned out that ObjectAddress
is more widely useful.
Additionally, some command execution routines grew an output argument
that's an object address which provides further info about the executed
command. To wit:
* for ALTER DOMAIN / ADD CONSTRAINT, it corresponds to the address of
the new constraint
* for ALTER OBJECT / SET SCHEMA, it corresponds to the address of the
schema that originally contained the object.
* for ALTER EXTENSION {ADD, DROP} OBJECT, it corresponds to the address
of the object added to or dropped from the extension.
There's no user-visible change in this commit, and no functional change
either.
Discussion: 20150218213255.GC6717@tamriel.snowman.net
Reviewed-By: Stephen Frost, Andres Freund
There's no reason to make users write an explicit cast to store a
json value in a jsonb column or vice versa.
We could probably even make these implicit, but that might open us up
to problems with ambiguous function calls, so for now just do this.
Previously, you could do \set variable operand1 operator operand2, but
nothing more complicated. Now, you can \set variable expression, which
makes it much simpler to do multi-step calculations here. This also
adds support for the modulo operator (%), with the same semantics as in
C.
Robert Haas and Fabien Coelho, reviewed by Álvaro Herrera and
Stephen Frost
Since 9.1, we've provided extensions with a way to denote
"configuration" tables- tables created by an extension which the user
may modify. By marking these as "configuration" tables, the extension
is asking for the data in these tables to be pg_dump'd (tables which
are not marked in this way are assumed to be entirely handled during
CREATE EXTENSION and are not included at all in a pg_dump).
Unfortunately, pg_dump neglected to consider foreign key relationships
between extension configuration tables and therefore could end up
trying to reload the data in an order which would cause FK violations.
This patch teaches pg_dump about these dependencies, so that the data
dumped out is done so in the best order possible. Note that there's no
way to handle circular dependencies, but those have yet to be seen in
the wild.
The release notes for this should include a caution to users that
existing pg_dump-based backups may be invalid due to this issue. The
data is all there, but restoring from it will require extracting the
data for the configuration tables and then loading them in the correct
order by hand.
Discussed initially back in bug #6738, more recently brought up by
Gilles Darold, who provided an initial patch which was further reworked
by Michael Paquier. Further modifications and documentation updates
by me.
Back-patch to 9.1 where we added the concept of extension configuration
tables.
In 6f9bd50eab, we modified
expand_security_quals() to tell expand_security_qual() about when the
current RTE was the targetRelation. Unfortunately, that commit
initialized the targetRelation variable used outside of the loop over
the RTEs instead of at the start of it.
This patch moves the variable and the initialization of it into the
loop, where it should have been to begin with.
Pointed out by Dean Rasheed.
Back-patch to 9.4 as the original commit was.
Previously, we cached domain constraints for the life of a query, or
really for the life of the FmgrInfo struct that was used to invoke
domain_in() or domain_check(). But plpgsql (and probably other places)
are set up to cache such FmgrInfos for the whole lifespan of a session,
which meant they could be enforcing really stale sets of constraints.
On the other hand, searching pg_constraint once per query gets kind of
expensive too: testing says that as much as half the runtime of a
trivial query such as "SELECT 0::domaintype" went into that.
To fix this, delegate the responsibility for tracking a domain's
constraints to the typcache, which has the infrastructure needed to
detect syscache invalidation events that signal possible changes.
This not only removes unnecessary repeat reads of pg_constraint,
but ensures that we never apply stale constraint data: whatever we
use is the current data according to syscache rules.
Unfortunately, the current configuration of the system catalogs means
we have to flush cached domain-constraint data whenever either pg_type
or pg_constraint changes, which happens rather a lot (eg, creation or
deletion of a temp table will do it). It might be worth rearranging
things to split pg_constraint into two catalogs, of which the domain
constraint one would probably be very low-traffic. That's a job for
another patch though, and in any case this patch should improve matters
materially even with that handicap.
This patch makes use of the recently-added memory context reset callback
feature to manage the lifespan of domain constraint caches, so that we
don't risk deleting a cache that might be in the midst of evaluation.
Although this is a bug fix as well as a performance improvement, no
back-patch. There haven't been many if any field complaints about
stale domain constraint checks, so it doesn't seem worth taking the
risk of modifying data structures as basic as MemoryContexts in back
branches.
This makes "ALTER TABLE tabname ALTER tscol TYPE ... USING tscol AT TIME
ZONE 'UTC'" skip rewriting the table when altering from "timestamp" to
"timestamptz" or vice versa. While it would be nicer still to optimize
this in the absence of the USING clause given timezone==UTC, transform
functions must consult IMMUTABLE facts only.
When the library already exists in the build directory, "ar" preserves
members not named on its command line. This mattered when, for example,
a "configure" rerun dropped a file from $(LIBOBJS). libpgport carried
the obsolete member until "make clean". Back-patch to 9.0 (all
supported versions).
Initial experience with this feature suggests that instances of
MemoryContextCallback are likely to propagate into some widely-used headers
over time. As things stood, that would result in pulling memutils.h or
at least memnodes.h into common headers, which does not seem desirable.
Instead, let's decide that this feature is part of the "ordinary palloc
user" API rather than the "specialized context management" API, and as
such should be declared in palloc.h not memutils.h.
The main value of this change is to avoid expensive I/O conversions when
assigning to a variable that has a typmod specification, if the value
to be assigned is already known to have the right typmod. This is
particularly valuable for arrays with typmod specifications; formerly,
in an assignment to an array element the entire array would invariably
get put through double I/O conversion to check the typmod, to absolutely
no purpose since we'd already properly coerced the new element value.
Extracted from my "expanded arrays" patch; this seems worth committing
separately, whatever becomes of that patch, since it's really an
independent issue.
As long as we're changing the function signatures, take the opportunity
to rationalize the argument lists of exec_assign_value, exec_cast_value,
and exec_simple_cast_value; that is, put the arguments into a saner order,
and get rid of the bizarre choice to pass exec_assign_value's isNull flag
by reference.
Part of the intent of the parameterized-path mechanism was to handle
star-schema queries efficiently, but some overly-restrictive search
limiting logic added in commit e2fa76d80b
prevented such cases from working as desired. Fix that and add a
regression test about it. Per gripe from Marc Cousin.
This is arguably a bug rather than a new feature, so back-patch to 9.2
where parameterized paths were introduced.
Add documentation about the new reset callback mechanism.
Also, at long last, recast the existing text so that it describes the
current context mechanisms as established fact rather than something
we're going to implement. Shoulda done that in 2001 or so ...
The type variable must get set on first iteration of the while loop,
but there are reasonably modern gcc versions that don't realize that.
Initialize it with a dummy value. This undoes a removal of initialization
in commit 654809e770.
That is, MemoryContextReset() now means what was formerly meant by
MemoryContextResetAndDeleteChildren(), and the latter is now just a macro
alias for the former. If you really want the functionality that was
formerly provided by MemoryContextReset(), what you have to do is
MemoryContextResetChildren() plus MemoryContextResetOnly() (which is a
new API to reset *only* the named context and not touch its children).
The reason for this change is that near fifteen years of experience has
proven that there is noplace where old-style MemoryContextReset() is
actually what you want. Making that the default behavior has led to lots
of context-leakage bugs, while we've not found anyplace where it's actually
necessary to keep the child contexts; at least the standard regression
tests do not reveal anyplace where this change breaks anything. And there
are upcoming patches that will introduce additional reasons why child
contexts need to be removed.
We could change existing calls of MemoryContextResetAndDeleteChildren to be
just MemoryContextReset, but for the moment I'll leave them alone; they're
not costing anything.
The way that columns are added to a view is by calling
AlterTableInternal with special subtype AT_AddColumnToView; but that
subtype is changed to AT_AddColumnRecurse by ATPrepAddColumn. This has
no visible effect in the current code, since views cannot have
inheritance children (thus the recursion step is a no-op) and adding a
column to a view is executed identically to doing it to a table; but it
does make a difference for future event trigger code keeping track of
commands, because the current situation leads to confusing the case with
a normal ALTER TABLE ADD COLUMN.
Fix the problem by passing a flag to ATPrepAddColumn to prevent it from
changing the command subtype. The event trigger code can then properly
ignore the subcommand. (We could remove the call to ATPrepAddColumn,
since views are never typed, and there is never a need for recursion,
which are the two conditions that are checked by ATPrepAddColumn; but it
seems more future-proof to keep the call in place.)
This allows cleanup actions to be registered to be called just before a
particular memory context's contents are flushed (either by deletion or
MemoryContextReset). The patch in itself has no use-cases for this, but
several likely reasons for wanting this exist.
In passing, per discussion, rearrange some boolean fields in struct
MemoryContextData so as to avoid wasted padding space. For safety,
this requires making allowInCritSection's existence unconditional;
but I think that's a better approach than what was there anyway.
Typo "aggreagate" appeared three times, and the return value of function
JsonbIteratorNext() was being assigned to an int variable in a bunch of
places.
When a composite type being used in a typed table is modified by way
of ALTER TYPE, a table rewrite occurs appearing to come from ALTER TYPE.
The existing event_trigger.c code was unable to cope with that
and raised a spurious error. The fix is just to accept that command
tag for the event, and document this properly.
Noted while fooling with deparsing of DDL commands. This appears to be
an oversight in commit 618c9430a.
Thanks to Mark Wong for documentation wording help.
Commit ab14a73a6c raised an error in these cases and later the
behaviour was copied to jsonb. This is what the XML code, which we
then adopted, does, as the XSD types don't accept infinite values.
However, json dates and timestamps are just strings as far as json is
concerned, so there is no reason not to render these values as
'infinity'.
The json portion of this is backpatched to 9.4 where the behaviour was
introduced. The jsonb portion only affects the development branch.
Per gripe on pgsql-general.
Up to now RecordTransactionCommit() waited for WAL to be flushed (if
synchronous_commit != off) and to be synchronously replicated (if
enabled), even if a transaction did not have a xid assigned. The primary
reason for that is that sequence's nextval() did not assign a xid, but
are worthwhile to wait for on commit.
This can be problematic because sometimes read only transactions do
write WAL, e.g. HOT page prune records. That then could lead to read only
transactions having to wait during commit. Not something people expect
in a read only transaction.
This lead to such strange symptoms as backends being seemingly stuck
during connection establishment when all synchronous replicas are
down. Especially annoying when said stuck connection is the standby
trying to reconnect to allow syncrep again...
This behavior also is involved in a rather complicated <= 9.4 bug where
the transaction started by catchup interrupt processing waited for
syncrep using latches, but didn't get the wakeup because it was already
running inside the same overloaded signal handler. Fix the issue here
doesn't properly solve that issue, merely papers over the problems. In
9.5 catchup interrupts aren't processed out of signal handlers anymore.
To fix all this, make nextval() acquire a top level xid, and only wait for
transaction commit if a transaction both acquired a xid and emitted WAL
records. If only a xid has been assigned we don't uselessly want to
wait just because of writes to temporary/unlogged tables; if only WAL
has been written we don't want to wait just because of HOT prunes.
The xid assignment in nextval() is unlikely to cause overhead in
real-world workloads. For one it only happens SEQ_LOG_VALS/32 values
anyway, for another only usage of nextval() without using the result in
an insert or similar is affected.
Discussion: 20150223165359.GF30784@awork2.anarazel.de,
369698E947874884A77849D8FE3680C2@maumau,
5CF4ABBA67674088B3941894E22A0D25@maumau
Per complaint from maumau and Thom Brown
Backpatch all the way back; 9.0 doesn't have syncrep, but it seems
better to be consistent behavior across all maintained branches.
"RETURN SQLERRM" prompted plpgsql_exec_function() to read from freed
memory. Back-patch to 9.0 (all supported versions). Little code ran
between the premature free and the read, so non-assert builds are
unlikely to witness user-visible consequences.
The RLS patch added a hasRowSecurity field to PlannerGlobal and
PlannedStmt but didn't update nodes/copyfuncs.c and nodes/outfuncs.c to
reflect those additional fields.
Correct that by adding entries to the appropriate functions for those
fields.
Pointed out by Robert.
In expand_security_qual(), we were handling locking correctly when a
PlanRowMark existed, but not when we were working with the target
relation (which doesn't have any PlanRowMarks, but the subquery created
for the security barrier quals still needs to lock the rows under it).
Noted by Etsuro Fujita when working with the Postgres FDW, which wasn't
properly issuing a SELECT ... FOR UPDATE to the remote side under a
DELETE.
Back-patch to 9.4 where updatable security barrier views were
introduced.
Per discussion with Etsuro and Dean Rasheed.
When I rewrote this in commit 56a79a869b,
I forgot that it's possible for the input array type to change from one
call to the next (this can happen when applying the function to
pg_statistic columns, for instance). Fix that.
The "simple" path for printing VALUES clauses doesn't work if we need
to attach nondefault column aliases, because there's noplace to do that
in the minimal VALUES() syntax. So modify get_simple_values_rte() to
detect nondefault aliases and treat that as a non-simple case. This
further exposes that the "non-simple" path never actually worked;
it didn't produce valid syntax. Fix that too. Per bug #12789 from
Curtis McEnroe, and analysis by Andrew Gierth.
Back-patch to all supported branches. Before 9.3, this also requires
back-patching the part of commit 092d7ded29
that created get_simple_values_rte() to begin with; inserting the extra
test into the old factorization of that logic would've been too messy.
There are a couple of places in our grammar that fail to be strict LALR(1),
by requiring more than a single token of lookahead to decide what to do.
Up to now we've dealt with that by using a filter between the lexer and
parser that merges adjacent tokens into one in the places where two tokens
of lookahead are necessary. But that creates a number of user-visible
anomalies, for instance that you can't name a CTE "ordinality" because
"WITH ordinality AS ..." triggers folding of WITH and ORDINALITY into one
token. I realized that there's a better way.
In this patch, we still do the lookahead basically as before, but we never
merge the second token into the first; we replace just the first token by
a special lookahead symbol when one of the lookahead pairs is seen.
This requires a couple extra productions in the grammar, but it involves
fewer special tokens, so that the grammar tables come out a bit smaller
than before. The filter logic is no slower than before, perhaps a bit
faster.
I also fixed the filter logic so that when backing up after a lookahead,
the current token's terminator is correctly restored; this eliminates some
weird behavior in error message issuance, as is shown by the one change in
existing regression test outputs.
I believe that this patch entirely eliminates odd behaviors caused by
lookahead for WITH. It doesn't really improve the situation for NULLS
followed by FIRST/LAST unfortunately: those sequences still act like a
reserved word, even though there are cases where they should be seen as two
ordinary identifiers, eg "SELECT nulls first FROM ...". I experimented
with additional grammar hacks but couldn't find any simple solution for
that. Still, this is better than before, and it seems much more likely
that we *could* somehow solve the NULLS case on the basis of this filter
behavior than the previous one.
The tar format (at least the version we are using), does not support
file names or symlink targets longer than 99 bytes. Until now, the tar
creation code would silently truncate any names that are too long. (Its
original application was pg_dump, where this never happens.) This
creates problems when running base backups over the replication
protocol.
The most important problem is when a tablespace path is longer than 99
bytes, which will result in a truncated tablespace path being backed up.
Less importantly, the basebackup protocol also promises to back up any
other files it happens to find in the data directory, which would also
lead to file name truncation if someone put a file with a long name in
there.
Now both of these cases result in an error during the backup.
Add tests that fail when a too-long file name or symlink is attempted to
be backed up.
Reviewed-by: Robert Hass <robertmhaas@gmail.com>
Use a different A_Expr_Kind for LIKE/ILIKE/SIMILAR TO constructs, so that
they can be distinguished from direct invocation of the underlying
operators. Also, postpone selection of the operator name when transforming
"x IN (select)" to "x = ANY (select)", so that those syntaxes can be told
apart at parse analysis time.
I had originally thought I'd also have to do something special for the
syntaxes IS NOT DISTINCT FROM, IS NOT DOCUMENT, and x NOT IN (SELECT...),
which the grammar translates as though they were NOT (construct).
On reflection though, we can distinguish those cases reliably by noting
whether the parse location shown for the NOT is the same as for its child
node. This only requires tweaking the parse locations for NOT IN, which
I've done here.
These changes should have no effect outside the parser; they're just in
support of being able to give accurate warnings for planned operator
precedence changes.
COMMENT, SECURITY LABEL, and GRANT/REVOKE now also fire
ddl_command_start and ddl_command_end event triggers, when they operate
on database-local objects.
Reviewed-By: Michael Paquier, Andres Freund, Stephen Frost
Instead of having a single knob (checkpoint_segments) that both triggers
checkpoints, and determines how many checkpoints to recycle, they are now
separate concerns. There is still an internal variable called
CheckpointSegments, which triggers checkpoints. But it no longer determines
how many segments to recycle at a checkpoint. That is now auto-tuned by
keeping a moving average of the distance between checkpoints (in bytes),
and trying to keep that many segments in reserve. The advantage of this is
that you can set max_wal_size very high, but the system won't actually
consume that much space if there isn't any need for it. The min_wal_size
sets a floor for that; you can effectively disable the auto-tuning behavior
by setting min_wal_size equal to max_wal_size.
The max_wal_size setting is now the actual target size of WAL at which a
new checkpoint is triggered, instead of the distance between checkpoints.
Previously, you could calculate the actual WAL usage with the formula
"(2 + checkpoint_completion_target) * checkpoint_segments + 1". With this
patch, you set the desired WAL usage with max_wal_size, and the system
calculates the appropriate CheckpointSegments with the reverse of that
formula. That's a lot more intuitive for administrators to set.
Reviewed by Amit Kapila and Venkata Balaji N.
Replace the if-switch-case constructs with two conversion tables,
containing all the supported conversions between human-readable unit
strings and the base units used in GUC variables. This makes the code
easier to read, and makes adding new units simpler.
When LockBufferForCleanup() has to wait for getting a cleanup lock on a
buffer it does so by setting a flag in the buffer header and then wait
for other backends to signal it using ProcWaitForSignal().
Unfortunately LockBufferForCleanup() missed that ProcWaitForSignal() can
return for other reasons than the signal it is hoping for. If such a
spurious signal arrives the wait flags on the buffer header will still
be set. That then triggers "ERROR: multiple backends attempting to wait
for pincount 1".
The fix is simple, unset the flag if still set when retrying. That
implies an additional spinlock acquisition/release, but that's unlikely
to matter given the cost of waiting for a cleanup lock. Alternatively
it'd have been possible to move responsibility for maintaining the
relevant flag to the waiter all together, but that might have had
negative consequences due to possible floods of signals. Besides being
more invasive.
This looks to be a very longstanding bug. The relevant code in
LockBufferForCleanup() hasn't changed materially since its introduction
and ProcWaitForSignal() was documented to return for unrelated reasons
since 8.2. The master only patch series removing ImmediateInterruptOK
made it much easier to hit though, as ProcSendSignal/ProcWaitForSignal
now uses a latch shared with other tasks.
Per discussion with Kevin Grittner, Tom Lane and me.
Backpatch to all supported branches.
Discussion: 11553.1423805224@sss.pgh.pa.us
Previously when the standby server failed to retrieve WAL files from any sources
(i.e., streaming replication, local pg_xlog directory or WAL archive), it always
waited for five seconds (hard-coded) before the next attempt. For example,
this is problematic in warm-standby because restore_command can fail
every five seconds even while new WAL file is expected to be unavailable for
a long time and flood the log files with its error messages.
This commit adds new parameter, wal_retrieve_retry_interval, to control that
wait time.
Alexey Vasiliev and Michael Paquier, reviewed by Andres Freund and me.
If libpq output buffer is full, pqSendSome() function tries to drain any
incoming data. This avoids deadlock, if the server e.g. sends a lot of
NOTICE messages, and blocks until we read them. However, pqSendSome() only
did that in blocking mode. In non-blocking mode, the deadlock could still
happen.
To fix, take a two-pronged approach:
1. Change the documentation to instruct that when PQflush() returns 1, you
should wait for both read- and write-ready, and call PQconsumeInput() if it
becomes read-ready. That fixes the deadlock, but applications are not going
to change overnight.
2. In pqSendSome(), drain the input buffer before returning 1. This
alleviates the problem for applications that only wait for write-ready. In
particular, a slow but steady stream of NOTICE messages during COPY FROM
STDIN will no longer cause a deadlock. The risk remains that the server
attempts to send a large burst of data and fills its output buffer, and at
the same time the client also sends enough data to fill its output buffer.
The application will deadlock if it goes to sleep, waiting for the socket
to become write-ready, before the server's data arrives. In practice,
NOTICE messages and such that the server might be sending are usually
short, so it's highly unlikely that the server would fill its output buffer
so quickly.
Backpatch to all supported versions.
We did not need a location tag on NullTest or BooleanTest before, because
no error messages referred directly to their locations. That's planned
to change though, so add these fields in a separate housekeeping commit.
Catversion bump because stored rules may change.
transformExpr() has for many years had provisions to do nothing when
applied to an already-transformed expression tree. However, this was
always ugly and of dubious reliability, so we'd be much better off without
it. The primary historical reason for it was that gram.y sometimes
returned multiple links to the same subexpression, which is no longer true
as of my BETWEEN fixes. We'd also grown some lazy hacks in CREATE TABLE
LIKE (failing to distinguish between raw and already-transformed index
specifications) and one or two other places.
This patch removes the need for and support for re-transforming already
transformed expressions. The index case is dealt with by adding a flag
to struct IndexStmt to indicate that it's already been transformed;
which has some benefit anyway in that tablecmds.c can now Assert that
transformation has happened rather than just assuming. The other main
reason was some rather sloppy code for array type coercion, which can
be fixed (and its performance improved too) by refactoring.
I did leave transformJoinUsingClause() still constructing expressions
containing untransformed operator nodes being applied to Vars, so that
transformExpr() still has to allow Var inputs. But that's a much narrower,
and safer, special case than before, since Vars will never appear in a raw
parse tree, and they don't have any substructure to worry about.
In passing fix some oversights in the patch that added CREATE INDEX
IF NOT EXISTS (missing processing of IndexStmt.if_not_exists). These
appear relatively harmless, but still sloppy coding practice.
Previously, gram.y itself converted BETWEEN into AND (or AND/OR) nests of
expression comparisons. This was always as bogus as could be, but fixing
it hasn't risen to the top of the to-do list. The present patch invents an
A_Expr representation for BETWEEN expressions, and does the expansion to
comparison trees in parse_expr.c which is at least a slightly saner place
to be doing semantic conversions. There should be no change in the post-
parse-analysis results.
This does nothing for the semantic issues with BETWEEN (dubious connection
to btree-opclass semantics, and multiple evaluation of possibly volatile
subexpressions) ... but it's a necessary preliminary step before we could
fix any of that. The main immediate benefit is that preserving BETWEEN as
an identifiable raw-parse-tree construct will enable better error messages.
While at it, fix the code so that multiply-referenced subexpressions are
physically duplicated before being passed through transformExpr(). This
gets rid of one of the principal reasons why transformExpr() has
historically had to allow already-processed input.
Everywhere else in the file, "context" is of type MemoryContext and
"set" is of type AllocSet. AllocSetContextCreate uses a variable of
type AllocSet, so rename it from "context" to "set".
Previously, each new array created a new memory context that started
out at 8kB. This is incredibly wasteful when there are lots of small
groups of just a few elements each.
Change initArrayResult() and friends to accept a "subcontext" argument
to indicate whether the caller wants the ArrayBuildState allocated in
a new subcontext or not. If not, it can no longer be released
separately from the rest of the memory context.
Fixes bug report by Frank van Vugt on 2013-10-19.
Tomas Vondra. Reviewed by Ali Akbar, Tom Lane, and me.
In a manual pass over the catalog declaration I found a number of
columns which the boostrap automatism didn't mark NOT NULL even though
they actually were. Add BKI_FORCE_NOT_NULL markings to them.
It's usually not critical if a system table column is falsely determined
to be nullable as the code should always catch relevant cases. But it's
good to have a extra layer in place.
Discussion: 20150215170014.GE15326@awork2.anarazel.de
Bootstrap determines whether a column is null based on simple builtin
rules. Those work surprisingly well, but nonetheless a few existing
columns aren't set correctly. Additionally there is at least one patch
sent to hackers where forcing the nullness of a column would be helpful.
The boostrap format has gained FORCE [NOT] NULL for this, which will be
emitted by genbki.pl when BKI_FORCE_(NOT_)?NULL is specified for a
column in a catalog header.
This patch doesn't change the marking of any existing columns.
Discussion: 20150215170014.GE15326@awork2.anarazel.de
This requires changing quite a few places that were depending on
sizeof(HeapTupleHeaderData), but it seems for the best.
Michael Paquier, some adjustments by me
After finding an "=" character, the pointer was advanced twice when it
should only advance once. This is harmless as long as the value after "="
has at least one character; but if it doesn't, we'd miss the terminator
character and include too much in the value.
In principle this could lead to reading off the end of memory. It does not
seem worth treating as a security issue though, because it would happen on
client side, and besides client logic that's taking conninfo strings from
untrusted sources has much worse security problems than this.
Report and patch received off-list from Thomas Fanghaenel.
Back-patch to 9.2 where the faulty code was introduced.
Commit 8001fe67a3 introduced this
requirement, but per discussion, we want to avoid requirements of
this type to make things easier on the calling code. An especially
important consideration is that this may be used in frontend code,
not just the backend.
Asif Naeem, reviewed by Michael Paquier
clang complains about this, not unreasonably, so define another struct
that's explicitly for a WordEntryPos with exactly one element.
While at it, get rid of pretty dubious use of a static variable for
more than one purpose --- if it were being treated as const maybe
I'd be okay with this, but it isn't.
This works by keeping a per-subtransaction record of the ins/upd/del
counters before the truncate, and then resetting them; this record is
useful to return to the previous state in case the truncate is rolled
back, either in a subtransaction or whole transaction. The state is
propagated upwards as subtransactions commit.
When the per-table data is sent to the stats collector, a flag indicates
to reset the live/dead counters to zero as well.
Catalog version bumped due to the change in pgstat format.
Author: Alexander Shulgin
Discussion: 1007.1207238291@sss.pgh.pa.us
Discussion: 548F7D38.2000401@BlueTreble.com
Reviewed-by: Álvaro Herrera, Jim Nasby
This gives a stronger guarantee than a mere comment against accessing these
fields as simple struct members. Since rolpassword is in fact varlena,
it's not clear why these didn't get marked from the beginning, but let's
do it now.
Michael Paquier
Replace some bogus "x[1]" declarations with "x[FLEXIBLE_ARRAY_MEMBER]".
Aside from being more self-documenting, this should help prevent bogus
warnings from static code analyzers and perhaps compiler misoptimizations.
This patch is just a down payment on eliminating the whole problem, but
it gets rid of a lot of easy-to-fix cases.
Note that the main problem with doing this is that one must no longer rely
on computing sizeof(the containing struct), since the result would be
compiler-dependent. Instead use offsetof(struct, lastfield). Autoconf
also warns against spelling that offsetof(struct, lastfield[0]).
Michael Paquier, review and additional fixes by me.
Per discussion, this could be useful for purposes such as programmatically
detecting a nonresponding stats collector. We already have the timestamp
anyway, it's just a matter of providing a SQL-accessible function to fetch
it.
Matt Kelly, reviewed by Jim Nasby
There wasn't any good reason for a single C function to implement both
these SQL functions: it saved very little code overall, and it required
significant pushups to re-determine at runtime which case applied. Redoing
it as two functions ends up with just slightly more lines of code, but it's
simpler to understand, and faster too because we need not repeat syscache
lookups on every call.
An important side benefit is that this eliminates the only case in which
different aliases of the same C function had both anyarray and anyelement
arguments at the same position, which would almost always be a mistake.
The opr_sanity regression test will now notice such mistakes since there's
no longer a valid case where it happens.
Code like
open(P, "cl /? 2>&1 |") || die "cl command not found";
does not actually catch any errors, because the exit status of the
command before the pipe is ignored. The fix is to look at $?.
This also gave the opportunity to clean up the logic of this code a bit.
The original representation uses "opcname for amname", which is good
enough; but if we replace "for" with "using", we can apply the returned
identity directly in a DROP command, as in
DROP OPERATOR CLASS opcname USING amname
This slightly simplifies code using object identities to programatically
execute commands on these kinds of objects.
Note backwards-incompatible change:
The previous representation dates back to 9.3 when object identities
were introduced by commit f8348ea3, but we don't want to change the
behavior on released branches unnecessarily and so this is not
backpatched.
Somebody apparently threw darts at the code to decide where to insert
these. They certainly didn't proceed by adding them where other similar
SETs were handled. This at least broke pg_restore, and perhaps other
use-cases too.
cfopen() and cfopen_write() failed to pass the compression level through
to zlib, so that you always got the default compression level if you got
any at all.
In passing, also fix these and related functions so that the correct errno
is reliably returned on failure; the original coding supposes that free()
cannot change errno, which is untrue on at least some platforms.
Per bug #12779 from Christoph Berg. Back-patch to 9.1 where the faulty
code was introduced.
Michael Paquier
The previous coding in EXPLAIN always labeled a ModifyTable node with the
name of the target table affected by its first child plan. When originally
written, this was necessarily the parent table of the inheritance tree,
so everything was unconfusing. But when we added NO INHERIT constraints,
it became possible for the parent table to be deleted from the plan by
constraint exclusion while still leaving child tables present. This led to
the ModifyTable plan node being labeled with the first surviving child,
which was deemed confusing. Fix it by retaining the parent table's RT
index in a new field in ModifyTable.
Etsuro Fujita, reviewed by Ashutosh Bapat and myself
After removal, the next_sibling pointer of a node was sometimes incorrectly
left to point to another node in the heap, which meant that a node was
sometimes linked twice into the heap. Surprisingly that didn't cause any
crashes in my testing, but it was clearly wrong and could easily segfault
in other scenarios.
Also always keep the prev_or_parent pointer as NULL on the root node. That
was not a correctness issue AFAICS, but let's be tidy.
Add a debugging function, to dump the contents of a pairing heap as a
string. It's #ifdef'd out, as it's not used for anything in any normal
code, but it was highly useful in debugging this. Let's keep it handy for
further reference.
The part of the comparison function that was supposed to keep heap tuples
ahead of index items was backwards. It would not lead to incorrect results,
but it is more efficient to return heap tuples first, before scanning more
index pages, when both have the same distance.
Alexander Korotkov
In investigating yesterday's crash report from Hugo Osvaldo Barrera, I only
looked back as far as commit f3aec2c7f5 where the breakage occurred
(which is why I thought the IPv4-in-IPv6 business was undocumented). But
actually the logic dates back to commit 3c9bb8886d and was simply
broken by erroneous refactoring in the later commit. A bit of archives
excavation shows that we added the whole business in response to a report
that some 2003-era Linux kernels would report IPv4 connections as having
IPv4-in-IPv6 addresses. The fact that we've had no complaints since 9.0
seems to be sufficient confirmation that no modern kernels do that, so
let's just rip it all out rather than trying to fix it.
Do this in the back branches too, thus essentially deciding that our
effective behavior since 9.0 is correct. If there are any platforms on
which the kernel reports IPv4-in-IPv6 addresses as such, yesterday's fix
would have made for a subtle and potentially security-sensitive change in
the effective meaning of IPv4 pg_hba.conf entries, which does not seem like
a good thing to do in minor releases. So let's let the post-9.0 behavior
stand, and change the documentation to match it.
In passing, I failed to resist the temptation to wordsmith the description
of pg_hba.conf IPv4 and IPv6 address entries a bit. A lot of this text
hasn't been touched since we were IPv4-only.
Avoid losing errno if readdir() fails and closedir() works. Consistently
return 4 rather than 3 if both a lost+found directory and other files are
found, rather than returning one value or the other depending on the
order of the directory listing. Update comments to match the actual
behavior.
These oversights date to commits 6f03927fce
and 17f1523932.
Marco Nenciarini
The previous coding copied garbage into a local variable, pretty much
ensuring that the intended test of an IPv6 connection address against a
promoted IPv4 address from pg_hba.conf would never match. The lack of
field complaints likely indicates that nobody realized this was supposed
to work, which is unsurprising considering that no user-facing docs suggest
it should work.
In principle this could have led to a SIGSEGV due to reading off the end of
memory, but since the source address would have pointed to somewhere in the
function's stack frame, that's quite unlikely. What led to discovery of
the bug is Hugo Osvaldo Barrera's report of a crash after an OS upgrade,
which is probably because he is now running a system in which memcpy raises
abort() upon detecting overlapping source and destination areas. (You'd
have to additionally suppose some things about the stack frame layout to
arrive at this conclusion, but it seems plausible.)
This has been broken since the code was added, in commit f3aec2c7f5,
so back-patch to all supported branches.
This reverts the removal of the call in commit (272923a0). It turns out it
wasn't superfluous after all: without it, renegotiation fails if a client
certificate was used. The rest of the changes in that commit are still OK
and not reverted.
Per investigation of bug #12769 by Arne Scheffer, although this doesn't fix
the reported bug yet.
exec_stmt_return() and exec_stmt_return_next() have fast-path code for
handling a simple variable reference (i.e. "return var") without going
through the full expression evaluation machinery. For some reason,
pl_gram.y was under the impression that this fast path only applied for
record/row variables; but in reality code for handling regular scalar
variables has been there all along. Adjusting the logic to allow that
code to be used actually results in a net savings of code in pl_gram.y
(by eliminating some redundancy), and it buys a measurable though not
very impressive amount of speedup.
Noted while fooling with my expanded-array patch, wherein this makes a much
bigger difference because it enables returning an expanded array variable
without an extra flattening step. But AFAICS this is a win regardless,
so commit it separately.
All the other certificates were created to be valid for 10000 days, because
we don't want to have to recreate them. But I missed the root CA cert, and
the pre-created certificates included in the repository expired in January.
Fix, and re-create all the certificates.
The four functions array_ref, array_set, array_get_slice, array_set_slice
have traditionally declared their array inputs and results as being of type
"ArrayType *". This is a lie, and has been since Berkeley days, because
they actually also support "fixed-length array" types such as "name" and
"point"; not to mention that the inputs could be toasted. These values
should be declared Datum instead to avoid confusion. The current coding
already risks possible misoptimization by compilers, and it'll get worse
when "expanded" array representations become a valid alternative.
However, there's a fair amount of code using array_ref and array_set with
arrays that *are* known to be ArrayType structures, and there might be more
such places in third-party code. Rather than cluttering those call sites
with PointerGetDatum/DatumGetArrayTypeP cruft, what I did was to rename the
existing functions to array_get_element/array_set_element, fix their
signatures, then reincarnate array_ref/array_set as backwards compatibility
wrappers.
array_get_slice/array_set_slice have no such constituency in the core code,
and probably not in third-party code either, so I just changed their APIs.
In commit bf7ca15875 I introduced an
assumption that an RTE referenced by a whole-row Var must have a valid eref
field. This is false for RTEs constructed by DoCopy, and there are other
places taking similar shortcuts. Perhaps we should make all those places
go through addRangeTableEntryForRelation or its siblings instead of having
ad-hoc logic, but the most reliable fix seems to be to make the new code in
ExecEvalWholeRowVar cope if there's no eref. We can reasonably assume that
there's no need to insert column aliases if no aliases were provided.
Add a regression test case covering this, and also verifying that a sane
column name is in fact available in this situation.
Although the known case only crashes in 9.4 and HEAD, it seems prudent to
back-patch the code change to 9.2, since all the ingredients for a similar
failure exist in the variant patch applied to 9.3 and 9.2.
Per report from Jean-Pierre Pelletier.
The client socket is always in non-blocking mode, and if we actually want
blocking behaviour, we emulate it by sleeping and retrying. But we have
retry loops at different layers for reads and writes, which was confusing.
To simplify, remove all the sleeping and retrying code from the lower
levels, from be_tls_read and secure_raw_read and secure_raw_write, and put
all the logic in secure_read() and secure_write().
At least in all modern versions of OpenSSL, it is enough to call
SSL_renegotiate() once, and then forget about it. Subsequent SSL_write()
and SSL_read() calls will finish the handshake.
The SSL_set_session_id_context() call is unnecessary too. We only have
one SSL context, and the SSL session was created with that to begin with.
pg_database.datfrozenxid and pg_database.datminmxid were not preserved
for the 'postgres' and 'template1' databases. This could cause missing
clog file errors on access to user tables and indexes after upgrades in
these databases.
Backpatch through 9.0
This omission leaked one PGresult per WAL streaming cycle, which possibly
would never be enough to notice in the real world, but it's still a leak.
Per Coverity. Back-patch to 9.3 where the error was introduced.
We'd leak the ident_serv data structure if the second pg_getaddrinfo_all
(the one for the local address) failed. This is not of great consequence
because a failure return here just leads directly to backend exit(), but
if this function is going to try to clean up after itself at all, it should
not have such holes in the logic. Try to fix it in a future-proof way by
having all the failure exits go through the same cleanup path, rather than
"optimizing" some of them.
Per Coverity. Back-patch to 9.2, which is as far back as this patch
applies cleanly.
We already had one go at this issue in commit d73b7f973d, but we
failed to notice that buildACLCommands also leaked several PQExpBuffers
along with a simply malloc'd string. This time let's try to make the
fix a bit more future-proof by eliminating the separate exit path.
It's still not exactly critical because pg_dump will curl up and die on
failure; but since the amount of the potential leak is now several KB,
it seems worth back-patching as far as 9.2 where the previous fix landed.
Per Coverity, which evidently is smarter than clang's static analyzer.
Back in 2003 we had a discussion about how to decide which casts to dump.
At the time pg_dump really only considered an object's containing schema
to decide what to dump (ie, dump whatever's not in pg_catalog), and so
we chose a complicated idea involving whether the underlying types were to
be dumped (cf commit a6790ce857). But users
are allowed to create casts between built-in types, and we failed to dump
such casts. Let's get rid of that heuristic, which has accreted even more
ugliness since then, in favor of just looking at the cast's OID to decide
if it's a built-in cast or not.
In passing, also fix some really ancient code that supposed that it had to
manufacture a dependency for the cast on its cast function; that's only
true when dumping from a pre-7.3 server. This just resulted in some wasted
cycles and duplicate dependency-list entries with newer servers, but we
might as well improve it.
Per gripes from a number of people, most recently Greg Sabino Mullane.
Back-patch to all supported branches.
Back in commit 400e2c9344 I rewrote GEQO's
gimme_tree function to improve its heuristic for modifying the given tour
into a legal join order. In what can only be called a fit of hubris,
I supposed that this new heuristic would *always* find a legal join order,
and ripped out the old logic that allowed gimme_tree to sometimes fail.
The folly of this is exposed by bug #12760, in which the "greedy" clumping
behavior of merge_clump() can lead it into a dead end which could only be
recovered from by un-clumping. We have no code for that and wouldn't know
exactly what to do with it if we did. Rather than try to improve the
heuristic rules still further, let's just recognize that it *is* a
heuristic and probably must always have failure cases. So, put back the
code removed in the previous commit to allow for failure (but comment it
a bit better this time).
It's possible that this code was actually fully correct at the time and
has only been broken by the introduction of LATERAL. But having seen this
example I no longer have much faith in that proposition, so back-patch to
all supported branches.
When ecpg was rewritten to the new protocol version not all variable types
were corrected. This patch rewrites the code for these types to fix that. It
also fixes the documentation to correctly tell the status of array handling.
This speeds up WAL generation and replay. The new algorithm is
significantly faster with large inputs, like full-page images or when
inserting wide rows. It is slower with tiny inputs, i.e. less than 10 bytes
or so, but the speedup with longer inputs more than make up for that. Even
small WAL records at least have 24 byte header in the front.
The output is identical to the current byte-at-a-time computation, so this
does not affect compatibility. The new algorithm is only used for the
CRC-32C variant, not the legacy version used in tsquery or the
"traditional" CRC-32 used in hstore and ltree. Those are not as performance
critical, and are usually only applied over small inputs, so it seems
better to not carry around the extra lookup tables to speed up those rare
cases.
Abhijit Menon-Sen
Fix some issues I noticed while fooling with an extension to allow an
additional kind of toast pointer. Much of this is just comment
improvement, but there are a couple of actual bugs, which might or might
not be reachable today depending on what can happen during logical
decoding. An example is that toast_flatten_tuple() failed to cover the
possibility of an indirection pointer in its input. Back-patch to 9.4
just in case that is reachable now.
In HEAD, also correct some really minor issues with recent compression
reorganization, such as dangerously underparenthesized macros.
To get CRC functionality in a client program, you now need to link with
libpgcommon instead of libpgport. The CRC code has nothing to do with
portability, so libpgcommon is a better home. (libpgcommon didn't exist
when pg_crc.c was originally moved to src/port.)
Remove the possibility to get CRC functionality by just #including
pg_crc_tables.h. I'm not aware of any extensions that actually did that and
couldn't simply link with libpgcommon.
This also moves the pg_crc.h header file from src/include/utils to
src/include/common, which will require changes to any external programs
that currently does #include "utils/pg_crc.h". That seems acceptable, as
include/common is clearly the right home for it now, and the change needed
to any such programs is trivial.
The meta data of PGLZ symbolized by PGLZ_Header is removed, to make
the compression and decompression code independent on the backend-only
varlena facility. PGLZ_Header is being used to store some meta data
related to the data being compressed like the raw length of the uncompressed
record or some varlena-related data, making it unpluggable once PGLZ is
stored in src/common as it contains some backend-only code paths with
the management of varlena structures. The APIs of PGLZ are reworked
at the same time to do only compression and decompression of buffers
without the meta-data layer, simplifying its use for a more general usage.
On-disk format is preserved as well, so there is no incompatibility with
previous major versions of PostgreSQL for TOAST entries.
Exposing compression and decompression APIs of pglz makes possible its
use by extensions and contrib modules. Especially this commit is required
for upcoming WAL compression feature so that the WAL reader facility can
decompress the WAL data by using pglz_decompress.
Michael Paquier, reviewed by me.
We reserve space for the full amount, not one less. The affected checks
deal with localized month and day names. Today's DCH_MAX_ITEM_SIZ value
would suffice for a 60-byte day name, while the longest known is the
49-byte mn_CN.utf-8 word for "Saturday." Thus, the upshot of this
change is merely to avoid misdirecting future readers of the code; users
are not expected to see errors either way.
When beginning streaming replication, the client usually issues the
IDENTIFY_SYSTEM command, which used to return the current WAL insert
position. That's not suitable for the intended purpose of that field,
however. pg_receivexlog uses it to start replication from the reported
point, but if it hasn't been flushed to disk yet, it will fail. Change
IDENTIFY_SYSTEM to report the flush position instead.
Backpatch to 9.1 and above. 9.0 doesn't report any WAL position.
actually check the returned pointer allocated, potentially NULL which
could be the result of a malloc call.
Issue noted by Coverity, fixed by Michael Paquier <michael@otacoo.com>
It was getting tedious to track and release all the different things that
form a scan key. We were leaking at least the queryCategories array, and
possibly more, on a rescan. That was visible if a GIN index was used in a
nested loop join. This also protects from leaks in extractQuery method.
No backpatching, given the lack of complaints from the field. Maybe later,
after this has received more field testing.
If an insertion or update had to wait for another transaction to finish,
because there was another insertion with conflicting key in progress,
we would pass a just-free'd item pointer to XactLockTableWait().
All calls to XactLockTableWait() and MultiXactIdWait() had similar issues.
Some passed a pointer to a buffer in the buffer cache, after already
releasing the lock. The call in EvalPlanQualFetch had already released the
pin too. All but the call in execUtils.c would merely lead to reporting a
bogus ctid, however (or an assertion failure, if enabled).
All the callers that passed HeapTuple->t_data->t_ctid were slightly bogus
anyway: if the tuple was updated (again) in the same transaction, its ctid
field would point to the next tuple in the chain, not the tuple itself.
Backpatch to 9.4, where the 'ctid' argument to XactLockTableWait was added
(in commit f88d4cfc)
On windows _isnan() (which isnan() is redirected to in port/win32.h)
is declared in float.h, not math.h.
Per buildfarm animal currawong.
Backpatch to all supported branches.
All the other new SSL information functions had dummy versions in
be-secure.c, but I missed PQsslAttributes(). Oops. Surprisingly, the linker
did not complain about the missing function on most platforms represented in
the buildfarm, even though it is exported, except for a few Windows systems.
It's perfectly fine to have blocked interrupts when
ProcessClientWriteInterrupt() is called. In fact it's commonly the
case when emitting error reports. And we deal with that correctly.
Even if that'd not be the case, it'd be a bad location for such a
assertion. Because ProcessClientWriteInterrupt() is only called when
the socket is blocked it's hard to hit.
Per Heikki and buildfarm animals nightjar and dunlin.
The remaining caller (lwlocks) doesn't need that facility, and we plan
to remove ImmedidateInterruptOK entirely. That means that interrupts
can't be serviced race-free and portably anyway, so there's little
reason for keeping the feature.
Reviewed-By: Heikki Linnakangas
Deadlock checking was performed inside signal handlers up to
now. While it's a remarkable feat to have made this work reliably,
it's quite complex to understand why that is the case. Partially it
worked due to the assumption that semaphores are signal safe - which
is not actually documented to be the case for sysv semaphores.
The reason we had to rely on performing this work inside signal
handlers is that semaphores aren't guaranteed to be interruptable by
signals on all platforms. But now that latches provide a somewhat
similar API, which actually has the guarantee of being interruptible,
we can avoid doing so.
Signalling between ProcSleep, ProcWakeup, ProcWaitForSignal and
ProcSendSignal is now done using latches. This increases the
likelihood of spurious wakeups. As spurious wakeup already were
possible and aren't likely to be frequent enough to be an actual
problem, this seems acceptable.
This change would allow for further simplification of the deadlock
checking, now that it doesn't have to run in a signal handler. But
even if I were motivated to do so right now, it would still be better
to do that separately. Such a cleanup shouldn't have to be reviewed a
the same time as the more fundamental changes in this commit.
There is one possible usability regression due to this commit. Namely
it is more likely than before that log_lock_waits messages are output
more than once.
Reviewed-By: Heikki Linnakangas
We used to handle authentication_timeout by setting
ImmediateInterruptOK to true during large parts of the authentication
phase of a new connection. While that happens to work acceptably in
practice, it's not particularly nice and has ugly corner cases.
Previous commits converted the FE/BE communication to use latches and
implemented support for interrupt handling during both
send/recv. Building on top of that work we can get rid of
ImmediateInterruptOK during authentication, by immediately treating
timeouts during authentication as a reason to die. As die interrupts
are handled immediately during client communication that provides a
sensibly quick reaction time to authentication timeout.
Additionally add a few CHECK_FOR_INTERRUPTS() to some more complex
authentication methods. More could be added, but this already should
provides a reasonable coverage.
While it this overall increases the maximum time till a timeout is
reacted to, it greatly reduces complexity and increases
reliability. That seems like a overall win. If the increase proves to
be noticeable we can deal with those cases by moving to nonblocking
network code and add interrupt checking there.
Reviewed-By: Heikki Linnakangas
This field has been unreferenced since 1998, and does not appear in lseg
values stored on disk (since sizeof(lseg) is only 32 bytes according to
pg_type). There was apparently some idea of maintaining it just in values
appearing in memory, but the bookkeeping required to make that work would
surely far outweigh the cost of recalculating the line's slope when needed.
Remove it to (a) simplify matters and (b) suppress some uninitialized-field
whining from Coverity.
Up to now it was impossible to terminate a backend that was trying to
send/recv data to/from the client when the socket's buffer was already
full/empty. While the send/recv calls itself might have gotten
interrupted by signals on some platforms, we just immediately retried.
That could lead to situations where a backend couldn't be terminated ,
after a client died without the connection being closed, because it
was blocked in send/recv.
The problem was far more likely to be hit when sending data than when
reading. That's because while reading a command from the client, and
during authentication, we processed interrupts immediately . That
primarily left COPY FROM STDIN as being problematic for recv.
Change things so that that we process 'die' events immediately when
the appropriate signal arrives. We can't sensibly react to query
cancels at that point, because we might loose sync with the client as
we could be in the middle of writing a message.
We don't interrupt writes if the write buffer isn't full, as indicated
by write() returning EWOULDBLOCK, as that would lead to fewer error
messages reaching clients.
Per discussion with Kyotaro HORIGUCHI and Heikki Linnakangas
Discussion: 20140927191243.GD5423@alap3.anarazel.de
Up to now large swathes of backend code ran inside signal handlers
while reading commands from the client, to allow for speedy reaction to
asynchronous events. Most prominently shared invalidation and NOTIFY
handling. That means that complex code like the starting/stopping of
transactions is run in signal handlers... The required code was
fragile and verbose, and is likely to contain bugs.
That approach also severely limited what could be done while
communicating with the client. As the read might be from within
openssl it wasn't safely possible to trigger an error, e.g. to cancel
a backend in idle-in-transaction state. We did that in some cases,
namely fatal errors, nonetheless.
Now that FE/BE communication in the backend employs non-blocking
sockets and latches to block, we can quite simply interrupt reads from
signal handlers by setting the latch. That allows us to signal an
interrupted read, which is supposed to be retried after returning from
within the ssl library.
As signal handlers now only need to set the latch to guarantee timely
interrupt processing, remove a fair amount of complicated & fragile
code from async.c and sinval.c.
We could now actually start to process some kinds of interrupts, like
sinval ones, more often that before, but that seems better done
separately.
This work will hopefully allow to handle cases like being blocked by
sending data, interrupting idle transactions and similar to be
implemented without too much effort. In addition to allowing getting
rid of ImmediateInterruptOK, that is.
Author: Andres Freund
Reviewed-By: Heikki Linnakangas