We call this thing a "transaction block" everywhere except in a few
functions, where it is mysteriously called a "transaction chain". In
the SQL standard, a transaction chain is something different. So rename
these functions to match the common terminology.
Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org>
Part of the intent in commit fd1a421fe was to allow SQL functions that are
declared to return VOID to contain anything, including an unrelated final
SELECT, the same as SQL-language procedures can. However, the planner's
inlining logic didn't get that memo. Fix it, and add some regression tests
covering this area, since evidently we had none.
In passing, clean up some typos in comments in create_function_3.sql,
and get rid of its none-too-safe assumption that DROP CASCADE notice
output is immutably ordered.
Per report from Prabhat Sahu.
Discussion: https://postgr.es/m/CANEvxPqxAj6nNHVcaXxpTeEFPmh24Whu+23emgjiuKrhJSct0A@mail.gmail.com
There's no functional change here, or at least I hope there isn't,
just code rearrangement. The rearrangement is motivated by
partition-wise aggregate, which doesn't need to consider the
degenerate case but wants to reuse the logic for the ordinary case.
Based loosely on a patch from Ashutosh Bapat and Jeevan Chalke, but I
whacked it around pretty heavily. The larger patch series of which
this patch is a part was also reviewed and tested by Antonin Houska,
Rajkumar Raghuwanshi, David Rowley, Dilip Kumar, Konstantin Knizhnik,
Pascal Legrand, Rafia Sabih, and me.
Discussion: http://postgr.es/m/CAFjFpRewpqCmVkwvq6qrRjmbMDpN0CZvRRzjd8UvncczA3Oz1Q@mail.gmail.com
Fix the warnings created by the compiler warning options
-Wformat-overflow=2 -Wformat-truncation=2, supported since GCC 7. This
is a more aggressive variant of the fixes in
6275f5d28a, which GCC 7 warned about by
default.
The issues are all harmless, but some dubious coding patterns are
cleaned up.
One issue that is of external interest is that BGW_MAXLEN is increased
from 64 to 96. Apparently, the old value would cause the bgw_name of
logical replication workers to be truncated in some circumstances.
But this doesn't actually add those warning options. It appears that
the warnings depend a bit on compilation and optimization options, so it
would be annoying to have to keep up with that. This is more of a
once-in-a-while cleanup.
Reviewed-by: Michael Paquier <michael@paquier.xyz>
get_number_of_groups() and make_partial_grouping_target() currently
fish information directly out of the PlannerInfo; in the former case,
the target list, and in the latter case, the HAVING qual. This works
fine if there's only one grouping relation, but if the pending patch
for partition-wise aggregate gets committed, we'll have multiple
grouping relations and must therefore use appropriately translated
versions of these values for each one. To make that simpler, pass the
values to be used as arguments.
Jeevan Chalke. The larger patch series of which this patch is a part
was also reviewed and tested by Antonin Houska, Rajkumar Raghuwanshi,
David Rowley, Dilip Kumar, Konstantin Knizhnik, Pascal Legrand, Rafia
Sabih, and me.
Discussion: http://postgr.es/m/CAM2+6=UqFnFUypOvLdm5TgC+2M=-E0Q7_LOh0VDFFzmk2BBPzQ@mail.gmail.com
Discussion: http://postgr.es/m/CAM2+6=W+L=C4yBqMrgrfTfNtbtmr4T53-hZhwbA2kvbZ9VMrrw@mail.gmail.com
The logical replication type map seems to have been misused by its only
caller -- it would try to use the remote OID as input for local type
routines, which unsurprisingly could result in bogus "cache lookup
failed for type XYZ" errors, or random other type names being picked up
if they happened to use the right OID. Fix that, changing
Oid logicalrep_typmap_getid(Oid remoteid) to
char *logicalrep_typmap_gettypname(Oid remoteid)
which is more useful. If the remote type is not part of the typmap,
this simply prints "unrecognized type" instead of choking trying to
figure out -- a pointless exercise (because the only input for that
comes from replication messages, which are not under the local node's
control) and dangerous to boot, when called from within an error context
callback.
Once that is done, it comes to light that the local OID in the typmap
entry was not being used for anything; the type/schema names are what we
need, so remove local type OID from that struct.
Once you do that, it becomes pointless to attach a callback to regular
syscache invalidation. So remove that also.
Reported-by: Dang Minh Huong
Author: Masahiko Sawada
Reviewed-by: Álvaro Herrera, Petr Jelínek, Dang Minh Huong, Atsushi Torikoshi
Discussion: https://postgr.es/m/75DB81BEEA95B445AE6D576A0A5C9E936A6BE964@BPXM05GP.gisp.nec.co.jp
Discussion: https://postgr.es/m/75DB81BEEA95B445AE6D576A0A5C9E936A6C4B0A@BPXM05GP.gisp.nec.co.jp
In b5635948ab, a couple of function header comments weren't changed, or
weren't changed correctly, to reflect the arguments being passed into
the functions. Specifically, get_number_of_groups() had the wrong
argument name in the commit and create_grouping_paths() wasn't
updated even though the arguments had been changed.
The issue with create_grouping_paths() was noticed by Ashutosh Bapat,
while I discovered the issue with get_number_of_groups() by looking to
see if there were any similar issues from that commit.
Discussion: https://postgr.es/m/CAFjFpRcbp4702jcp387PExt3fNCt62QJN8++DQGwBhsW6wRHWA@mail.gmail.com
In a top-level CALL, the values of INOUT arguments will be returned as a
result row. In PL/pgSQL, the values are assigned back to the input
arguments. In other languages, the same convention as for return a
record from a function is used. That does not require any code changes
in the PL implementations.
Reviewed-by: Pavel Stehule <pavel.stehule@gmail.com>
Autovacuum's 'workitem' request queue is of limited size, so requests
can fail if they arrive more quickly than autovacuum can process them.
Emit a log message when this happens, to provide better visibility of
this.
Backpatch to 10. While this represents an API change for
AutoVacuumRequestWork, that function is not yet prepared to deal with
external modules calling it, so there doesn't seem to be any risk (other
than log spam, that is.)
Author: Masahiko Sawada
Reviewed-by: Fabrízio Mello, Ildar Musin, Álvaro Herrera
Discussion: https://postgr.es/m/CAD21AoB1HrQhp6_4rTyHN5kWEJCEsG8YzsjZNt-ctoXSn5Uisw@mail.gmail.com
A simple UNION ALL gets flattened into an appendrel of subquery
RTEs, but up until now it's been impossible for the appendrel to use
the partial paths for the subqueries, so we can implement the
appendrel as a Parallel Append but only one with non-partial paths
as children.
There are three separate obstacles to removing that limitation.
First, when planning a subquery, propagate any partial paths to the
final_rel so that they are potentially visible to outer query levels
(but not if they have initPlans attached, because that wouldn't be
safe). Second, after planning a subquery, propagate any partial paths
for the final_rel to the subquery RTE in the outer query level in the
same way we do for non-partial paths. Third, teach finalize_plan() to
account for the possibility that the fake parameter we use for rescan
signalling when the plan contains a Gather (Merge) node may be
propagated from an outer query level.
Patch by me, reviewed and tested by Amit Khandekar, Rajkumar
Raghuwanshi, and Ashutosh Bapat. Test cases based on examples by
Rajkumar Raghuwanshi.
Discussion: http://postgr.es/m/CA+Tgmoa6L9A1nNCk3aTDVZLZ4KkHDn1+tm7mFyFvP+uQPS7bAg@mail.gmail.com
The existing logic for updating pg_class.reltuples trusted the sampling
results only for the pages ANALYZE actually visited, preferring to
believe the previous tuple density estimate for all the unvisited pages.
While there's some rationale for doing that for VACUUM (first that
VACUUM is likely to visit a very nonrandom subset of pages, and second
that we know for sure that the unvisited pages did not change), there's
no such rationale for ANALYZE: by assumption, it's looked at an unbiased
random sample of the table's pages. Furthermore, in a very large table
ANALYZE will have examined only a tiny fraction of the table's pages,
meaning it cannot slew the overall density estimate very far at all.
In a table that is physically growing, this causes reltuples to increase
nearly proportionally to the change in relpages, regardless of what is
actually happening in the table. This has been observed to cause reltuples
to become so much larger than reality that it effectively shuts off
autovacuum, whose threshold for doing anything is a fraction of reltuples.
(Getting to the point where that would happen seems to require some
additional, not well understood, conditions. But it's undeniable that if
reltuples is seriously off in a large table, ANALYZE alone will not fix it
in any reasonable number of iterations, especially not if the table is
continuing to grow.)
Hence, restrict the use of vac_estimate_reltuples() to VACUUM alone,
and in ANALYZE, just extrapolate from the sample pages on the assumption
that they provide an accurate model of the whole table. If, by very bad
luck, they don't, at least another ANALYZE will fix it; in the old logic
a single bad estimate could cause problems indefinitely.
In HEAD, let's remove vac_estimate_reltuples' is_analyze argument
altogether; it was never used for anything and now it's totally pointless.
But keep it in the back branches, in case any third-party code is calling
this function.
Per bug #15005. Back-patch to all supported branches.
David Gould, reviewed by Alexander Kuzmenkov, cosmetic changes by me
Discussion: https://postgr.es/m/20180117164916.3fdcf2e9@engels
In databases with many tables, re-fetching the statistics takes some time,
so that this behavior seriously decreases the available concurrency for
multiple autovac workers. There's discussion afoot about more complete
fixes, but a simple and back-patchable amelioration is to claim the table
and release the lock before rechecking stats. If we find out there's no
longer a reason to process the table, re-taking the lock to un-claim the
table is cheap enough.
(This patch is quite old, but got lost amongst a discussion of more
aggressive fixes. It's not clear when or if such a fix will be
accepted, but in any case it'd be unlikely to get back-patched.
Let's do this now so we have some improvement for the back branches.)
In passing, make the normal un-claim step take AutovacuumScheduleLock
not AutovacuumLock, since that is what is documented to protect the
wi_tableoid field. This wasn't an actual bug in view of the fact that
readers of that field hold both locks, but it creates some concurrency
penalty against operations that need only AutovacuumLock.
Back-patch to all supported versions.
Jeff Janes
Discussion: https://postgr.es/m/26118.1520865816@sss.pgh.pa.us
Several places used similar code to convert a string to an int, so take
the function that we already had and make it globally available.
Reviewed-by: Michael Paquier <michael@paquier.xyz>
A Value node would store an integer as a long. This causes needless
portability risks, as long can be of varying sizes. Change it to use
int instead. All code using this was already careful to only store
32-bit values anyway.
Reviewed-by: Michael Paquier <michael@paquier.xyz>
CREATE TABLE / LIKE with a bigint identity column would fail on
platforms where long is 32 bits. Copying the sequence values used
makeInteger(), which would truncate the 64-bit sequence data to 32 bits.
To fix, use makeFloat() instead, like the parser. (This does not
actually make use of floats, but stores the values as strings.)
Bug: #15096
Reviewed-by: Michael Paquier <michael@paquier.xyz>
If a table containing a primary key is attach as partition to a
partitioned table which has a primary key with a different definition,
we would happily create a second one in the new partition. Oops. It
turns out that this is because an error check in DefineIndex is executed
only if you tell it that it's being run by ALTER TABLE, and the original
code here wasn't. Change it so that it does.
Added a couple of test cases for this, also. A previously working test
started to fail in a different way than before patch because the new
check is called earlier; change the PK to plain UNIQUE so that the new
behavior isn't invoked, so that the test continues to verify what we
want it to verify.
Reported by: Noriyoshi Shinoda
Discussion: https://postgr.es/m/DF4PR8401MB102060EC2615EC9227CC73F7EEDF0@DF4PR8401MB1020.NAMPRD84.PROD.OUTLOOK.COM
One of the things canonicalize_qual() does is to remove constant-NULL
subexpressions of top-level AND/OR clauses. It does that on the assumption
that what it's given is a top-level WHERE clause, so that NULL can be
treated like FALSE. Although this is documented down inside a subroutine
of canonicalize_qual(), it wasn't mentioned in the documentation of that
function itself, and some callers hadn't gotten that memo.
Notably, commit d007a9505 caused get_relation_constraints() to apply
canonicalize_qual() to CHECK constraints. That allowed constraint
exclusion to misoptimize situations in which a CHECK constraint had a
provably-NULL subclause, as seen in the regression test case added here,
in which a child table that should be scanned is not. (Although this
thinko is ancient, the test case doesn't fail before 9.2, for reasons
I've not bothered to track down in detail. There may be related cases
that do fail before that.)
More recently, commit f0e44751d added an independent bug by applying
canonicalize_qual() to index expressions, which is even sillier since
those might not even be boolean. If they are, though, I think this
could lead to making incorrect index entries for affected index
expressions in v10. I haven't attempted to prove that though.
To fix, add an "is_check" parameter to canonicalize_qual() to specify
whether it should assume WHERE or CHECK semantics, and make it perform
NULL-elimination accordingly. Adjust the callers to apply the right
semantics, or remove the call entirely in cases where it's not known
that the expression has one or the other semantics. I also removed
the call in some cases involving partition expressions, where it should
be a no-op because such expressions should be canonical already ...
and was a no-op, independently of whether it could in principle have
done something, because it was being handed the qual in implicit-AND
format which isn't what it expects. In HEAD, add an Assert to catch
that type of mistake in future.
This represents an API break for external callers of canonicalize_qual().
While that's intentional in HEAD to make such callers think about which
case applies to them, it seems like something we probably wouldn't be
thanked for in released branches. Hence, in released branches, the
extra parameter is added to a new function canonicalize_qual_ext(),
and canonicalize_qual() is a wrapper that retains its old behavior.
Patch by me with suggestions from Dean Rasheed. Back-patch to all
supported branches.
Discussion: https://postgr.es/m/24475.1520635069@sss.pgh.pa.us
Commit b08df9cab left things rather poorly documented as far as the
exact semantics of "clause_is_check" mode went. Also, that mode did
not really work correctly for predicate_refuted_by; although given the
lack of specification as to what it should do, as well as the lack
of any actual use-case, that's perhaps not surprising.
Rename "clause_is_check" to "weak" proof mode, and provide specifications
for what it should do. I defined weak refutation as meaning "truth of A
implies non-truth of B", which makes it possible to use the mode in the
part of relation_excluded_by_constraints that checks for mutually
contradictory WHERE clauses. Fix up several places that did things wrong
for that definition. (As far as I can see, these errors would only lead
to failure-to-prove, not incorrect claims of proof, making them not
serious bugs even aside from the fact that v10 contains no use of this
mode. So there seems no need for back-patching.)
In addition, teach predicate_refuted_by_recurse that it can use
predicate_implied_by_recurse after all when processing a strong NOT-clause,
so long as it asks for the correct proof strength. This is an optimization
that could have been included in commit b08df9cab, but wasn't.
Also, simplify and generalize the logic that checks for whether nullness of
the argument of IS [NOT] NULL would force overall nullness of the predicate
or clause. (This results in a change in the partition_prune test's output,
as it is now able to prune an all-nulls partition that it did not recognize
before.)
In passing, in PartConstraintImpliedByRelConstraint, remove bogus
conversion of the constraint list to explicit-AND form and then right back
again; that accomplished nothing except forcing a useless extra level of
recursion inside predicate_implied_by.
Discussion: https://postgr.es/m/5983.1520487191@sss.pgh.pa.us
Since commit 69f4b9c85f, the existing
code was no longer assessing the parallel-safety of the real tlist
for each upper rel, but rather the first of possibly several tlists
created by split_pathtarget_at_srfs(). Repair.
Even though this is clearly wrong, it's not clear that it has any
user-visible consequences at the moment, so no back-patch for now. If
we discover later that it does have user-visible consequences, we
might need to back-patch this to v10.
Patch by me, per a report from Rajkumar Raghuwanshi.
Discussion: http://postgr.es/m/CA+Tgmoaob_Strkg4Dcx=VyxnyXtrmkV=ofj=pX7gH9hSre-g0Q@mail.gmail.com
We were independently checking ReservedBackends < MaxConnections and
max_wal_senders < MaxConnections, but because walsenders aren't allowed
to use superuser-reserved connections, that's really the wrong thing.
Correct behavior is to insist on ReservedBackends + max_wal_senders being
less than MaxConnections. Fix the code and associated documentation.
This has been wrong for a long time, but since the situation probably
hardly ever arises in the field (especially pre-v10, when the default
for max_wal_senders was zero), no back-patch.
Discussion: https://postgr.es/m/28271.1520195491@sss.pgh.pa.us
Noticed while playing with changes that mess with the bootstrap
sequence; the operations patched here failed to emit anything, leading
the developer to think that the bug was in the previous operation that
did emit a message.
Commit 1804284042 established that single-batch
parallel-aware hash joins could create one large shared hash table using the
combined work_mem budget of all participants. The costing accidentally
assumed that parallel-oblivious hash joins could also do that. The
documentation for initial_cost_hashjoin() also failed to mention the new
argument. Repair.
Author: Thomas Munro
Reported-By: Antonin Houska
Reviewed-By: Antonin Houska
Discussion: https://postgr.es/m/12441.1513935950%40localhost
If a walsender exits leaving data in reorderbuffers, the next walsender
that tries to decode the same transaction would append its decoded data
in the same spill files without truncating it first, which effectively
duplicate the data. Avoid that by removing any leftover reorderbuffer
spill files when a walsender starts.
Backpatch to 9.4; this bug has been there from the very beginning of
logical decoding.
Author: Craig Ringer, revised by me
Reviewed by: Álvaro Herrera, Petr Jelínek, Masahiko Sawada
Apparently, it doesn't work to use a plain cstring as a Name datum: you
may end up having random bytes because of failing to zero the bytes
after the terminating \0, as indicated by valgrind. I introduced this
bug in 5564c11815, so backpatch this fix to REL_10_STABLE, like that
commit.
While at it, fix a slightly misleading comment, pointed out by David
Rowley.
Since edd44738bc WCO expressions of partitioned tables are
initialized with the first subplan as parent. That's not correct, as
the correct context is the ModifyTableState node. That's also what is
used for RETURNING processing, initialized nearby.
This appears not to cause any visible problems for in core code, but
is problematic for in development patch.
Discussion: https://postgr.es/m/20180303043818.tnvlo243bgy7una3@alap3.anarazel.de
This is analogous to the syntax allowed for VACUUM. This allows us to
avoid making new options reserved keywords and makes it easier to
allow arbitrary argument order. Oh, and it's consistent with the other
commands, too.
Author: Nathan Bossart
Reviewed-By: Michael Paquier, Masahiko Sawada
Discussion: https://postgr.es/m/D3FC73E2-9B1A-4DB4-8180-55F57D116B4E@amazon.com
The LIKE INCLUDING ALL clause to CREATE TABLE intuitively indicates
cloning of extended statistics on the source table, but it failed to do
so. Patch it up so that it does. Also include an INCLUDING STATISTICS
option to the LIKE clause, so that the behavior can be requested
individually, or excluded individually.
While at it, reorder the INCLUDING options, both in code and in docs, in
alphabetical order which makes more sense than feature-implementation
order that was previously used.
Backpatch this to Postgres 10, where extended statistics were
introduced, because this is seen as an oversight in a fresh feature
which is better to get consistent from the get-go instead of changing
only in pg11.
In pg11, comments on statistics objects are cloned too. In pg10 they
are not, because I (Álvaro) was too coward to change the parse node as
required to support it. Also, in pg10 I chose not to renumber the
parser symbols for the various INCLUDING options in LIKE, for the same
reason. Any corresponding user-visible changes (docs) are backpatched,
though.
Reported-by: Stephen Froehlich
Author: David Rowley
Reviewed-by: Álvaro Herrera, Tomas Vondra
Discussion: https://postgr.es/m/CY1PR0601MB1927315B45667A1B679D0FD5E5EF0@CY1PR0601MB1927.namprd06.prod.outlook.com
Commit 34db06ef9a adopted a lock-free
design for shm_mq.c, but it introduced a race condition that could
lose messages. When shm_mq_receive_bytes() detects that the other end
has detached, it must make sure that it has seen the final version of
mq_bytes_written, or it might miss a message sent before detaching.
Thomas Munro
Discussion: https://postgr.es/m/CAEepm%3D2myZ4qxpt1a%3DC%2BwEv3o188K13K3UvD-44FK0SdAzHy%2Bw%40mail.gmail.com
If convert_to_scalar is passed a pair of datatypes it can't cope with,
its former behavior was just to elog(ERROR). While this is OK so far as
the core code is concerned, there's extension code that would like to use
scalarltsel/scalargtsel/etc as selectivity estimators for operators that
work on non-core datatypes, and this behavior is a show-stopper for that
use-case. If we simply allow convert_to_scalar to return FALSE instead of
outright failing, then the main logic of scalarltsel/scalargtsel will work
fine for any operator that behaves like a scalar inequality comparison.
The lack of conversion capability will mean that we can't estimate to
better than histogram-bin-width precision, since the code will effectively
assume that the comparison constant falls at the middle of its bin. But
that's still a lot better than nothing. (Someday we should provide a way
for extension code to supply a custom version of convert_to_scalar, but
today is not that day.)
While poking at this issue, we noted that the existing code for handling
type bytea in convert_to_scalar is several bricks shy of a load.
It assumes without checking that if the comparison value is type bytea,
the bounds values are too; in the worst case this could lead to a crash.
It also fails to detoast the input values, so that the comparison result is
complete garbage if any input is toasted out-of-line, compressed, or even
just short-header. I'm not sure how often such cases actually occur ---
the bounds values, at least, are probably safe since they are elements of
an array and hence can't be toasted. But that doesn't make this code OK.
Back-patch to all supported branches, partly because author requested that,
but mostly because of the bytea bugs. The change in API for the exposed
routine convert_network_to_scalar() is theoretically a back-patch hazard,
but it seems pretty unlikely that any third-party code is calling that
function directly.
Tomas Vondra, with some adjustments by me
Discussion: https://postgr.es/m/b68441b6-d18f-13ab-b43b-9a72188a4e02@2ndquadrant.com
Separate out the pg_attribute logic of genbki.pl into its own function.
Drop unnecessary "defined $catalog->{data}" check. This both narrows
and shortens the data writing loop of the script. There is no functional
change (the emitted files are the same as before).
John Naylor
Discussion: https://postgr.es/m/CAJVSVGXnLH=BSo0x-aA818f=MyQqGS5nM-GDCWAMdnvQJTRC1A@mail.gmail.com
Rationalize a couple of macro names:
* In catalog/pg_init_privs.h, rename Anum_pg_init_privs_privs to
Anum_pg_init_privs_initprivs to match the column's actual name.
* In ecpg, rename ZPBITOID to BITOID to match catalog/pg_type.h.
This reduces reader confusion, and will allow us to generate these
macros automatically in future.
In catalog/pg_tablespace.h, fix the ordering of related DATA and
#define lines to agree with how it's done elsewhere. This has no
impact today, but simplifies life for the bootstrap data conversion
scripts.
John Naylor
Discussion: https://postgr.es/m/CAJVSVGXnLH=BSo0x-aA818f=MyQqGS5nM-GDCWAMdnvQJTRC1A@mail.gmail.com
Sloppy coding in this function could lead to leaking a VM buffer pin,
or to attempting to free the same pin twice. Repair. While at it,
reduce the code's tendency to free and reacquire the same page pin.
Back-patch to 9.6; before that, this routine did not concern itself
with VM pages.
Amit Kapila and Tom Lane
Discussion: https://postgr.es/m/CAA4eK1KJKwhc=isgTQHjM76CAdVswzNeAuZkh_cx-6QgGkSEgA@mail.gmail.com
The new column distinguishes normal functions, procedures, aggregates,
and window functions. This replaces the existing columns proisagg and
proiswindow, and replaces the convention that procedures are indicated
by prorettype == 0. Also change prorettype to be VOIDOID for procedures.
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Instead of marking data from the ringer buffer consumed and setting the
sender's latch for every message, do it only when the amount of data we
can consume is at least 1/4 of the size of the ring buffer, or when no
data remains in the ring buffer. This is dramatically faster in my
testing; apparently, the savings from sending signals less frequently
outweighs the benefit of letting the sender know about available buffer
space sooner.
Patch by me, reviewed by Andres Freund and tested by Rafia Sabih.
Discussion: http://postgr.es/m/CA+TgmoYK7RFj6r7KLEfSGtYZCi3zqTRhAz8mcsDbUAjEmLOZ3Q@mail.gmail.com
Previously, mq_bytes_read and mq_bytes_written were protected by the
spinlock, but that turns out to cause pretty serious spinlock
contention on queries which send many tuples through a Gather or
Gather Merge node. This patches changes things so that we instead
read and write those values using 8-byte atomics. Since mq_bytes_read
can only be changed by the receiver and mq_bytes_written can only be
changed by the sender, the only purpose of the spinlock is to prevent
reads and writes of these values from being torn on platforms where
8-byte memory access is not atomic, making the conversion fairly
straightforward.
Testing shows that this produces some slowdown if we're using emulated
64-bit atomics, but since they should be available on any platform
where performance is a primary concern, that seems OK. It's faster,
sometimes a lot faster, on platforms where such atomics are available.
Patch by me, reviewed by Andres Freund, who also suggested the
design. Also tested by Rafia Sabih.
Discussion: http://postgr.es/m/CA+TgmoYuK0XXxmUNTFT9TSNiBtWnRwasBcHHRCOK9iYmDLQVPg@mail.gmail.com
Previously, it just returned the heap tuple count, which might be only an
estimate, and would be completely the wrong thing if the index is partial.
Since this function scans every index page anyway to find free pages,
it's practically free to count the surviving index tuples. Let's do that
and return an accurate count.
This is easily visible as a wrong reltuples value for a partial GiST
index following VACUUM, so back-patch to all supported branches.
Andrey Borodin, reviewed by Michail Nikolaev
Discussion: https://postgr.es/m/151956654251.6915.675951950408204404.pgcf@coridan.postgresql.org
For consistency with other code that deals in numbers of buckets, the
macro BUCKETS_PER_PARTITION should produce a value of type size_t.
Also, fix a mention of an obsolete proposed name for dshash.c that
appeared in a comment.
Author: Thomas Munro, based on an observation from Amit Kapila
Discussion: https://postgr.es/m/CAA4eK1%2BBOp5aaW3aHEkg5Bptf8Ga_BkBnmA-%3DXcAXShs0yCiYQ%40mail.gmail.com
These errors have been seen in the field in corrupted-data situations.
It seems worthwhile to report them with ERRCODE_DATA_CORRUPTED, rather
than the generic ERRCODE_INTERNAL_ERROR, for the benefit of log monitoring
and tools like amcheck. However, use errmsg_internal so that the text
strings still aren't translated; it seems unlikely to be worth
translators' time to do so.
Back-patch to 9.3, like the predecessor commit d70cf811f that introduced
these elog calls originally (replacing Asserts).
Peter Geoghegan
Discussion: https://postgr.es/m/CAH2-Wzmn4-Pg-UGFwyuyK-wiTih9j32pwg_7T9iwqXpAUZr=Mg@mail.gmail.com
Commit 4800f16a7a added some sanity checks to ensure we don't
accidentally corrupt data, but in one of them we failed to consider the
effects of a database upgraded from 9.2 or earlier, where a tuple
exclusively locked prior to the upgrade has a slightly different bit
pattern. Fix that by using the macro that we fixed in commit
74ebba84ae for similar situations.
Reported-by: Alexandre Garcia
Reviewed-by: Andres Freund
Discussion: https://postgr.es/m/CAPYLKR6yxV4=pfW0Gwij7aPNiiPx+3ib4USVYnbuQdUtmkMaEA@mail.gmail.com
Andres suspects that this bug may have wider ranging consequences, but I
couldn't find anything.
Since 9.5, it's possible that some but not all columns of an index
support returning the indexed value for index-only scans. If the
same indexed column appears in index columns that behave both ways,
check_index_only() supposed that it'd be OK to do an index-only scan
testing that column; but that fails if we have to recheck the indexed
condition on one of the columns that doesn't support this.
In principle we could make this work by remapping the recheck expressions
to pull the value from a column that does support returning the indexed
value. But such cases are so weird and rare that, at least for now,
it doesn't seem worth the trouble. Instead, just teach check_index_only
that a value is returnable only if all the index columns containing it
are returnable, rather than any of them.
Per report from David Pereiro Lagares. Back-patch to 9.5 where the
possibility of this situation appeared.
Kyotaro Horiguchi
Discussion: https://postgr.es/m/1516210494.1798.16.camel@nlpgo.com
formrdesc's comment listed the specific catalogs it is called for,
but the list was out of date. Rather than jumping back onto that
maintenance treadmill, let's just remove the list. It tells the
reader nothing that can't be learned quickly and more reliably by
searching relcache.c for callers of formrdesc().
Oversight noted by Kyotaro Horiguchi.
Discussion: https://postgr.es/m/20180214.105314.138966434.horiguchi.kyotaro@lab.ntt.co.jp
Commit a26116c6c accidentally changed the behavior of the SQL format_type()
function while refactoring. For the reasons explained in that function's
comment, a NULL typemod argument should behave differently from a -1
argument. Since we've managed to break this, add a regression test
memorializing the intended behavior.
In passing, be consistent about the type of the "flags" parameter.
Noted by Rushabh Lathia, though I revised the patch some more.
Discussion: https://postgr.es/m/CAGPqQf3RB2q-d2Awp_-x-Ur6aOxTUwnApt-vm-iTtceZxYnePg@mail.gmail.com
Use IndexTupleSize everywhere, instead. Also, remove IndexTupleSize's
internal typecast, as that's not really needed and might mask coding
errors. Change some pointer variable datatypes in the call sites
to compensate for that and make it clearer what we're assuming.
Ildar Musin, Robert Haas, Stephen Frost
Discussion: https://postgr.es/m/0274288e-9e88-13b6-c61c-7b36928bf221@postgrespro.ru
Solaris 11.4 has built-in functions named b64_encode and b64_decode.
Rename ours to something else to avoid the conflict (fortunately,
ours are static so the impact is limited).
One could wish for less duplication of code in this area, but that
would be a larger patch and not very suitable for back-patching.
Since this is a portability fix, we want to put it into all supported
branches.
Report and initial patch by Rainer Orth, reviewed and adjusted a bit
by Michael Paquier
Discussion: https://postgr.es/m/ydd372wk28h.fsf@CeBiTec.Uni-Bielefeld.DE
The previous code considered two tables to have the partition scheme
if the underlying columns had the same collation, but what we
actually need to compare is not the collations associated with the
column but the collation used for partitioning. Fix that.
Robert Haas and Amit Langote
Discussion: http://postgr.es/m/0f95f924-0efa-4cf5-eb5f-9a3d1bc3c33d@lab.ntt.co.jp
Parallel-aware plan nodes must be prepared to run without parallelism
if it's not possible at execution time for whatever reason. Commit
ab72716778, which introduced Parallel
Append, overlooked this.
Rajkumar Raghuwanshi reported this problem, and I included his test
case in this patch. The code changes are by me.
Discussion: http://postgr.es/m/CAKcux6=WqkUudLg1GLZZ7fc5ScWC1+Y9qD=pAHeqy32WoeJQvw@mail.gmail.com
Tom Kazimiers reported that transition tables don't work correctly when
they are scanned by more than one executor node. That's because commit
18ce3a4ab allocated separate read pointers for each executor node, as it
must, but failed to make them active at the appropriate times. Repair.
Thomas Munro
Discussion: https://postgr.es/m/20180224034748.bixarv6632vbxgeb%40dewberry.localdomain
A before-update row trigger may choose to return the "new" or "old" tuple
unmodified. ExecBRUpdateTriggers failed to consider the second
possibility, and would proceed to free the "old" tuple even if it was the
one returned, leading to subsequent access to already-deallocated memory.
In debug builds this reliably leads to an "invalid memory alloc request
size" failure; in production builds it might accidentally work, but data
corruption is also possible.
This is a very old bug. There are probably a couple of reasons it hasn't
been noticed up to now. It would be more usual to return NULL if one
wanted to suppress the update action; returning "old" is significantly less
efficient since the update will occur anyway. Also, none of the standard
PLs would ever cause this because they all returned freshly-manufactured
tuples even if they were just copying "old". But commit 4b93f5799 changed
that for plpgsql, making it possible to see the bug with a plpgsql trigger.
Still, this is certainly legal behavior for a trigger function, so it's
ExecBRUpdateTriggers's fault not plpgsql's.
It seems worth creating a test case that exercises returning "old" directly
with a C-language trigger; testing this through plpgsql seems unreliable
because its behavior might change again.
Report and fix by Rushabh Lathia; regression test case by me.
Back-patch to all supported branches.
Discussion: https://postgr.es/m/CAGPqQf1P4pjiNPrMof=P_16E-DFjt457j+nH2ex3=nBTew7tXw@mail.gmail.com
Commit 3bf05e096b sometimes uses the
cheapest_partial_path variable in this function to mean the cheapest
one from the input rel and at other times the cheapest one from the
partially grouped rel, but it never resets it, so we can end up with
bad plans, leading to "ERROR: Aggref found in non-Agg plan node".
Jeevan Chalke, per a report from Andreas Joseph Krogh and a separate
off-list report from Rajkumar Raghuwanshi
Discussion: http://postgr.es/m/CAM2+6=X9kxQoL2ZqZ00E6asBt9z+rfyWbOmhXJ0+8fPAyMZ9Jg@mail.gmail.com
This makes the client programs behave as documented regardless of the
connect-time search_path and regardless of user-created objects. Today,
a malicious user with CREATE permission on a search_path schema can take
control of certain of these clients' queries and invoke arbitrary SQL
functions under the client identity, often a superuser. This is
exploitable in the default configuration, where all users have CREATE
privilege on schema "public".
This changes behavior of user-defined code stored in the database, like
pg_index.indexprs and pg_extension_config_dump(). If they reach code
bearing unqualified names, "does not exist" or "no schema has been
selected to create in" errors might appear. Users may fix such errors
by schema-qualifying affected names. After upgrading, consider watching
server logs for these errors.
The --table arguments of src/bin/scripts clients have been lax; for
example, "vacuumdb -Zt pg_am\;CHECKPOINT" performed a checkpoint. That
now fails, but for now, "vacuumdb -Zt 'pg_am(amname);CHECKPOINT'" still
performs a checkpoint.
Back-patch to 9.3 (all supported versions).
Reviewed by Tom Lane, though this fix strategy was not his first choice.
Reported by Arseniy Sharoglazov.
Security: CVE-2018-1058
Historically, pg_dump has "set search_path = foo, pg_catalog" when
dumping an object in schema "foo", and has also caused that setting
to be used while restoring the object. This is problematic because
functions and operators in schema "foo" could capture references meant
to refer to pg_catalog entries, both in the queries issued by pg_dump
and those issued during the subsequent restore run. That could
result in dump/restore misbehavior, or in privilege escalation if a
nefarious user installs trojan-horse functions or operators.
This patch changes pg_dump so that it does not change the search_path
dynamically. The emitted restore script sets the search_path to what
was used at dump time, and then leaves it alone thereafter. Created
objects are placed in the correct schema, regardless of the active
search_path, by dint of schema-qualifying their names in the CREATE
commands, as well as in subsequent ALTER and ALTER-like commands.
Since this change requires a change in the behavior of pg_restore
when processing an archive file made according to this new convention,
bump the archive file version number; old versions of pg_restore will
therefore refuse to process files made with new versions of pg_dump.
Security: CVE-2018-1058
Up until now, we've abused grouped_rel->partial_pathlist as a place to
store partial paths that have been partially aggregate, but that's
really not correct, because a partial path for a relation is supposed
to be one which produces the correct results with the addition of only
a Gather or Gather Merge node, and these paths also require a Finalize
Aggregate step. Instead, add a new partially_group_rel which can hold
either partial paths (which need to be gathered and then have
aggregation finalized) or non-partial paths (which only need to have
aggregation finalized). This allows us to reuse generate_gather_paths
for partially_grouped_rel instead of writing new code, so that this
patch actually basically no net new code while making things cleaner,
simplifying things for pending patches for partition-wise aggregate.
Robert Haas and Jeevan Chalke. The larger patch series of which this
patch is a part was also reviewed and tested by Antonin Houska,
Rajkumar Raghuwanshi, David Rowley, Dilip Kumar, Konstantin Knizhnik,
Pascal Legrand, Rafia Sabih, and me.
Discussion: http://postgr.es/m/CA+TgmobrzFYS3+U8a_BCy3-hOvh5UyJbC18rEcYehxhpw5=ETA@mail.gmail.com
Discussion: http://postgr.es/m/CA+TgmoZyQEjdBNuoG9-wC5GQ5GrO4544Myo13dVptvx+uLg9uQ@mail.gmail.com
Recent Perl versions don't have the current directory in the module
include path anymore, so we need to add it here explicitly to make these
scripts continue to work.
Commit 0a459cec9 left this for later, but since time's running out,
I went ahead and took care of it. There are more data types that
somebody might someday want RANGE support for, but this is enough
to satisfy all expectations of the SQL standard, which just says that
"numeric, datetime, and interval" types should have RANGE support.
In the pgoutput plugin, skip changes for relations that are not
publishable, per is_publishable_class(). This concerns in particular
materialized views and information_schema tables. While those relations
cannot be part of a publication, per existing checks, they will be
considered by a FOR ALL TABLES publication. A subscription would not
actually apply changes for those relations, again per existing checks,
but trying to match incoming changes to local tables on the subscriber
would lead to errors if no matching local table exists. Skipping those
changes on the publisher avoids sending useless changes and eliminates
the error.
Bug: #15044
Reported-by: Chad Trabant <chad@iris.washington.edu>
Reviewed-by: Petr Jelinek <petr.jelinek@2ndquadrant.com>
Given overlapping or partially redundant join clauses, for example
t1 JOIN t2 ON t1.a = t2.x AND t1.b = t2.x
the planner's EquivalenceClass machinery will ordinarily refactor the
clauses as "t1.a = t1.b AND t1.a = t2.x", so that join processing doesn't
see multiple references to the same EquivalenceClass in a list of join
equality clauses. However, if the join is outer, it's incorrect to derive
a restriction clause on the outer side from the join conditions, so the
clause refactoring does not happen and we end up with overlapping join
conditions. The code that attempted to deal with such cases had several
subtle bugs, which could result in "left and right pathkeys do not match in
mergejoin" or "outer pathkeys do not match mergeclauses" planner errors,
if the selected join plan type was a mergejoin. (It does not appear that
any actually incorrect plan could have been emitted.)
The core of the problem really was failure to recognize that the outer and
inner relations' pathkeys have different relationships to the mergeclause
list. A join's mergeclause list is constructed by reference to the outer
pathkeys, so it will always be ordered the same as the outer pathkeys, but
this cannot be presumed true for the inner pathkeys. If the inner sides of
the mergeclauses contain multiple references to the same EquivalenceClass
({t2.x} in the above example) then a simplistic rendering of the required
inner sort order is like "ORDER BY t2.x, t2.x", but the pathkey machinery
recognizes that the second sort column is redundant and throws it away.
The mergejoin planning code failed to account for that behavior properly.
One error was to try to generate cut-down versions of the mergeclause list
from cut-down versions of the inner pathkeys in the same way as the initial
construction of the mergeclause list from the outer pathkeys was done; this
could lead to choosing a mergeclause list that fails to match the outer
pathkeys. The other problem was that the pathkey cross-checking code in
create_mergejoin_plan treated the inner and outer pathkey lists
identically, whereas actually the expectations for them must be different.
That led to false "pathkeys do not match" failures in some cases, and in
principle could have led to failure to detect bogus plans in other cases,
though there is no indication that such bogus plans could be generated.
Reported by Alexander Kuzmenkov, who also reviewed this patch. This has
been broken for years (back to around 8.3 according to my testing), so
back-patch to all supported branches.
Discussion: https://postgr.es/m/5dad9160-4632-0e47-e120-8e2082000c01@postgrespro.ru
Similar to what commit b022923556 for a
different set of functions, pass the required bits of the PartitionKey
instead of the whole thing. This allows these functions to be used
without needing the PartitionKey to be available.
Amit Langote. The larger patch series of which this patch is a part
has been reviewed and tested by Ashutosh Bapat, David Rowley, Dilip
Kumar, Jesper Pedersen, Rajkumar Raghuwanshi, Beena Emerson, Kyotaro
Horiguchi, Álvaro Herrera, and me, but especially and in great detail
by David Rowley.
Discussion: http://postgr.es/m/098b9c71-1915-1a2a-8d52-1a7a50ce79e8@lab.ntt.co.jp
Discussion: http://postgr.es/m/1f6498e8-377f-d077-e791-5dc84dba2c00@lab.ntt.co.jp
To support parameters in CALL, move the parse analysis of the procedure
and arguments into the global transformation phase, so that the parser
hooks can be applied. And then at execution time pass the parameters
from ProcessUtility on to ExecuteCallStmt.
Add the user-callable functions sha224, sha256, sha384, sha512. We
already had these in the C code to support SCRAM, but there was no test
coverage outside of the SCRAM tests. Adding these as user-callable
functions allows writing some tests. Also, we have a user-callable md5
function but no more modern alternative, which led to wide use of md5 as
a general-purpose hash function, which leads to occasional complaints
about using md5.
Also mark the existing md5 functions as leak-proof.
Reviewed-by: Michael Paquier <michael@paquier.xyz>
It's not necessary to fully initialize the executor data structures
for partitions to which no tuples are ever routed. Consider, for
example, an INSERT statement that inserts only one row: it only cares
about the partition to which that one row is routed. The new function
ExecInitPartitionInfo performs the initialization in question only
when a particular partition is about to receive a tuple. This includes
creating, validating, and saving a pointer to the ResultRelInfo,
setting up for speculative insertions, translating WCOs and
initializing the resulting expressions, translating returning lists
and building the appropriate projection information, and setting up a
tuple conversion map.
One thing that's not deferred is locking the child partitions; that
seems desirable but would need more thought. Still, testing shows
that this makes single-row inserts significantly faster on a table
with many partitions without harming the bulk-insert case.
Amit Langote, reviewed by Etsuro Fujita, with a few changes by me
Discussion: http://postgr.es/m/8975331d-d961-cbdd-f862-fdd3d97dc2d0@lab.ntt.co.jp
Previously, Append didn't charge anything at all, and MergeAppend
charged only cpu_operator_cost, about half the value used here. This
change might make MergeAppend plans slightly more likely to be chosen
than before, since this commit increases the assumed cost for Append
-- with default values -- by 0.005 per tuple but MergeAppend by only
0.0025 per tuple. Since the comparisons required by MergeAppend are
costed separately, it's not clear why MergeAppend needs to be
otherwise more expensive than Append, so hopefully this is OK.
Prior to partition-wise join, it didn't really matter whether or not
an Append node had any cost of its own, because every plan had to use
the same number of Append or MergeAppend nodes and in the same places.
Only the relative cost of Append vs. MergeAppend made a difference.
Now, however, it is possible to avoid some of the Append nodes using a
partition-wise join, so it's worth making an effort. Pending patches
for partition-wise aggregate care too, because an Append of Aggregate
nodes will incur the Append overhead fewer times than an Aggregate
over an Append. Although in most cases this change will favor the use
of partition-wise techniques, it does the opposite when the join
cardinality is greater than the sum of the input cardinalities. Since
this situation arises in an existing regression test, I [rhaas]
adjusted it to keep the overall plan shape approximately the same.
Jeevan Chalke, per a suggestion from David Rowley. Reviewed by
Ashutosh Bapat. Some changes by me. The larger patch series of which
this patch is a part was also reviewed and tested by Antonin Houska,
Rajkumar Raghuwanshi, David Rowley, Dilip Kumar, Konstantin Knizhnik,
Pascal Legrand, Rafia Sabih, and me.
Discussion: http://postgr.es/m/CAKJS1f9UXdk6ZYyqbJnjFO9a9hyHKGW7B=ZRh-rxy9qxfPA5Gw@mail.gmail.com
Previously tts_off was, for unknown reasons, of type long. For one
that's unnecessary as tuples are restricted in length, for another
long would be a bad choice of type even if that weren't the case, as
it's not reliably wider than an int. Also HeapTupleHeader->t_len is a
uint32.
This is split off from a larger patch implementing JITed tuple
deforming. Seems like an independent improvement, as tiny as it is.
Author: Andres Freund
An updating query that reads a CTE within an InitPlan or SubPlan could get
incorrect results if it updates rows that are concurrently being modified.
This is caused by CteScanNext supposing that nothing inside its recursive
ExecProcNode call could change which read pointer is selected in the CTE's
shared tuplestore. While that's normally true because of scoping
considerations, it can break down if an EPQ plan tree gets built during the
call, because EvalPlanQualStart builds execution trees for all subplans
whether they're going to be used during the recheck or not. And it seems
like a pretty shaky assumption anyway, so let's just reselect our own read
pointer here.
Per bug #14870 from Andrei Gorita. This has been broken since CTEs were
implemented, so back-patch to all supported branches.
Discussion: https://postgr.es/m/20171024155358.1471.82377@wrigleys.postgresql.org
If we restrict unique constraints on partitioned tables so that they
must always include the partition key, then our standard approach to
unique indexes already works --- each unique key is forced to exist
within a single partition, so enforcing the unique restriction in each
index individually is enough to have it enforced globally. Therefore we
can implement unique indexes on partitions by simply removing a few
restrictions (and adding others.)
Discussion: https://postgr.es/m/20171222212921.hi6hg6pem2w2t36z@alvherre.pgsql
Discussion: https://postgr.es/m/20171229230607.3iib6b62fn3uaf47@alvherre.pgsql
Reviewed-by: Simon Riggs, Jesper Pedersen, Peter Eisentraut, Jaime
Casanova, Amit Langote
In what was doubtless a typo, commit bf6c614a2 introduced a duplicate
initialization of a local variable. This made Coverity unhappy, as well
as pretty much anybody reading the code. We don't even have a real use
for the local variable, so just remove it.
Introduce a new format_type_extended, with a flags bitmask argument that
can modify the default behavior. A few compatibility and readability
wrappers remain:
format_type_be
format_type_be_qualified
format_type_with_typemod
while format_type_with_typemod_qualified, which had a single caller, is
removed.
Author: Michael Paquier, some revisions by me
Discussion: 20180213035107.GA2915@paquier.xyz
The reason for doing so is that it will allow expression evaluation to
optimize based on the underlying tupledesc. In particular it will
allow to JIT tuple deforming together with the expression itself.
For that expression initialization needs to be moved after the
relevant slots are initialized - mostly unproblematic, except in the
case of nodeWorktablescan.c.
After doing so there's no need for ExecAssignResultType() and
ExecAssignResultTypeFromTL() anymore, as all former callers have been
converted to create a slot with a fixed descriptor.
When creating a slot with a fixed descriptor, tts_values/isnull can be
allocated together with the main slot, reducing allocation overhead
and increasing cache density a bit.
Author: Andres Freund
Discussion: https://postgr.es/m/20171206093717.vqdxe5icqttpxs3p@alap3.anarazel.de
This has a performance benefit on own, although not hugely so. The
primary benefit is that it will allow for to JIT tuple deforming and
comparator invocations.
Large parts of this were previously committed (773aec7aa), but the
commit contained an omission around cross-type comparisons and was
thus reverted.
Author: Andres Freund
Discussion: https://postgr.es/m/20171129080934.amqqkke2zjtekd4t@alap3.anarazel.de
elog(FATAL) would end up calling PortalCleanup(), which would call
executor shutdown code, which could fail and crash, especially under
parallel query. This was introduced by
8561e4840c, which did not want to mark an
active portal as failed by a normal transaction abort anymore. But we
do need to do that for an elog(FATAL) exit. Introduce a variable
shmem_exit_inprogress similar to the existing proc_exit_inprogress, so
we can tell whether we are in the FATAL exit scenario.
Reported-by: Andres Freund <andres@anarazel.de>
Other header files should never #include postgres.h (nor postgres_fe.h,
nor c.h), per project policy. Also, there's no need for any backend .c
file to explicitly include elog.h or palloc.h, because postgres.h pulls
those in already.
Extracted from a larger patch by Kyotaro Horiguchi. The rest of the
removals he suggests require more study, but these are no-brainers.
Discussion: https://postgr.es/m/20180215.200447.209320006.horiguchi.kyotaro@lab.ntt.co.jp
This reverts commit 773aec7aa9.
There's an unresolved issue in the reverted commit: It only creates
one comparator function, but in for the nodeSubplan.c case we need
more (c.f. FindTupleHashEntry vs LookupTupleHashEntry calls in
nodeSubplan.c).
This isn't too difficult to fix, but it's not entirely trivial
either. The fact that the issue only causes breakage on 32bit systems
shows that the current test coverage isn't that great. To avoid
turning half the buildfarm red till those two issues are addressed,
revert.
Seems a bit silly that many (in fact all, as of today) uses of
StaticAssertExpr would need to cast it to void to avoid warnings from
pickier compilers. Let's just do the cast right in the macro, instead.
In passing, change StaticAssertExpr to StaticAssertStmt in one
place where that seems more apropos.
Discussion: https://postgr.es/m/16161.1518715186@sss.pgh.pa.us
All of these are false positives, but in each case a fair amount of
analysis is needed to see that, and it's not too surprising that not all
compilers are smart enough. (In particular, in the logtape.c case, a
compiler lacking the knowledge provided by the Assert would almost surely
complain, so that this warning will be seen in any non-assert build.)
Some of these are of long standing while others are pretty recent,
but it only seems worth fixing them in HEAD.
Jaime Casanova, tweaked a bit by me
Discussion: https://postgr.es/m/CAJGNTeMcYAMJdPAom52dppLMtF-UnEZi0dooj==75OEv1EoBZA@mail.gmail.com
Formerly, DTYPE_REC was used only for variables declared as "record";
variables of named composite types used DTYPE_ROW, which is faster for
some purposes but much less flexible. In particular, the ROW code paths
are entirely incapable of dealing with DDL-caused changes to the number
or data types of the columns of a row variable, once a particular plpgsql
function has been parsed for the first time in a session. And, since the
stored representation of a ROW isn't a tuple, there wasn't any easy way
to deal with variables of domain-over-composite types, since the domain
constraint checking code would expect the value to be checked to be a
tuple. A lesser, but still real, annoyance is that ROW format cannot
represent a true NULL composite value, only a row of per-field NULL
values, which is not exactly the same thing.
Hence, switch to using DTYPE_REC for all composite-typed variables,
whether "record", named composite type, or domain over named composite
type. DTYPE_ROW remains but is used only for its native purpose, to
represent a fixed-at-compile-time list of variables, for instance the
targets of an INTO clause.
To accomplish this without taking significant performance losses, introduce
infrastructure that allows storing composite-type variables as "expanded
objects", similar to the "expanded array" infrastructure introduced in
commit 1dc5ebc90. A composite variable's value is thereby kept (most of
the time) in the form of separate Datums, so that field accesses and
updates are not much more expensive than they were in the ROW format.
This holds the line, more or less, on performance of variables of named
composite types in field-access-intensive microbenchmarks, and makes
variables declared "record" perform much better than before in similar
tests. In addition, the logic involved with enforcing composite-domain
constraints against updates of individual fields is in the expanded
record infrastructure not plpgsql proper, so that it might be reusable
for other purposes.
In further support of this, introduce a typcache feature for assigning a
unique-within-process identifier to each distinct tuple descriptor of
interest; in particular, DDL alterations on composite types result in a new
identifier for that type. This allows very cheap detection of the need to
refresh tupdesc-dependent data. This improves on the "tupDescSeqNo" idea
I had in commit 687f096ea: that assigned identifying sequence numbers to
successive versions of individual composite types, but the numbers were not
unique across different types, nor was there support for assigning numbers
to registered record types.
In passing, allow plpgsql functions to accept as well as return type
"record". There was no good reason for the old restriction, and it
was out of step with most of the other PLs.
Tom Lane, reviewed by Pavel Stehule
Discussion: https://postgr.es/m/8962.1514399547@sss.pgh.pa.us
The modern way is to use a missing_ok argument instead of two separate
almost-identical routines, so do that.
Author: Michaël Paquier
Reviewed-by: Álvaro Herrera
Discussion: https://postgr.es/m/20180201063212.GE6398@paquier.xyz
The previous code failed to realize that this setting effectively
disables parallelism, and would crash if it decided to attempt
parallelism anyway. Instead, treat it as a disabling condition.
Kyotaro Horiguchi, who also reported the issue. Reviewed by Michael
Paquier and Peter Geoghegan.
Discussion: http://postgr.es/m/20180209.170635.256350357.horiguchi.kyotaro@lab.ntt.co.jp
Prematurely freeing the EState used to evaluate CALL arguments led, in some
cases, to passing dangling pointers to the procedure. This was masked in
trivial cases because the argument pointers would point to Const nodes in
the original expression tree, and in some other cases because the result
value would end up in the standalone ExprContext rather than in memory
belonging to the EState --- but that wasn't exactly high quality
programming either, because the standalone ExprContext was never
explicitly freed, breaking assorted API contracts.
In addition, using a separate EState for each argument was just silly.
So let's use just one EState, and one ExprContext, and make the latter
belong to the former rather than be standalone, and clean up the EState
(and hence the ExprContext) post-call.
While at it, improve the function's commentary a bit.
Discussion: https://postgr.es/m/29173.1518282748@sss.pgh.pa.us
CALL statements cannot support sub-SELECTs in the arguments of the called
procedure, since they just use ExecEvalExpr to evaluate such arguments.
Teach transformSubLink() to reject the case, as it already does for other
contexts in which subqueries are not supported.
In passing, s/EXPR_KIND_CALL/EXPR_KIND_CALL_ARGUMENT/ to make that enum
symbol line up more closely with the phrasing of the error messages it is
associated with. And fix someone's weak grasp of English grammar in the
preceding EXPR_KIND_PARTITION_EXPRESSION addition. Also update an
incorrect comment in resolve_unique_index_expr (possibly it was correct
when written, but nowadays transformExpr definitely does reject SRFs here).
Per report from Pavel Stehule --- but this resolves only one of the bugs
he mentions.
Discussion: https://postgr.es/m/CAFj8pRDxOwPPzpA8i+AQeDQFj7bhVw-dR2==rfWZ3zMGkm568Q@mail.gmail.com
Doing so causes EXPLAIN ANALYZE to show trigger statistics multiple
times. Commit 2f17844104 seems to
be to blame for this.
Amit Langote, revieed by Amit Khandekar, Etsuro Fujita, and me.
When the previously-chosen plan was non-partial, all pa_finished
flags for partial plans are now set, and pa_next_plan has not yet
been set to INVALID_SUBPLAN_INDEX, the previous code could go into
an infinite loop.
Report by Rajkumar Raghuwanshi. Patch by Amit Khandekar and me.
Review by Kyotaro Horiguchi.
Discussion: http://postgr.es/m/CAJ3gD9cf43z78qY=U=H0HvOEN341qfRO-vLpnKPSviHeWgJQ5w@mail.gmail.com
This patch adds the ability to use "RANGE offset PRECEDING/FOLLOWING"
frame boundaries in window functions. We'd punted on that back in the
original patch to add window functions, because it was not clear how to
do it in a reasonably data-type-extensible fashion. That problem is
resolved here by adding the ability for btree operator classes to provide
an "in_range" support function that defines how to add or subtract the
RANGE offset value. Factoring it this way also allows the operator class
to avoid overflow problems near the ends of the datatype's range, if it
wishes to expend effort on that. (In the committed patch, the integer
opclasses handle that issue, but it did not seem worth the trouble to
avoid overflow failures for datetime types.)
The patch includes in_range support for the integer_ops opfamily
(int2/int4/int8) as well as the standard datetime types. Support for
other numeric types has been requested, but that seems like suitable
material for a follow-on patch.
In addition, the patch adds GROUPS mode which counts the offset in
ORDER-BY peer groups rather than rows, and it adds the frame_exclusion
options specified by SQL:2011. As far as I can see, we are now fully
up to spec on window framing options.
Existing behaviors remain unchanged, except that I changed the errcode
for a couple of existing error reports to meet the SQL spec's expectation
that negative "offset" values should be reported as SQLSTATE 22013.
Internally and in relevant parts of the documentation, we now consistently
use the terminology "offset PRECEDING/FOLLOWING" rather than "value
PRECEDING/FOLLOWING", since the term "value" is confusingly vague.
Oliver Ford, reviewed and whacked around some by me
Discussion: https://postgr.es/m/CAGMVOdu9sivPAxbNN0X+q19Sfv9edEPv=HibOJhB14TJv_RCQg@mail.gmail.com
LogicalTapeFreeze() may write out its first block when it is dirty but
not full, and then immediately read the first block back in from its
BufFile as a BLCKSZ-width block. This can only occur in rare cases
where very few tuples were written out, which is currently only
possible with parallel external tuplesorts. To avoid valgrind
complaints, tell it to treat the tail of logtape.c's buffer as
defined.
Commit 9da0cc3528 exposed this problem
but did not create it. LogicalTapeFreeze() has always tended to write
out some amount of garbage bytes, but previously never wrote less than
one block of data in total, so the problem was masked.
Per buildfarm members lousyjack and skink.
Peter Geoghegan, based on a suggestion from Tom Lane and me. Some
comment revisions by me.
Up to now, useful info for writing a new btree opclass has been buried
in the backend's nbtree/README file. Let's move it into the SGML docs,
in preparation for extending it with info about "in_range" functions
in the upcoming window RANGE patch.
To do this, I chose to create a new chapter for btree indexes in Part VII
(Internals), parallel to the chapters that exist for the newer index AMs.
This is a pretty short chapter as-is. At some point somebody might care
to flesh it out with more detail about btree internals, but that is
beyond the scope of my ambition for today.
Discussion: https://postgr.es/m/23141.1517874668@sss.pgh.pa.us
The previous code assumed that we'd always succeed in creating
child-joins for a joinrel for which partition-wise join was considered,
but that's not guaranteed, at least in the case where dummy rels
are involved.
Ashutosh Bapat, with some wordsmithing by me.
Discussion: http://postgr.es/m/CAFjFpRf8=uyMYYfeTBjWDMs1tR5t--FgOe2vKZPULxxdYQ4RNw@mail.gmail.com
Failure to advance the list pointer while reading partition expressions
from a list results in invoking an input function with inappropriate data,
possibly leading to crashes or, with carefully crafted input, disclosure
of arbitrary backend memory.
Bug discovered independently by Álvaro Herrera and David Rowley.
This patch is by Álvaro but owes something to David's proposed fix.
Back-patch to v10 where the issue was introduced.
Security: CVE-2018-1052
We don't need to set up the shared space for hash join instrumentation data
if instrumentation hasn't been requested. Let's follow the example of the
similar Sort node code and save a few cycles by skipping that when we can.
This reverts commit d59ff4ab3 and instead allows us to use the safer choice
of passing noError = false to shm_toc_lookup in ExecHashInitializeWorker,
since if we reach that call there should be a TOC entry to be found.
Thomas Munro
Discussion: https://postgr.es/m/E1ehkoZ-0005uW-43%40gemulon.postgresql.org
One or another author of commit 5bcf389ec seems to have thought that
computing an offset from a NULL pointer would yield another NULL pointer.
There may possibly be architectures where that works, but common machines
don't work like that. Per a quick code review of places calling
shm_toc_lookup and not using noError = false.
Commit 445dbd82a basically missed the point of commit d46633506,
which was that we shouldn't allow shm_toc_lookup() failure to lead
to a core dump or assertion crash, because the odds of such a
failure should never be considered negligible. It's correct that
we can't expect the PARALLEL_KEY_ERROR_QUEUE TOC entry to be there
if we have no workers. But if we have no workers, we're not going
to do anything in this function with the lookup result anyway,
so let's just skip it. That lets the code use the easy-to-prove-safe
noError=false case, rather than anything requiring effort to review.
Back-patch to v10, like the previous commit.
Discussion: https://postgr.es/m/3647.1517601675@sss.pgh.pa.us
Investigation of 2d2d06b7e2 revealed that
identity values were not applied in some further cases, including
logical replication subscribers, VALUES RTEs, and ALTER TABLE ... ADD
COLUMN. To fix all that, apply the identity column expression in
build_column_default() instead of repeating the same logic at each call
site.
For ALTER TABLE ... ADD COLUMN ... IDENTITY, the previous coding
completely ignored that existing rows for the new column should have
values filled in from the identity sequence. The coding using
build_column_default() fails for this because the sequence ownership
isn't registered until after ALTER TABLE, and we can't do it before
because we don't have the column in the catalog yet. So we specially
remember in ColumnDef the sequence name that we decided on and build a
custom NextValueExpr using that.
Reviewed-by: Michael Paquier <michael.paquier@gmail.com>
To make this work, tuplesort.c and logtape.c must also support
parallelism, so this patch adds that infrastructure and then applies
it to the particular case of parallel btree index builds. Testing
to date shows that this can often be 2-3x faster than a serial
index build.
The model for deciding how many workers to use is fairly primitive
at present, but it's better than not having the feature. We can
refine it as we get more experience.
Peter Geoghegan with some help from Rushabh Lathia. While Heikki
Linnakangas is not an author of this patch, he wrote other patches
without which this feature would not have been possible, and
therefore the release notes should possibly credit him as an author
of this feature. Reviewed by Claudio Freire, Heikki Linnakangas,
Thomas Munro, Tels, Amit Kapila, me.
Discussion: http://postgr.es/m/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
Discussion: http://postgr.es/m/CAH2-Wz=AxWqDoVvGU7dq856S4r6sJAj6DBn7VMtigkB33N5eyg@mail.gmail.com
Remove partition_bound_cmp() and partition_bound_bsearch(), whose
void * argument could be, depending on the situation, of any of
three different types: PartitionBoundSpec *, PartitionRangeBound *,
Datum *.
Instead, introduce separate bound-searching functions for each
situation: partition_list_bsearch, partition_range_bsearch,
partition_range_datum_bsearch, and partition_hash_bsearch. This
requires duplicating the code for binary search, but it makes the
code much more type safe, involves fewer branches at runtime, and
at least in my opinion, is much easier to understand.
Along the way, add an option to partition_range_datum_bsearch
allowing the number of keys to be specified, so that we can search
for partitions based on a prefix of the full list of partition
keys. This is important for pending work to improve partition
pruning.
Amit Langote, per a suggestion from me.
Discussion: http://postgr.es/m/CA+TgmoaVLDLc8=YESRwD32gPhodU_ELmXyKs77gveiYp+JE4vQ@mail.gmail.com
Once this function has been called, we know that all workers have
started and attached to their error queues -- so if any of them
subsequently exit uncleanly, we'll be sure to throw an ERROR promptly.
Otherwise, users of the ParallelContext machinery must be careful not
to wait forever for a worker that has failed to start. Parallel query
manages to work without needing this for reasons explained in new
comments added by this patch, but it's a useful primitive for other
parallel operations, such as the pending patch to make creating a
btree index run in parallel.
Amit Kapila, revised by me. Additional review by Peter Geoghegan.
Discussion: http://postgr.es/m/CAA4eK1+e2MzyouF5bg=OtyhDSX+=Ao=3htN=T-r_6s3gCtKFiw@mail.gmail.com
The old code generated always generated a constraint of the form
col = ANY(ARRAY[val1, val2, ...]), but that's invalid when col is an
array type. Instead, generate col = val when there's only one value,
col = val1 OR col = val2 OR ... when there are multiple values and
col is of array type, and the old form when there are multiple values
and col is not of an array type.
As a side benefit, this makes constraint exclusion able to prune
a list partition declared to accept a single Boolean value, which
didn't work before.
Amit Langote, reviewed by Etsuro Fujita
Discussion: http://postgr.es/m/97267195-e235-89d1-a41a-c110198dfce9@lab.ntt.co.jp
pg_hba_file_rules erroneously reported this as scram-sha256. Fix that.
To avoid future errors and confusion, also adjust documentation links
and internal symbols to have a separator between "sha" and "256".
Reported-by: Christophe Courtois <christophe.courtois@dalibo.com>
Author: Michael Paquier <michael.paquier@gmail.com>
It's a common task to evaluate a qual and reset the corresponding
expression context. Currently that requires storing the result of the
qual eval, resetting the context, and then reacting on the result. As
that's awkward several places only reset the context next time through
a node. That's not great, so introduce a helper that evaluates and
resets.
It's a bit ugly that it currently uses MemoryContextReset() instead of
ResetExprContext(), but that seems easier than reordering all of
executor.h.
Author: Andres Freund
Discussion: https://postgr.es/m/20180109222544.f7loxrunqh3xjl5f@alap3.anarazel.de
There's never any value in giving a fully specified cache key to
SearchCatCacheList: you might as well call SearchCatCache instead,
since there could be only one match. So the maximum useful number of
key arguments is one less than the supported number of key columns.
We might as well remove the useless extra argument and save some few
bytes per call site, as well as a cycle or so per call.
I believe the reason it was coded like this is that originally, callers
had to write out all the dummy arguments in each call, and so it seemed
less confusing if SearchCatCache and SearchCatCacheList took the same
number of key arguments. But since commit e26c539e9, callers only write
their live arguments explicitly, making that a non-factor; and there's
surely been enough time for third-party modules to adapt to that coding
style. So this is only an ABI break not an API break for callers.
Per discussion with Oliver Ford, this might also make it less confusing
how to use SearchCatCacheList correctly.
Discussion: https://postgr.es/m/27788.1517069693@sss.pgh.pa.us
ExecPushExprSlots didn't initialize ExprEvalStep's resvalue/resnull
steps as it didn't use them. That caused wrong valgrind warnings for
an upcoming patch, so zero-intialize.
Also zero-initialize all scratch ExprEvalStep's allocated on the
stack, to avoid issues with similar future omissions of non-critial
data.
The changes in b81b5a96f4 did not fully
address the issue, because the bit-mixing of the IV into the final
hash-key didn't prevent clustering in the input-data survive in the
output data.
This didn't cause a lot of problems because of the additional growth
conditions added d4c62a6b62. But as we
want to rein those in due to explosive growth in some edges, this
needs to be fixed.
Author: Andres Freund
Discussion: https://postgr.es/m/20171127185700.1470.20362@wrigleys.postgresql.org
Backpatch: 10, where simplehash was introduced
create_plan_recurse lacked any stack depth check. This is not per
our normal coding rules, but I'd supposed it was safe because earlier
planner processing is more complex and presumably should eat more
stack. But bug #15033 from Andrew Grossman shows this isn't true,
at least not for queries having the form of a many-thousand-way
INTERSECT stack.
Further testing showed that recurse_set_operations is also capable
of being crashed in this way, since it likewise will recurse to the
bottom of a parsetree before calling any support functions that
might themselves contain any stack checks. However, its stack
consumption is only perhaps a third of create_plan_recurse's.
It's possible that this particular problem with create_plan_recurse can
only manifest in 9.6 and later, since before that we didn't build a Path
tree for set operations. But having seen this example, I now have no
faith in the proposition that create_plan_recurse doesn't need a stack
check, so back-patch to all supported branches.
Discussion: https://postgr.es/m/20180127050845.28812.58244@wrigleys.postgresql.org
Commit 09529a70b changed nodeIndexscan.c and nodeIndexonlyscan.c to
postpone initialization of the indexscan proper until the first tuple
fetch. It overlooked the question of mark/restore behavior, which means
that if some caller attempts to mark the scan before the first tuple fetch,
you get a null pointer dereference.
The only existing user of mark/restore is nodeMergejoin.c, which (somewhat
accidentally) will never attempt to set a mark before the first inner tuple
unless the inner child node is a Material node. Hence the case can't arise
normally, so it seems sufficient to document the assumption at both ends.
However, during an EvalPlanQual recheck, ExecScanFetch doesn't call
IndexNext but just returns the jammed-in test tuple. Therefore, if we're
doing a recheck in a plan tree with a mergejoin with inner indexscan,
it's possible to reach ExecIndexMarkPos with iss_ScanDesc still null,
as reported by Guo Xiang Tan in bug #15032.
Really, when there's a test tuple supplied during an EPQ recheck, touching
the index at all is the wrong thing: rather, the behavior of mark/restore
ought to amount to saving and restoring the es_epqScanDone flag. We can
avoid finding a place to actually save the flag, for the moment, because
given the assumption that no caller will set a mark before fetching a
tuple, es_epqScanDone must always be set by the time we try to mark.
So the actual behavior change required is just to not reach the index
access if a test tuple is supplied.
The set of plan node types that need to consider this issue are those
that support EPQ test tuples (i.e., call ExecScan()) and also support
mark/restore; which is to say, IndexScan, IndexOnlyScan, and perhaps
CustomScan. It's tempting to try to fix the problem in one place by
teaching ExecMarkPos() itself about EPQ; but ExecMarkPos supports some
plan types that aren't Scans, and also it seems risky to make assumptions
about what a CustomScan wants to do here. Also, the most likely future
change here is to decide that we do need to support marks placed before
the first tuple, which would require additional work in IndexScan and
IndexOnlyScan in any case. Hence, fix the EPQ issue in nodeIndexscan.c
and nodeIndexonlyscan.c, accepting the small amount of code duplicated
thereby, and leave it to CustomScan providers to fix this bug if they
have it.
Back-patch to v10 where commit 09529a70b came in. In earlier branches,
the index_markpos() call is a waste of cycles when EPQ is active, but
no more than that, so it doesn't seem appropriate to back-patch further.
Discussion: https://postgr.es/m/20180126074932.3098.97815@wrigleys.postgresql.org
We have a lot of code in which option names, which from the user's
viewpoint are logically keywords, are passed through the grammar as plain
identifiers, and then matched to string literals during command execution.
This approach avoids making words into lexer keywords unnecessarily. Some
places matched these strings using plain strcmp, some using pg_strcasecmp.
But the latter should be unnecessary since identifiers would have been
downcased on their way through the parser. Aside from any efficiency
concerns (probably not a big factor), the lack of consistency in this area
creates a hazard of subtle bugs due to different places coming to different
conclusions about whether two option names are the same or different.
Hence, standardize on using strcmp() to match any option names that are
expected to have been fed through the parser.
This does create a user-visible behavioral change, which is that while
formerly all of these would work:
alter table foo set (fillfactor = 50);
alter table foo set (FillFactor = 50);
alter table foo set ("fillfactor" = 50);
alter table foo set ("FillFactor" = 50);
now the last case will fail because that double-quoted identifier is
different from the others. However, none of our documentation says that
you can use a quoted identifier in such contexts at all, and we should
discourage doing so since it would break if we ever decide to parse such
constructs as true lexer keywords rather than poor man's substitutes.
So this shouldn't create a significant compatibility issue for users.
Daniel Gustafsson, reviewed by Michael Paquier, small changes by me
Discussion: https://postgr.es/m/29405B24-564E-476B-98C0-677A29805B84@yesql.se
This is preparatory refactoring to prepare the way for partition-wise
aggregate, which will reuse the new subroutines for child grouping
rels. It also does not seem like a bad idea on general principle,
as the function was getting pretty long.
Jeevan Chalke. The larger patch series of which this patch is a part
was reviewed and tested by Antonin Houska, Rajkumar Raghuwanshi,
Ashutosh Bapat, David Rowley, Dilip Kumar, Konstantin Knizhnik,
Pascal Legrand, and me. Some cosmetic changes by me.
Discussion: http://postgr.es/m/CAM2+6=V64_xhstVHie0Rz=KPEQnLJMZt_e314P0jaT_oJ9MR8A@mail.gmail.com
This clause was superseded by SQL-standard syntax back in 7.3.
We've kept it around for backwards-compatibility purposes ever since;
but 15 years seems like long enough for that, especially seeing that
there are undocumented weirdnesses in how it interacts with the
SQL-standard syntax for specifying the same options.
Michael Paquier, per an observation by Daniel Gustafsson;
some small cosmetic adjustments to nearby code by me.
Discussion: https://postgr.es/m/20180115022748.GB1724@paquier.xyz
The existing "connection authorized" server log messages used OpenSSL
API calls directly, even though similar abstracted API calls exist.
Change to use the latter instead.
Change the function prototype for the functions that return the TLS
version and the cipher to return const char * directly instead of
copying into a buffer. That makes them slightly easier to use.
Add bits= to the message. psql shows that, so we might as well show the
same information on the client and server.
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Michael Paquier <michael.paquier@gmail.com>
get_relation_info() was too optimistic about opening indexes in
partitioned tables, which would raise errors when any queries were
planned on such tables. Fix by ignoring any indexes of the partitioned
kind.
CLUSTER (and ALTER TABLE CLUSTER ON) had a similar problem. Fix by
disallowing these commands in partitioned tables.
Fallout from 8b08f7d482.
These were introduced in 4cbb646334, but
after further analysis and testing, they should not be necessary and
probably weren't the part of that commit that fixed anything.
Reviewed-by: Michael Paquier <michael.paquier@gmail.com>