LLVM 12, to be released in a few months, made some breaking changes to
the Orc JIT interface. OrcV2 eventually will make it easier to support
features like concurrent JIT compilation, but this commit only allows
to compile against LLVM 12.
This commit is a bit bigger than desirable. That partially is because
the V2 interface is more granular than V1 interface, but also because
I chose to make some minor changes to < LLVM 12 code to keep the code
somewhat readable.
The LLVM 12 support will need to be backpatched. I plan to do so after
the patch stewed on the buildfarm for a few days.
Author: Andres Freund
Discussion: https://postgr.es/m/20201016011244.pmyvr3ee2gbzplq4@alap3.anarazel.de
Remove a CHECK_FOR_INTERRUPTS() call that could never actually handle an
interrupt. We always have a heap page buffer lock at this point.
Having a useless CHECK_FOR_INTERRUPTS() call is harmless but misleading.
It is probably possible to work around the immediate problem by moving
the CHECK_FOR_INTERRUPTS() to before the heap page buffer lock is
acquired. That isn't enough to make the function responsive to
interrupts, though. The index AM caller will still hold an exclusive
buffer lock of its own.
Specifically, this blocks DECLARE ... WITH HOLD and firing of deferred
triggers within index expressions and materialized view queries. An
attacker having permission to create non-temp objects in at least one
schema could execute arbitrary SQL functions under the identity of the
bootstrap superuser. One can work around the vulnerability by disabling
autovacuum and not manually running ANALYZE, CLUSTER, REINDEX, CREATE
INDEX, VACUUM FULL, or REFRESH MATERIALIZED VIEW. (Don't restore from
pg_dump, since it runs some of those commands.) Plain VACUUM (without
FULL) is safe, and all commands are fine when a trusted user owns the
target object. Performance may degrade quickly under this workaround,
however. Back-patch to 9.5 (all supported versions).
Reviewed by Robert Haas. Reported by Etienne Stalmans.
Security: CVE-2020-25695
Commit 257836a7 included an assertion that a version lookup routine is
not trying to look up "C" or "POSIX", but that case is reachable with
the user-facing SQL function pg_collation_actual_version(). Remove the
assertion.
The description of how LP_DEAD bit setting by index scans works
following commit 2ed5b87f was rather unclear. Clean that up a bit.
Also refer to LP_DEAD bit setting within _bt_check_unique() at the start
of the same section. This mechanism may actually be more important than
the generic kill_prior_tuple mechanism that the section focuses on, so
it at least deserves to be mentioned in passing.
* Avoid pointlessly highlighting that an index vacuum was executed by a
parallel worker; user doesn't care.
* Don't give the impression that a non-concurrent reindex of an invalid
index on a TOAST table would work, because it wouldn't.
* Add a "translator:" comment for a mysterious message.
Discussion: https://postgr.es/m/20201107034943.GA16596@alvherre.pgsql
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Move the system catalog index declarations from catalog/indexing.h to
the respective parent tables' catalog/pg_*.h files. The original
reason for having it split was that the old genbki system produced the
output in the order of the catalog files it read, so all the indexing
stuff needed to come separately. But this is no longer the case, and
keeping it together makes more sense.
Reviewed-by: John Naylor <john.naylor@enterprisedb.com>
Discussion: https://www.postgresql.org/message-id/flat/c7cc82d6-f976-75d6-2e3e-b03d2cab26bb@2ndquadrant.com
Move the system catalog toast table declarations from
catalog/toasting.h to the respective parent tables' catalog/pg_*.h
files. The original reason for having it split was that the old
genbki system produced the output in the order of the catalog files it
read, so all the toasting stuff needed to come separately. But this
is no longer the case, and keeping it together makes more sense.
Reviewed-by: John Naylor <john.naylor@enterprisedb.com>
Discussion: https://www.postgresql.org/message-id/flat/c7cc82d6-f976-75d6-2e3e-b03d2cab26bb@2ndquadrant.com
The list of indexes was being leaked when asked for an index that
doesn't have an index partition in the table partition. Not a common
case admittedly --and in most cases where it occurs, caller throws an
error anyway-- but worth fixing for cleanliness and in case any
third-party code is calling this function.
While at it, remove use of lfirst_oid() to obtain a value we already
have.
Author: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/20201105203606.GF22691@telsasoft.com
This should have been done in the initial commit that made
unix_socket_directories a list as of c9b0cbe. This change allows to
support correctly the case of ALTER SYSTEM, where it is possible to
specify multiple paths as a list, like the following pattern where
flattening is applied to each item:
ALTER SYSTEM SET unix_socket_directories = '/path1', '/path2';
Any parameters specified in postgresql.conf are parsed the same way, so
there is no compatibility change. pg_dump has a hardcoded list of
parameters marked with GUC_LIST_QUOTE, that gets its routine update.
These are reordered alphabetically for clarity.
Author: Ian Lawrence Barwick
Reviewed-by: Peter Eisentraunt, Tom Lane, Michael Paquier
Discussion: https://postgr.es/m/CAB8KJ=iMOtNY6_sUwV=LQVCJ2zgYHBDyNzVfvE5GN3WQ3v9kQg@mail.gmail.com
brin_form_tuple failed to consider the values may be toasted, inserting
the toast pointer into the index. This may easily result in index
corruption, as the toast data may be deleted and cleaned up by vacuum.
The cleanup however does not care about indexes, leaving invalid toast
pointers behind, which triggers errors like this:
ERROR: missing chunk number 0 for toast value 16433 in pg_toast_16426
A less severe consequence are inconsistent failures due to the index row
being too large, depending on whether brin_form_tuple operated on plain
or toasted version of the row. For example
CREATE TABLE t (val TEXT);
INSERT INTO t VALUES ('... long value ...')
CREATE INDEX idx ON t USING brin (val);
would likely succeed, as the row would likely include toast pointer.
Switching the order of INSERT and CREATE INDEX would likely fail:
ERROR: index row size 8712 exceeds maximum 8152 for index "idx"
because this happens before the row values are toasted.
The bug exists since PostgreSQL 9.5 where BRIN indexes were introduced.
So backpatch all the way back.
Author: Tomas Vondra
Reviewed-by: Alvaro Herrera
Backpatch-through: 9.5
Discussion: https://postgr.es/m/20201001184133.oq5uq75sb45pu3aw@development
Discussion: https://postgr.es/m/20201104010544.zexj52mlldagzowv%40development
Revert 59ab4ac32, as well as the followup fix 33862cb9c, in all
branches. We need to think a bit harder about what the behavior
of LOCK TABLE on views should be, and there's no time for that
before next week's releases. We'll take another crack at this
later.
Discussion: https://postgr.es/m/16703-e348f58aab3cf6cc@postgresql.org
Currently only OpenSSL requires this initialization, but in the future
other SSL implementations are likely to need it as well. Abstracting
this functionality out into a separate function makes this cleaner and
more clear, and also removes the dependency on OpenSSL headers from
fork_process.c.
OpenSSL is special in that we need to initialize this random number
generator even if we're not going to use it directly, until we drop
support for everything prior to OpenSSL 1.1.1. (And of course also if we
actually use it). All other implementations are left empty at this time,
but more are expected to be added in the future.
Author: Daniel Gustafsson <daniel@yesql.se>, Michael Paquier <michael@paquier.xyz>
Reviewed-By: Magnus Hagander <magnus@hagander.net>
Discussion: https://postgr.es/m/F6291C3C-747C-4C93-BCE0-28BB420B1FF5@yesql.se
There is no outright bug here but it is better to be consistent with the
usage at other places in the same file. In the passing, fix a wrong
assertion in pgstat_recv_replslot.
Author: Kyotaro Horiguchi
Reviewed-by: Sawada Masahiko and Amit Kapila
Discussion: https://postgr.es/m/20201104.175523.1704166915688949637.horikyota.ntt@gmail.com
wal_consistency_checking indicated an inconsistency in certain cases
involving nbtree page deletion. The underlying issue is that there was
a minor difference between the page image produced after a REDO routine
ran and the corresponding page image following original execution.
This harmless inconsistency has been around forever. We more or less
expect total consistency among even deleted nbtree pages these days,
though, so this won't do anymore.
To fix, tweak the REDO routine to match original execution.
Oversight in commit f47b5e13.
LOCK TABLE has complained about "infinite recursion" when applied
to a self-referential view, ever since we made it recurse into views
in v11. However, that breaks pg_dump's new assumption that it's
okay to lock every relation. There doesn't seem to be any good
reason to throw an error: if we just abandon the recursion, we've
still satisfied the requirement of locking every referenced relation.
Per bug #16703 from Andrew Bille (via Alexander Lakhin).
Discussion: https://postgr.es/m/16703-e348f58aab3cf6cc@postgresql.org
Logic for counting heap TIDs from posting list tuples (added by commit
0d861bbb) was faulty. It didn't count any TIDs/index tuples in the
event of no callback being set. This meant that we incorrectly counted
no index tuples in clean-up only VACUUMs, which could lead to
pg_class.reltuples being spuriously set to 0 in affected indexes.
To fix, go back to counting items from the page in cases where there is
no callback. This approach isn't very accurate, but it works well
enough in practice while avoiding the expense of accessing every index
tuple during cleanup-only VACUUMs.
Author: Peter Geoghegan <pg@bowt.ie>
Reported-By: Jehan-Guillaume de Rorthais <jgdr@dalibo.com>
https://postgr.es/m/20201023174451.69e358f1@firost
Backpatch: 13-, where nbtree deduplication was introduced
Commit dee663f7 intended to drop any queued up fsync requests before
unlinking segment files, but missed a code path. Fix, by centralizing
the forget-and-unlink code into a single function.
Reported-by: Tomas Vondra <tomas.vondra@2ndquadrant.com>
Discussion: https://postgr.es/m/20201104013205.icogbi773przyny5%40development
The SQL spec calls out nonstandard syntax for certain function calls,
for example substring() with numeric position info is supposed to be
spelled "SUBSTRING(string FROM start FOR count)". We accept many
of these things, but up to now would not print them in the same format,
instead simplifying down to "substring"(string, start, count).
That's long annoyed me because it creates an interoperability
problem: we're gratuitously injecting Postgres-specific syntax into
what might otherwise be a perfectly spec-compliant view definition.
However, the real reason for addressing it right now is to support
a planned change in the semantics of EXTRACT() a/k/a date_part().
When we switch that to returning numeric, we'll have the parser
translate EXTRACT() to some new function name (might as well be
"extract" if you ask me) and then teach ruleutils.c to reverse-list
that per SQL spec. In this way existing calls to date_part() will
continue to have the old semantics.
To implement this, invent a new CoercionForm value COERCE_SQL_SYNTAX,
and make the parser insert that rather than COERCE_EXPLICIT_CALL when
the input has SQL-spec decoration. (But if the input has the form of
a plain function call, continue to mark it COERCE_EXPLICIT_CALL, even
if it's calling one of these functions.) Then ruleutils.c recognizes
COERCE_SQL_SYNTAX as a cue to emit SQL call syntax. It can know
which decoration to emit using hard-wired knowledge about the
functions that could be called this way. (While this solution isn't
extensible without manual additions, neither is the grammar, so this
doesn't seem unmaintainable.) Notice that this solution will
reverse-list a function call with SQL decoration only if it was
entered that way; so dump-and-reload will not by itself produce any
changes in the appearance of views.
This requires adding a CoercionForm field to struct FuncCall.
(I couldn't resist the temptation to rearrange that struct's
field order a tad while I was at it.) FuncCall doesn't appear
in stored rules, so that change isn't a reason for a catversion
bump, but I did one anyway because the new enum value for
CoercionForm fields could confuse old backend code.
Possible future work:
* Perhaps CoercionForm should now be renamed to DisplayForm,
or something like that, to reflect its more general meaning.
This'd require touching a couple hundred places, so it's not
clear it's worth the code churn.
* The SQLValueFunction node type, which was invented partly for
the same goal of improving SQL-compatibility of view output,
could perhaps be replaced with regular function calls marked
with COERCE_SQL_SYNTAX. It's unclear if this would be a net
code savings, however.
Discussion: https://postgr.es/m/42b73d2d-da12-ba9f-570a-420e0cce19d9@phystech.edu
Gen_fmgrtab.pl treated aggregate functions the same as other built-in
functions, which is wasteful because there is no real need to have
entries for them in the fmgr_builtins[] table. Suppressing those
entries saves about 3KB in the compiled table on my machine; which
is not a lot but it's not nothing either, considering that that
table is pretty "hot". The only outside code change needed is
that ExecInitWindowAgg() can't be allowed to call fmgr_info_cxt()
on a plain aggregate function. But that saves a few cycles anyway.
Having done that, the aggregate_dummy() function is unreferenced
and might as well be dropped. Using "aggregate_dummy" as the prosrc
value for an aggregate is now just a documentation convention not
something that matters. There was some discussion of using NULL
instead to save a few bytes in pg_proc, but we'd have to remove
prosrc's BKI_FORCE_NOT_NULL marking which doesn't seem a great idea.
Anyway, it's possible there's client-side code that expects to
see "aggregate_dummy" there, so I'm loath to change it without a
strong reason.
Discussion: https://postgr.es/m/533989.1604263665@sss.pgh.pa.us
Commit ac22929a26 changed recoveryWakeupLatch so that it's reset to
NULL at the end of recovery. This change could cause a segmentation fault
in the buildfarm member 'elver'.
Previously the latch was reset to NULL after calling ShutdownWalRcv().
But there could be a window between ShutdownWalRcv() and the actual
exit of walreceiver. If walreceiver set the latch during that window,
the segmentation fault could happen.
To fix the issue, this commit changes walreceiver so that it sets
the latch only when the latch has not been reset to NULL yet.
Author: Fujii Masao
Discussion: https://postgr.es/m/5c1f8a85-747c-7bf9-241e-dd467d8a3586@iki.fi
hash_array_extended() needs to pass PG_GET_COLLATION() to the hash
function of the element type. Otherwise, the hash function of a
collation-aware data type such as text will error out, since the
introduction of nondeterministic collation made hash functions require
a collation, too.
The consequence of this is that before this change, hash partitioning
using an array over text in the partition key would not work.
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://www.postgresql.org/message-id/flat/32c1fdae-95c6-5dc6-058a-a90330a3b621%40enterprisedb.com
This commit gets rid of the dedicated latch for signaling the startup
process in favor of using its procLatch, since that comports better
with possible generic signal handlers using that latch.
Commit 1e53fe0e70 changed background processes so that they use standard
SIGHUP handler. Like that, this commit also makes the startup process use
standard SIGHUP handler to simplify the code.
Author: Fujii Masao
Reviewed-by: Bharath Rupireddy, Michael Paquier
Discussion: https://postgr.es/m/CALj2ACXPorUqePswDtOeM_s82v9RW32E1fYmOPZ5NuE+TWKj_A@mail.gmail.com
Accept that we can't get versions for such locale names for now. Users
will need to specify the newer language tag format to enable the
collation versioning feature. It's not clear that we can do automatic
conversion from the old style to the new style reliably enough for this
purpose.
Unfortunately, this means that collation versioning probably won't work
for the default collation unless you provide something like en-US at
initdb or CREATE DATABASE time (though, for reasons not yet understood,
it does seem to work on some systems). It'd be nice to find a better
solution, or document this quirk if we settle on it, but this should
unbreak the 3 failing build farm animals in the meantime.
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAEepm%3D0uEQCpfq_%2BLYFBdArCe4Ot98t1aR4eYiYTe%3DyavQygiQ%40mail.gmail.com
When considering Incremental Sort below a Gather Merge, we need to be
a bit more careful when matching pathkeys to EC members. It's not enough
to find a member whose Vars are all in the current relation's target;
volatile expressions in particular need to be contained in the target,
otherwise it's too early to use the pathkey.
Reported-by: Jaime Casanova
Author: James Coleman
Reviewed-by: Tomas Vondra
Backpatch-through: 13, where the incremental sort code was added
Discussion: https://postgr.es/m/CAJGNTeNaxpXgBVcRhJX%2B2vSbq%2BF2kJqGBcvompmpvXb7pq%2BoFA%40mail.gmail.com
If the planner erroneously puts a non-parallel-safe SubPlan into
a parallelized portion of the query tree, nodeSubplan.c will fail
in the worker processes because it finds a null in es_subplanstates,
which it's unable to cope with. It seems worth a test-and-elog to
make that an error case rather than a core dump case.
This probably should have been included in commit 16ebab688, which
was responsible for allowing nulls to appear in es_subplanstates
to begin with. So, back-patch to v10 where that came in.
Discussion: https://postgr.es/m/924226.1604422326@sss.pgh.pa.us
The intention in commit 491c029db was to require superuserness to
change the BYPASSRLS property, but the actual effect of the coding
in AlterRole() was to require superuserness to change anything at all
about a BYPASSRLS role. Other properties of a BYPASSRLS role should
be changeable under the same rules as for a normal role, though.
Fix that, and also take care of some documentation omissions related
to BYPASSRLS and REPLICATION role properties.
Tom Lane and Stephen Frost, per bug report from Wolfgang Walther.
Back-patch to all supported branches.
Discussion: https://postgr.es/m/a5548a9f-89ee-3167-129d-162b5985fcf8@technowledgy.de
The current implementation cannot handle this correctly, so just
forbid it for now.
GENERATED clauses must be attached to the column definition and cannot
be added later like DEFAULT, so if a child table has a generation
expression that the parent does not have, the child column will
necessarily be an attlocal column. So to implement ALTER TABLE ONLY /
DROP EXPRESSION, we'd need extra code to update attislocal of the
direct child tables, somewhat similar to how DROP COLUMN does it, so
that the resulting state can be properly dumped and restored.
Discussion: https://www.postgresql.org/message-id/flat/15830.1575468847%40sss.pgh.pa.us
Commit d94c36a45a introduced error handling to sslinfo to handle
OpenSSL errors gracefully. This ports this errorhandling to the
backend TLS implementation.
Author: Daniel Gustafsson <daniel@yesql.se>
fill_hba_line() thought it could get away with passing sizeof(struct
sockaddr_storage) rather than the actual addrlen previously returned
by getaddrinfo(). While that appears to work on many platforms,
it does not work on FreeBSD 11: you get back a failure, which leads
to the view showing NULL for the address and netmask columns in all
rows. The POSIX spec for getnameinfo() is pretty clearly on
FreeBSD's side here: you should pass the actual address length.
So it seems plausible that there are other platforms where this
coding also fails, and we just hadn't noticed.
Also, IMO the fact that getnameinfo() failure leads to a NULL output
is pretty bogus in itself. Our pg_getnameinfo_all() wrapper is
careful to emit "???" on failure, and we should use that in such
cases. NULL should only be emitted in rows that don't have IP
addresses.
Per bug #16695 from Peter Vandivier. Back-patch to v10 where this
code was added.
Discussion: https://postgr.es/m/16695-a665558e2f630be7@postgresql.org
make_ruledef() and make_viewdef() were coded to cope with possible
null-ness of these columns, but they've been marked BKI_FORCE_NOT_NULL
for some time. So there's not really any need to do more than what
we do for the other columns of pg_rewrite, i.e. just Assert that
we got non-null results.
(There is a school of thought that says Asserts aren't the thing
to do to check for corrupt data, but surely here is not the place
to start if we want such a policy.)
Also, remove long-dead-if-indeed-it-ever-wasn't-dead handling of
an empty actions list in make_ruledef(). That's an error case
and should be treated as such. (DO INSTEAD NOTHING is represented
by a CMD_NOTHING Query, not an empty list; cf transformRuleStmt.)
Kyotaro Horiguchi, some changes by me
Discussion: https://postgr.es/m/CAEudQApoA=tMTic6xEPYP_hsNZ8XtToVThK_0x7D_aFQYowq3w@mail.gmail.com
Traditionally, the names of fmgroids.h macros for pg_proc OIDs
have been constructed from the prosrc field. But sometimes the
same C function underlies multiple pg_proc entries, forcing us
to make an arbitrary choice of which OID to reference; the other
entries are then not namable via fmgroids.h. Moreover, we could
not have macros at all for pg_proc entries that aren't for
C-coded functions.
Instead, use the proname field, and append the proargtypes field
(replacing inter-argument spaces with underscores) if proname is
not unique. Special-casing unique entries such as F_OIDEQ removes
the need to change a lot of code. Indeed, I can only find two
places in the tree that need to be adjusted; while this changes
quite a few existing entries in fmgroids.h, few of them are
referenced from C code.
With this patch, all entries in pg_proc.dat have macros in fmgroids.h.
Discussion: https://postgr.es/m/472274.1604258384@sss.pgh.pa.us
Some places were using PG_GETARG_UINT32 where PG_GETARG_TRANSACTIONID
would be more appropriate. (Of course, they are the same internally,
so there is no externally visible effect.) To do that, export
PG_GETARG_TRANSACTIONID outside of xid.c. We also export
PG_RETURN_TRANSACTIONID for symmetry, even though there are currently
no external users.
Author: Ashutosh Bapat <ashutosh.bapat@2ndquadrant.com>
Discussion: https://www.postgresql.org/message-id/flat/d8f6bdd536df403b9b33816e9f7e0b9d@G08CNEXMBPEKD05.g08.fujitsu.local
Record the current version of dependent collations in pg_depend when
creating or rebuilding an index. When accessing the index later, warn
that the index may be corrupted if the current version doesn't match.
Thanks to Douglas Doole, Peter Eisentraut, Christoph Berg, Laurenz Albe,
Michael Paquier, Robert Haas, Tom Lane and others for very helpful
discussion.
Author: Thomas Munro <thomas.munro@gmail.com>
Author: Julien Rouhaud <rjuju123@gmail.com>
Reviewed-by: Peter Eisentraut <peter.eisentraut@2ndquadrant.com> (earlier versions)
Discussion: https://postgr.es/m/CAEepm%3D0uEQCpfq_%2BLYFBdArCe4Ot98t1aR4eYiYTe%3DyavQygiQ%40mail.gmail.com
Provide a place for the version of referenced database objects to be
recorded. A follow-up commit will use this to record dependencies on
collation versions for indexes, but similar ideas for other kinds of
objects have also been mooted.
Author: Thomas Munro <thomas.munro@gmail.com>
Reviewed-by: Julien Rouhaud <rjuju123@gmail.com>
Reviewed-by: Peter Eisentraut <peter.eisentraut@2ndquadrant.com>
Discussion: https://postgr.es/m/CAEepm%3D0uEQCpfq_%2BLYFBdArCe4Ot98t1aR4eYiYTe%3DyavQygiQ%40mail.gmail.com
This model couldn't be extended to cover the default collation, and
didn't have any information about the affected database objects when the
version changed. Remove, in preparation for a follow-up commit that
will add a new mechanism.
Author: Thomas Munro <thomas.munro@gmail.com>
Reviewed-by: Julien Rouhaud <rjuju123@gmail.com>
Reviewed-by: Peter Eisentraut <peter.eisentraut@2ndquadrant.com>
Discussion: https://postgr.es/m/CAEepm%3D0uEQCpfq_%2BLYFBdArCe4Ot98t1aR4eYiYTe%3DyavQygiQ%40mail.gmail.com
Logical replication protocol uses a single byte character to identify a
message type in logical replication protocol. The code uses string
literals for the same. Use Enum so that
1. All the string literals used can be found at a single place. This
makes it easy to add more types without the risk of conflicts.
2. It's easy to locate the code handling a given message type.
3. When used with switch statements, it is easy to identify the missing
cases using -Wswitch.
Author: Ashutosh Bapat
Reviewed-by: Kyotaro Horiguchi, Andres Freund, Peter Smith and Amit Kapila
Discussion: https://postgr.es/m/CAExHW5uPzQ7L0oAd_ENyvaiYMOPgkrAoJpE+ZY5-obdcVT6NPg@mail.gmail.com
Previously we only tagged on the required information to allow the
executor to perform run-time partition pruning for Append/MergeAppend
nodes belonging to base relations. It was thought that nested
Append/MergeAppend nodes were just about always pulled up into the
top-level Append/MergeAppend and that making the run-time pruning info for
any sub Append/MergeAppend nodes was a waste of time. However, that was
likely badly thought through.
Some examples of cases we're unable to pullup nested Append/MergeAppends
are: 1) Parallel Append nodes with a mix of parallel and non-parallel
paths into a Parallel Append. 2) When planning an ordered Append scan a
sub-partition which is unordered may require a nested MergeAppend path to
ensure sub-partitions don't mix up the order of tuples being fed into the
top-level Append.
Unfortunately, it was not just as simple as removing the lines in
createplan.c which were purposefully not building the run-time pruning
info for anything but RELOPT_BASEREL relations. The code in
add_paths_to_append_rel() was far too sloppy about which partitioned_rels
it included for the Append/MergeAppend paths. The original code there
would always assume accumulate_append_subpath() would pull each sub-Append
and sub-MergeAppend path into the top-level path. While it does not
appear that there were any actual bugs caused by having the additional
partitioned table RT indexes recorded, what it did mean is that later in
planning, when we built the run-time pruning info that we wasted effort
and built PartitionedRelPruneInfos for partitioned tables that we had no
subpaths for the executor to run-time prune.
Here we tighten that up so that partitioned_rels only ever contains the RT
index for partitioned tables which actually have subpaths in the given
Append/MergeAppend. We can now Assert that every PartitionedRelPruneInfo
has a non-empty present_parts. That should allow us to catch any weird
corner cases that have been missed.
In passing, it seems there is no longer a good reason to have the
AppendPath and MergeAppendPath's partitioned_rel fields a List of IntList.
We can simply have a List of Relids instead. This is more compact in
memory and faster to add new members to. We still know which is the root
level partition as these always have a lower relid than their children.
Previously this field was used for more things, but run-time partition
pruning now remains the only user of it and it has no need for a List of
IntLists.
Here we also get rid of the RelOptInfo partitioned_child_rels field. This
is what was previously used to (sometimes incorrectly) set the
Append/MergeAppend path's partitioned_rels field. That was the only usage
of that field, so we can happily just remove it.
I also couldn't resist changing some nearby code to make use of the newly
added for_each_from macro so we can skip the first element in the list
without checking if the current item was the first one on each
iteration.
A bug report from Andreas Kretschmer prompted all this work, however,
after some consideration, I'm not personally classing this as a bug fix.
So no backpatch. In Andreas' test case, it just wasn't that clear that
there was a nested Append since the top-level Append just had a single
sub-path which was pulled up a level, per 8edd0e794.
Author: David Rowley
Reviewed-by: Amit Langote
Discussion: https://postgr.es/m/flat/CAApHDvqSchs%2BubdybcfFaSPB%2B%2BEA7kqMaoqajtP0GtZvzOOR3g%40mail.gmail.com
Statistics associated to an index got lost after running REINDEX
CONCURRENTLY, while the non-concurrent case preserves these correctly.
The concurrent and non-concurrent operations need to be consistent for
the end-user, and missing statistics would force to wait for a new
analyze to happen, which could take some time depending on the activity
of the existing autovacuum workers. This issue is fixed by copying any
existing entries in pg_statistic associated to the old index to the new
one. Note that this copy is already done with the data of the index in
the stats collector.
Reported-by: Fabrízio de Royes Mello
Author: Michael Paquier, Fabrízio de Royes Mello
Reviewed-by: Justin Pryzby
Discussion: https://postgr.es/m/CAFcNs+qpFPmiHd1oTXvcPdvAHicJDA9qBUSujgAhUMJyUMb+SA@mail.gmail.com
Backpatch-through: 12
Certain background workers initiate parallel queries while
debug_query_string==NULL, at which point they attempted strlen(NULL) and
died to SIGSEGV. Older debug_query_string observers allow NULL, so do
likewise in these newer ones. Back-patch to v11, where commit
7de4a1bcc5 introduced the first of these.
Discussion: https://postgr.es/m/20201014022636.GA1962668@rfd.leadboat.com
Because of this, if you tried to create an operator family with the new
sortsupport function, you got an error:
ERROR: support function number 11 is invalid for access method gist
We missed this in commit 16fa9b2b30 that added the sortsupport function,
because it only added sortsupport to a built-in operator family.
Author: Andrey Borodin
Discussion: https://www.postgresql.org/message-id/3520A18A-5C38-4697-A2E3-F3BDE3496CD5%40yandex-team.ru
On the same reasoning as in commit 36b931214, forbid using custom
oid_symbol macros in pg_type as well as pg_proc, so that we always
rely on the predictable macro names generated by genbki.pl.
We do continue to grant grandfather status to the names CASHOID and
LSNOID, although those are now considered deprecated aliases for the
preferred names MONEYOID and PG_LSNOID. This is because there's
likely to be client-side code using the old names, and this bout of
neatnik-ism doesn't quite seem worth breaking client code.
There might be a case for grandfathering EVTTRIGGEROID, too, since
externally-maintained PLs may reference that symbol. But renaming
such references to EVENT_TRIGGEROID doesn't seem like a particularly
heavy lift --- we make far more significant backend API changes in
every major release. For now I didn't add that, but we could
reconsider if there's pushback.
The other names changed here seem pretty unlikely to have any outside
uses. Again, we could add alias macros if there are complaints, but
for now I didn't.
As before, no need for a catversion bump.
John Naylor
Discussion: https://postgr.es/m/CAFBsxsHpCbjfoddNGpnnnY5pHwckWfiYkMYSF74PmP1su0+ZOw@mail.gmail.com
When ComputeXidHorizons() was called before MyDatabaseOid is set,
e.g. because a dead row in a shared relation is encountered during
InitPostgres(), the horizon for normal tables was computed too
aggressively, ignoring all backends connected to a database.
During subsequent pruning in a data table the too aggressive horizon
could end up still being used, possibly leading to still needed tuples
being removed. Not good.
This is a bug in dc7420c2c9, which the test added in 94bc27b576 made
visible, if run with force_parallel_mode set to regress. In that case
the bug is reliably triggered, because "pruning_query" is run in a
parallel worker and the start of that parallel worker is likely to
encounter a dead row in pg_database.
The fix is trivial: Compute a more pessimistic data table horizon if
MyDatabaseId is not yet known.
Author: Andres Freund
Discussion: https://postgr.es/m/20201029040030.p4osrmaywhqaesd4@alap3.anarazel.de
This adds the statistics about transactions streamed to the decoding
output plugin from ReorderBuffer. Users can query the
pg_stat_replication_slots view to check these stats and call
pg_stat_reset_replication_slot to reset the stats of a particular slot.
Users can pass NULL in pg_stat_reset_replication_slot to reset stats of
all the slots.
Commit 9868167500 has added the basic infrastructure to capture the stats
of slot and this commit extends the statistics collector to track
additional information about slots.
Bump the catversion as we have added new columns in the catalog entry.
Author: Ajin Cherian and Amit Kapila
Reviewed-by: Sawada Masahiko and Dilip Kumar
Discussion: https://postgr.es/m/CAA4eK1+chpEomLzgSoky-D31qev19AmECNiEAietPQUGEFhtVA@mail.gmail.com
This fixes a bug in the edge case where, for a temp table, heap_page_prune()
can end up with a different horizon than heap_vacuum_rel(). Which can trigger
errors like "ERROR: cannot freeze committed xmax ...".
The bug was introduced due to interaction of a7212be8b9 "Set cutoff xmin more
aggressively when vacuuming a temporary table." with dc7420c2c9 "snapshot
scalability: Don't compute global horizons while building snapshots.".
The problem is caused by lazy_scan_heap() assuming that the only reason its
HeapTupleSatisfiesVacuum() call would return HEAPTUPLE_DEAD is if the tuple is
a HOT tuple, or if the tuple's inserting transaction has aborted since the
heap_page_prune() call. But after a7212be8b9 that was also possible in other
cases for temp tables, because heap_page_prune() uses a different visibility
test after dc7420c2c9.
The fix is fairly simple: Move the special case logic for temp tables from
vacuum_set_xid_limits() to the infrastructure introduced in dc7420c2c9. That
ensures that the horizon used for pruning is at least as aggressive as the one
used by lazy_scan_heap(). The concrete horizon used for temp tables is
slightly different than the logic in dc7420c2c9, but should always be as
aggressive as before (see comments).
A significant benefit to centralizing the logic procarray.c is that now the
more aggressive horizons for temp tables does not just apply to VACUUM but
also to e.g. HOT pruning and the nbtree killtuples logic.
Because isTopLevel is not needed by vacuum_set_xid_limits() anymore, I
undid the the related changes from a7212be8b9.
This commit also adds an isolation test ensuring that the more aggressive
vacuuming and pruning of temp tables keeps working.
Debugged-By: Amit Kapila <amit.kapila16@gmail.com>
Debugged-By: Tom Lane <tgl@sss.pgh.pa.us>
Debugged-By: Ashutosh Sharma <ashu.coek88@gmail.com>
Author: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/20201014203103.72oke6hqywcyhx7s@alap3.anarazel.de
Discussion: https://postgr.es/m/20201015083735.derdzysdtqdvxshp@alap3.anarazel.de
It's unsafe to do this at parse time because addition of generated
columns to a table would not invalidate stored rules containing
UPDATEs on the table ... but there might now be dependent generated
columns that were not there when the rule was made. This also fixes
an oversight that rewriteTargetView failed to update extraUpdatedCols
when transforming an UPDATE on an updatable view. (Since the new
calculation is downstream of that, rewriteTargetView doesn't actually
need to do anything; but before, there was a demonstrable bug there.)
In v13 and HEAD, this leads to easily-visible bugs because (since
commit c6679e4fc) we won't recalculate generated columns that aren't
listed in extraUpdatedCols. In v12 this bitmap is mostly just used
for trigger-firing decisions, so you'd only notice a problem if a
trigger cared whether a generated column had been updated.
I'd complained about this back in May, but then forgot about it
until bug #16671 from Michael Paul Killian revived the issue.
Back-patch to v12 where this field was introduced. If existing
stored rules contain any extraUpdatedCols values, they'll be
ignored because the rewriter will overwrite them, so the bug will
be fixed even for existing rules. (But note that if someone were
to update to 13.1 or 12.5, store some rules with UPDATEs on tables
having generated columns, and then downgrade to a prior minor version,
they might observe issues similar to what this patch fixes. That
seems unlikely enough to not be worth going to a lot of effort to fix.)
Discussion: https://postgr.es/m/10206.1588964727@sss.pgh.pa.us
Discussion: https://postgr.es/m/16671-2fa55851859fb166@postgresql.org
We have a perfectly good convention for OID macros for built-in functions
already, so making custom symbols is just introducing unnecessary
deviation from the convention. Remove the one case that had snuck in,
and add an error check in genbki.pl to discourage future instances.
Although this touches pg_proc.dat, there's no need for a catversion
bump since the actual catalog data isn't changed.
John Naylor
Discussion: https://postgr.es/m/CAFBsxsHpCbjfoddNGpnnnY5pHwckWfiYkMYSF74PmP1su0+ZOw@mail.gmail.com
get_foreign_key_join_selectivity() looks for join clauses that equate
the two sides of the FK constraint. However, if we have a query like
"WHERE fktab.a = pktab.a and fktab.a = 1", it won't find any such join
clause, because equivclass.c replaces the given clauses with "fktab.a
= 1 and pktab.a = 1", which can be enforced at the scan level, leaving
nothing to be done for column "a" at the join level.
We can fix that expectation without much trouble, but then a new problem
arises: applying the foreign-key-based selectivity rule produces a
rowcount underestimate, because we're effectively double-counting the
selectivity of the "fktab.a = 1" clause. So we have to cancel that
selectivity out of the estimate.
To fix, refactor process_implied_equality() so that it can pass back the
new RestrictInfo to its callers in equivclass.c, allowing the generated
"fktab.a = 1" clause to be saved in the EquivalenceClass's ec_derives
list. Then it's not much trouble to dig out the relevant RestrictInfo
when we need to adjust an FK selectivity estimate. (While at it, we
can also remove the expensive use of initialize_mergeclause_eclasses()
to set up the new RestrictInfo's left_ec and right_ec pointers.
The equivclass.c code can set those basically for free.)
This seems like clearly a bug fix, but I'm hesitant to back-patch it,
first because there's some API/ABI risk for extensions and second because
we're usually loath to destabilize plan choices in stable branches.
Per report from Sigrid Ehrenreich.
Discussion: https://postgr.es/m/1019549.1603770457@sss.pgh.pa.us
Discussion: https://postgr.es/m/AM6PR02MB5287A0ADD936C1FA80973E72AB190@AM6PR02MB5287.eurprd02.prod.outlook.com
This makes use of CheckBuffer() introduced in c780a7a, adding a SQL
wrapper able to do checks for all the pages of a relation. By default,
all the fork types of a relation are checked, and it is possible to
check only a given relation fork. Note that if the relation given in
input has no physical storage or is temporary, then no errors are
generated, allowing full-database checks when coupled with a simple scan
of pg_class for example. This is not limited to clusters with data
checksums enabled, as clusters without data checksums can still apply
checks on pages using the page headers or for the case of a page full of
zeros.
This function returns a set of tuples consisting of:
- The physical file where a broken page has been detected (without the
segment number as that can be AM-dependent, which can be guessed from
the block number for heap). A relative path from PGPATH is used.
- The block number of the broken page.
By default, only superusers have an access to this function but
execution rights can be granted to other users.
The feature introduced here is still minimal, and more improvements
could be done, like:
- Addition of a start and end block number to run checks on a range
of blocks, which would apply only if one fork type is checked.
- Addition of some progress reporting.
- Throttling, with configuration parameters in function input or
potentially some cost-based GUCs.
Regression tests are added for positive cases in the main regression
test suite, and TAP tests are added for cases involving the emulation of
page corruptions.
Bump catalog version.
Author: Julien Rouhaud, Michael Paquier
Reviewed-by: Masahiko Sawada, Justin Pryzby
Discussion: https://postgr.es/m/CAOBaU_aVvMjQn=ge5qPiJOPMmOj5=ii3st5Q0Y+WuLML5sR17w@mail.gmail.com
CheckBuffer() is designed to be a concurrent-safe function able to run
sanity checks on a relation page without loading it into the shared
buffers. The operation is done using a lock on the partition involved
in the shared buffer mapping hashtable and an I/O lock for the buffer
itself, preventing the risk of false positives due to any concurrent
activity.
The primary use of this function is the detection of on-disk corruptions
for relation pages. If a page is found in shared buffers, the on-disk
page is checked if not dirty (a follow-up checkpoint would flush a valid
version of the page if dirty anyway), as it could be possible that a
page was present for a long time in shared buffers with its on-disk
version corrupted. Such a scenario could lead to a corrupted cluster if
a host is plugged off for example. If the page is not found in shared
buffers, its on-disk state is checked. PageIsVerifiedExtended() is used
to apply the same sanity checks as when a page gets loaded into shared
buffers.
This function will be used by an upcoming patch able to check the state
of on-disk relation pages using a SQL function.
Author: Julien Rouhaud, Michael Paquier
Reviewed-by: Masahiko Sawada
Discussion: https://postgr.es/m/CAOBaU_aVvMjQn=ge5qPiJOPMmOj5=ii3st5Q0Y+WuLML5sR17w@mail.gmail.com
This accompanies select_common_type() and select_common_collation().
Typmods were previously combined using hand-coded logic in several
places. The logic in select_common_typmod() isn't very exciting, but
it makes the code more compact and readable in a few locations, and in
the future we can perhaps do more complicated things if desired.
As a small enhancement, the type unification of the direct and
aggregate arguments of hypothetical-set aggregates now unifies the
typmod as well using this new function, instead of just dropping it.
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://www.postgresql.org/message-id/flat/97df3af9-8b5e-fb7f-a029-3eb7e80d7af9@2ndquadrant.com
The restriction that only tables and views can be locked by LOCK TABLE
is quite arbitrary, since the underlying mechanism can lock any relation
type. Drop the restriction so that programs such as pg_dump can lock
all relations they're interested in, preventing schema changes that
could cause a dump to fail after expending much effort.
Backpatch to 9.5.
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reported-by: Wells Oliver <wells.oliver@gmail.com>
Discussion: https://postgr.es/m/20201021200659.GA32358@alvherre.pgsql
Previously, ExecInitModifyTable relied on ExecInitJunkFilter,
and thence ExecCleanTypeFromTL, to build the target descriptor from
the query tlist. While we just checked (in ExecCheckPlanOutput)
that the tlist produces compatible output, this is not a great
substitute for the relation's actual tuple descriptor that's
available from the relcache. For one thing, dropped columns will
not be correctly marked attisdropped; it's a bit surprising that
we've gotten away with that this long. But the real reason for
being concerned with this is that using the table's descriptor means
that the slot will have correct attrmissing data, allowing us to
revert the klugy fix of commit ba9f18abd. (This commit undoes
that one's changes in trigger.c, but keeps the new test case.)
Thus we can solve the bogus-trigger-tuple problem with fewer cycles
rather than more.
No back-patch, since this doesn't fix any additional bug, and it
seems somewhat more likely to have unforeseen side effects than
ba9f18abd's narrow fix.
Discussion: https://postgr.es/m/16644-5da7ef98a7ac4545@postgresql.org
This is useful for checks of relation pages without having to load the
pages into the shared buffers, and two cases can make use of that: page
verification in base backups and the online, lock-safe, flavor.
Compatibility is kept with past versions using a macro that calls the
new extended routine with the set of options compatible with the
original version.
Extracted from a larger patch by the same author.
Author: Anastasia Lubennikova
Reviewed-by: Michael Paquier, Julien Rouhaud
Discussion: https://postgr.es/m/608f3476-0598-2514-2c03-e05c7d2b0cbd@postgrespro.ru
If the old row has any "missing" attributes that are supposed to
be retrieved from an associated tuple descriptor, the wrong things
happened because the trigger result is shoved directly into an
executor slot that lacks the missing-attribute data. Notably,
CHECK-constraint verification would incorrectly see those columns
as NULL, and so would RETURNING-list evaluation.
Band-aid around this by forcibly expanding the tuple before passing
it to the trigger function. (IMO it was a fundamental misdesign to
put the missing-attribute data into tuple constraints, which so
much of the system considers to be optional. But we're probably
stuck with that now, and will have to continue to apply band-aids
as we find other places with similar issues.)
Back-patch to v12. v11 would also have the issue, except that
commit 920311ab1 already applied a similar band-aid. That forced
expansion in more cases than seem really necessary, though, so
this isn't a directly equivalent fix.
Amit Langote, with some cosmetic changes by me
Discussion: https://postgr.es/m/16644-5da7ef98a7ac4545@postgresql.org
The ExplainCloseGroup arguments for incremental sort usage data
didn't match the corresponding ExplainOpenGroup. This only matters
for XML-format output, which is probably why we'd not noticed.
Daniel Gustafsson, per bug #16683 from Frits Jalvingh
Discussion: https://postgr.es/m/16683-8005033324ad34e9@postgresql.org
There is a handful of places where we called list_delete_ptr() to remove
some element from a List. In many of these places we know, or with very
little additional effort know the index of the ListCell that we need to
remove.
Here we change all of those places to instead either use one of;
list_delete_nth_cell(), foreach_delete_current() or list_delete_last().
Each of these saves from having to iterate over the list to search for the
element to remove by its pointer value.
There are some small performance gains to be had by doing this, but in the
general case, none of these lists are likely to be very large, so the
lookup was probably never that expensive anyway. However, some of the
calls are in fairly hot code paths, e.g process_equivalence(). So any
small gains there are useful.
Author: Zhijie Hou and David Rowley
Discussion: https://postgr.es/m/b3517353ec7c4f87aa560678fbb1034b@G08CNEXMBPEKD05.g08.fujitsu.local
After de8feb1f3a, some warnings remained
that were only visible when using GCC on Windows. Fix those as well.
Note that the ecpg test source files don't use the full pg_config.h,
so we can't use pg_funcptr_t there but have to do it the long way.
More precisely, correctly handle the ONLY flag indicating not to
recurse. This was implemented in 86f575948c by recursing in
trigger.c, but that's the wrong place; use ATSimpleRecursion instead,
which behaves properly. However, because legacy inheritance has never
recursed in that situation, make sure to do that only for new-style
partitioning.
I noticed this problem while testing a fix for another bug in the
vicinity.
This has been wrong all along, so backpatch to 11.
Discussion: https://postgr.es/m/20201016235925.GA29829@alvherre.pgsql
Change the attribute 'name' to 'slot_name' in pg_stat_replication_slots
view to make it clear and that way we will be consistent with the other
places like pg_stat_wal_receiver view where we display the same attribute.
In the passing, fix the typo in one of the macros in the related code.
Bump the catversion as we have modified the name in the catalog as well.
Reported-by: Noriyoshi Shinoda
Author: Noriyoshi Shinoda
Reviewed-by: Sawada Masahiko and Amit Kapila
Discussion: https://postgr.es/m/CA+fd4k5_pPAYRTDrO2PbtTOe0eHQpBvuqmCr8ic39uTNmR49Eg@mail.gmail.com
Since commit 913bbd88d, check_sql_fn_retval() can either insert type
coercion steps in-line in the Query that produces the SQL function's
results, or generate a new top-level Query to perform the coercions,
if modifying the Query's output in-place wouldn't be safe. However,
it appears that the latter case has never actually worked, because
the code tried to inject the new Query back into the query list it was
passed ... which is not the list that will be used for later processing
when we execute the SQL function "normally" (without inlining it).
So we ended up with no coercion happening at run-time, leading to
wrong results or crashes depending on the datatypes involved.
While the regression tests look like they cover this area well enough,
through a huge bit of bad luck all the test cases that exercise the
separate-Query path were checking either inline-able cases (which
accidentally didn't have the bug) or cases that are no-ops at runtime
(e.g., varchar to text), so that the failure to perform the coercion
wasn't obvious. The fact that the cases that don't work weren't
allowed at all before v13 probably contributed to not noticing the
problem sooner, too.
To fix, get rid of the separate "flat" list of Query nodes and instead
pass the real two-level list that is going to be used later. I chose
to make the same change in check_sql_fn_statements(), although that has
no actual bug, just so that we don't need that data structure at all.
This is an API change, as evidenced by the adjustments needed to
callers outside functions.c. That's a bit scary to be doing in a
released branch, but so far as I can tell from a quick search,
there are no outside callers of these functions (and they are
sufficiently specific to our semantics for SQL-language functions that
it's not apparent why any extension would need to call them). In any
case, v13 already changed the API of check_sql_fn_retval() compared to
prior branches.
Per report from pinker. Back-patch to v13 where this code came in.
Discussion: https://postgr.es/m/1603050466566-0.post@n3.nabble.com
Store the tuple conversion map to convert a tuple from a child table's
format to the root format in a new ri_ChildToRootMap field in
ResultRelInfo. It is initialized if transition tuple capture for FOR
STATEMENT triggers or INSERT tuple routing on a partitioned table is
needed. Previously, ModifyTable kept the maps in the per-subplan
ModifyTableState->mt_per_subplan_tupconv_maps array, or when tuple
routing was used, in
ResultRelInfo->ri_Partitioninfo->pi_PartitionToRootMap. The new field
replaces both of those.
Now that the child-to-root tuple conversion map is always available in
ResultRelInfo (when needed), remove the TransitionCaptureState.tcs_map
field. The callers of Exec*Trigger() functions no longer need to set or
save it, which is much less confusing and bug-prone. Also, as a future
optimization, this will allow us to delay creating the map for a given
result relation until the relation is actually processed during
execution.
Author: Amit Langote
Discussion: https://www.postgresql.org/message-id/CA%2BHiwqHtCWLdK-LO%3DNEsvOdHx%2B7yv4mE_zYK0i3BH7dXb-wxog%40mail.gmail.com
When executing DDL on a partitioned table or on a table with inheritance
children, statement-level triggers must be fired against the table given
in the original statement. The code to look that up was a bit messy and
duplicative. Commit 501ed02cf6 added a helper function,
getASTriggerResultRelInfo() (later renamed to getTargetResultRelInfo())
for it, but for some reason it was only used when firing AFTER STATEMENT
triggers and the code to fire BEFORE STATEMENT triggers duplicated the
logic.
Determine the target relation in ExecInitModifyTable(), and set it always
in ModifyTableState. Code that used to call getTargetResultRelInfo() can
now use ModifyTableState->rootResultRelInfo directly.
Discussion: https://www.postgresql.org/message-id/CA%2BHiwqFViT47Zbr_ASBejiK7iDG8%3DQ1swQ-tjM6caRPQ67pT%3Dw%40mail.gmail.com
In shm_mq_receive(), a huge payload could trigger an unjustified
"invalid memory alloc request size" error due to the way the buffer
size is increased.
Add error checks (documenting the upper limit) and avoid the error by
limiting the allocation size to MaxAllocSize.
Author: Markus Wanner <markus.wanner@2ndquadrant.com>
Discussion: https://www.postgresql.org/message-id/flat/3bb363e7-ac04-0ac4-9fe8-db1148755bfa%402ndquadrant.com
Given a query with enough joins, it was possible that the query planner,
after multiplying the row estimates with the join selectivity that the
estimated number of rows would exceed the limits of the double data type
and become infinite.
To give an indication on how extreme a case is required to hit this, the
particular example case reported required 379 joins to a table without any
statistics, which resulted in the 1.0/DEFAULT_NUM_DISTINCT being used for
the join selectivity. This eventually caused the row estimates to go
infinite and resulted in an assert failure in initial_cost_mergejoin()
where the infinite row estimated was multiplied by an outerstartsel of 0.0
resulting in NaN. The failing assert verified that NaN <= Inf, which is
false.
To get around this we use clamp_row_est() to cap row estimates at a
maximum of 1e100. This value is thought to be low enough that costs
derived from it would remain within the bounds of what the double type can
represent.
Aside from fixing the failing Assert, this also has the added benefit of
making it so add_path() will still receive proper numerical values as
costs which will allow it to make more sane choices when determining the
cheaper path in extreme cases such as the one described above.
Additionally, we also get rid of the isnan() checks in the join costing
functions. The actual case which originally triggered those checks to be
added in the first place never made it to the mailing lists. It seems
likely that the new code being added to clamp_row_est() will result in
those becoming checks redundant, so just remove them.
The fairly harmless assert failure problem does also exist in the
backbranches, however, a more minimalistic fix will be applied there.
Reported-by: Onder Kalaci
Reviewed-by: Tom Lane
Discussion: https://postgr.es/m/DM6PR21MB1211FF360183BCA901B27F04D80B0@DM6PR21MB1211.namprd21.prod.outlook.com
Unfortunately in LLVM 3.9 LLVMGetAttributeCountAtIndex(func, index)
crashes when called with an index that has 0 attributes. Since there's
no way to work around this in the C API, add a small C++ wrapper doing
so.
The only reason this didn't fail before 72559438f9 is that there
always are function attributes...
Author: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/20201016001254.w2nfj7gd74jmb5in@alap3.anarazel.de
Backpatch: 11-, like 72559438f9
Previously we only copied the function attributes. That caused problems at
least on s390x: Because we didn't copy the 'zeroext' attribute for
ExecAggTransReparent()'s *IsNull parameters, expressions invoking it didn't
ensure that the upper bytes of the registers were zeroed. In the - relatively
rare - cases where not, ExecAggTransReparent() wrongly ended up in the
newValueIsNull branch due to the register not being zero. Subsequently causing
a crash.
It's quite possible that this would cause problems on other platforms, and in
other places than just ExecAggTransReparent() on s390x.
Thanks to Christoph (and the Debian project) for providing me with access to a
s390x machine, allowing me to debug this.
Reported-By: Christoph Berg
Author: Andres Freund
Discussion: https://postgr.es/m/20201015083246.kie5726xerdt3ael@alap3.anarazel.de
Backpatch: 11-, where JIT was added
Pavan Deolasee recently noted that a few of the
HeapTupleHeaderIndicatesMovedPartitions calls added by commit
5db6df0c01 are useless, since they are done after comparing t_self
with t_ctid. But because t_self can never be set to the magical values
that indicate that the tuple moved partition, this can never succeed: if
the first test fails (so we know t_self equals t_ctid), necessarily the
second test will also fail.
So these checks can be removed and no harm is done.
Discussion: https://postgr.es/m/20200929164411.GA15497@alvherre.pgsql
Most importantly, remove optimization in LogicalRepSyncTableStart that
skips the normal walrcv_startstreaming/endstreaming dance. The
optimization is not critically important for production uses anyway,
since it only fires in cases with no activity, and saves an
uninteresting amount of work even then. Critically, it obscures bugs by
hiding the interesting code path from test cases.
Also: in GetSubscriptionRelState, remove pointless relation open; access
pg_subscription_rel->srsubstate with GETSTRUCT as is typical rather than
SysCacheGetAttr; remove unused 'missing_ok' argument.
In wait_for_relation_state_change, use explicit catalog snapshot
invalidation rather than obscurely (and expensively) through
GetLatestSnapshot.
In various places: sprinkle comments more liberally and rewrite a number
of them. Other cosmetic code improvements.
No backpatch, since no bug is being fixed here.
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Petr Jelínek <petr.jelinek@2ndquadrant.com>
Discussion: https://postgr.es/m/20201010190637.GA5774@alvherre.pgsql
The in-core equivalents can make use of built-in functions if the
compiler supports this option, making optimizations possible. 0ba99c8
replaced all existing calls in the code base at this time, but b0b39f7
(GSSAPI encryption) has forgotten to do the switch.
Discussion: https://postgr.es/m/20201014055303.GG3349@paquier.xyz
A number of places were using appendStringInfo() when they could have been
using appendStringInfoString() instead. While there's no functionality
change there, it's just more efficient to use appendStringInfoString()
when no formatting is required. Likewise for some
appendStringInfoString() calls which were just appending a single char.
We can just use appendStringInfoChar() for that.
Additionally, many places were using appendPQExpBuffer() when they could
have used appendPQExpBufferStr(). Change those too.
Patch by Zhijie Hou, but further searching by me found significantly more
places that deserved the same treatment.
Author: Zhijie Hou, David Rowley
Discussion: https://postgr.es/m/cb172cf4361e4c7ba7167429070979d4@G08CNEXMBPEKD05.g08.fujitsu.local
While registering for postmaster exit events, we have to handle a couple
of edge cases where the postmaster is already gone. Commit 815c2f09
missed one: EACCES must surely imply that PostmasterPid no longer
belongs to our postmaster process (or alternatively an unexpected
permissions model has been imposed on us). Like ESRCH, this should be
treated as a WL_POSTMASTER_DEATH event, rather than being raised with
ereport().
No known problems reported in the wild. Per code review from Tom Lane.
Back-patch to 13.
Reported-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/3624029.1602701929%40sss.pgh.pa.us
during logical decoding.
Prior to commit c55040ccd0 we have no way of knowing the invalidations
before commit. So, while decoding we use to execute all the invalidations
at each command end as we had no way of knowing which invalidations
happened before that command. Due to this, transactions involving large
amounts of DDLs use to take more time and also lead to high CPU usage. But
now we know specific invalidations at each command end so we execute only
required invalidations.
It has been observed that decoding of a transaction containing truncation
of a table with 1000 partitions would be finished in 1s whereas before
this patch it used to take 4-5 minutes.
Author: Dilip Kumar
Reviewed-by: Amit Kapila and Keisuke Kuroda
Discussion: https://postgr.es/m/CANDwggKYveEtXjXjqHA6RL3AKSHMsQyfRY6bK+NqhAWJyw8psQ@mail.gmail.com
I removed the duplicate command tags for START_REPLICATION inadvertently
in commit 07082b08cc, but the replication protocol requires them. The
fact that the replication protocol was broken was not noticed because
all our test cases use an optimized code path that exits early, failing
to verify that the behavior is correct for non-optimized cases. Put
them back.
Also document this protocol quirk.
Add a test case that shows the failure. It might still succeed even
without the patch when run on a fast enough server, but it suffices to
show the bug in enough cases that it would be noticed in buildfarm.
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reported-by: Henry Hinze <henry.hinze@gmail.com>
Reviewed-by: Petr Jelínek <petr.jelinek@2ndquadrant.com>
Discussion: https://postgr.es/m/16643-eaadeb2a1a58d28c@postgresql.org
If WaitEventSetWait() reports that the postmaster has gone away, later
calls to WaitEventSetWait() should continue to report that. Otherwise
further waits that occur in the proc_exit() path after we already
noticed the postmaster's demise could block forever.
Back-patch to 13, where the kqueue support landed.
Reported-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/3624029.1602701929%40sss.pgh.pa.us
Maintaining 'es_result_relation_info' correctly at all times has become
cumbersome, especially with partitioning where each partition gets its
own result relation info. Having to set and reset it across arbitrary
operations has caused bugs in the past.
This changes all the places that used 'es_result_relation_info', to
receive the currently active ResultRelInfo via function parameters
instead.
Author: Amit Langote
Discussion: https://www.postgresql.org/message-id/CA%2BHiwqGEmiib8FLiHMhKB%2BCH5dRgHSLc5N5wnvc4kym%2BZYpQEQ%40mail.gmail.com
FDWs that can perform an UPDATE/DELETE remotely using the "direct modify"
set of APIs need to access the ResultRelInfo of the target table. That's
currently available in EState.es_result_relation_info, but the next
commit will remove that field.
This commit adds a new resultRelation field in ForeignScan, to store the
target relation's RT index, and the corresponding ResultRelInfo in
ForeignScanState. The FDW's PlanDirectModify callback is expected to set
'resultRelation' along with 'operation'. The core code doesn't need them
for anything, they are for the convenience of FDW's Begin- and
IterateDirectModify callbacks.
Authors: Amit Langote, Etsuro Fujita
Discussion: https://www.postgresql.org/message-id/CA%2BHiwqGEmiib8FLiHMhKB%2BCH5dRgHSLc5N5wnvc4kym%2BZYpQEQ%40mail.gmail.com
Instead of allocating all the ResultRelInfos upfront in one big array,
allocate them in ExecInitModifyTable(). es_result_relations is now an
array of ResultRelInfo pointers, rather than an array of structs, and it
is indexed by the RT index.
This simplifies things: we get rid of the separate concept of a "result
rel index", and don't need to set it in setrefs.c anymore. This also
allows follow-up optimizations (not included in this commit yet) to skip
initializing ResultRelInfos for target relations that were not needed at
runtime, and removal of the es_result_relation_info pointer.
The EState arrays of regular result rels and root result rels are merged
into one array. Similarly, the resultRelations and rootResultRelations
lists in PlannedStmt are merged into one. It's not actually clear to me
why they were kept separate in the first place, but now that the
es_result_relations array is indexed by RT index, it certainly seems
pointless.
The PlannedStmt->resultRelations list is now only needed for
ExecRelationIsTargetRelation(). One visible effect of this change is that
ExecRelationIsTargetRelation() will now return 'true' also for the
partition root, if a partitioned table is updated. That seems like a good
thing, although the function isn't used in core code, and I don't see any
reason for an FDW to call it on a partition root.
Author: Amit Langote
Discussion: https://www.postgresql.org/message-id/CA%2BHiwqGEmiib8FLiHMhKB%2BCH5dRgHSLc5N5wnvc4kym%2BZYpQEQ%40mail.gmail.com
gistRelocateBuildBuffersOnSplit did not get the memo about which
attribute count to use. This could lead to a crash if there were
included columns and buffering build was chosen. (Because there
are random page-split decisions elsewhere in GiST index build,
the crashes are not entirely deterministic.)
Back-patch to v12 where GiST gained support for included columns.
Pavel Borisov
Discussion: https://postgr.es/m/CALT9ZEECCV5m7wvxg46PC-7x-EybUmnpupBGhSFMoAAay+r6HQ@mail.gmail.com
Commit 16fa9b2b3 broke the ability to reliably test GiST buffered builds,
because it caused sorted builds to be done instead if sortsupport is
available, regardless of any attempt to override that. While a would-be
test case could try to work around that by choosing an opclass that has
no sortsupport function, coverage would be silently lost the moment
someone decides it'd be a good idea to add a sortsupport function.
Hence, rearrange the logic in gistbuild() so that if "buffering = on"
is specified in CREATE INDEX, we will use that method, sortsupport or no.
Also document the interaction between sorting and the buffering
parameter, as 16fa9b2b3 failed to do.
(Note that in fact we still lack any test coverage of buffered builds,
but this is a prerequisite to adding a non-fragile test.)
Discussion: https://postgr.es/m/3249980.1602532990@sss.pgh.pa.us
The prohibitValueChange code paths in set_config_option(), which
are executed whenever we re-read a PGC_POSTMASTER variable from
postgresql.conf, neglected to free anything before exiting. Thus
we'd leak the proposed new value of a PGC_STRING variable, as noted
by BoChen in bug #16666. For all variable types, if the check hook
creates an "extra" chunk, we'd also leak that.
These are malloc not palloc chunks, so there is no mechanism for
recovering the leaks before process exit. Fortunately, the values
are typically not very large, meaning you'd have to go through an
awful lot of SIGHUP configuration-reload cycles to make the leakage
amount to anything. Still, for a long-lived postmaster process it
could potentially be a problem.
Oversight in commit 2594cf0e8. Back-patch to all supported branches.
Discussion: https://postgr.es/m/16666-2c41a4eec61b03e1@postgresql.org
In the past, we always estimated that a ModifyTable node would emit the
same number of rows as its subpaths. Without a RETURNING clause, the
correct estimate is zero. Fix, in preparation for a proposed parallel
write patch that is sensitive to that number.
A remaining problem is that for RETURNING queries, the estimated width
is based on subpath output rather than the RETURNING tlist.
Reviewed-by: Greg Nancarrow <gregn4422@gmail.com>
Discussion: https://postgr.es/m/CAJcOf-cXnB5cnMKqWEp2E2z7Mvcd04iLVmV%3DqpFJrR3AcrTS3g%40mail.gmail.com
Up to now, only ECONNRESET (and EPIPE, in most but not quite all places)
received special treatment in our error handling logic. This patch
changes things so that related error codes such as ECONNABORTED are
also recognized as indicating that the connection's dead and unlikely
to come back.
We continue to think, however, that only ECONNRESET and EPIPE should be
reported as probable server crashes; the other cases indicate network
connectivity problems but prove little about the server's state. Thus,
there's no change in the error message texts that are output for such
cases. The key practical effect is that errcode_for_socket_access()
will report ERRCODE_CONNECTION_FAILURE rather than
ERRCODE_INTERNAL_ERROR for a network failure. It's expected that this
will fix buildfarm member lorikeet's failures since commit 32a9c0bdf,
as that seems to be due to not treating ECONNABORTED equivalently to
ECONNRESET.
The set of errnos treated this way now includes ECONNABORTED, EHOSTDOWN,
EHOSTUNREACH, ENETDOWN, ENETRESET, and ENETUNREACH. Several of these
were second-class citizens in terms of their handling in places like
get_errno_symbol(), so upgrade the infrastructure where necessary.
As committed, this patch assumes that all these symbols are defined
everywhere. POSIX specifies all of them except EHOSTDOWN, but that
seems to exist on all platforms of interest; we'll see what the
buildfarm says about that.
Probably this should be back-patched, but let's see what the buildfarm
thinks of it first.
Fujii Masao and Tom Lane
Discussion: https://postgr.es/m/2621622.1602184554@sss.pgh.pa.us
Multiply before dividing, not the reverse, so that cases that should
produce exact results do produce exact results. (width_bucket_float8
got this right already.) Even when the result is inexact, this avoids
making it more inexact, since only the division step introduces any
imprecision.
While at it, fix compute_bucket() to not uselessly repeat the sign
check already done by its caller, and avoid duplicating the
multiply/divide steps by adjusting variable usage.
Per complaint from Martin Visser. Although this seems like a bug fix,
I'm hesitant to risk changing width_bucket()'s results in stable
branches, so no back-patch.
Discussion: https://postgr.es/m/6FA5117D-6AED-4656-8FEF-B74AC18FAD85@brytlyt.com
While the calculation is not well-defined if the bounds arguments are
infinite, there is a perfectly sane outcome if the test operand is
infinite: it's just like any other value that's before the first bucket
or after the last one. width_bucket_float8() got this right, but
I was too hasty about the case when adding infinities to numerics
(commit a57d312a7), so that width_bucket_numeric() just rejected it.
Fix that, and sync the relevant error message strings.
No back-patch needed, since infinities-in-numeric haven't shipped yet.
Discussion: https://postgr.es/m/2465409.1602170063@sss.pgh.pa.us
This adds the statistics about transactions spilled to disk from
ReorderBuffer. Users can query the pg_stat_replication_slots view to check
these stats and call pg_stat_reset_replication_slot to reset the stats of
a particular slot. Users can pass NULL in pg_stat_reset_replication_slot
to reset stats of all the slots.
This commit extends the statistics collector to track this information
about slots.
Author: Sawada Masahiko and Amit Kapila
Reviewed-by: Amit Kapila and Dilip Kumar
Discussion: https://postgr.es/m/CA+fd4k5_pPAYRTDrO2PbtTOe0eHQpBvuqmCr8ic39uTNmR49Eg@mail.gmail.com
It appears that commit cf63c641c, which intended to prevent
misoptimization of the result-building step in makeOrderedSetArgs,
didn't go far enough: buildfarm member hornet's version of xlc
is now optimizing back to the old, broken behavior in which
list_length(directargs) is fetched only after list_concat() has
changed that value. I'm not entirely convinced whether that's
an undeniable compiler bug or whether it can be justified by a
sufficiently aggressive interpretation of C sequence points.
So let's just change the code to make it harder to misinterpret.
Back-patch to all supported versions, just in case.
Discussion: https://postgr.es/m/1830491.1601944935@sss.pgh.pa.us
The date-vs-timestamp, date-vs-timestamptz, and timestamp-vs-timestamptz
comparators all worked by promoting the first type to the second and
then doing a simple same-type comparison. This works fine, except
when the conversion result is out of range, in which case we throw an
entirely avoidable error. The sources of such failures are
(a) type date can represent dates much farther in the future than
the timestamp types can;
(b) timezone rotation might cause a just-in-range timestamp value to
become a just-out-of-range timestamptz value.
Up to now we just ignored these corner-case issues, but now we have
an actual user complaint (bug #16657 from Huss EL-Sheikh), so let's
do something about it.
It turns out that commit 52ad1e659 already built all the necessary
infrastructure to support error-free comparisons, but neglected to
actually use it in the main-line code paths. Fix that, do a little
bit of code style review, and remove the now-duplicate logic in
jsonpath_exec.c.
Back-patch to v13 where 52ad1e659 came in. We could take this back
further by back-patching said infrastructure, but given the small
number of complaints so far, I don't feel a great need to.
Discussion: https://postgr.es/m/16657-cde2f876d8cc7971@postgresql.org
In logical replication when a subscriber is missing some columns, it
currently emits an error message that says "some" columns are missing, but
it doesn't specify the missing column names. Change that to display
missing column names which makes an error to be more informative to the
user.
We have decided not to backpatch this commit as this is a minor usability
improvement and no user has reported this.
Reported-by: Bharath Rupireddy
Author: Bharath Rupireddy
Reviewed-by: Kyotaro Horiguchi and Amit Kapila
Discussion: https://postgr.es/m/CALj2ACVkW-EXH_4pmBK8tNeHRz5ksUC4WddGactuCjPiBch-cg@mail.gmail.com
This patch prevents crashes or wrong plans when partition-wise joins
are considered during GEQO planning, as a consequence of the
EquivalenceClass data structures becoming corrupt after a GEQO
context reset.
A remaining problem is that successive GEQO cycles will make multiple
copies of the required EC members, since add_child_join_rel_equivalences
has no idea that such members might exist already. For now we'll just
live with that. The lack of field complaints of crashes suggests that
this is a mighty little-used situation.
Back-patch to v12 where this code was introduced.
Discussion: https://postgr.es/m/1683100.1601860653@sss.pgh.pa.us
Since PG 12, clientcert no longer supported only on/off, so remove 1/0
as possible values, and instead support only the text strings
'verify-ca' and 'verify-full'.
Remove support for 'no-verify' since that is possible by just not
specifying clientcert.
Also, throw an error if 'verify-ca' is used and 'cert' authentication is
used, since cert authentication requires verify-full.
Also improve the docs.
THIS IS A BACKWARD INCOMPATIBLE API CHANGE.
Reported-by: Kyotaro Horiguchi
Discussion: https://postgr.es/m/20200716.093012.1627751694396009053.horikyota.ntt@gmail.com
Author: Kyotaro Horiguchi
Backpatch-through: master
This should help to identify what happened when studying the postmaster
log after-the-fact.
While here, clean up some old comments in the same function.
Discussion: https://postgr.es/m/1568983.1601845687@sss.pgh.pa.us
get_eclass_for_sort_expr() computes expr_relids and nullable_relids
early on, even though they won't be needed unless we make a new
EquivalenceClass, which we often don't. Aside from the probably-minor
inefficiency, there's a memory management problem: these bitmapsets will
be built in the caller's context, leading to dangling pointers if that
is shorter-lived than root->planner_cxt. This would be a live bug if
get_eclass_for_sort_expr() could be called with create_it = true during
GEQO join planning. So far as I can find, the core code never does
that, but it's hard to be sure that no extensions do, especially since
the comments make it clear that that's supposed to be a supported case.
Fix by not computing these values until we've switched into planner_cxt
to build the new EquivalenceClass.
generate_join_implied_equalities() uses inner_rel->relids to look up
relevant eclasses, but it ought to be using nominal_inner_relids.
This is presently harmless because a child RelOptInfo will always have
exactly the same eclass_indexes as its topmost parent; but that might
not be true forever, and anyway it makes the code confusing.
The first of these is old (introduced by me in f3b3b8d5b), so back-patch
to all supported branches. The second only dates to v13, but we might
as well back-patch it to keep the code looking similar across branches.
Discussion: https://postgr.es/m/1508010.1601832581@sss.pgh.pa.us
Unlike for functions, OUT parameters for procedures are part of the
signature. Therefore, they have to be listed in pg_proc.proargtypes
as well as mentioned in ALTER PROCEDURE and DROP PROCEDURE.
Reviewed-by: Andrew Dunstan <andrew.dunstan@2ndquadrant.com>
Reviewed-by: Pavel Stehule <pavel.stehule@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/2b8490fe-51af-e671-c504-47359dc453c5@2ndquadrant.com
The handling of those options was inconsistent, as the processing used
directly the value assigned to the option to check if it was redundant,
leading to patterns like this one to succeed (note that false is
specified first):
COPY hoge to '/path/to/file/' (header off, header on);
And the opposite would fail correctly (note that true is first here):
COPY hoge to '/path/to/file/' (header on, header off);
While on it, add some tests to check for all redundant patterns with the
options of COPY. I have gone through the code and did not notice
similar mistakes for other commands.
"header" got it wrong since b63990c, and "freeze" was wrong from the
start as of 8de72b6. No backpatch is done per the lack of complaints.
Reported-by: Rémi Lapeyre
Discussion: https://postgr.es/m/20200929072433.GA15570@paquier.xyz
Discussion: https://postgr.es/m/0B55BD07-83E4-439F-AACC-FA2D7CF50532@lenstra.fr
The BKI file's string quoting conventions were previously quite weird,
perhaps as a result of repurposing a function built to scan
single-quoted strings to scan double-quoted ones. Change to use the
same rules as we use in GUC files, allowing some simplifications in
genbki.pl and initdb.c.
While at it, completely remove the backend's scanstr() function, which
was essentially a duplicate of the string dequoting code in guc-file.l.
Instead export that one (under a less generic name than it had) and let
bootscanner.l use it. Now we can clarify that scansup.c exists only to
support the main lexer. We could alternatively have removed GUC_scanstr,
but this way seems better since the previous arrangement could mislead
a reader into thinking that scanstr() had something to do with the main
lexer's handling of string literals. Maybe it did once, but if so it
was a long time ago.
This patch does not bump catversion, since the initially-installed
catalog contents don't change. Note however that successful initdb
after applying this patch will require up-to-date postgres.bki as well
as postgres and initdb executables.
In passing, remove a bunch of very-long-obsolete #include's in
bootparse.y and bootscanner.l.
John Naylor
Discussion: https://postgr.es/m/CACPNZCtDpd18T0KATTmCggO2GdVC4ow86ypiq5ENff1VnauL8g@mail.gmail.com
This view shows the statistics about WAL activity. Currently it has only
two columns: wal_buffers_full and stats_reset. wal_buffers_full column
indicates the number of times WAL data was written to the disk because
WAL buffers got full. This information is useful when tuning wal_buffers.
stats_reset column indicates the time at which these statistics were
last reset.
pg_stat_wal view is also the basic infrastructure to expose other
various statistics about WAL activity later.
Bump PGSTAT_FILE_FORMAT_ID due to the change in pgstat format.
Bump catalog version.
Author: Masahiro Ikeda
Reviewed-by: Takayuki Tsunakawa, Kyotaro Horiguchi, Amit Kapila, Fujii Masao
Discussion: https://postgr.es/m/188bd3f2d2233cf97753b5ced02bb050@oss.nttdata.com
Providing this information can be useful for example when diagnosing
problems related to recovery conflicts or for recovery issues without
having to go through the output generated by pg_waldump to get some
information about the blocks a WAL record works on.
The block information is printed in the same format as pg_waldump. This
already existed in xlog.c for debugging purposes with -DWAL_DEBUG, so
adding the block information in the callback has required just a small
refactoring.
Author: Bertrand Drouvot
Reviewed-by: Michael Paquier, Masahiko Sawada
Discussion: https://postgr.es/m/c31e2cba-efda-762c-f4ad-5c25e5dac3d0@amazon.com
This is not strictly necessary, as the right-links are only needed by
scans that are concurrent with page splits, and neither scans or page
splits can happen during sorted index build. But it seems like a good
idea to set them anyway, if we e.g. want to add a check to amcheck in
the future to verify that the chain of right-links is complete.
Author: Andrey Borodin
Discussion: https://www.postgresql.org/message-id/4D68C21F-9FB9-41DA-B663-FDFC8D143788%40yandex-team.ru
I (Andres) broke this in 623a9CA79bx, because I didn't think about the
way snapshots are built on standbys sufficiently. Unfortunately our
existing tests did not catch this, as they are all just querying with
psql (therefore ending up with fresh snapshots).
The fix is trivial, we just need to increment the transaction
completion counter in ExpireTreeKnownAssignedTransactionIds(), which
is the equivalent of ProcArrayEndTransaction() during recovery.
This commit also adds a new test doing some basic testing of the
correctness of snapshots built on standbys. To avoid the
aforementioned issue of one-shot psql's not exercising the snapshot
caching, the test uses a long lived psqls, similar to
013_crash_restart.pl. It'd be good to extend the test further.
Reported-By: Ian Barwick <ian.barwick@2ndquadrant.com>
Author: Andres Freund <andres@anarazel.de>
Author: Ian Barwick <ian.barwick@2ndquadrant.com>
Discussion: https://postgr.es/m/61291ffe-d611-f889-68b5-c298da9fb18f@2ndquadrant.com
The error message about columns in the primary key not including all of
the partition key was unclear; reword it.
Backpatch all the way to pg11, where it appeared.
Reported-by: Nagaraj Raj <nagaraj.sf@yahoo.com>
Discussion: https://postgr.es/m/64062533.78364.1601415362244@mail.yahoo.com
Previously, a conversion such as
to_date('-44-02-01','YYYY-MM-DD')
would result in '0045-02-01 BC', as the code attempted to interpret
the negative year as BC, but failed to apply the correction needed
for our internal handling of BC years. Fix the off-by-one problem.
Also, arrange for the combination of a negative year and an
explicit "BC" marker to cancel out and produce AD. This is how
the negative-century case works, so it seems sane to do likewise.
Continue to read "year 0000" as 1 BC. Oracle would throw an error,
but we've accepted that case for a long time so I'm hesitant to
change it in a back-patch.
Per bug #16419 from Saeed Hubaishan. Back-patch to all supported
branches.
Dar Alathar-Yemen and Tom Lane
Discussion: https://postgr.es/m/16419-d8d9db0a7553f01b@postgresql.org
Previously we threw an error. But make_date already allowed the case,
so it is inconsistent as well as unhelpful for make_timestamp not to.
Both functions continue to reject year zero.
Code and test fixes by Peter Eisentraut, doc changes by me
Discussion: https://postgr.es/m/13c0992c-f15a-a0ca-d839-91d3efd965d9@2ndquadrant.com
The SQL standard doesn't require jsonpath .datetime() method to support the
ISO 8601 format. But our to_json[b]() functions convert timestamps to text in
the ISO 8601 format in the sake of compatibility with javascript. So, we add
support of the ISO 8601 to the jsonpath .datetime() in the sake compatibility
with to_json[b]().
The standard mode of datetime parsing currently supports just template patterns
and separators in the format string. In order to implement ISO 8601, we have to
add support of the format string double quotes to the standard parsing mode.
Discussion: https://postgr.es/m/94321be0-cc96-1a81-b6df-796f437f7c66%40postgrespro.ru
Author: Nikita Glukhov, revised by me
Backpatch-through: 13
bffe1bd684 has introduced jsonpath .datetime() method, but default formats
for time and timestamp contain excess space between time and timezone. This
commit removes this excess space making behavior of .datetime() method
standard-compliant.
Discussion: https://postgr.es/m/94321be0-cc96-1a81-b6df-796f437f7c66%40postgrespro.ru
Author: Nikita Glukhov
Backpatch-through: 13
Previously the standby server didn't archive timeline history files
streamed from the primary even when archive_mode is set to "always",
while it archives the streamed WAL files. This could cause the PITR to
fail because there was no required timeline history file in the archive.
The cause of this issue was that walreceiver didn't mark those files as
ready for archiving.
This commit makes walreceiver mark those streamed timeline history
files as ready for archiving if archive_mode=always. Then the archiver
process archives the marked timeline history files.
Back-patch to all supported versions.
Reported-by: Grigory Smolkin
Author: Grigory Smolkin, Fujii Masao
Reviewed-by: David Zhang, Anastasia Lubennikova
Discussion: https://postgr.es/m/54b059d4-2b48-13a4-6f43-95a087c92367@postgrespro.ru
This addresses a couple of issues with the so-said subject:
- Report the correct parent relation with the index actually being
rebuilt or validated. Previously, the command status remained set to
the last index created for the progress of the index build and
validation, which would be incorrect when working on a table that has
more than one index.
- Use the correct phase when waiting before the drop of the old
indexes. Previously, this was reported with the same status as when
waiting before the old indexes are marked as dead.
Author: Matthias van de Meent, Michael Paquier
Discussion: https://postgr.es/m/CAEze2WhqFgcwe1_tv=sFYhLWV2AdpfukumotJ6JNcAOQs3jufg@mail.gmail.com
Backpatch-through: 12
We have a dozen or so places that need to iterate over all but the
first cell of a List. Prior to v13 this was typically written as
for_each_cell(lc, lnext(list_head(list)))
Commit 1cff1b95a changed these to
for_each_cell(lc, list, list_second_cell(list))
This patch introduces a new macro for_each_from() which expresses
the start point as a list index, allowing these to be written as
for_each_from(lc, list, 1)
This is marginally more efficient, since ForEachState.i can be
initialized directly instead of backing into it from a ListCell
address. It also seems clearer and less typo-prone.
Some of the remaining uses of for_each_cell() look like they could
profitably be changed to for_each_from(), but here I confined myself
to changing uses of list_second_cell().
Also, fix for_each_cell_setup() and for_both_cell_setup() to
const-ify their arguments; that's a simple oversight in 1cff1b95a.
Back-patch into v13, on the grounds that (1) the const-ification
is a minor bug fix, and (2) it's better for back-patching purposes
if we only have two ways to write these loops rather than three.
In HEAD, also remove list_third_cell() and list_fourth_cell(),
which were also introduced in 1cff1b95a, and are unused as of
cc99baa43. It seems unlikely that any third-party code would
have started to use them already; anyone who has can be directed
to list_nth_cell instead.
Discussion: https://postgr.es/m/CAApHDvpo1zj9KhEpU2cCRZfSM3Q6XGdhzuAS2v79PH7WJBkYVA@mail.gmail.com
transformPartitionBoundValue went out of its way to do the wrong
thing: there is no reason to complain about a non-matching COLLATE
clause in a partition boundary expression. We're coercing the
bound expression to the target column type as though by an
implicit assignment, and the rules for implicit assignment say
that collations can be implicitly converted.
What we *do* need to do, and the code is not doing, is apply
assign_expr_collations() to the bound expression. While this is
merely a definition disagreement, that is a bug that needs to be
back-patched, so I'll commit it separately.
Discussion: https://postgr.es/m/CAJV4CdrZ5mKuaEsRSbLf2URQ3h6iMtKD=hik8MaF5WwdmC9uZw@mail.gmail.com
SQL operations such as CURRENT_DATE, CURRENT_TIME, LOCALTIME, and
conversion of "now" in a datetime input string have to obtain the
transaction start timestamp ("now()") as a broken-down struct pg_tm.
This is a remarkably expensive conversion, and since now() does not
change intra-transaction, it doesn't really need to be done more than
once per transaction. Introducing a simple cache provides visible
speedups in queries that compute these values many times, for example
insertion of many rows that use a default value of CURRENT_DATE.
Peter Smith, with a bit of kibitzing by me
Discussion: https://postgr.es/m/CAHut+Pu89TWjq530V2gY5O6SWi=OEJMQ_VHMt8bdZB_9JFna5A@mail.gmail.com
When commit bd3daddaf introduced AlternativeSubPlans, I had some
ambitions towards allowing the choice of subplan to change during
execution. That has not happened, or even been thought about, in the
ensuing twelve years; so it seems like a failed experiment. So let's
rip that out and resolve the choice of subplan at the end of planning
(in setrefs.c) rather than during executor startup. This has a number
of positive benefits:
* Removal of a few hundred lines of executor code, since
AlternativeSubPlans need no longer be supported there.
* Removal of executor-startup overhead (particularly, initialization
of subplans that won't be used).
* Removal of incidental costs of having a larger plan tree, such as
tree-scanning and copying costs in the plancache; not to mention
setrefs.c's own costs of processing the discarded subplans.
* EXPLAIN no longer has to print a weird (and undocumented)
representation of an AlternativeSubPlan choice; it sees only the
subplan actually used. This should mean less confusion for users.
* Since setrefs.c knows which subexpression of a plan node it's
working on at any instant, it's possible to adjust the estimated
number of executions of the subplan based on that. For example,
we should usually estimate more executions of a qual expression
than a targetlist expression. The implementation used here is
pretty simplistic, because we don't want to expend a lot of cycles
on the issue; but it's better than ignoring the point entirely,
as the executor had to.
That last point might possibly result in shifting the choice
between hashed and non-hashed EXISTS subplans in a few cases,
but in general this patch isn't meant to change planner choices.
Since we're doing the resolution so late, it's really impossible
to change any plan choices outside the AlternativeSubPlan itself.
Patch by me; thanks to David Rowley for review.
Discussion: https://postgr.es/m/1992952.1592785225@sss.pgh.pa.us
This function leaked some memory while loading qual clauses for
an RLS policy. While ordinarily negligible, that could build up
in some repeated-reload cases, as reported by Konstantin Knizhnik.
We can improve matters by borrowing the coding long used in
RelationBuildRuleLock: build stringToNode's result directly in
the target context, and remember to explicitly pfree the
input string.
This patch by no means completely guarantees zero leaks within
this function, since we have no real guarantee that the catalog-
reading subroutines it calls don't leak anything. However,
practical tests suggest that this is enough to resolve the issue.
In any case, any remaining leaks are similar to those risked by
RelationBuildRuleLock and other relcache-loading subroutines.
If we need to fix them, we should adopt a more global approach
such as that used by the RECOVER_RELATION_BUILD_MEMORY hack.
While here, let's remove the need for an expensive PG_TRY block by
using MemoryContextSetParent to reparent an initially-short-lived
context for the RLS data.
Back-patch to all supported branches.
Discussion: https://postgr.es/m/21356c12-8917-8249-b35f-1c447231922b@postgrespro.ru
Commit 464824323e changed the logical replication protocol to allow the
streaming of in-progress transactions and used the new version of protocol
irrespective of the server version. Use the appropriate version of the
protocol based on the server version.
Reported-by: Ashutosh Sharma
Author: Dilip Kumar
Reviewed-by: Ashutosh Sharma and Amit Kapila
Discussion: https://postgr.es/m/CAE9k0P=9OpXcNrcU5Gsvd5MZ8GFpiN833vNHzX6Uc=8+h1ft1Q@mail.gmail.com
Previously, we called fsync() after writing out individual pg_xact,
pg_multixact and pg_commit_ts pages due to cache pressure, leading to
regular I/O stalls in user backends and recovery. Collapse requests for
the same file into a single system call as part of the next checkpoint,
as we already did for relation files, using the infrastructure developed
by commit 3eb77eba. This can cause a significant improvement to
recovery performance, especially when it's otherwise CPU-bound.
Hoist ProcessSyncRequests() up into CheckPointGuts() to make it clearer
that it applies to all the SLRU mini-buffer-pools as well as the main
buffer pool. Rearrange things so that data collected in CheckpointStats
includes SLRU activity.
Also remove the Shutdown{CLOG,CommitTS,SUBTRANS,MultiXact}() functions,
because they were redundant after the shutdown checkpoint that
immediately precedes them. (I'm not sure if they were ever needed, but
they aren't now.)
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> (parts)
Tested-by: Jakub Wartak <Jakub.Wartak@tomtom.com>
Discussion: https://postgr.es/m/CA+hUKGLJ=84YT+NvhkEEDAuUtVHMfQ9i-N7k_o50JmQ6Rpj_OQ@mail.gmail.com
The previous coding was confused about whether head_timestamp was
intended to represent the timestamp for the newest bucket in the
mapping or the oldest timestamp for the oldest bucket in the mapping.
Decide that it's intended to be the oldest one, and repair
accordingly.
To do that, we need to do two things. First, when advancing to a
new bucket, don't categorically set head_timestamp to the new
timestamp. Do this only if we're blowing out the map completely
because a lot of time has passed since we last maintained it. If
we're replacing entries one by one, advance head_timestamp by
1 minute for each; if we're filling in unused entries, don't
advance head_timestamp at all.
Second, fix the computation of how many buckets we need to advance.
The previous formula would be correct if head_timestamp were the
timestamp for the new bucket, but we're now making all the code
agree that it's the timestamp for the oldest bucket, so adjust the
formula accordingly.
This is certainly a bug fix, but I don't feel good about
back-patching it without the introspection tools added by commit
aecf5ee2bb, and perhaps also some
actual tests. Since back-patching the introspection tools might
not attract sufficient support and since there are no automated
tests of these fixes yet, I'm just committing this to master for
now.
Patch by me, reviewed by Thomas Munro, Dilip Kumar, Hamid Akhtar.
Discussion: http://postgr.es/m/CA+TgmoY=aqf0zjTD+3dUWYkgMiNDegDLFjo+6ze=Wtpik+3XqA@mail.gmail.com
Existing code used various inconsistent ways to printf struct stat's
st_size member. The type of that is off_t, which is in most cases a
signed 64-bit integer, so use the long long int format for it.
This makes it possible for code outside snapmgr.c to examine the
contents of this data structure. This commit does not add any code
which actually does so; a subsequent commit will make that change.
Patch by me, reviewed by Thomas Munro, Dilip Kumar, Hamid Akhtar.
Discussion: http://postgr.es/m/CA+TgmoY=aqf0zjTD+3dUWYkgMiNDegDLFjo+6ze=Wtpik+3XqA@mail.gmail.com
Commit fbeb9da22, which added the tsearch_readline APIs, left
t_readline() in place as a compatibility measure. But that function
has been unused and deprecated for twelve years now, so that seems
like enough time to remove it. Doing so, and merging t_readline's
code into tsearch_readline, aids in making several useful
improvements:
* The hard-wired 4K limit on line length in tsearch data files is
removed, by using a StringInfo buffer instead of a fixed-size buffer.
* We can buy back the per-line palloc/pfree added by 3ea7e9550
in the common case where encoding conversion is not required.
* We no longer need a separate pg_verify_mbstr call, as that
functionality was folded into encoding conversion some time ago.
(We could have done some of this stuff while keeping t_readline as a
separate API, but there seems little point, since there's no reason
for anyone to still be using t_readline directly.)
Discussion: https://postgr.es/m/48A4FA71-524E-41B9-953A-FD04EF36E2E7@yesql.se
Harmonize behavior by moving reponsibility for fsyncing directories down
into slru.c. In 10 and later, only the multixact directories were
missed (see commit 1b02be21), and in older branches all SLRUs were
missed.
Back-patch to all supported releases.
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CA%2BhUKGLtsTUOScnNoSMZ-2ZLv%2BwGh01J6kAo_DM8mTRq1sKdSQ%40mail.gmail.com
We failed to pass down the query string to check_new_partition_bound,
so that its attempts to provide error cursor positions were for naught;
one must have the query string to get parser_errposition to do anything.
Adjust its API to require a ParseState to be passed down.
Also, improve the logic inside check_new_partition_bound so that the
cursor points at the partition bound for the specific column causing
the issue, when one can be identified.
That part is also for naught if we can't determine the query position of
the column with the problem. Improve transformPartitionBoundValue so
that it makes sure that const-simplified partition expressions will be
properly labeled with positions. In passing, skip calling evaluate_expr
if the value is already a Const, which is surely the most common case.
Alexandra Wang, Ashwin Agrawal, Amit Langote; reviewed by Ashutosh Bapat
Discussion: https://postgr.es/m/CACiyaSopZoqssfMzgHk6fAkp01cL6vnqBdmTw2C5_KJaFR_aMg@mail.gmail.com
Discussion: https://postgr.es/m/CAJV4CdrZ5mKuaEsRSbLf2URQ3h6iMtKD=hik8MaF5WwdmC9uZw@mail.gmail.com
tsearch_readline() saves the string pointer it returns to the caller
for possible use in the associated error context callback. However,
the caller will usually pfree that string sometime before it next
calls tsearch_readline(), so that there is a window where an ereport
will try to print an already-freed string.
The built-in users of tsearch_readline() happen to all do that pfree
at the bottoms of their loops, so that the window is effectively
empty for them. However, this is not documented as a requirement,
and contrib/dict_xsyn doesn't do it like that, so it seems likely
that third-party dictionaries might have live bugs here.
The practical consequences of this seem pretty limited in any case,
since production builds wouldn't clobber the freed string immediately,
besides which you'd not expect syntax errors in dictionary files
being used in production. Still, it's clearly a bug waiting to bite
somebody.
Fix by pstrdup'ing the string to be saved for the error callback,
and then pfree'ing it next time through. It's been like this for
a long time, so back-patch to all supported branches.
Discussion: https://postgr.es/m/48A4FA71-524E-41B9-953A-FD04EF36E2E7@yesql.se
The existing message about "a column definition list is only allowed for
functions returning "record"" could be given in some cases where it was
fairly confusing; in particular, a function with multiple OUT parameters
*does* return record according to pg_proc. Break it down into a couple
more cases to deliver a more on-point complaint. Per complaint from
Bruce Momjian.
Discussion: https://postgr.es/m/798909.1600562993@sss.pgh.pa.us
This completes the project of making all our derived files be
pgindent-clean (or else explicitly excluded from indentation),
so that no surprises result when running pgindent in a built-out
development tree.
Discussion: https://postgr.es/m/79ed5348-be7a-b647-dd40-742207186a22@2ndquadrant.com
Since we're bypassing the buffer manager, we need to call
PageSetChecksumInplace() directly. As reported by Justin Pryzby.
In the passing, add RelationOpenSmgr() calls before all smgrwrite() and
smgrextend() calls. Tom added one before the first smgrextend() call in
commit c2bb287025, which seems to be enough, but let's play it safe and
do it before each one. That's how it's done in the similar code in
nbtsort.c, too.
Discussion: https://www.postgresql.org/message-id/20200920224446.GF30557@telsasoft.com
It's no longer necessary to assign explicit precedences to GENERATED,
NULL_P, PRESERVE, or STRIP_P.
Actually, we don't need to assign precedence to IDENT either; that was
really just there to govern the behavior of target_el's "a_expr IDENT"
production, which no longer ends with that terminal. However, it seems
like a good idea to continue to do so, because it provides a reference
point for a precedence level that we can assign to other unreserved
keywords that lack a natural precedence level.
Research by Peter Eisentraut and John Naylor; comment rewrite by me.
Discussion: https://postgr.es/m/38ca86db-42ab-9b48-2902-337a0d6b8311@2ndquadrant.com
Commit be0a6666 left behind a comment about the order of some tests that
didn't make sense without the expensive division, and in fact we might
as well change the order to one that fails more cheaply most of the time
as a micro-optimization. Also, remove the "+ 1" applied to max_bucket,
to drop an instruction and match the original behavior. Per review
from Tom Lane.
Discussion: https://postgr.es/m/VI1PR0701MB696044FC35013A96FECC7AC8F62D0%40VI1PR0701MB6960.eurprd07.prod.outlook.com
Since ancient times we have had support for a fill factor (maximum load
factor) to be set for a dynahash hash table, but:
1. It was an integer, whereas for in-memory hash tables interesting
load factor targets are probably somewhere near the 0.75-1.0 range.
2. It was implemented in a way that performed an expensive division
operation that regularly showed up in profiles.
3. We are not aware of anyone ever having used a non-default value.
Therefore, remove support, effectively fixing it at 1.
Author: Jakub Wartak <Jakub.Wartak@tomtom.com>
Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Reviewed-by: Tomas Vondra <tomas.vondra@2ndquadrant.com>
Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/VI1PR0701MB696044FC35013A96FECC7AC8F62D0%40VI1PR0701MB6960.eurprd07.prod.outlook.com
Up to now, if you tried to omit "AS" before a column label in a SELECT
list, it would only work if the column label was an IDENT, that is not
any known keyword. This is rather unfriendly considering that we have
so many keywords and are constantly growing more. In the wake of commit
1ed6b8956 it's possible to improve matters quite a bit.
We'd originally tried to make this work by having some of the existing
keyword categories be allowed without AS, but that didn't work too well,
because each category contains a few special cases that don't work
without AS. Instead, invent an entirely orthogonal keyword property
"can be bare column label", and mark all keywords that way for which
we don't get shift/reduce errors by doing so.
It turns out that of our 450 current keywords, all but 39 can be made
bare column labels, improving the situation by over 90%. This number
might move around a little depending on future grammar work, but it's
a pretty nice improvement.
Mark Dilger, based on work by myself and Robert Haas;
review by John Naylor
Discussion: https://postgr.es/m/38ca86db-42ab-9b48-2902-337a0d6b8311@2ndquadrant.com
After commits 85f6b49c2c and 3ba59ccc89, we can allow parallel inserts
which was earlier not possible as parallel group members won't conflict
for relation extension and page lock. In those commits, we forgot to
update comments at few places.
Author: Amit Kapila
Reviewed-by: Robert Haas and Dilip Kumar
Backpatch-through: 13
Discussion: https://postgr.es/m/CAFiTN-tMrQh5FFMPx5aWJ+1gi1H6JxktEhq5mDwCHgnEO5oBkA@mail.gmail.com
This feature has been a thorn in our sides for a long time, causing
many grammatical ambiguity problems. It doesn't seem worth the
pain to continue to support it, so remove it.
There are some follow-on improvements we can make in the grammar,
but this commit only removes the bare minimum number of productions,
plus assorted backend support code.
Note that pg_dump and psql continue to have full support, since
they may be used against older servers. However, pg_dump warns
about postfix operators. There is also a check in pg_upgrade.
Documentation-wise, I (tgl) largely removed the "left unary"
terminology in favor of saying "prefix operator", which is
a more standard and IMO less confusing term.
I included a catversion bump, although no initial catalog data
changes here, to mark the boundary at which oprkind = 'r'
stopped being valid in pg_operator.
Mark Dilger, based on work by myself and Robert Haas;
review by John Naylor
Discussion: https://postgr.es/m/38ca86db-42ab-9b48-2902-337a0d6b8311@2ndquadrant.com
For parallel btree scan to work for array of scan keys, it should reach
BTPARALLEL_DONE state once for every distinct combination of array keys.
This is required to ensure that the parallel workers don't try to seize
blocks at the same time for different scan keys. We missed to update this
state when we discovered that the scan keys can't be satisfied.
Author: James Hunter
Reviewed-by: Amit Kapila
Tested-by: Justin Pryzby
Backpatch-through: 10, where it was introduced
Discussion: https://postgr.es/m/4248CABC-25E3-4809-B4D0-128E1BAABC3C@amazon.com
In the particular case of GRANTED BY, this is specified in the SQL
standard. Since in PostgreSQL, CURRENT_ROLE is equivalent to
CURRENT_USER, and CURRENT_USER is already supported here, adding
CURRENT_ROLE is trivial. The other cases are PostgreSQL extensions,
but for the same reason it also makes sense there.
Reviewed-by: Vik Fearing <vik@postgresfriends.org>
Reviewed-by: Asif Rehman <asifr.rehman@gmail.com>
Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Discussion: https://www.postgresql.org/message-id/flat/f2feac44-b4c5-f38f-3699-2851d6a76dc9%402ndquadrant.com
This adds a new optional support function to the GiST access method:
sortsupport. If it is defined, the GiST index is built by sorting all data
to the order defined by the sortsupport's comparator function, and packing
the tuples in that order to GiST pages. This is similar to how B-tree
index build works, and is much faster than inserting the tuples one by
one. The resulting index is smaller too, because the pages are packed more
tightly, upto 'fillfactor'. The normal build method works by splitting
pages, which tends to lead to more wasted space.
The quality of the resulting index depends on how good the opclass-defined
sort order is. A good order preserves locality of the input data.
As the first user of this facility, add 'sortsupport' function to the
point_ops opclass. It sorts the points in Z-order (aka Morton Code), by
interleaving the bits of the X and Y coordinates.
Author: Andrey Borodin
Reviewed-by: Pavel Borisov, Thomas Munro
Discussion: https://www.postgresql.org/message-id/1A36620E-CAD8-4267-9067-FB31385E7C0D%40yandex-team.ru
Because the code path taken for SQL commands executed in a walsender
will update the process title, we pretty much have to update the
title for replication commands as well. Otherwise, the title shows
"idle" for the rest of a logical walsender's lifetime once it's
executed any SQL command.
Playing with this, I confirm that a walsender now typically spends
most of its life reporting
walsender postgres [local] START_REPLICATION
Considering this in isolation, it might be better to have it say
walsender postgres [local] sending replication data
However, consistency with the other cases seems to be a stronger
argument.
In passing, remove duplicative pgstat_report_activity call.
Discussion: https://postgr.es/m/880181.1600026471@sss.pgh.pa.us
Since commit fd5942c18f (2012, 9.3-era), walsender has been sending
completion tags for certain replication commands twice -- and they're
not even consistent. Apparently neither libpq nor JDBC have a problem
with it, but it's not kosher. Fix by remove the EndCommand() call in
the common code path for them all, and inserting specific calls to
EndReplicationCommand() specifically in those places where it's needed.
EndReplicationCommand() is a new simple function to send the completion
tag for replication commands. Do this instead of sending a generic
SELECT completion tag for them all, which was also pretty bogus (if
innocuous). While at it, change StartReplication() to use
EndReplicationCommand() instead of pg_puttextmessage().
In commit 2f9661311b, I failed to realize that replication commands
are not close-enough kin of regular SQL commands, so the
DROP_REPLICATION_SLOT tag I added is undeserved and a type pun. Take it
out.
Backpatch to 13, where the latter commit appeared. The duplicate tag
has been sent since 9.3, but since nothing is broken, it doesn't seem
worth fixing.
Per complaints from Tom Lane.
Discussion: https://postgr.es/m/1347966.1600195735@sss.pgh.pa.us
We decided that the policy established in commit 7634bd4f6 for
the bgwriter, checkpointer, walwriter, and walreceiver processes,
namely that they should accept SIGQUIT at all times, really ought
to apply uniformly to all postmaster children. Therefore, get
rid of the duplicative and inconsistent per-process code for
establishing that signal handler and removing SIGQUIT from BlockSig.
Instead, make InitPostmasterChild do it.
The handler set up by InitPostmasterChild is SignalHandlerForCrashExit,
which just summarily does _exit(2). In interactive backends, we
almost immediately replace that with quickdie, since we would prefer
to try to tell the client that we're dying. However, this patch is
changing the behavior of autovacuum (both launcher and workers), as
well as walsenders. Those processes formerly also used quickdie,
but AFAICS that was just mindless copy-and-paste: they don't have
any interactive client that's likely to benefit from being told this.
The stats collector continues to be an outlier, in that it thinks
SIGQUIT means normal exit. That should probably be changed for
consistency, but there's another patch set where that's being
dealt with, so I didn't do so here.
Discussion: https://postgr.es/m/644875.1599933441@sss.pgh.pa.us
Since there is only one place that actually needs the partition check
expression, namely ExecPartitionCheck, it's better to fetch it from
the relcache there. In this way we will never fetch it at all if
the query never has use for it, and we still fetch it just once when
we do need it.
The reason for taking an interest in this is that if the relcache
doesn't already have the check expression cached, fetching it
requires obtaining AccessShareLock on the partition root. That
means that operations that look like they should only touch the
partition itself will also take a lock on the root. In particular
we observed that TRUNCATE on a partition may take a lock on the
partition's root, contributing to a deadlock situation in parallel
pg_restore.
As written, this patch does have a small cost, which is that we
are microscopically reducing efficiency for the case where a partition
has an empty check expression. ExecPartitionCheck will be called,
and will go through the motions of setting up and checking an empty
qual, where before it would not have been called at all. We could
avoid that by adding a separate boolean flag to track whether there
is a partition expression to test. However, this case only arises
for a default partition with no siblings, which surely is not an
interesting case in practice. Hence adding complexity for it
does not seem like a good trade-off.
Amit Langote, per a suggestion by me
Discussion: https://postgr.es/m/VI1PR03MB31670CA1BD9625C3A8C5DD05EB230@VI1PR03MB3167.eurprd03.prod.outlook.com
If a partitioned table's column is already marked NOT NULL, there is
no need to examine its partitions, because we can rely on previous
DDL to have enforced that the child columns are NOT NULL as well.
(Unfortunately, the same cannot be said for traditional inheritance,
so for now we have to restrict the optimization to partitioned tables.)
Hence, we may skip recursing to child tables in this situation.
The reason this case is worth worrying about is that when pg_dump dumps
a partitioned table having a primary key, it will include the requisite
NOT NULL markings in the CREATE TABLE commands, and then add the
primary key as a separate step. The primary key addition generates a
SET NOT NULL as a subcommand, just to be sure. So the situation where
a SET NOT NULL is redundant does arise in the real world.
Skipping the recursion does more than just save a few cycles: it means
that a command such as "ALTER TABLE ONLY partition_parent ADD PRIMARY
KEY" will take locks only on the partition parent table, not on the
partitions. It turns out that parallel pg_restore is effectively
assuming that that's true, and has little choice but to do so because
the dependencies listed for such a TOC entry don't include the
partitions. pg_restore could thus issue this ALTER while data restores
on the partitions are still in progress. Taking unnecessary locks on
the partitions not only hurts concurrency, but can lead to actual
deadlock failures, as reported by Domagoj Smoljanovic.
(A contributing factor in the deadlock is that TRUNCATE on a child
partition wants a non-exclusive lock on the parent. This seems
likewise unnecessary, but the fix for it is more invasive so we
won't consider back-patching it. Fortunately, getting rid of one
of these two poor behaviors is enough to remove the deadlock.)
Although support for partitioned primary keys came in with v11,
this patch is dependent on the SET NOT NULL refactoring done by
commit f4a3fdfbd, so we can only patch back to v12.
Patch by me; thanks to Alvaro Herrera and Amit Langote for review.
Discussion: https://postgr.es/m/VI1PR03MB31670CA1BD9625C3A8C5DD05EB230@VI1PR03MB3167.eurprd03.prod.outlook.com
The code recorded cache invalidation events by zeroing the "localreloid"
field of affected cache entries. However, it's possible for an inval
event to occur even while we have the entry open and locked. So an
ill-timed inval could result in "cache lookup failed for relation 0"
errors, if the worker's code tried to use the cleared field. We can
fix that by creating a separate bool field to record whether the entry
needs to be revalidated. (In the back branches, cram the bool into
what had been padding space, to avoid an ABI break in the somewhat
unlikely event that any extension is looking at this struct.)
Also, rearrange the logic in logicalrep_rel_open so that it
does the right thing in cases where table_open would fail.
We should retry the lookup by name in that case, but we didn't.
The real-world impact of this is probably small. In the first place,
the error conditions are very low probability, and in the second place,
the worker would just exit and get restarted. We only noticed because
in a CLOBBER_CACHE_ALWAYS build, the failure can occur repeatedly,
preventing the worker from making progress. Nonetheless, it's clearly
a bug, and it impedes a useful type of testing; so back-patch to v10
where this code was introduced.
Discussion: https://postgr.es/m/1032727.1600096803@sss.pgh.pa.us
Previously, it was based on nBlocksAllocated to account for tapes with
open write buffers that may not have made it to the BufFile yet.
That was unnecessary, because callers do not need to get the number of
blocks while a tape has an open write buffer; and it also conflicted
with the preallocation logic added for HashAgg.
Reviewed-by: Peter Geoghegan
Discussion: https://postgr.es/m/ce5af05900fdbd0e9185747825a7423c48501964.camel@j-davis.com
Backpatch-through: 13
In commit 464824323e, for each RelationSyncEntry we maintained the list
of xids (streamed_txns) for which we have already sent the schema. This
helps us to track when to send the schema to the downstream node for
replication of streaming transactions. Before this list got initialized,
we were processing invalidation messages which access this list and led
to an assertion failure.
In passing, clean up the nearby code:
* Initialize the list of xids with NIL instead of NULL which is our usual
coding practice.
* Remove the MemoryContext switch for creating a RelationSyncEntry in dynahash.
Diagnosed-by: Amit Kapila and Tom Lane
Author: Amit Kapila
Reviewed-by: Tom Lane and Dilip Kumar
Discussion: https://postgr.es/m/904373.1600033123@sss.pgh.pa.us
This function could often be seen in profiles of vacuum and could often
be a significant bottleneck during recovery. The problem was that a qsort
was performed in order to sort an array of item pointers in reverse offset
order so that we could use that to safely move tuples up to the end of the
page without overwriting the memory of yet-to-be-moved tuples. i.e. we
used to compact the page starting at the back of the page and move towards
the front. The qsort that this required could be expensive for pages with
a large number of tuples.
In this commit, we take another approach to tuple compactification.
Now, instead of sorting the remaining item pointers array we first check
if the array is presorted and only memmove() the tuples that need to be
moved. This presorted check can be done very cheaply in the calling
functions when the array is being populated. This presorted case is very
fast.
When the item pointer array is not presorted we must copy tuples that need
to be moved into a temp buffer before copying them back into the page
again. This differs from what we used to do here as we're now copying the
tuples back into the page in reverse line pointer order. Previously we
left the existing order alone. Reordering the tuples results in an
increased likelihood of hitting the pre-sorted case the next time around.
Any newly added tuple which consumes a new line pointer will also maintain
the correct sort order of tuples in the page which will also result in the
presorted case being hit the next time. Only consuming an unused line
pointer can cause the order of tuples to go out again, but that will be
corrected next time the function is called for the page.
Benchmarks have shown that the non-presorted case is at least equally as
fast as the original qsort method even when the page just has a few
tuples. As the number of tuples becomes larger the new method maintains
its performance whereas the original qsort method became much slower when
the number of tuples on the page became large.
Author: David Rowley
Reviewed-by: Thomas Munro
Tested-by: Jakub Wartak
Discussion: https://postgr.es/m/CA+hUKGKMQFVpjr106gRhwk6R-nXv0qOcTreZuQzxgpHESAL6dw@mail.gmail.com
ALTER TABLE commands in an extension script are added to an event
trigger command list; but starting with commit b5810de3f4 they do so in
a memory context that's too short-lived, so when execution ends and time
comes to use the entries, they've already been freed.
(This would also be a problem with ALTER TABLE commands in a
multi-command query string, but these serendipitously end in
PortalContext -- which probably explains why it took so long for this to
be reported.)
Fix by using the memory context specifically set for that, instead.
Backpatch to 13, where the aforementioned commit appeared.
Reported-by: Philippe Beaudoin
Author: Jehan-Guillaume de Rorthais <jgdr@dalibo.com>
Discussion: https://postgr.es/m/20200902193715.6e0269d4@firost
Introduced in 0aa8f7640.
MSVC warned about performing 32-bit bit shifting when it appeared like we
might like a 64-bit result. We did, but it just so happened that none of
the calls to this function could have caused the 32-bit shift to overflow.
Here we just cast the constant to int64 to make the compiler happy.
Discussion: https://postgr.es/m/CAApHDvofA_vsrpC13mq_hZyuye5B-ssKEaer04OouXYCO5-uXQ@mail.gmail.com
A walsender process that has executed a SQL command left the text of
that command in pg_stat_activity.query indefinitely, which is quite
confusing if it's in RUNNING state but not doing that query. An easy
and useful fix is to treat replication commands as if they were SQL
queries, and show them in pg_stat_activity according to the same rules
as for regular queries. While we're at it, it seems also sensible to
set debug_query_string, allowing error logging and debugging to see
the replication command.
While here, clean up assorted silliness in exec_replication_command:
* The SQLCmd path failed to restore CurrentMemoryContext to the caller's
value, and failed to delete the temp context created in this routine.
It's only through great good fortune that these oversights did not
result in long-term memory leaks or other problems. It seems cleaner
to code SQLCmd as a separate early-exit path, so do it like that.
* Remove useless duplicate call of SnapBuildClearExportedSnapshot().
* replication_scanner_finish() was never called.
None of those things are significant enough to merit a backpatch,
so this is for HEAD only.
Discussion: https://postgr.es/m/880181.1600026471@sss.pgh.pa.us
3c84046 is the original commit that introduced index_set_state_flags(),
where the presence of SnapshotNow made necessary the use of an in-place
update. SnapshotNow has been removed in 813fb03, so there is no actual
reasons to not make this operation transactional.
Note that while making the operation more robust, using a transactional
operation in this routine was not strictly necessary as there was no use
case for it yet. However, some future features are going to need a
transactional behavior, like support for CREATE/DROP INDEX CONCURRENTLY
with partitioned tables, where indexes in a partition tree need to have
all their pg_index.indis* flags updated in the same transaction to make
the operation stable to the end-user by keeping partition trees
consistent, even with a failure mid-flight.
REINDEX CONCURRENTLY uses already transactional updates when swapping
the old and new indexes, making this change more consistent with the
index-swapping logic.
Author: Michael Paquier
Reviewed-by: Anastasia Lubennikova
Discussion: https://postgr.es/m/20200903080440.GA8559@paquier.xyz
transformCreateStmt() adjusts the transformed statement's RangeVar
to specify the target schema explicitly, for the express reason
of making sure that auxiliary statements derived by parse
transformation operate on the right table. But the refactoring
I did in commit 502898192 got this wrong and passed the untransformed
RangeVar to expandTableLikeClause(). This could lead to assertion
failures or weird misbehavior if the wrong table was accessed.
Per report from Alexander Lakhin. Like the previous patch, back-patch
to all supported branches.
Discussion: https://postgr.es/m/05051f9d-b32b-cb35-6735-0e9f2ab86b5f@gmail.com
We use the timestamp of the global statfile if we are not able to
determine it for a particular database in case the entry for that database
doesn't exist. However, we were using it even when the statfile is
corrupt.
As there is no user reported issue and it is not clear if there is any
impact of this on actual application so decided not to backpatch.
Reported-by: Amit Kapila
Author: Amit Kapila
Reviewed-by: Sawada Masahiko, Magnus Hagander and Alvaro Herrera
Discussion: https://postgr.es/m/CAA4eK1J3oTJKyVq6v7K4d3jD+vtnruG9fHRib6UuWWsrwAR6Aw@mail.gmail.com
The bgwriter, checkpointer, walwriter, and walreceiver processes
claimed to allow SIGQUIT "at all times". In reality SIGQUIT
would get re-blocked during error recovery, because we didn't
update the actual signal mask immediately, so sigsetjmp() would
save and reinstate a mask that includes SIGQUIT.
This appears to be simply a coding oversight. There's never a
good reason to hold off SIGQUIT in these processes, because it's
going to just call _exit(2) which should be safe enough, especially
since the postmaster is going to tear down shared memory afterwards.
Hence, stick in PG_SETMASK() calls to install the modified BlockSig
mask immediately.
Also try to improve the comments around sigsetjmp blocks. Most of
them were just referencing postgres.c, which is misleading because
actually postgres.c manages the signals differently.
No back-patch, since there's no evidence that this is causing any
problems in the field.
Discussion: https://postgr.es/m/CALDaNm1d1hHPZUg3xU4XjtWBOLCrA+-2cJcLpw-cePZ=GgDVfA@mail.gmail.com
Currently, no useful trace is left in the logs when the postmaster
is forced to use SIGKILL to shut down children that failed to respond
to SIGQUIT. Some questions were raised about how often that scenario
happens in the buildfarm, so let's add a LOG-level message showing
that it happened.
Discussion: https://postgr.es/m/1850884.1599601164@sss.pgh.pa.us
Although 58c6feccf fixed the case for SIGQUIT, we were still calling
proc_exit() from signal handlers for SIGTERM and timeout failures in
ProcessStartupPacket. Fortunately, at the point where that code runs,
we haven't yet connected to shared memory in any meaningful way, so
there is nothing we need to undo in shared memory. This means it
should be safe to use _exit(1) here, ie, not run any atexit handlers
but also inform the postmaster that it's not a crash exit.
To make sure nobody breaks the "nothing to undo" expectation, add
a cross-check that no on-shmem-exit or before-shmem-exit handlers
have been registered yet when we finish using these signal handlers.
This change is simple enough that maybe it could be back-patched,
but I won't risk that right now.
Discussion: https://postgr.es/m/1850884.1599601164@sss.pgh.pa.us
Bring the signal handling for startup-packet collection into line
with the policy established in commits bedadc732 and 8e19a8264,
namely don't risk running atexit callbacks when handling SIGQUIT.
Ideally, we'd not do so for SIGTERM or timeout interrupts either,
but that change seems a bit too risky for the back branches.
For now, just improve the comments in this area to describe the risk.
Also relocate where BackendInitialize re-disables these interrupts,
to minimize the code span where they're active. This doesn't buy
a whole lot of safety, but it can't hurt.
In passing, rename startup_die() to remove confusion about whether
it is for the startup process.
Like the previous commits, back-patch to all supported branches.
Discussion: https://postgr.es/m/1850884.1599601164@sss.pgh.pa.us
Move applicable code out of RelationBuildDesc(), which nailed relations
bypass. Non-assert builds experienced no known problems. Back-patch to
v13, where commit c6b92041d3 introduced
rd_firstRelfilenodeSubid.
Kyotaro Horiguchi. Reported by Justin Pryzby.
Discussion: https://postgr.es/m/20200907023737.GA7158@telsasoft.com
Commit 8e19a8264 changed the SIGQUIT handlers of almost all server
processes not to run atexit callbacks. The archiver process was
skipped, perhaps because it's not connected to shared memory; but
it's just as true here that running atexit callbacks in a signal
handler is unsafe. So let's make it work like the rest.
In HEAD and v13, we can use the common SignalHandlerForCrashExit
handler. Before that, just tweak pgarch_exit to use _exit(2)
explicitly.
Like the previous commit, back-patch to all supported branches.
Kyotaro Horiguchi, back-patching by me
Discussion: https://postgr.es/m/1850884.1599601164@sss.pgh.pa.us
Partitioning tuple route code assumes that the partition chosen while
descending the partition hierarchy is always the correct one. This is
true except when the partition is the default partition and another
partition has been added concurrently: the partition constraint changes
and we don't recheck it. This can lead to tuples mistakenly being added
to the default partition that should have been rejected.
Fix by rechecking the default partition constraint while descending the
hierarchy.
An isolation test based on the reproduction steps described by Hao Wu
(with tweaks for extra coverage) is included.
Backpatch to 12, where this bug came in with 898e5e3290.
Reported by: Hao Wu <hawu@vmware.com>
Author: Amit Langote <amitlangote09@gmail.com>
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/CA+HiwqFqBmcSSap4sFnCBUEL_VfOMmEKaQ3gwUhyfa4c7J_-nA@mail.gmail.com
Discussion: https://postgr.es/m/DM5PR0501MB3910E97A9EDFB4C775CF3D75A42F0@DM5PR0501MB3910.namprd05.prod.outlook.com
Historically, cancel_before_shmem_exit() just silently did nothing
if the specified callback wasn't the top-of-stack. The folly of
ignoring this case was exposed by the bugs fixed in 303640199 and
bab150045, so let's make it throw elog(ERROR) instead.
There is a decent argument to be made that PG_ENSURE_ERROR_CLEANUP
should use some separate infrastructure, so it wouldn't break if
something inside the guarded code decides to register a new
before_shmem_exit callback. However, a survey of the surviving
uses of before_shmem_exit() and PG_ENSURE_ERROR_CLEANUP doesn't
show any plausible conflicts of that sort today, so for now we'll
forgo the extra complexity. (It will almost certainly become
necessary if anyone ever wants to wrap PG_ENSURE_ERROR_CLEANUP
around arbitrary user-defined actions, though.)
No backpatch, since this is developer support not a production issue.
Bharath Rupireddy, per advice from Andres Freund, Robert Haas, and myself
Discussion: https://postgr.es/m/CALj2ACWk7j4F2v2fxxYfrroOF=AdFNPr1WsV+AGtHAFQOqm_pw@mail.gmail.com
The problem is caused by me (Andres) having ProcSleep() look at the
wrong PGPROC entry in 5788e258bb.
Unfortunately it seems hard to write a reliable test for autovacuum
cancellations. Perhaps somebody will come up with a good approach, but
it seems worth fixing the issue even without a test.
Reported-By: Jeff Janes <jeff.janes@gmail.com>
Author: Jeff Janes <jeff.janes@gmail.com>
Discussion: https://postgr.es/m/CAMkU=1wH2aUy+wDRDz+5RZALdcUnEofV1t9PzXS_gBJO9vZZ0Q@mail.gmail.com
This essentially reverts a micro-optimization I made years ago,
as part of the much larger commit d72f6c750. It's doubtful
that there was any hard evidence for it being helpful even then,
and the case is even more dubious now that modern compilers
are so much smarter about inlining memset().
The proximate reason for undoing it is to get rid of the type punning
inherent in MemSet, for fear that that may cause problems now that
we're applying additional optimization switches to numeric.c.
At the very least this'll silence some warnings from a few old
buildfarm animals.
(It's probably past time for another look at whether MemSet is still
worth anything at all, but I do not propose to tackle that question
right now.)
Discussion: https://postgr.es/m/CAJ3gD9evtA_vBo+WMYMyT-u=keHX7-r8p2w7OSRfXf42LTwCZQ@mail.gmail.com
Until now, REINDEX was not able to work with partitioned tables and
indexes, forcing users to reindex partitions one by one. This extends
REINDEX INDEX and REINDEX TABLE so as they can accept a partitioned
index and table in input, respectively, to reindex all the partitions
assigned to them with physical storage (foreign tables, partitioned
tables and indexes are then discarded).
This shares some logic with schema and database REINDEX as each
partition gets processed in its own transaction after building a list of
relations to work on. This choice has the advantage to minimize the
number of invalid indexes to one partition with REINDEX CONCURRENTLY in
the event a cancellation or failure in-flight, as the only indexes
handled at once in a single REINDEX CONCURRENTLY loop are the ones from
the partition being working on.
Isolation tests are added to emulate some cases I bumped into while
developing this feature, particularly with the concurrent drop of a
leaf partition reindexed. However, this is rather limited as LOCK would
cause REINDEX to block in the first transaction building the list of
partitions.
Per its multi-transaction nature, this new flavor cannot run in a
transaction block, similarly to REINDEX SCHEMA, SYSTEM and DATABASE.
Author: Justin Pryzby, Michael Paquier
Reviewed-by: Anastasia Lubennikova
Discussion: https://postgr.es/m/db12e897-73ff-467e-94cb-4af03705435f.adger.lj@alibaba-inc.com
Tomas Vondra observed that the IO behavior for HashAgg tends to be
worse than for Sort. Penalize HashAgg IO costs accordingly.
Also, account for the CPU effort of spilling the tuples and reading
them back.
Discussion: https://postgr.es/m/20200906212112.nzoy5ytrzjjodpfh@development
Reviewed-by: Tomas Vondra
Backpatch-through: 13
Experimentation shows that clang will auto-vectorize the critical
multiplication loop if the termination condition is written "i2 < limit"
rather than "i2 <= limit". This seems unbelievably stupid, but I've
reproduced it on both clang 9.0.1 (RHEL8) and 11.0.3 (macOS Catalina).
gcc doesn't care, so tweak the code to do it that way.
Discussion: https://postgr.es/m/CAJ3gD9evtA_vBo+WMYMyT-u=keHX7-r8p2w7OSRfXf42LTwCZQ@mail.gmail.com
Some kernels can tell us the type of a "dirent", so we can avoid a call
to stat() or lstat() in many cases. Define a new function
get_dirent_type() to contain that logic, for use by the backend and
frontend versions of walkdir(), and perhaps other callers in future.
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Juan José Santamaría Flecha <juanjo.santamaria@gmail.com>
Discussion: https://postgr.es/m/CA%2BhUKG%2BFzxupGGN4GpUdbzZN%2Btn6FQPHo8w0Q%2BAPH5Wz8RG%2Bww%40mail.gmail.com
Compile numeric.c with -ftree-vectorize where available, and adjust
the innermost loop of mul_var() so that it is amenable to being
auto-vectorized. (Mainly, that involves making it process the arrays
left-to-right not right-to-left.)
Applying -ftree-vectorize actually makes numeric.o smaller, at least
with my compiler (gcc 8.3.1 on x86_64), and it's a little faster too.
Independently of that, fixing the inner loop to be vectorizable also
makes things a bit faster. But doing both is a huge win for
multiplications with lots of digits. For me, the numeric regression
test is the same speed to within measurement noise, but numeric_big
is a full 45% faster.
We also looked into applying -funroll-loops, but that makes numeric.o
bloat quite a bit, and the additional speed improvement is very
marginal.
Amit Khandekar, reviewed and edited a little by me
Discussion: https://postgr.es/m/CAJ3gD9evtA_vBo+WMYMyT-u=keHX7-r8p2w7OSRfXf42LTwCZQ@mail.gmail.com
Replace CFLAGS_VECTOR with CFLAGS_UNROLL_LOOPS and CFLAGS_VECTORIZE,
allowing us to distinguish whether we want to apply -funroll-loops,
-ftree-vectorize, or both to a particular source file. Up to now
the only consumer of the symbol has been checksum.c which wants
both, so that there was no need to distinguish; but that's about
to change.
Amit Khandekar, reviewed and edited a little by me
Discussion: https://postgr.es/m/CAJ3gD9evtA_vBo+WMYMyT-u=keHX7-r8p2w7OSRfXf42LTwCZQ@mail.gmail.com
Letting the caller provide a StringInfo to read into is helpful when
the caller needs to merge lines or otherwise modify the data after
it's been read. Notably, now the code added by commit 8f8154a50
can use pg_get_line_append() instead of having its own copy of that
logic. A follow-on commit will also make use of this.
Also, since StringInfo buffers are a minimum of 1KB long, blindly
using pg_get_line() in a loop can eat a lot more memory than one would
expect. I discovered for instance that commit e0f05cd5b caused initdb
to consume circa 10MB to read postgres.bki, even though that's under
1MB worth of data. A less memory-hungry alternative is to re-use the
same StringInfo for all lines and pg_strdup the results.
Discussion: https://postgr.es/m/1315832.1599345736@sss.pgh.pa.us
We reported the wrong types when complaining that an aggregate's
moving-aggregate implementation is inconsistent with its regular
implementation.
This was wrong since the feature was introduced, so back-patch
to all supported branches.
Jeff Janes
Discussion: https://postgr.es/m/CAMkU=1x808LH=LPhZp9mNSP0Xd1xDqEd+XeGcvEe48dfE6xV=A@mail.gmail.com
At some point back in the last century, somebody felt that reading
all of pg_type twice was cheaper, or at least easier, than using
repalloc() to resize the Typ[] array dynamically. That seems like an
entirely wacko proposition, so rewrite the code to do it the other
way. (To add insult to injury, there were two not-quite-identical
copies of said code.)
initdb.c's readfile() function had the same disease of preferring
to do double the I/O to avoid resizing its output array. Here,
we can make things easier by using the just-invented pg_get_line()
function to handle reading individual lines without a predetermined
notion of how long they are.
On my machine, it's difficult to detect any net change in the
overall runtime of initdb from these changes; but they should
help on slower buildfarm machines (especially since a buildfarm
cycle involves a lot of initdb's these days).
My attention was drawn to these places by scan-build complaints,
but on inspection they needed a lot more work than just suppressing
dead stores :-(
This commit improves the dependency registrations by taking advantage of
the preliminary work done in 63110c62, to group together the insertion
of dependencies of the same type to pg_depend. With the current layer
of routines available, and as only dependencies of the same type can be
grouped, there are code paths still doing more than one multi-insert
when it is necessary to register dependencies of multiple types
(constraint and index creation are two cases doing that).
While on it, this refactors some of the code to use ObjectAddressSet()
when manipulating object addresses.
Author: Daniel Gustafsson, Michael Paquier
Reviewed-by: Andres Freund, Álvaro Herrera
Discussion: https://postgr.es/m/20200807061619.GA23955@paquier.xyz
This is a follow-up of the work done in e3931d01. This case is a bit
different than pg_attribute and pg_shdepend: the maximum number of items
to insert is known in advance, but there is no need to handle pinned
dependencies. Hence, the base allocation for slots is done based on the
number of items and the maximum allowed with a cap at 64kB. Slots are
initialized once used to minimize the overhead of the operation.
The insertions can be done for dependencies of the same type. More
could be done by grouping the insertion of multiple dependency types in
a single batch. This is left as future work.
Some of the multi-insert logic is also simplified for pg_shdepend, as
per the feedback discussed for this specific patch. This also moves to
indexing.h the variable capping the maximum amount of data that can be
used at once for a multi-insert, instead of having separate definitions
for pg_attribute, pg_depend and pg_shdepend.
Author: Daniel Gustafsson, Michael Paquier
Reviewed-by: Andres Freund, Álvaro Herrera
Discussion: https://postgr.es/m/20200807061619.GA23955@paquier.xyz
Commit 3f60f690f only partially fixed the broken-status-tracking
issue in LogicalRepApplyLoop: we need ping_sent to have the same
lifetime as last_recv_timestamp. The effects are much less serious
than what that commit fixed, though. AFAICS this would just lead to
extra ping requests being sent, once per second until the sender
responds. Still, it's a bug, so backpatch to v10 as before.
Discussion: https://postgr.es/m/959627.1599248476@sss.pgh.pa.us
When reading a WAL record fails to find continuation record(s) of the
proper length, report what it expects, for clarity.
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/20200903212152.GA15319@alvherre.pgsql
This node already handles multiple options using a bitmask, so having a
separate boolean flag is not necessary. This simplifies the code a bit
with less arguments to give to the reindex routines, by replacing the
boolean with an equivalent bitmask value.
Reviewed-by: Julien Rouhaud
Discussion: https://postgr.es/m/20200902110326.GA14963@paquier.xyz
This patch started out with the goal of harmonizing various arbitrary
limits on password length, but after awhile a better idea emerged:
let's just get rid of those fixed limits.
recv_password_packet() has an arbitrary limit on the packet size,
which we don't really need, so just drop it. (Note that this doesn't
really affect anything for MD5 or SCRAM password verification, since
those will hash the user's password to something shorter anyway.
It does matter for auth methods that require a cleartext password.)
Likewise remove the arbitrary error condition in pg_saslprep().
The remaining limits are mostly in client-side code that prompts
for passwords. To improve those, refactor simple_prompt() so that
it allocates its own result buffer that can be made as big as
necessary. Actually, it proves best to make a separate routine
pg_get_line() that has essentially the semantics of fgets(), except
that it allocates a suitable result buffer and hence will never
return a truncated line. (pg_get_line has a lot of potential
applications to replace randomly-sized fgets buffers elsewhere,
but I'll leave that for another patch.)
I built pg_get_line() atop stringinfo.c, which requires moving
that code to src/common/; but that seems fine since it was a poor
fit for src/port/ anyway.
This patch is mostly mine, but it owes a good deal to Nathan Bossart
who pressed for a solution to the password length problem and
created a predecessor patch. Also thanks to Peter Eisentraut and
Stephen Frost for ideas and discussion.
Discussion: https://postgr.es/m/09512C4F-8CB9-4021-B455-EF4C4F0D55A0@amazon.com
Because sigsetjmp() will restore the initial state with signals blocked,
the code path in bgworker.c for reporting an error and exiting would
execute that way. Usually this is fairly harmless; but if a parallel
worker had an error message exceeding the shared-memory communication
buffer size (16K) it would lock up, because it would wait for a
resume-sending signal from its parallel leader which it would never
detect.
To fix, just unblock signals at the appropriate point.
This can be shown to fail back to 9.6. The lack of parallel query
infrastructure makes it difficult to provide a simple test case for
9.5; but I'm pretty sure the issue exists in some form there as well,
so apply the code change there too.
Vignesh C, reviewed by Bharath Rupireddy, Robert Haas, and myself
Discussion: https://postgr.es/m/CALDaNm1d1hHPZUg3xU4XjtWBOLCrA+-2cJcLpw-cePZ=GgDVfA@mail.gmail.com
A backslash at the end of a line now causes the next line to be appended
to the current one (effectively, the backslash and newline are discarded).
This allows long HBA entries to be created without legibility problems.
While we're here, get rid of the former hard-wired length limit on
pg_hba.conf lines, by using an expansible StringInfo buffer instead
of a fixed-size local variable.
Since the same code is used to read the ident map file, these changes
apply there as well.
Fabien Coelho, reviewed by Justin Pryzby and David Zhang
Discussion: https://postgr.es/m/alpine.DEB.2.21.2003251906140.15243@pseudo
To add support for streaming of in-progress transactions into the
built-in logical replication, we need to do three things:
* Extend the logical replication protocol, so identify in-progress
transactions, and allow adding additional bits of information (e.g.
XID of subtransactions).
* Modify the output plugin (pgoutput) to implement the new stream
API callbacks, by leveraging the extended replication protocol.
* Modify the replication apply worker, to properly handle streamed
in-progress transaction by spilling the data to disk and then
replaying them on commit.
We however must explicitly disable streaming replication during
replication slot creation, even if the plugin supports it. We
don't need to replicate the changes accumulated during this phase,
and moreover we don't have a replication connection open so we
don't have where to send the data anyway.
Author: Tomas Vondra, Dilip Kumar and Amit Kapila
Reviewed-by: Amit Kapila, Kuntal Ghosh and Ajin Cherian
Tested-by: Neha Sharma, Mahendra Singh Thalor and Ajin Cherian
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
This splits a string at occurrences of a delimiter. It is exactly like
string_to_array() except for producing a set of values instead of an
array of values. Thus, the relationship of these two functions is
the same as between regexp_split_to_table() and regexp_split_to_array().
Although the same results could be had from unnest(string_to_array()),
this is somewhat faster than that, and anyway it seems reasonable to
have it for symmetry with the regexp functions.
Pavel Stehule, reviewed by Peter Smith
Discussion: https://postgr.es/m/CAFj8pRD8HOpjq2TqeTBhSo_QkzjLOhXzGCpKJ4nCs7Y9SQkuPw@mail.gmail.com
In SyncRepWaitForLSN() routine called in transaction commit time,
SyncRepLock is necessary to atomically both check the shared
sync_standbys_defined flag and operate the sync replication wait-queue.
On the other hand, when the flag is false, the lock is not necessary
because the wait-queue is not touched. But due to the changes by
commit 48c9f49265, previously the lock was taken whatever the flag was.
This could cause unnecessary performance overhead in every transaction
commit time. Therefore this commit avoids that unnecessary aquisition
of SyncRepLock.
Author: Fujii Masao
Reviewed-by: Asim Praveen, Masahiko Sawada,
Discussion: https://postgr.es/m/20200406050332.nsscfqjzk2d57zyx@alap3.anarazel.de
When multiple relations are reindexed, a scan of pg_class is done first
to build the list of relations to work on. However the REINDEX logic
has never checked if a relation listed still exists when beginning the
work on it, causing for example sudden cache lookup failures.
This commit adds safeguards against dropped relations for REINDEX,
similarly to VACUUM or CLUSTER where we try to open the relation,
ignoring it if it is missing. A new option is added to the REINDEX
routines to control if a missed relation is OK to ignore or not.
An isolation test, based on REINDEX SCHEMA, is added for the concurrent
and non-concurrent cases.
Author: Michael Paquier
Reviewed-by: Anastasia Lubennikova
Discussion: https://postgr.es/m/20200813043805.GE11663@paquier.xyz
Since other sessions aren't allowed to look into a temporary table
of our own session, we do not need to worry about the global xmin
horizon when setting the vacuum XID cutoff. Indeed, if we're not
inside a transaction block, we may set oldestXmin to be the next
XID, because there cannot be any in-doubt tuples in a temp table,
nor any tuples that are dead but still visible to some snapshot of
our transaction. (VACUUM, of course, is never inside a transaction
block; but we need to test that because CLUSTER shares the same code.)
This approach allows us to always clean out a temp table completely
during VACUUM, independently of concurrent activity. Aside from
being useful in its own right, that simplifies building reproducible
test cases.
Discussion: https://postgr.es/m/3490536.1598629609@sss.pgh.pa.us
We were already raising an error for DROP INDEX CONCURRENTLY on a
partitioned table, albeit a different and confusing one:
ERROR: DROP INDEX CONCURRENTLY must be first action in transaction
Change that to throw a more comprehensible error:
ERROR: cannot drop partitioned index \"%s\" concurrently
Michael Paquier authored the test case for indexes on temporary
partitioned tables.
Backpatch to 11, where indexes on partitioned tables were added.
Reported-by: Jan Mussler <jan.mussler@zalando.de>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/16594-d2956ca909585067@postgresql.org
Commit 808e13b282 introduced a few APIs to extend the existing Buffile
interface. In SharedFileSetDeleteOnProcExit, it tries to delete the list
element while traversing the list with 'foreach' construct which makes the
behavior of list traversal unpredictable.
Author: Amit Kapila
Reviewed-by: Dilip Kumar
Tested-by: Dilip Kumar and Neha Sharma
Discussion: https://postgr.es/m/CAA4eK1JhLatVcQ2OvwA_3s0ih6Hx9+kZbq107cXVsSWWukH7vA@mail.gmail.com
Historically, we've considered the state with relpages and reltuples
both zero as indicating that we do not know the table's tuple density.
This is problematic because it's impossible to distinguish "never yet
vacuumed" from "vacuumed and seen to be empty". In particular, a user
cannot use VACUUM or ANALYZE to override the planner's normal heuristic
that an empty table should not be believed to be empty because it is
probably about to get populated. That heuristic is a good safety
measure, so I don't care to abandon it, but there should be a way to
override it if the table is indeed intended to stay empty.
Hence, represent the initial state of ignorance by setting reltuples
to -1 (relpages is still set to zero), and apply the minimum-ten-pages
heuristic only when reltuples is still -1. If the table is empty,
VACUUM or ANALYZE (but not CREATE INDEX) will override that to
reltuples = relpages = 0, and then we'll plan on that basis.
This requires a bunch of fiddly little changes, but we can get rid of
some ugly kluges that were formerly needed to maintain the old definition.
One notable point is that FDWs' GetForeignRelSize methods will see
baserel->tuples = -1 when no ANALYZE has been done on the foreign table.
That seems like a net improvement, since those methods were formerly
also in the dark about what baserel->tuples = 0 really meant. Still,
it is an API change.
I bumped catversion because code predating this change would get confused
by seeing reltuples = -1.
Discussion: https://postgr.es/m/F02298E0-6EF4-49A1-BCB6-C484794D9ACC@thebuild.com
A failure when dropping concurrently an index used in a replica identity
could leave in pg_index an index marked as !indisvalid and
indisreplident. Reindexing this index would switch back indisvalid to
true, and if the replica identity of the parent relation was switched to
use a different index, it would be possible to finish with more than one
index marked as indisreplident. If that were to happen, this could mess
up with the relation cache as an incorrect index could be used for the
replica identity.
Indexes marked as invalid are discarded as candidates for the replica
identity, as of RelationGetIndexList(), so similarly to what is done
with indisclustered, resetting indisreplident when the index is marked
as invalid keeps things consistent. REINDEX CONCURRENTLY's swapping
already resets the flag for the old index, while the new index inherits
the value of the old index to-be-dropped, so only DROP INDEX was an
issue.
Even if this is a bug, the sequence able to reproduce a problem requires
a failure while running DROP INDEX CONCURRENTLY, something unlikely
going to happen in the field, so no backpatch is done.
Author: Michael Paquier
Reviewed-by: Dmitry Dolgov
Discussion: https://postgr.es/m/20200827025721.GN2017@paquier.xyz
collectMatchBitmap() needs to re-find the index tuple it was previously
looking at, after transiently dropping lock on the index page it's on.
The tuple should still exist and be at its prior position or somewhere
to the right of that, since ginvacuum never removes tuples but
concurrent insertions could add one. However, there was a thinko in
that logic, to the effect of expecting any inserted tuples to have the
same index "attnum" as what we'd been scanning. Since there's no
physical separation of tuples with different attnums, it's not terribly
hard to devise scenarios where this fails, leading to transient "lost
saved point in index" errors. (While I've duplicated this with manual
testing, it seems impossible to make a reproducible test case with our
available testing technology.)
Fix by just continuing the scan when the attnum doesn't match.
While here, improve the error message used if we do fail, so that it
matches the wording used in btree for a similar case.
collectMatchBitmap()'s posting-tree code path was previously not
exercised at all by our regression tests. While I can't make
a regression test that exhibits the bug, I can at least improve
the code coverage here, so do that. The test case I made for this
is an extension of one added by 4b754d6c1, so it only works in
HEAD and v13; didn't seem worth trying hard to back-patch it.
Per bug #16595 from Jesse Kinkead. This has been broken since
multicolumn capability was added to GIN (commit 27cb66fdf),
so back-patch to all supported branches.
Discussion: https://postgr.es/m/16595-633118be8eef9ce2@postgresql.org
The additional information added will be an offset number for heap
operations. This information will help us in finding the exact tuple due
to which the error has occurred.
Author: Mahendra Singh Thalor and Amit Kapila
Reviewed-by: Sawada Masahiko, Justin Pryzby and Amit Kapila
Discussion: https://postgr.es/m/CAKYtNApK488TDF4bMbw+1QH8HJf9cxdNDXquhU50TK5iv_FtCQ@mail.gmail.com
Allow BufFile to support temporary files that can be used by the single
backend when the corresponding files need to be survived across the
transaction and need to be opened and closed multiple times. Such files
need to be created as a member of a SharedFileSet.
Additionally, this commit implements the interface for BufFileTruncate to
allow files to be truncated up to a particular offset and extends the
BufFileSeek API to support the SEEK_END case. This also adds an option to
provide a mode while opening the shared BufFiles instead of always opening
in read-only mode.
These enhancements in BufFile interface are required for the upcoming
patch to allow the replication apply worker, to handle streamed
in-progress transactions.
Author: Dilip Kumar, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
Previously the codes for pg_backend_memory_contexts were in
src/backend/utils/mmgr/mcxt.c. This commit moves them to
src/backend/utils/adt/mcxtfuncs.c so that mcxt.c basically includes
only the low-level interface for memory contexts.
Author: Atsushi Torikoshi
Reviewed-by: Michael Paquier, Fujii Masao
Discussion: https://postgr.es/m/20200819135545.GC19121@paquier.xyz
pg_backend_memory_contexts view contains some internal information of
memory contexts. Since exposing them to any users by default may cause
security issue, this commit allows only superusers to read this view,
by default, like we do for pg_shmem_allocations view.
Bump catalog version.
Author: Atsushi Torikoshi
Reviewed-by: Michael Paquier, Fujii Masao
Discussion: https://postgr.es/m/1414992.1597849297@sss.pgh.pa.us
It's a bit inefficient to test if a Bitmapset is empty by counting all the
members and seeing if that number is zero. It's much better just to use
bms_is_empty(). Likewise for checking if there are at least two members,
just use bms_membership(), which does not need to do anything more after
finding two members.
Discussion: https://postgr.es/m/CAApHDvpvwm_QjbDOb5xga%2BKmX9XkN9xQavNGm3SvDbVnCYOerQ%40mail.gmail.com
Reviewed-by: Tomas Vondra
We were displaying the wrong phase information for 'info' message in the
index clean up phase because we were switching to the previous phase a bit
early. We were also not displaying context information for heap phase
unless the block number is valid which is fine for error cases but for
messages at 'info' or lower error level it appears to be inconsistent with
index phase information.
Reported-by: Sawada Masahiko
Author: Sawada Masahiko
Reviewed-by: Amit Kapila
Backpatch-through: 13, where it was introduced
Discussion: https://postgr.es/m/CA+fd4k4HcbhPnCs7paRTw1K-AHin8y4xKomB9Ru0ATw0UeTy2w@mail.gmail.com
The trouble with doing this is that an apparently-constant subquery
output column isn't really constant if it is a grouping column that
appears in only some of the grouping sets. A qual using such a
column would be subject to incorrect const-folding after push-down,
as seen in bug #16585 from Paul Sivash.
To fix, just disable qual pushdown altogether if the sub-query has
nonempty groupingSets. While we could imagine far less restrictive
solutions, there is not much point in working harder right now,
because subquery_planner() won't move HAVING clauses to WHERE within
such a subquery. If the qual stays in HAVING it's not going to be
a lot more useful than if we'd kept it at the outer level.
Having said that, this restriction could be removed if we used a
parsetree representation that distinguished such outputs from actual
constants, which is something I hope to do in future. Hence, make
the patch a minimal addition rather than integrating it more tightly
(e.g. by renumbering the existing items in subquery_is_pushdown_safe's
comment).
Back-patch to 9.5 where grouping sets were introduced.
Discussion: https://postgr.es/m/16585-9d8c340d23ade8c1@postgresql.org
Commit 1281a5c90 rearranged the logic in this area rather drastically,
and it broke the case of adding a foreign key constraint in the same
ALTER that adds the pkey or unique constraint it depends on. While
self-referential fkeys are surely a pretty niche case, this used to
work so we shouldn't break it.
To fix, reorganize the scheduling rules in ATParseTransformCmd so
that a transformed AT_AddConstraint subcommand will be delayed into
a later pass in all cases, not only when it's been spit out as a
side-effect of parsing some other command type.
Also tweak the logic so that we won't run ATParseTransformCmd twice
while doing this. It seems to work even without that, but it's
surely wasting cycles to do so.
Per bug #16589 from Jeremy Evans. Back-patch to v13 where the new
code was introduced.
Discussion: https://postgr.es/m/16589-31c8d981ca503896@postgresql.org
If a CREATE TABLE command uses both LIKE and traditional inheritance,
Vars in CHECK constraints and expression indexes that are absorbed
from a LIKE parent table tended to get mis-numbered, resulting in
wrong answers and/or bizarre error messages (though probably not any
actual crashes, thanks to validation occurring in the executor).
In v12 and up, the same could happen to Vars in GENERATED expressions,
even in cases with no LIKE clause but multiple traditional-inheritance
parents.
The cause of the problem for LIKE is that parse_utilcmd.c supposed
it could renumber such Vars correctly during transformCreateStmt(),
which it cannot since we have not yet accounted for columns added via
inheritance. Fix that by postponing processing of LIKE INCLUDING
CONSTRAINTS, DEFAULTS, GENERATED, INDEXES till after we've performed
DefineRelation().
The error with GENERATED and multiple inheritance is a simple oversight
in MergeAttributes(); it knows it has to renumber Vars in inherited
CHECK constraints, but forgot to apply the same processing to inherited
GENERATED expressions (a/k/a defaults).
Per bug #16272 from Tom Gottfried. The non-GENERATED variants of the
issue are ancient, presumably dating right back to the addition of
CREATE TABLE LIKE; hence back-patch to all supported branches.
Discussion: https://postgr.es/m/16272-6e32da020e9a9381@postgresql.org