9bdb300de modified the EXPLAIN output for Hash Aggregate to show details
from parallel workers. However, it neglected to consider that a given
parallel worker may not have assisted with the given Hash Aggregate. This
can occur when workers fail to start or during Parallel Append with
enable_partitionwise_join enabled when only a single worker is working on
a non-parallel aware sub-plan. It could also happen if a worker simply
wasn't fast enough to get any work done before other processes went and
finished all the work.
The bogus output came from the fact that ExplainOpenWorker() skipped
showing any details for non-initialized workers but show_hashagg_info()
did show details from the worker. This meant that the worker properties
that were shown were not properly attributed to the worker that they
belong to.
In passing, we also now don't show Hash Aggregate properties for the
leader process when it did not contribute any work to the Hash Aggregate.
This can occur either during Parallel Append when only a parallel worker
worked on a given sub plan or with parallel_leader_participation set to
off. This aims to make the behavior of Hash Aggregate's EXPLAIN output
more similar to Sort's.
Reported-by: Justin Pryzby
Discussion: https://postgr.es/m/20200805012105.GZ28072%40telsasoft.com
Backpatch-through: 13, where the original breakage was introduced
This seems more correct, because other before_shmem_exit calls may
expect the infrastructure that is needed to run queries and access the
database to be working, and also because this cleanup has nothing to
do with shared memory.
There are no known user-visible consequences to this, though, apart
from what was previous fixed by commit
303640199d and back-patched as commit
bcbc27251d and commit
f7013683d9, so for now, no back-patch.
Bharath Rupireddy
Discussion: http://postgr.es/m/CALj2ACWk7j4F2v2fxxYfrroOF=AdFNPr1WsV+AGtHAFQOqm_pw@mail.gmail.com
Since we no longer require AccessExclusiveLock to add a partition,
the executor may see that a partitioned table has more partitions
than the planner saw. ExecCreatePartitionPruneState's code for
matching up the partition lists in such cases was faulty, and would
misbehave if the planner had successfully pruned any partitions from
the query. (Thus, trouble would occur only if a partition addition
happens concurrently with a query that uses both static and dynamic
partition pruning.) This led to an Assert failure in debug builds,
and probably to crashes or query misbehavior in production builds.
To repair the bug, just explicitly skip zeroes in the plan's
relid_map[] list. I also made some cosmetic changes to make the code
more readable (IMO anyway). Also, convert the cross-checking Assert
to a regular test-and-elog, since it's now apparent that this logic
is more fragile than one would like.
Currently, there's no way to repeatably exercise this code, except
with manual use of a debugger to stop the backend between planning
and execution. Hence, no test case in this patch. We oughta do
something about that testability gap, but that's for another day.
Amit Langote and Tom Lane, per report from Justin Pryzby. Oversight
in commit 898e5e329; backpatch to v12 where that appeared.
Discussion: https://postgr.es/m/20200802181131.GA27754@telsasoft.com
Currently, page unlink leaves remaining items "as is", but replay of
corresponding WAL-record re-initializes page leaving it with no items.
For the sake of consistency, this commit makes primary delete all the items
during page unlink as well.
Thanks to this change, we now don't mask contents of deleted btree page for
WAL consistency checking.
Discussion: https://postgr.es/m/CAPpHfdt_OTyQpXaPJcWzV2N-LNeNJseNB-K_A66qG%3DL518VTFw%40mail.gmail.com
Author: Alexander Korotkov
Reviewed-by: Peter Geoghegan
It was possible for the logic used by backward scans (which must reason
about concurrent page splits/deletions in its own peculiar way) to
become confused when running on a replica. Concurrent replay of a WAL
record that describes the second phase of page deletion could cause
_bt_walk_left() to get confused. btree_xlog_unlink_page() simply failed
to adhere to the same locking protocol that we use on the primary, which
is obviously wrong once you consider these two disparate functions
together. This bug is present in all stable branches.
More concretely, the problem was that nothing stopped _bt_walk_left()
from observing inconsistencies between the deletion's target page and
its original sibling pages when running on a replica. This is true even
though the second phase of page deletion is supposed to work as a single
atomic action. Queries running on replicas raised "could not find left
sibling of block %u in index %s" can't-happen errors when they went back
to their scan's "original" page and observed that the page has not been
marked deleted (even though it really was concurrently deleted).
There is no evidence that this actually happened in the real world. The
issue came to light during unrelated feature development work. Note
that _bt_walk_left() is the only code that cares about the difference
between a half-dead page and a fully deleted page that isn't also
exclusively used by nbtree VACUUM (unless you include contrib/amcheck
code). It seems very likely that backward scans are the only thing that
could become confused by the inconsistency. Even amcheck's complex
bt_right_page_check_scankey() dance was unaffected.
To fix, teach btree_xlog_unlink_page() to lock the left sibling, target,
and right sibling pages in that order before releasing any locks (just
like _bt_unlink_halfdead_page()). This is the simplest possible
approach. There doesn't seem to be any opportunity to be more clever
about lock acquisition in the REDO routine, and it hardly seems worth
the trouble in any case.
This fix might enable contrib/amcheck verification of leaf page sibling
links with only an AccessShareLock on the relation. An amcheck patch
from Andrey Borodin was rejected back in January because it clashed with
btree_xlog_unlink_page()'s lax approach to locking pages. It now seems
likely that the real problem was with btree_xlog_unlink_page(), not the
patch.
This is a low severity, low likelihood bug, so no backpatch.
Author: Michail Nikolaev
Diagnosed-By: Michail Nikolaev
Discussion: https://postgr.es/m/CANtu0ohkR-evAWbpzJu54V8eCOtqjJyYp3PQ_SGoBTRGXWhWRw@mail.gmail.com
Add a documenting assertion that's similar to the nearby assertion added
by commit cd8c73a3. This conveys that the entire call to _bt_pagedel()
does no work if it isn't possible to get a descent stack for the initial
scanblkno page.
This is useful for monitoring purposes with log parsing. Similarly to
pg_stat_activity, the leader's PID is shown only for active parallel
workers, minimizing the log footprint for the leaders as the equivalent
shared memory field is set as long as a backend is alive.
Author: Justin Pryzby
Reviewed-by: Álvaro Herrera, Michael Paquier, Julien Rouhaud, Tom Lane
Discussion: https://postgr.es/m/20200315111831.GA21492@telsasoft.com
We have edge-case bugs when assigning values in the last few dozen pages
before the wrap limit. We may introduce similar bugs in the future. At
default BLCKSZ, this makes such bugs unreachable outside of single-user
mode. Also, when VACUUM began to consume mxacts, multiStopLimit did not
change to compensate.
pg_upgrade may fail on a cluster that was already printing "must be
vacuumed" warnings. Follow the warning's instructions to clear the
warning, then run pg_upgrade again. One can still, peacefully consume
98% of XIDs or mxacts, so DBAs need not change routine VACUUM settings.
Discussion: https://postgr.es/m/20200621083513.GA3074645@rfd.leadboat.com
This allows AM-specific knowledge to be applied during creation of
pg_amop and pg_amproc entries. Specifically, the AM knows better than
core code which entries to consider as required or optional. Giving
the latter entries the appropriate sort of dependency allows them to
be dropped without taking out the whole opclass or opfamily; which
is something we'd like to have to correct obsolescent entries in
extensions.
This callback also opens the door to performing AM-specific validity
checks during opclass creation, rather than hoping than an opclass
developer will remember to test with "amvalidate". For the most part
I've not actually added any such checks yet; that can happen in a
follow-on patch. (Note that we shouldn't remove any tests from
"amvalidate", as those are still needed to cross-check manually
constructed entries in the initdb data. So adding tests to
"amadjustmembers" will be somewhat duplicative, but it seems like
a good idea anyway.)
Patch by me, reviewed by Alexander Korotkov, Hamid Akhtar, and
Anastasia Lubennikova.
Discussion: https://postgr.es/m/4578.1565195302@sss.pgh.pa.us
When doing multiple insertions in pg_shdepend for the copy of
dependencies from a template database in CREATE DATABASE, the same
number of slots would have been created and used all the time. As the
number of items to insert is not known in advance, this makes most of
the slots created for nothing. This improves the slot handling so as
slot creation only happens when needed, minimizing the overhead of the
operation.
Author: Michael Paquier
Reviewed-by: Daniel Gustafsson
Discussion: https://postgr.es/m/20200731024148.GB3317@paquier.xyz
When reading the code it's not obvious when one should prefer dynahash
over simplehash and vice-versa, so, for programmer-friendliness, add
comments to inform that decision.
Show sample simplehash method signatures.
Author: James Coleman <jtc331@gmail.com>
Discussion: https://postgr.es/m/CAAaqYe_dOF39gAJ8rL-a3YO3Qo96MHMRQ2whFjK5ZcU6YvMQSA%40mail.gmail.com
If a base type supports typmods, its array type does too, with the
same interpretation. Hence changes in pg_type.typmodin/typmodout
must be propagated to the array type.
While here, improve AlterTypeRecurse to not recurse to domains if
there is nothing we'd need to change.
Oversight in fe30e7ebf. Back-patch to v13 where that came in.
The new hlCover() algorithm that I introduced in commit c9b0c678d
turns out to potentially take O(N^2) or worse time on long documents,
if there are many occurrences of individual query words but few or no
substrings that actually satisfy the query. (One way to hit this
behavior is with a "common_word & rare_word" type of query.) This
seems unavoidable given the original goal of checking every substring
of the document, so we have to back off that idea. Fortunately, it
seems unlikely that anyone would really want headlines spanning all of
a long document, so we can avoid the worse-than-linear behavior by
imposing a maximum length of substring that we'll consider.
For now, just hard-wire that maximum length as a multiple of max_words
times max_fragments. Perhaps at some point somebody will argue for
exposing it as a ts_headline parameter, but I'm hesitant to make such
a feature addition in a back-patched bug fix.
I also noted that the hlFirstIndex() function I'd added in that
commit was unnecessarily stupid: it really only needs to check whether
a HeadlineWordEntry's item pointer is null or not. This wouldn't make
all that much difference in typical cases with queries having just
a few terms, but a cycle shaved is a cycle earned.
In addition, add a CHECK_FOR_INTERRUPTS call in TS_execute_recurse.
This ensures that hlCover's loop is cancellable if it manages to take
a long time, and it may protect some other TS_execute callers as well.
Back-patch to 9.6 as the previous commit was. I also chose to add the
CHECK_FOR_INTERRUPTS call to 9.5. The old hlCover() algorithm seems
to avoid the O(N^2) behavior, at least on the test case I tried, but
nonetheless it's not very quick on a long document.
Per report from Stephen Frost.
Discussion: https://postgr.es/m/20200724160535.GW12375@tamriel.snowman.net
Create an optional region in the main shared memory segment that can be
used to acquire and release "fast" DSM segments, and can benefit from
huge pages allocated at cluster startup time, if configured. Fall back
to the existing mechanisms when that space is full. The size is
controlled by a new GUC min_dynamic_shared_memory, defaulting to 0.
Main region DSM segments initially contain whatever garbage the memory
held last time they were used, rather than zeroes. That change revealed
that DSA areas failed to initialize themselves correctly in memory that
wasn't zeroed first, so fix that problem.
Discussion: https://postgr.es/m/CA%2BhUKGLAE2QBv-WgGp%2BD9P_J-%3Dyne3zof9nfMaqq1h3EGHFXYQ%40mail.gmail.com
Avoid repeatedly calling lseek(SEEK_END) during recovery by caching
the size of each fork. For now, we can't use the same technique in
other processes, because we lack a shared invalidation mechanism.
Do this by generalizing the pre-existing caching used by FSM and VM
to support all forks.
Discussion: https://postgr.es/m/CAEepm%3D3SSw-Ty1DFcK%3D1rU-K6GSzYzfdD4d%2BZwapdN7dTa6%3DnQ%40mail.gmail.com
For pg_attribute, this allows to insert at once a full set of attributes
for a relation (roughly 15% of WAL reduction in extreme cases). For
pg_shdepend, this reduces the work done when creating new shared
dependencies from a database template. The number of slots used for the
insertion is capped at 64kB of data inserted for both, depending on the
number of items to insert and the length of the rows involved.
More can be done for other catalogs, like pg_depend. This part requires
a different approach as the number of slots to use depends also on the
number of entries discarded as pinned dependencies. This is also
related to the rework or dependency handling for ALTER TABLE and CREATE
TABLE, mainly.
Author: Daniel Gustafsson
Reviewed-by: Andres Freund, Michael Paquier
Discussion: https://postgr.es/m/20190213182737.mxn6hkdxwrzgxk35@alap3.anarazel.de
Create LatchWaitSet at backend startup time, and use it to implement
WaitLatch(). This avoids repeated epoll/kqueue setup and teardown
system calls.
Reorder SubPostmasterMain() slightly so that we restore the postmaster
pipe and Windows signal emulation before we reach InitPostmasterChild(),
to make this work in EXEC_BACKEND builds.
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Discussion: https://postgr.es/m/CA%2BhUKGJAC4Oqao%3DqforhNey20J8CiG2R%3DoBPqvfR0vOJrFysGw%40mail.gmail.com
Add a GUC that acts as a multiplier on work_mem. It gets applied when
sizing executor node hash tables that were previously size constrained
using work_mem alone.
The new GUC can be used to preferentially give hash-based nodes more
memory than the generic work_mem limit. It is intended to enable admin
tuning of the executor's memory usage. Overall system throughput and
system responsiveness can be improved by giving hash-based executor
nodes more memory (especially over sort-based alternatives, which are
often much less sensitive to being memory constrained).
The default value for hash_mem_multiplier is 1.0, which is also the
minimum valid value. This means that hash-based nodes continue to apply
work_mem in the traditional way by default.
hash_mem_multiplier is generally useful. However, it is being added now
due to concerns about hash aggregate performance stability for users
that upgrade to Postgres 13 (which added disk-based hash aggregation in
commit 1f39bce0). While the old hash aggregate behavior risked
out-of-memory errors, it is nevertheless likely that many users actually
benefited. Hash agg's previous indifference to work_mem during query
execution was not just faster; it also accidentally made aggregation
resilient to grouping estimate problems (at least in cases where this
didn't create destabilizing memory pressure).
hash_mem_multiplier can provide a certain kind of continuity with the
behavior of Postgres 12 hash aggregates in cases where the planner
incorrectly estimates that all groups (plus related allocations) will
fit in work_mem/hash_mem. This seems necessary because hash-based
aggregation is usually much slower when only a small fraction of all
groups can fit. Even when it isn't possible to totally avoid hash
aggregates that spill, giving hash aggregation more memory will reliably
improve performance (the same cannot be said for external sort
operations, which appear to be almost unaffected by memory availability
provided it's at least possible to get a single merge pass).
The PostgreSQL 13 release notes should advise users that increasing
hash_mem_multiplier can help with performance regressions associated
with hash aggregation. That can be taken care of by a later commit.
Author: Peter Geoghegan
Reviewed-By: Álvaro Herrera, Jeff Davis
Discussion: https://postgr.es/m/20200625203629.7m6yvut7eqblgmfo@alap3.anarazel.de
Discussion: https://postgr.es/m/CAH2-WzmD%2Bi1pG6rc1%2BCjc4V6EaFJ_qSuKCCHVnH%3DoruqD-zqow%40mail.gmail.com
Backpatch: 13-, where disk-based hash aggregation was introduced.
When fast promotion was supported in 9.3, non-fast promotion became
undocumented feature and it's basically not available for ordinary users.
However we decided not to remove non-fast promotion at that moment,
to leave it for a release or two for debugging purpose or as an emergency
method because fast promotion might have some issues, and then to
remove it later. Now, several versions were released since that decision
and there is no longer reason to keep supporting non-fast promotion.
Therefore this commit removes non-fast promotion.
Author: Fujii Masao
Reviewed-by: Hamid Akhtar, Kyotaro Horiguchi
Discussion: https://postgr.es/m/76066434-648f-f567-437b-54853b43398f@oss.nttdata.com
Use HyperLogLog to estimate the group cardinality in a spilled
partition. This estimate is used to choose the number of partitions if
we recurse.
The previous behavior was to use the number of tuples in a spilled
partition as the estimate for the number of groups, which lead to
overpartitioning. That could cause the number of batches to be much
higher than expected (with each batch being very small), which made it
harder to interpret EXPLAIN ANALYZE results.
Reviewed-by: Peter Geoghegan
Discussion: https://postgr.es/m/a856635f9284bc36f7a77d02f47bbb6aaf7b59b3.camel@j-davis.com
Backpatch-through: 13
There were various unnecessary differences between Hash Agg's EXPLAIN
ANALYZE output and Hash Join's. Here we modify the Hash Agg output so
that it's better aligned to Hash Join's.
The following changes have been made:
1. Start batches counter at 1 instead of 0.
2. Always display the "Batches" property, even when we didn't spill to
disk.
3. Use the text "Batches" instead of "HashAgg Batches" for text format.
4. Use the text "Memory Usage" instead of "Peak Memory Usage" for text
format.
5. Include "Batches" before "Memory Usage" in both text and non-text
formats.
In passing also modify the "Planned Partitions" property so that we show
it regardless of if the value is 0 or not for non-text EXPLAIN formats.
This was pointed out by Justin Pryzby and probably should have been part
of 40efbf870.
Reviewed-by: Justin Pryzby, Jeff Davis
Discussion: https://postgr.es/m/CAApHDvrshRnA6C0VFnu7Fb9TVvgGo80PUMm5+2DiaS1gEkPvtw@mail.gmail.com
Backpatch-through: 13, where HashAgg batching was introduced
This adds seven methods to the output plugin API, adding support for
streaming changes of large in-progress transactions.
* stream_start
* stream_stop
* stream_abort
* stream_commit
* stream_change
* stream_message
* stream_truncate
Most of this is a simple extension of the existing methods, with
the semantic difference that the transaction (or subtransaction)
is incomplete and may be aborted later (which is something the
regular API does not really need to deal with).
This also extends the 'test_decoding' plugin, implementing these
new stream methods.
The stream_start/start_stop are used to demarcate a chunk of changes
streamed for a particular toplevel transaction.
This commit simply adds these new APIs and the upcoming patch to "allow
the streaming mode in ReorderBuffer" will use these APIs.
Author: Tomas Vondra, Dilip Kumar, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma and Mahendra Singh Thalor
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
In the case of range partitioning, get_steps_using_prefix() assumes that
the passed-in prefix list contains at least one clause for each of the
partition keys earlier than one specified in the passed-in
step_lastkeyno, but the caller (ie, gen_prune_steps_from_opexps())
didn't take it into account, which led to a server crash or incorrect
results when the list contained no clauses for such partition keys, as
reported in bug #16500 and #16501 from Kobayashi Hisanori. Update the
caller to call that function only when the list created there contains
at least one clause for each of the earlier partition keys in the case
of range partitioning.
While at it, fix some other issues:
* The list to pass to get_steps_using_prefix() is allowed to contain
multiple clauses for the same partition key, as described in the
comment for that function, but that function actually assumed that the
list contained just a single clause for each of middle partition keys,
which led to an assertion failure when the list contained multiple
clauses for such partition keys. Update that function to match the
comment.
* In the case of hash partitioning, partition keys are allowed to be
NULL, in which case the list to pass to get_steps_using_prefix()
contains no clauses for NULL partition keys, but that function treats
that case as like the case of range partitioning, which led to the
assertion failure. Update the assertion test to take into account
NULL partition keys in the case of hash partitioning.
* Fix a typo in a comment in get_steps_using_prefix_recurse().
* gen_partprune_steps() failed to detect self-contradiction from
strict-qual clauses and an IS NULL clause for the same partition key
in some cases, producing incorrect partition-pruning steps, which led
to incorrect results of partition pruning, but didn't cause any
user-visible problems fortunately, as the self-contradiction is
detected later in the query planning. Update that function to detect
the self-contradiction.
Per bug #16500 and #16501 from Kobayashi Hisanori. Patch by me, initial
diagnosis for the reported issue and review by Dmitry Dolgov.
Back-patch to v11, where partition pruning was introduced.
Discussion: https://postgr.es/m/16500-d1613f2a78e1e090%40postgresql.org
Discussion: https://postgr.es/m/16501-5234a9a0394f6754%40postgresql.org
Note: This GUC was originally named enable_hashagg_disk when it appeared
in commit 1f39bce0, which added disk-based hash aggregation. It was
subsequently renamed in commit 92c58fd9.
Author: Peter Geoghegan
Reviewed-By: Jeff Davis, Álvaro Herrera
Discussion: https://postgr.es/m/9d9d1e1252a52ea1bad84ea40dbebfd54e672a0f.camel%40j-davis.com
Backpatch: 13-, where disk-based hash aggregation was introduced.
Previously we would allocate blocks to parallel workers during a parallel
sequential scan 1 block at a time. Since other workers were likely to
request a block before a worker returns for another block number to work
on, this could lead to non-sequential I/O patterns in each worker which
could cause the operating system's readahead to perform poorly or not at
all.
Here we change things so that we allocate consecutive "chunks" of blocks
to workers and have them work on those until they're done, at which time
we allocate another chunk for the worker. The size of these chunks is
based on the size of the relation.
Initial patch here was by Thomas Munro which showed some good improvements
just having a fixed chunk size of 64 blocks with a simple ramp-down near
the end of the scan. The revisions of the patch to make the chunk size
based on the relation size and the adjusted ramp-down in powers of two was
done by me, along with quite extensive benchmarking to determine the
optimal chunk sizes.
For the most part, benchmarks have shown significant performance
improvements for large parallel sequential scans on Linux, FreeBSD and
Windows using SSDs. It's less clear how this affects the performance of
cloud providers. Tests done so far are unable to obtain stable enough
performance to provide meaningful benchmark results. It is possible that
this could cause some performance regressions on more obscure filesystems,
so we may need to later provide users with some ability to get something
closer to the old behavior. For now, let's leave that until we see that
it's really required.
Author: Thomas Munro, David Rowley
Reviewed-by: Ranier Vilela, Soumyadeep Chakraborty, Robert Haas
Reviewed-by: Amit Kapila, Kirk Jamison
Discussion: https://postgr.es/m/CA+hUKGJ_EErDv41YycXcbMbCBkztA34+z1ts9VQH+ACRuvpxig@mail.gmail.com
The initial implementation of leader_pid in pg_stat_activity added by
b025f32 took the approach to strictly print what a PGPROC entry
includes. In short, if a backend has been involved in parallel query at
least once, leader_pid would remain set as long as the backend is alive.
For a parallel group leader, this means that the field would always be
set after it participated at least once in parallel query, and after
more discussions this could be confusing if using for example a
connection pooler.
This commit changes the data printed so as leader_pid becomes always
NULL for a parallel group leader, showing up a non-NULL value only for
the parallel workers, and actually as long as a parallel query is
running as workers are shut down once the query has completed.
This does not change the definition of any catalog, so no catalog bump
is needed. Per discussion with Justin Pryzby, Álvaro Herrera, Julien
Rouhaud and me.
Discussion: https://postgr.es/m/20200721035145.GB17300@paquier.xyz
Backpatch-through: 13
OpenSSL deprecated RAND_cleanup(), and OpenSSL 1.1.0 made it into a
no-op. Replace it with RAND_poll(), per an OpenSSL community
recommendation. While this has no user-visible consequences under
OpenSSL defaults, it might help under non-default settings.
Daniel Gustafsson, reviewed by David Steele and Michael Paquier.
Discussion: https://postgr.es/m/9B038FA5-23E8-40D0-B932-D515E1D8F66A@yesql.se
At least on Linux and macOS, fread() turns out to have far higher
per-call overhead than one could wish. Reading 64KB of data at a time
and then parceling it out with our own memcpy logic makes binary COPY
from a file significantly faster --- around 30% in simple testing for
cases with narrow text columns (on Linux ... even more on macOS).
In binary COPY from frontend, there's no per-call fread(), and this
patch introduces an extra layer of memcpy'ing, but it still manages
to eke out a small win. Apparently, the control-logic overhead in
CopyGetData() is enough to be worth avoiding for small fetches.
Bharath Rupireddy and Amit Langote, reviewed by Vignesh C,
cosmetic tweaks by me
Discussion: https://postgr.es/m/CALj2ACU5Bz06HWLwqSzNMN=Gupoj6Rcn_QVC+k070V4em9wu=A@mail.gmail.com
Commit 85c9d347 addressed a similar problem for Gather and Gather
Merge nodes but forgot to account for nodes above parallel nodes. This
still works for nodes above Gather node because we shut down the workers
for Gather node as soon as there are no more tuples. We can do a similar
thing for Gather Merge as well but it seems better to account for stats
during nodes shutdown after completing the execution.
Reported-by: Stéphane Lorek, Jehan-Guillaume de Rorthais
Author: Jehan-Guillaume de Rorthais <jgdr@dalibo.com>
Reviewed-by: Amit Kapila
Backpatch-through: 10, where it was introduced
Discussion: https://postgr.es/m/20200718160206.584532a2@firost
It's fairly silly that ignoring NOT subexpressions is TS_execute's
default behavior. It's wrong on its face and it encourages errors
of omission. Moreover, the only two remaining callers that aren't
specifying CALC_NOT are in ts_headline calculations, and it's very
arguable that those are bugs: if you've specified "!foo" in your
query, why would you want to get a headline that includes "foo"?
Hence, rip that out and change the default behavior to be to calculate
NOT accurately. As a concession to the slim chance that there is still
somebody somewhere who needs the incorrect behavior, provide a new
SKIP_NOT flag to explicitly request that.
Back-patch into v13, mainly because it seems better to change this
at the same time as the previous commit's rejiggering of TS_execute
related APIs. Any outside callers affected by this change are
probably also affected by that one.
Discussion: https://postgr.es/m/CALT9ZEE-aLotzBg-pOp2GFTesGWVYzXA3=mZKzRDa_OKnLF7Mg@mail.gmail.com
Text search sometimes failed to find valid matches, for instance
'!crew:A'::tsquery might fail to locate 'crew:1B'::tsvector during
an index search. The root of the issue is that TS_execute's callback
functions were not changed to use ternary (yes/no/maybe) reporting
when we made the search logic itself do so. It's somewhat annoying
to break that API, but on the other hand we now see that any code
using plain boolean logic is almost certainly broken since the
addition of phrase search. There seem to be very few outside callers
of this code anyway, so we'll just break them intentionally to get
them to adapt.
This allows removal of tsginidx.c's private re-implementation of
TS_execute, since that's now entirely duplicative. It's also no
longer necessary to avoid use of CALC_NOT in tsgistidx.c, since
the underlying callbacks can now do something reasonable.
Back-patch into v13. We can't change this in stable branches,
but it seems not quite too late to fix it in v13.
Tom Lane and Pavel Borisov
Discussion: https://postgr.es/m/CALT9ZEE-aLotzBg-pOp2GFTesGWVYzXA3=mZKzRDa_OKnLF7Mg@mail.gmail.com
When wal_level=logical, write invalidations at command end into WAL so
that decoding can use this information.
This patch is required to allow the streaming of in-progress transactions
in logical decoding. The actual work to allow streaming will be committed
as a separate patch.
We still add the invalidations to the cache and write them to WAL at
commit time in RecordTransactionCommit(). This uses the existing
XLOG_INVALIDATIONS xlog record type, from the RM_STANDBY_ID resource
manager (see LogStandbyInvalidations for details).
So existing code relying on those invalidations (e.g. redo) does not need
to be changed.
The invalidations written at command end uses a new xlog record type
XLOG_XACT_INVALIDATIONS, from RM_XACT_ID resource manager. See
LogLogicalInvalidations for details.
These new xlog records are ignored by existing redo procedures, which
still rely on the invalidations written to commit records.
The invalidations are decoded and accumulated in top-transaction, and then
executed during replay. This obviates the need to decode the
invalidations as part of a commit record.
Bump XLOG_PAGE_MAGIC, since this introduces XLOG_XACT_INVALIDATIONS.
Author: Dilip Kumar, Tomas Vondra, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma and Mahendra Singh Thalor
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
Add infinities that behave the same as they do in the floating-point
data types. Aside from any intrinsic usefulness these may have,
this closes an important gap in our ability to convert floating
values to numeric and/or replace float-based APIs with numeric.
The new values are represented by bit patterns that were formerly
not used (although old code probably would take them for NaNs).
So there shouldn't be any pg_upgrade hazard.
Patch by me, reviewed by Dean Rasheed and Andrew Gierth
Discussion: https://postgr.es/m/606717.1591924582@sss.pgh.pa.us
convutils.pm used implicit conversion of undefined value to integer
zero. Some of conversion scripts are susceptible to regexp greediness.
Fix, avoiding whitespace changes in the output. Also update ICU URLs
that moved.
No need to back-patch, because the output of these scripts is also in
the source tree so we shouldn't need to rerun them on back-branches.
Author: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Discussion: https://postgr.es/m/CA%2BhUKGJ7SEGLbj%3D%3DTQCcyKRA9aqj8%2B6L%3DexSq1y25TA%3DWxLziQ%40mail.gmail.com
Since commit 044c99bc5, eqjoinsel passes the passed-in collation
to any operators it invokes. However, neqjoinsel failed to pass
on whatever collation it got, so that if we invoked a
collation-dependent operator via that code path, we'd get "could not
determine which collation to use for string comparison" or the like.
Per report from Justin Pryzby. Back-patch to v12, like the previous
commit.
Discussion: https://postgr.es/m/20200721191606.GL5748@telsasoft.com
Holding just a buffer pin (with no buffer lock) on an nbtree buffer/page
provides very weak guarantees, especially compared to heapam, where it's
often safe to read a page while only holding a buffer pin. This commit
has Valgrind enforce the following rule: it is never okay to access an
nbtree buffer without holding both a pin and a lock on the buffer.
A draft version of this patch detected questionable code that was
cleaned up by commits fa7ff642 and 7154aa16. The code in question used
to access an nbtree buffer page's special/opaque area with no buffer
lock (only a buffer pin). This practice (which isn't obviously unsafe)
is hereby formally disallowed in nbtree. There doesn't seem to be any
reason to allow it, and banning it keeps things simple for Valgrind.
The new checks are implemented by adding custom nbtree client requests
(located in LockBuffer() wrapper functions); these requests are
"superimposed" on top of the generic bufmgr.c Valgrind client requests
added by commit 1e0dfd16. No custom resource management cleanup code is
needed to undo the effects of marking buffers as non-accessible under
this scheme.
Author: Peter Geoghegan
Reviewed-By: Anastasia Lubennikova, Georgios Kokolatos
Discussion: https://postgr.es/m/CAH2-WzkLgyN3zBvRZ1pkNJThC=xi_0gpWRUb_45eexLH1+k2_Q@mail.gmail.com
Rather than always insisting on an exact match of the type OID in the
data to the element type or column type we expect, complain only when
both OIDs fall within the manually-assigned range. This acknowledges
the reality that user-defined types don't have stable OIDs, while
still preserving some of the mistake-detection value of the old test.
(It's not entirely clear whether to error if one OID is manually
assigned and the other isn't. But perhaps that case could arise in
cross-version cases where a former extension type has been imported
into core, so I let it pass.)
This change allows us to remove the prohibition on binary transfer
of user-defined arrays and composites in the recently-landed support
for binary logical replication (commit 9de77b545). We can just
unconditionally drop that check, since if the client has asked for
binary transfer it must be >= v14 and must have this change.
Discussion: https://postgr.es/m/CADK3HH+R3xMn=8t3Ct+uD+qJ1KD=Hbif5NFMJ+d5DkoCzp6Vgw@mail.gmail.com
The bug fixed in commit 72eab84a5 would not have occurred if initdb
had a less surprising rule about which columns should be marked
NOT NULL by default. Let's make that rule be strictly that the
column must be fixed-width and its predecessors must be fixed-width
and NOT NULL, removing the hacky and unsafe exceptions for oidvector
and int2vector.
Since we do still want all existing oidvector and int2vector columns
to be marked NOT NULL, we have to put BKI_FORCE_NOT_NULL labels on
them. But making this less magic and more documented seems like a
good idea, even if it's a shade more verbose.
I didn't bump catversion since the initial catalog contents are
not actually changed by this patch. Note however that the
contents of postgres.bki do change, and feeding an old copy of
that to a new backend will produce wrong results.
Discussion: https://postgr.es/m/204760.1595181800@sss.pgh.pa.us
The executor checks for this error, and so does the bootstrap catalog
loader, but we never checked for it in retail catalog manipulations.
The folly of that has now been exposed, so let's add assertions
checking it. Checking in CatalogTupleInsert[WithInfo] and
CatalogTupleUpdate[WithInfo] should be enough to cover this.
Back-patch to v10; the aforesaid functions didn't exist before that,
and it didn't seem worth adapting the patch to the oldest branches.
But given the risk of JIT crashes, I think we certainly need this
as far back as v11.
Pre-v13, we have to explicitly exclude pg_subscription.subslotname
and pg_subscription_rel.srsublsn from the checks, since they are
mismarked. (Even if we change our mind about applying BKI_FORCE_NULL
in the branch tips, it doesn't seem wise to have assertions that
would fire in existing databases.)
Discussion: https://postgr.es/m/298837.1595196283@sss.pgh.pa.us
Many situations where the offset is infinity were not handled sanely.
We should generally allow the val versus base +/- offset comparison to
proceed according to the normal rules of IEEE arithmetic; however, we
must do something special for the corner cases where base +/- offset
would produce NaN due to subtracting two like-signed infinities.
That corresponds to asking which values infinitely precede +inf or
infinitely follow -inf, which should certainly be true of any finite
value or of the opposite-signed infinity. After some discussion it
seems that the best decision is to make it true of the same-signed
infinity as well, ie, just return constant TRUE if the calculation
would produce a NaN.
(We could write this with a bit less code by subtracting anyway,
and then checking for a NaN result. However, I prefer this
formulation because it'll be easier to transpose into numeric.c.)
Although this seems like clearly a bug fix with respect to finite
values, it is less obviously correct for infinite values. Between
that and the fact that the whole issue only arises for very strange
window specifications (e.g. RANGE BETWEEN 'inf' PRECEDING AND 'inf'
PRECEDING), I'll desist from back-patching.
Noted by Dean Rasheed.
Discussion: https://postgr.es/m/3393130.1594925893@sss.pgh.pa.us
Strengthen the LockBuffer() assertion that verifies BufferIsValid() by
making it verify BufferIsPinned() instead. Do the same in nearby
related functions.
There is probably not much chance that anybody will try to lock a buffer
that is not already pinned, but we might as well make sure of that.
The code has always set this column to NULL when it's not valid,
but the catalog header's description failed to reflect that,
as did the SGML docs, as did some of the code. To prevent future
coding errors of the same ilk, let's hide the field from C code
as though it were variable-length (which, in a sense, it is).
As with commit 72eab84a5, we can only fix this cleanly in HEAD
and v13; the problem extends further back but we'll need some
klugery in the released branches.
Discussion: https://postgr.es/m/367660.1595202498@sss.pgh.pa.us
Commit b9c130a1f failed to apply the publisher-to-subscriber column
mapping while checking which columns were updated. Perhaps less
significantly, it didn't exclude dropped columns either. This could
result in an incorrect updated-columns bitmap and thus wrong decisions
about whether to fire column-specific triggers on the subscriber while
applying updates. In HEAD (since commit 9de77b545), it could also
result in accesses off the end of the colstatus array, as detected by
buildfarm member skink. Fix the logic, and adjust 003_constraints.pl
so that the problem is exposed in unpatched code.
In HEAD, also add some assertions to check that we don't access off
the ends of these newly variable-sized arrays.
Back-patch to v10, as b9c130a1f was.
Discussion: https://postgr.es/m/CAH2-Wz=79hKQ4++c5A060RYbjTHgiYTHz=fw6mptCtgghH2gJA@mail.gmail.com
max_slot_wal_keep_size that was added in v13 and wal_keep_segments are
the GUC parameters to specify how much WAL files to retain for
the standby servers. While max_slot_wal_keep_size accepts the number of
bytes of WAL files, wal_keep_segments accepts the number of WAL files.
This difference of setting units between those similar parameters could
be confusing to users.
To alleviate this situation, this commit renames wal_keep_segments to
wal_keep_size, and make users specify the WAL size in it instead of
the number of WAL files.
There was also the idea to rename max_slot_wal_keep_size to
max_slot_wal_keep_segments, in the discussion. But we have been moving
away from measuring in segments, for example, checkpoint_segments was
replaced by max_wal_size. So we concluded to rename wal_keep_segments
to wal_keep_size.
Back-patch to v13 where max_slot_wal_keep_size was added.
Author: Fujii Masao
Reviewed-by: Álvaro Herrera, Kyotaro Horiguchi, David Steele
Discussion: https://postgr.es/m/574b4ea3-e0f9-b175-ead2-ebea7faea855@oss.nttdata.com
The logical decoding infrastructure needs to know which top-level
transaction the subxact belongs to, in order to decode all the
changes. Until now that might be delayed until commit, due to the
caching (GPROC_MAX_CACHED_SUBXIDS), preventing features requiring
incremental decoding.
So we also write the assignment info into WAL immediately, as part
of the next WAL record (to minimize overhead) only when wal_level=logical.
We can not remove the existing XLOG_XACT_ASSIGNMENT WAL as that is
required for avoiding overflow in the hot standby snapshot.
Bump XLOG_PAGE_MAGIC, since this introduces XLR_BLOCK_ID_TOPLEVEL_XID.
Author: Tomas Vondra, Dilip Kumar, Amit Kapila
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma and Mahendra Singh Thalor
Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
There was no easy way to find how many times generic and custom plans
have been executed for a prepared statement. This commit exposes those
numbers of times in pg_prepared_statements view.
Author: Atsushi Torikoshi, Kyotaro Horiguchi
Reviewed-by: Tatsuro Yamada, Masahiro Ikeda, Fujii Masao
Discussion: https://postgr.es/m/CACZ0uYHZ4M=NZpofH6JuPHeX=__5xcDELF8hT8_2T+R55w4RQw@mail.gmail.com
Valgrind builds with assertions enabled sometimes perform a
theoretically unsafe page access inside an assertion in
heapam_tuple_lock(). This happened when the eval-plan-qual isolation
test ran one of the permutations added by commit a2418f9e23.
Avoid complaints from Valgrind by moving the assertion ever so slightly.
This is minor cleanup for commit 1e0dfd16, which added Valgrind buffer
access instrumentation.
No backpatch, since this only happens within an assertion, and seems
very unlikely to cause any real problems even with assert-enabled
builds.
Make PinBuffer() mark buffers as defined to Valgrind unconditionally,
including when the buffer header spinlock must be acquired. Failure to
handle that case could lead to false positive reports from Valgrind.
This theoretically creates a risk that we'll mark buffers defined even
when external callers don't end up with a buffer pin. That seems
perfectly acceptable, though, since in general we make no guarantees
about buffers that are unsafe to access being reliably marked as unsafe.
Oversight in commit 1e0dfd16, which added valgrind buffer access
instrumentation.
This patch adds a "binary" option to CREATE/ALTER SUBSCRIPTION.
When that's set, the publisher will send data using the data type's
typsend function if any, rather than typoutput. This is generally
faster, if slightly less robust.
As committed, we won't try to transfer user-defined array or composite
types in binary, for fear that type OIDs won't match at the subscriber.
This might be changed later, but it seems like fit material for a
follow-on patch.
Dave Cramer, reviewed by Daniel Gustafsson, Petr Jelinek, and others;
adjusted some by me
Discussion: https://postgr.es/m/CADK3HH+R3xMn=8t3Ct+uD+qJ1KD=Hbif5NFMJ+d5DkoCzp6Vgw@mail.gmail.com
The term "hash_mem" will take on new significance when pending work to
add a new hash_mem_multiplier GUC is committed. Rename a local variable
that happens to have been called hash_mem now to avoid confusion.
Teach Valgrind memcheck to maintain the "defined-ness" of each shared
buffer based on whether the backend holds at least one pin at the point
it is accessed by access method code. Bugs like the one fixed by commit
b0229f26 can be detected using this new instrumentation.
Note that backends running with Valgrind naturally have their own
independent ideas about whether any given byte in shared memory is safe
or unsafe to access. There is no risk that concurrent access by
multiple backends to the same shared memory will confuse Valgrind's
instrumentation, because everything already works at the process level
(or at the memory mapping level, if you prefer).
Author: Álvaro Herrera, Peter Geoghegan
Reviewed-By: Anastasia Lubennikova
Discussion: https://postgr.es/m/20150723195349.GW5596@postgresql.org
Discussion: https://postgr.es/m/CAH2-WzkLgyN3zBvRZ1pkNJThC=xi_0gpWRUb_45eexLH1+k2_Q@mail.gmail.com
There is no advantage to attempting deduplication for a unique index
during CREATE INDEX, since there cannot possibly be any duplicates.
Doing so wastes cycles due to unnecessary copying. Make sure that we
avoid it consistently.
We already avoided unique index deduplication in the case where there
were some spool2 tuples to merge. That didn't account for the fact that
spool2 is removed early/unset in the common case where it has no tuples
that need to be merged (i.e. it failed to account for the "spool2 turns
out to be unnecessary" optimization in _bt_spools_heapscan()).
Oversight in commit 0d861bbb, which added nbtree deduplication
Backpatch: 13-, where nbtree deduplication was introduced.
Commit 1e53fe0e70 has unified the usage of the config-file reload flag by
using the same signal handler function for the SIGHUP signal at many places
in the code. By mistake, it used the wrong SIGNAL in apply launcher
process for the SIGHUP signal handler function.
Author: Bharath Rupireddy
Reviewed-by: Dilip Kumar
Backpatch-through: 13, where it was introduced
Discussion: https://postgr.es/m/CALj2ACVzHCRnS20bOiEHaLtP5PVBENZQn4khdsSJQgOv_GM-LA@mail.gmail.com
This representation saves 8 bytes per tuple compared to HeapTuple, and
avoids the need to allocate, copy and free on the receiving side.
Gather can emit the returned MinimalTuple directly, but GatherMerge now
needs to make an explicit copy because it buffers multiple tuples at a
time. That should be no worse than before.
Reviewed-by: Soumyadeep Chakraborty <soumyadeep2007@gmail.com>
Discussion: https://postgr.es/m/CA%2BhUKG%2B8T_ggoUTAE-U%3DA%2BOcPc4%3DB0nPPHcSfffuQhvXXjML6w%40mail.gmail.com
When using the following functions, users could see various types of
errors of the type "cache lookup failed for OID XXX" with elog(), that
can only be used for internal errors:
* pg_describe_object()
* pg_identify_object()
* pg_identify_object_as_address()
The set of APIs managing object addresses for all object types are made
smarter by gaining a new argument "missing_ok" that allows any caller to
control if an error is raised or not on an undefined object. The SQL
functions listed above are changed to handle the case where an object is
missing.
Regression tests are added for all object types for the cases where
these are undefined. Before this commit, these cases failed with cache
lookup errors, and now they basically return NULL (minus the name of the
object type requested).
Author: Michael Paquier
Reviewed-by: Aleksander Alekseev, Dmitry Dolgov, Daniel Gustafsson,
Álvaro Herrera, Kyotaro Horiguchi
Discussion: https://postgr.es/m/CAB7nPqSZxrSmdHK-rny7z8mi=EAFXJ5J-0RbzDw6aus=wB5azQ@mail.gmail.com
reparameterize_path_by_child() failed to reparameterize BitmapAnd
and BitmapOr paths. This matters only if such a path is chosen as
the inside of a nestloop partition-wise join, where we have to pass
in parameters from the outside of the nestloop. If that did happen,
we generated a bad plan that would likely lead to crashes at execution.
This is not entirely reparameterize_path_by_child()'s fault though;
it's the victim of an ancient decision (my ancient decision, I think)
to not bother filling in param_info in BitmapAnd/Or path nodes. That
caused the function to believe that such nodes and their children
contain no parameter references and so need not be processed.
In hindsight that decision looks pretty penny-wise and pound-foolish:
while it saves a few cycles during path node setup, we do commonly
need the information later. In particular, by reversing the decision
and requiring valid param_info data in all nodes of a bitmap path
tree, we can get rid of indxpath.c's get_bitmap_tree_required_outer()
function, which computed the data on-demand. It's not unlikely that
that nets out as a savings of cycles in many scenarios. A couple
of other things in indxpath.c can be simplified as well.
While here, get rid of some cases in reparameterize_path_by_child()
that are visibly dead or useless, given that we only care about
reparameterizing paths that can be on the inside of a parameterized
nestloop. This case reminds one of the maxim that untested code
probably does not work, so I'm unwilling to leave unreachable code
in this function. (I did leave the T_Gather case in place even
though it's not reached in the regression tests. It's not very
clear to me when the planner might prefer to put Gather below
rather than above a nestloop, but at least in principle the case
might be interesting.)
Per bug #16536, originally from Arne Roland but with a test case
by Andrew Gierth. Back-patch to v11 where this code came in.
Discussion: https://postgr.es/m/16536-2213ee0b3aad41fd@postgresql.org
Three groups of issues needed to be addressed:
load_external_function() and related functions returned PGFunction,
even though not necessarily all callers are looking for a function of
type PGFunction. Since these functions are really just wrappers
around dlsym(), change to return void * just like dlsym().
In dynahash.c, we are using strlcpy() where a function with a
signature like memcpy() is expected. This should be safe, as the new
comment there explains, but the cast needs to be augmented to avoid
the warning.
In PL/Python, methods all need to be cast to PyCFunction, per Python
API, but this now runs afoul of these warnings. (This issue also
exists in core CPython.)
To fix the second and third case, we add a new type pg_funcptr_t that
is defined specifically so that gcc accepts it as a special function
pointer that can be cast to any other function pointer without the
warning.
Also add -Wcast-function-type to the standard warning flags, subject
to configure check.
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://www.postgresql.org/message-id/flat/1e97628e-6447-b4fd-e230-d109cec2d584%402ndquadrant.com
An ALTER TABLE to validate a foreign key in which another subcommand
already caused a pending table rewrite could fail due to ALTER TABLE
attempting to validate the foreign key before the actual table rewrite
takes place. This situation could result in an error such as:
ERROR: could not read block 0 in file "base/nnnnn/nnnnn": read only 0 of 8192 bytes
The failure here was due to the SPI call which validates the foreign key
trying to access an index which is yet to be rebuilt.
Similarly, we also incorrectly tried to validate CHECK constraints before
the heap had been rewritten.
The fix for both is to delay constraint validation until phase 3, after
the table has been rewritten. For CHECK constraints this means a slight
behavioral change. Previously ALTER TABLE VALIDATE CONSTRAINT on
inheritance tables would be validated from the bottom up. This was
different from the order of evaluation when a new CHECK constraint was
added. The changes made here aligns the VALIDATE CONSTRAINT evaluation
order for inheritance tables to be the same as ADD CONSTRAINT, which is
generally top-down.
Reported-by: Nazli Ugur Koyluoglu, using SQLancer
Discussion: https://postgr.es/m/CAApHDvp%3DZXv8wiRyk_0rWr00skhGkt8vXDrHJYXRMft3TjkxCA%40mail.gmail.com
Backpatch-through: 9.5 (all supported versions)
Incorrect function names were referenced. As this fixes some portions
of tableam.h, that is mentioned in the docs as something to look at when
implementing a table AM, backpatch down to 12 where this has been
introduced.
Author: Hironobu Suzuki
Discussion: https://postgr.es/m/8fe6d672-28dd-3f1d-7aed-ac2f6d599d3f@interdb.jp
Backpatch-through: 12
The qual pushdown logic assumed that all Vars in a restriction clause
must be Vars referencing subquery outputs; but since we introduced
LATERAL, it's possible for such a Var to be a lateral reference instead.
This led to an assertion failure in debug builds. In a non-debug
build, there might be no ill effects (if qual_is_pushdown_safe decided
the qual was unsafe anyway), or we could get failures later due to
construction of an invalid plan. I've not gone to much length to
characterize the possible failures, but at least segfaults in the
executor have been observed.
Given that this has been busted since 9.3 and it took this long for
anybody to notice, I judge that the case isn't worth going to great
lengths to optimize. Hence, fix by just teaching qual_is_pushdown_safe
that such quals are unsafe to push down, matching the previous behavior
when it accidentally didn't fail.
Per report from Tom Ellis. Back-patch to all supported branches.
Discussion: https://postgr.es/m/20200713175124.GQ8220@cloudinit-builder
Remove previous hack in KeepLogSeg that added a case to deal with a
(badly represented) invalid segment number. This was added for the sake
of GetWALAvailability. But it's not needed if in that function we
initialize the segment number to be retreated to the currently being
written segment, so do that instead.
Per valgrind-running buildfarm member skink, and some sparc64 animals.
Discussion: https://postgr.es/m/1724648.1594230917@sss.pgh.pa.us
The stats with this commit was available only for WALSenders, however,
users might want to see for backends doing logical decoding via SQL API.
Then, users might want to reset and access these stats across server
restart which was not possible with the current patch.
List of commits reverted:
caa3c4242c Don't call elog() while holding spinlock.
e641b2a995 Doc: Update the documentation for spilled transaction
statistics.
5883f5fe27 Fix unportable printf format introduced in commit 9290ad198.
9290ad198b Track statistics for spilling of changes from ReorderBuffer.
Additionaly, remove the release notes entry for this feature.
Backpatch-through: 13, where it was introduced
Discussion: https://postgr.es/m/CA+fd4k5_pPAYRTDrO2PbtTOe0eHQpBvuqmCr8ic39uTNmR49Eg@mail.gmail.com
This includes two changes:
- Addition of a new function pg_xact_commit_timestamp_origin() able, for
a given transaction ID, to return the commit timestamp and replication
origin of this transaction. An equivalent function existed in
pglogical.
- Addition of the replication origin to pg_last_committed_xact().
The commit timestamp manager includes already APIs able to return the
replication origin of a transaction on top of its commit timestamp, but
the code paths for replication origins were never stressed as those
functions have never looked for a replication origin, and the SQL
functions available have never included this information since their
introduction in 73c986a.
While on it, refactor a test of modules/commit_ts/ to use tstzrange() to
check that a transaction timestamp is within the wanted range, making
the test a bit easier to read.
Bump catalog version.
Author: Movead Li
Reviewed-by: Madan Kumar, Michael Paquier
Discussion: https://postgr.es/m/2020051116430836450630@highgo.ca
The raw_buf and line_buf buffers aren't used when reading binary format,
so skip allocating them. raw_buf is 64K so that seems like a worthwhile
savings. An unused line_buf only wastes 1K, but as long as we're checking
it's free to avoid allocating that too.
Bharath Rupireddy, tweaked a bit by me
Discussion: https://postgr.es/m/CALj2ACXcCKaGPY0whowqrJ4OPJvDnTssgpGCzvuFQu5z0CXb-g@mail.gmail.com
"relkind" normally refers to the char field from pg_class. However, in
the parse nodes AlterTableStmt and CreateTableAsStmt, "relkind" was used
for a field of type enum ObjectType, that could refer to other object
types than those possible for a relkind. Such fields being usually
named "objtype", switch the name in both structures to make things more
consistent. Note that this led to some confusion in functions that
also operate on a RangeTableEntry object, which also has a field named
"relkind".
This naming goes back to commit 09d4e96, where only OBJECT_TABLE and
OBJECT_INDEX were used. This got extended later to use as well
OBJECT_TYPE with e440e12, not really a relation kind.
Author: Mark Dilger
Reviewed-by: Daniel Gustafsson, Álvaro Herrera, Michael Paquier
Discussion: https://postgr.es/m/609181AE-E399-47C7-9221-856E0F96BF93@enterprisedb.com
SQL standard doesn't define numeric Inf or NaN values. It appears even more
ridiculous to support then in jsonpath assuming JSON doesn't support these
values as well. This commit forbids returning NaN from .double(), which was
previously allowed. NaN can't be result of inner-jsonpath computation over
non-NaNs. So, we can not expect NaN in the jsonpath output.
Reported-by: Tom Lane
Discussion: https://postgr.es/m/203949.1591879542%40sss.pgh.pa.us
Author: Alexander Korotkov
Reviewed-by: Tom Lane
Backpatch-through: 12
When jsonpath .double() method detects that numeric or string can't be
converted to double precision, it throws an error. This commit makes these
errors explicitly express the reason of failure.
Discussion: https://postgr.es/m/CAPpHfdtqJtiSXkP7tOXez18NxhLUH_-75bL8%3DOce4Ki%2Bbv7V6Q%40mail.gmail.com
Author: Alexander Korotkov
Reviewed-by: Tom Lane
Backpatch-through: 12
This message was being emitted on the grounds that only crashed
summarization could cause it, but in reality even an aborted vacuum
could do it ... which makes it way too noisy, particularly since it
shows up in regression tests and makes them die.
Reported by Tom Lane.
Discussion: https://postgr.es/m/489091.1593534251@sss.pgh.pa.us
Due to not having our signals straight about CRLF vs. LF line
termination, the output of pg_current_logfile() included a trailing
\r on Windows. To fix, force the file descriptor it uses into text
mode.
While here, move a couple of local variable declarations to make
the function's logic clearer.
In v12 and v13, also back-patch the test added by 1c4e88e2f so that
this function has some test coverage. However, the 004_logrotate.pl
test script doesn't exist before v12, and it didn't seem worth adding
to older branches just for this.
Per report from Thomas Kellerer. Back-patch to v10 where this
function was added.
Discussion: https://postgr.es/m/412ae8da-76bb-640f-039a-f3513499e53d@gmx.net
The Sort node does not put a space between the number of kilobytes and
the "kB" of memory or disk space used, but HashAgg does. Here we align
HashAgg to do the same as Sort. Sort has been displaying this
information for longer than HashAgg, so it makes sense to align HashAgg
to Sort rather than the other way around.
Reported-by: Justin Pryzby
Discussion: https://postgr.es/m/20200708163021.GW4107@telsasoft.com
Backpatch-through: 13, where the hashagg started showing these details
Since slot_keep_segs indicates the number of WAL segments not LSN,
its datatype should not be XLogRecPtr.
Back-patch to v13 where this issue was added.
Reported-by: Atsushi Torikoshi
Author: Atsushi Torikoshi, tweaked by Fujii Masao
Discussion: https://postgr.es/m/ebd0d674f3e050222238a960cac5251a@oss.nttdata.com
Commit f7f70d5e2 left one inconsistency behind: we're still creating
pg_type entries for the composite types of sequences and toast tables,
but not arrays over those composites. But there seems precious little
reason to have named composite types for toast tables, and not much more
to have them for sequences (especially given the thought that sequences
may someday not be standalone relations at all).
So, let's close that inconsistency by removing these composite types,
rather than adding arrays for them. This buys back a little bit of
the initial pg_type bloat added by the previous patch, and could be
a significant savings in a large database with many toast tables.
Aside from a small logic rearrangement in heap_create_with_catalog,
this patch mostly needs to clean up some places that were assuming that
pg_class.reltype always has a valid value. Those are really pre-existing
bugs, given that it's documented otherwise; notably, the plpgsql changes
fix code that gives "cache lookup failed for type 0" on indexes today.
But none of these seem interesting enough to back-patch.
Also, remove the pg_dump/pg_upgrade infrastructure for propagating
a toast table's pg_type OID into the new database, since we no longer
need that.
Discussion: https://postgr.es/m/761F1389-C6A8-4C15-80CE-950C961F5341@gmail.com
The previous definition of the column was almost universally disliked,
so provide this updated definition which is more useful for monitoring
purposes: a large positive value is good, while zero or a negative value
means danger. This should be operationally more convenient.
Backpatch to 13, where the new column to pg_replication_slots (and the
feature it represents) were added.
Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reported-by: Fujii Masao <masao.fujii@oss.nttdata.com>
Discussion: https://postgr.es/m/9ddfbf8c-2f67-904d-44ed-cf8bc5916228@oss.nttdata.com
Previously we checked that the ssl pointer was not null, but this puts a
requirement on there being such a pointer which may not be true in
future multi-ssl-library supporting times. This seems to have been an
oversight in 9029f4b374, but hasn't really had any effect since we only
have one library.
Author: Daniel Gustafsson
nbtree index builds cannot write out an empty page. That would mean
that there was no way to create a pivot tuple pointing to the page one
level up, since _bt_truncate() generates one based on page's firstright
tuple.
Replace the unnecessary PageIsEmpty() check with an assertion that
checks that the page has space for at least two line pointers (the
would-be high key line pointer, plus at least one valid "data item"
tuple line pointer).
The PageIsEmpty() check was added by commit 5d9f146c over 20 years ago.
It looks like it has always been unnecessary.
When we invented arrays of composite types (commit bc8036fc6),
we excluded system catalogs, basically just on the grounds of not
wanting to bloat pg_type. However, it's definitely inconsistent that
catalogs' composite types can't be put into arrays when others can.
Another problem is that the exclusion is done by checking
IsUnderPostmaster in heap_create_with_catalog, which means that
(1) If a user tries to create a table in single-user mode, it doesn't
get an array type. That's bad in itself, plus it breaks pg_upgrade.
(2) If someone drops and recreates a system view or information_schema
view (as we occasionally recommend doing), it will now have an array
type where it did not before, making for still more inconsistency.
So this is all pretty messy. Let's just get rid of the inconsistency
and decree that system-created relations should have array types if
similar user-created ones would, i.e. it only depends on the relkind.
As of HEAD, that means that the initial contents of pg_type grow from
411 rows to 605, which is a lot of growth percentage-wise, but it's
still quite a small catalog compared to others.
Wenjing Zeng, reviewed by Shawn Wang, further hacking by me
Discussion: https://postgr.es/m/761F1389-C6A8-4C15-80CE-950C961F5341@gmail.com
This introduces a new set of extended routines for procedure and
operator name lookups, with a flag bitmask argument that can modify the
result. The following options are available:
- Force schema qualification, ignoring search_path. This is similar to
the existing option for format_{operator|procedure}_qualified().
- Force NULL as result instead of a numeric OID for an undefined
object. This option is new.
This is a refactoring similar to 1185c78, that will be used for a future
patch to improve the SQL functions providing information using object
addresses for undefined objects.
Author: Michael Paquier
Reviewed-by: Aleksander Alekseev, Dmitry Dolgov, Daniel Gustafsson,
Álvaro Herrera
Discussion: https://postgr.es/m/CAB7nPqSZxrSmdHK-rny7z8mi=EAFXJ5J-0RbzDw6aus=wB5azQ@mail.gmail.com
If a type scanned is undefined, type format routines have two behaviors
depending on if FORMAT_TYPE_ALLOW_INVALID is used by the caller or not:
- Issue a cache lookup error
- Return an undefined type name "???", "???[]" or "-"
The current interface is not really helpful for callers willing to
format properly a type name, but still make sure that the type is
defined as there could be types matching the strings generated when
looking for an undefined type, even if that should not be a problem in
practice. In order to counter that, add a new flag called
FORMAT_TYPE_INVALID_AS_NULL that returns a NULL result instead of "???
or "-" which does not generate an error. This flag will be used in a
follow-up patch improving the set of SQL functions showing information
for object addresses when it comes to undefined objects.
Author: Michael Paquier
Reviewed-by: Aleksander Alekseev, Dmitry Dolgov, Daniel Gustafsson,
Álvaro Herrera
Discussion: https://postgr.es/m/CAB7nPqSZxrSmdHK-rny7z8mi=EAFXJ5J-0RbzDw6aus=wB5azQ@mail.gmail.com
The cfbot and some BF animals are complaining about the previous
read_binary_file commit because of ignoring return value of ‘fread’.
So let's make everyone happy by testing the return value even though
not strictly needed.
Reported by Justin Pryzby, and suggested patch by Tom Lane. Backpatched
to v11 same as the previous commit.
Reported-By: Justin Pryzby
Reviewed-By: Tom Lane
Discussion: https://postgr.es/m/flat/969b8d82-5bb2-5fa8-4eb1-f0e685c5d736%40joeconway.com
Backpatch-through: 11
read_binary_file(), used by SQL functions pg_read_file() and friends,
uses stat to determine file length to read, when not passed an explicit
length as an argument. This is problematic, for example, if the file
being read is a virtual file with a stat-reported length of zero.
Arrange to read until EOF, or StringInfo data string lenth limit, is
reached instead.
Original complaint and patch by me, with significant review, corrections,
advice, and code optimizations by Tom Lane. Backpatched to v11. Prior to
that only paths relative to the data and log dirs were allowed for files,
so no "zero length" files were reachable anyway.
Reviewed-By: Tom Lane
Discussion: https://postgr.es/m/flat/969b8d82-5bb2-5fa8-4eb1-f0e685c5d736%40joeconway.com
Backpatch-through: 11
After running GetForeignRelSize for a foreign table, adjust rel->tuples
to be at least as large as rel->rows. This prevents bizarre behavior
in estimate_num_groups() and perhaps other places, especially in the
scenario where rel->tuples is zero because pg_class.reltuples is
(suggesting that ANALYZE has never been run for the table). As things
stood, we'd end up estimating one group out of any GROUP BY on such a
table, whereas the default group-count estimate is more likely to result
in a sane plan.
Also, clarify in the documentation that GetForeignRelSize has the option
to override the rel->tuples value if it has a better idea of what to use
than what is in pg_class.reltuples.
Per report from Jeff Janes. Back-patch to all supported branches.
Patch by me; thanks to Etsuro Fujita for review
Discussion: https://postgr.es/m/CAMkU=1xNo9cnan+Npxgz0eK7394xmjmKg-QEm8wYG9P5-CcaqQ@mail.gmail.com
Commit ecd9e9f0b fixed the problem in the wrong place, causing unwanted
side-effects on the behavior of GetNextTempTableSpace(). Instead,
let's make SharedFileSetInit() responsible for subbing in the value
of MyDatabaseTableSpace when the default tablespace is called for.
The convention about what is in the tempTableSpaces[] array is
evidently insufficiently documented, so try to improve that.
It also looks like SharedFileSetInit() is doing the wrong thing in the
case where temp_tablespaces is empty. It was hard-wiring use of the
pg_default tablespace, but it seems like using MyDatabaseTableSpace
is more consistent with what happens for other temp files.
Back-patch the reversion of PrepareTempTablespaces()'s behavior to
9.5, as ecd9e9f0b was. The changes in SharedFileSetInit() go back
to v11 where that was introduced. (Note there is net zero code change
before v11 from these two patch sets, so nothing to release-note.)
Magnus Hagander and Tom Lane
Discussion: https://postgr.es/m/CABUevExg5YEsOvqMxrjoNvb3ApVyH+9jggWGKwTDFyFCVWczGQ@mail.gmail.com
A likely copy/paste error in 98e8b48053 from back in 2004 would
cause temp tablespace to be reset to InvalidOid if temp_tablespaces
was set to the same value as the primary tablespace in the database.
This would cause shared filesets (such as for parallel hash joins)
to ignore them, putting the temporary files in the default tablespace
instead of the configured one. The bug is in the old code, but it
appears to have been exposed only once we had shared filesets.
Reviewed-By: Daniel Gustafsson
Discussion: https://postgr.es/m/CABUevExg5YEsOvqMxrjoNvb3ApVyH+9jggWGKwTDFyFCVWczGQ@mail.gmail.com
Backpatch-through: 9.5
Do the same for the maintenance_work_mem global variable.
Oversight in commit 848ae330a4, which increased the previous defaults
for work_mem and maintenance_work_mem by 4X.
Make some of the variable names in _bt_search() consistent with
corresponding variables within _bt_getstackbuf(). This naming scheme is
clearer because the variable names always express a relationship between
the currently locked buffer/page and some other page.
src/backend/replication/README includes since 32bc08b a basic
description of the WAL receiver hooks available in walreceiver.h for a
module like libpqwalreceiver, but the README has never been updated to
reflect changes done to the hooks, so the contents of the README have
rotten with the time. This commit moves the description from the README
to walreceiver.h, where it will be hard to miss that a description
update or addition is needed depending on the modifications done to the
hooks.
Each hook now includes a description of what it does in walreceiver.h,
and the replication's README mentions walreceiver.h.
Thanks also to Amit Kapila for the discussion.
Author: Michael Paquier
Reviewed-by: Peter Eisentraut
Discussion: https://postgr.es/m/20200502024606.GA471944@paquier.xyz
This is a follow-up commit similar to 68de144, with more places in the
backend code simplified with the macros able to assign values to the
fields of ObjectAddress. The code paths changed here could be
transitioned later into using more grouping when inserting dependency
records, simplifying this future work.
Author: Daniel Gustafsson, Michael Paquier
Discussion: https://postgr.es/m/20190213182737.mxn6hkdxwrzgxk35@alap3.anarazel.de
Use separate functions to save and restore error context information as
that made code easier to understand. Also, make it clear that the index
information required for error context is sane.
Author: Andres Freund, Justin Pryzby, Amit Kapila
Backpatch-through: 13, where it was introduced
Discussion: https://postgr.es/m/CAA4eK1LWo+v1OWu=Sky27GTGSCuOmr7iaURNbc5xz6jO+SaPeA@mail.gmail.com
When creating an extension, the same type of dependency is used when
registering a dependency to a schema and required extensions. This
improves the code so as those dependencies are not recorded one-by-one,
but grouped together. Note that this has as side effect to remove
duplicate dependency entries, even if it should not happen in practice
as extensions listed as required in a control file should be listed only
once.
Extracted from a larger patch by the same author.
Author: Daniel Dustafsson
Discussion: https://postgr.es/m/20200629065535.GA183079@paquier.xyz
The "Disk Usage" and "HashAgg Batches" properties in the EXPLAIN ANALYZE
output for HashAgg were previously only shown if the number of batches
was greater than 0. Here we change this so that these properties are
always shown for EXPLAIN ANALYZE formats other than "text". The idea here
is that since the HashAgg could have spilled to disk if there had been
more data or groups to aggregate, then it's relevant that we're clear in
the EXPLAIN ANALYZE output when no spilling occurred in this particular
execution of the given plan.
For the "text" EXPLAIN format, we still hide these properties when no
spilling occurs. This EXPLAIN format is designed to be easy for humans
to read. To maintain the readability we have a higher threshold for which
properties we display for this format.
Discussion: https://postgr.es/m/CAApHDvo_dmNozQQTmN-2jGp1vT%3Ddxx7Q0vd%2BMvD1cGpv2HU%3DSg%40mail.gmail.com
Backpatch-through: 13, where the hashagg spilling code was added.
Commit 54cd4f045 added some kluges to work around an old glibc bug,
namely that %.*s could misbehave if glibc thought any characters in
the supplied string were incorrectly encoded. Now that we use our
own snprintf.c implementation, we need not worry about that bug (even
if it still exists in the wild). Revert a couple of particularly
ugly hacks, and remove or improve assorted comments.
Note that there can still be encoding-related hazards here: blindly
clipping at a fixed length risks producing wrongly-encoded output
if the clip splits a multibyte character. However, code that's
doing correct multibyte-aware clipping doesn't really need a comment
about that, while code that isn't needs an explanation why not,
rather than a red-herring comment about an obsolete bug.
Discussion: https://postgr.es/m/279428.1593373684@sss.pgh.pa.us
Since %c only passes a C "char" to printf, it's incapable of dealing
with multibyte characters. Passing just the first byte of such a
character leads to an output string that is visibly not correctly
encoded, resulting in undesirable behavior such as encoding conversion
failures while sending error messages to clients.
We've lived with this issue for a long time because it was inconvenient
to avoid in a portable fashion. However, now that we always use our own
snprintf code, it's reasonable to use the %.*s format to print just one
possibly-multibyte character in a string. (We previously avoided that
obvious-looking answer in order to work around glibc's bug #6530, cf
commits 54cd4f045 and ed437e2b2.)
Hence, run around and fix a bunch of places that used %c to report
a character found in a user-supplied string. For simplicity, I did
not touch places that were emitting non-user-facing debug messages,
or reporting catalog data that should always be ASCII. (It's also
unclear how useful this approach could be in frontend code, where
it's less certain that we know what encoding we're dealing with.)
In passing, improve a couple of poorly-written error messages in
pageinspect/heapfuncs.c.
This is a longstanding issue, but I'm hesitant to back-patch because
of the impact on translatable message strings. In any case this fix
would not work reliably before v12.
Tom Lane and Quan Zongliang
Discussion: https://postgr.es/m/a120087c-4c88-d9d4-1ec5-808d7a7f133d@gmail.com
SQL:1999 had syntax
SUBSTRING(text FROM pattern FOR escapechar)
but this was replaced in SQL:2003 by the more clear
SUBSTRING(text SIMILAR pattern ESCAPE escapechar)
but this was never implemented in PostgreSQL. This patch adds that
new syntax as an alternative in the parser, and updates documentation
and tests to indicate that this is the preferred alternative now.
Reviewed-by: Pavel Stehule <pavel.stehule@gmail.com>
Reviewed-by: Vik Fearing <vik@postgresfriends.org>
Reviewed-by: Fabien COELHO <coelho@cri.ensmp.fr>
Discussion: https://www.postgresql.org/message-id/flat/a15db31c-d0f8-8ce0-9039-578a31758adb%402ndquadrant.com
The logic used to build the set of dependencies needed for a type is
rather repetitive with direct assignments for each ObjectAddress field.
This refactors the code to use the macro ObjectAddressSet() instead, to
do the same work. There are more areas of the backend code that could
use this macro, but these are left for a follow-up patch that will
partially rework the way dependencies are recorded as well. Type
dependencies are left out of the follow-up patch, so they are refactored
separately here.
Extracted from a larger patch by the same author.
Author: Daniel Gustafsson
Discussion: https://potgr.es/m/20190213182737.mxn6hkdxwrzgxk35@alap3.anarazel.de
OpenSSL's native reports about problems related to protocol version
restrictions are pretty opaque and inconsistent. When we get an
SSL error that is plausibly due to this, emit a hint message that
includes the range of SSL protocol versions we (think we) are
allowing. This should at least get the user thinking in the right
direction to resolve the problem, even if the hint isn't totally
accurate, which it might not be for assorted reasons.
Back-patch to v13 where we increased the default minimum protocol
version, thereby increasing the risk of this class of failure.
Patch by me, reviewed by Daniel Gustafsson
Discussion: https://postgr.es/m/a9408304-4381-a5af-d259-e55d349ae4ce@2ndquadrant.com
We failed to save slot to disk after invalidating it, so the state was
lost in case of server restart or crash. Fix by marking it dirty and
flushing.
Also, if the slot is known invalidated we don't need to reason about the
LSN at all -- it's known invalidated. Only test the LSN if the slot is
known not invalidated.
Author: Fujii Masao <masao.fujii@oss.nttdata.com>
Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/17a69cfe-f1c1-a416-ee25-ae15427c69eb@oss.nttdata.com
Commit 0d861bbb, which added deduplication to nbtree, had
_bt_check_unique() pass a TID to table_index_fetch_tuple_check() that
isn't safe to mutate. table_index_fetch_tuple_check()'s tid argument is
modified when the TID in question is not the latest visible tuple in a
hot chain, though this wasn't documented.
To fix, go back to using a local copy of the TID in _bt_check_unique(),
and update comments above table_index_fetch_tuple_check().
Backpatch: 13-, where B-Tree deduplication was introduced.
If restart_lsn of logical replication slot gets behind more than
max_slot_wal_keep_size from the current LSN, the logical replication slot
would be invalidated and its restart_lsn is reset to an invalid LSN.
If this logical replication slot with an invalid restart_lsn was specified as
the source slot in pg_copy_logical_replication_slot(), the function caused
the assertion failure unexpectedly.
This assertion was added because restart_lsn should not be invalid before.
But in v13, it can be invalid thanks to max_slot_wal_keep_size. So since this
assertion is no longer useful, this commit removes it.
This commit also changes the errcode in the error message that
pg_copy_logical_replication_slot() emits when the slot with an invalid
restart_lsn is specified, to more appropriate one.
Back-patch to v13 where max_slot_wal_keep_size was added and
the assertion was no longer valid.
Author: Fujii Masao
Reviewed-by: Alvaro Herrera, Kyotaro Horiguchi
Discussion: https://postgr.es/m/f91de4fb-a7ab-b90e-8132-74796e049d51@oss.nttdata.com
In pg_replication_slot, change output from normal/reserved/lost to
reserved/extended/unreserved/ lost, which better expresses the possible
states particularly near the time where segments are no longer safe but
checkpoint has not run yet.
Under the new definition, reserved means the slot is consuming WAL
that's still under the normal WAL size constraints; extended means it's
consuming WAL that's being protected by wal_keep_segments or the slot
itself, whose size is below max_slot_wal_keep_size; unreserved means the
WAL is no longer safe, but checkpoint has not yet removed those files.
Such as slot is in imminent danger, but can still continue for a little
while and may catch up to the reserved WAL space.
Also, there were some bugs in the calculations used to report the
status; fixed those.
Backpatch to 13.
Reported-by: Fujii Masao <masao.fujii@oss.nttdata.com>
Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com>
Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/20200616.120236.1809496990963386593.horikyota.ntt@gmail.com
The current definition is dangerous. No bugs exist in our code at
present, but backpatch to 11 nonetheless where it was introduced.
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
The description of InsertPgAttributeTuple() does not match its handling
of pg_attribute contents with NULL values for a long time, with 911e702
making things more inconsistent. This adjusts the description to match
the reality.
Author: Daniel Gustafsson
Discussion: https://postgr.es/m/4E4E4B33-9FDF-4D21-B77A-642D027AEAD9@yesql.se
explain_get_index_name() applied quote_identifier() to the index name.
This is fine for text output, but the non-text output formats all have
their own quoting conventions and would much rather start from the
actual index name. For example in JSON you'd get something like
"Index Name": "\"My Index\"",
which is surely not desirable, especially when the same does not
happen for table names. Hence, move the responsibility for applying
quoting out to the callers, where it can go into already-existing
special code paths for text format.
This changes the API spec for users of explain_get_index_name_hook:
before, they were supposed to apply quote_identifier() if necessary,
now they should not. Research suggests that the only publicly
available user of the hook is hypopg, and it actually forgot to
apply quoting anyway, so it's fine. (In any case, there's no
behavioral change for the output of a hook as seen in non-text
EXPLAIN formats, so this won't break any case that programs should
be relying on.)
Digging in the commit logs, it appears that quoting was included in
explain_get_index_name's duties when commit 604ffd280 invented it;
and that was fine at the time because we only had text output format.
This should have been rethought when non-text formats were invented,
but it wasn't.
This is a fairly clear bug for users of non-text EXPLAIN formats,
so back-patch to all supported branches.
Per bug #16502 from Maciek Sakrejda. Patch by me (based on
investigation by Euler Taveira); thanks to Julien Rouhaud for review.
Discussion: https://postgr.es/m/16502-57bd1c9f913ed1d1@postgresql.org
spg_mask() didn't take into account that pd_lower equal to SizeOfPageHeaderData
is still valid value. This commit fixes that. Backpatch to 11, where
spg_mask() pg_lower check was introduced.
Reported-by: Michael Paquier
Discussion: https://postgr.es/m/20200615131405.GM52676%40paquier.xyz
Backpatch-through: 11
The function has been reading global variable forceSyncCommit, mirroring
the intent of the caller that passed forceSync=forceSyncCommit. The
other caller, RecordTransactionCommitPrepared(), passed false. Since
COMMIT PREPARED can't share a transaction with any command, it certainly
doesn't share a transaction with a command that sets forceSyncCommit.
Reviewed by Michael Paquier.
Discussion: https://postgr.es/m/20200617032615.GC2916904@rfd.leadboat.com
It was possible for deduplication's single value strategy to mistakenly
believe that a very small duplicate tuple counts as one of the six large
tuples that it aims to leave behind after the page finally splits. This
could cause slightly suboptimal space utilization with very low
cardinality indexes, though only under fairly narrow conditions.
To fix, be particular about what kind of tuple counts as a
maxpostingsize-capped tuple. This avoids confusion in the event of a
small tuple that gets "wedged" between two large tuples, where all
tuples on the page are duplicates of the same value.
Discussion: https://postgr.es/m/CAH2-Wz=Y+sgSFc-O3LpiZX-POx2bC+okec2KafERHuzdVa7-rQ@mail.gmail.com
Backpatch: 13-, where deduplication was introduced (by commit 0d861bbb)
This commit fixes the following issues.
1. There is the case where the slot is dropped while trying to invalidate it.
InvalidateObsoleteReplicationSlots() did not handle this case, and
which could cause checkpoint to fail.
2. InvalidateObsoleteReplicationSlots() could emit the same log message
multiple times unnecessary. It should be logged only once.
3. When marking the slot as used, we always searched the target slot from
all the replication slots even if we already found it. This could cause
useless waste of cycles.
Back-patch to v13 where these issues were added as a part of
max_slot_wal_keep_size code.
Author: Fujii Masao
Reviewed-by: Kyotaro Horiguchi, Alvaro Herrera
Discussion: https://postgr.es/m/66c05b67-3396-042c-1b41-bfa6c3ddcf82@oss.nttdata.com
Since 1f39bce02, HashAgg nodes have had the ability to spill to disk when
memory consumption exceeds work_mem. That commit added new properties to
EXPLAIN ANALYZE to show the maximum memory usage and disk usage, however,
it didn't quite go as far as showing that information for parallel
workers. Since workers may have experienced something very different from
the main process, we should show this information per worker, as is done
in Sort.
Reviewed-by: Justin Pryzby
Reviewed-by: Jeff Davis
Discussion: https://postgr.es/m/CAApHDvpEKbfZa18mM1TD7qV6PG+w97pwCWq5tVD0dX7e11gRJw@mail.gmail.com
Backpatch-through: 13, where the hashagg spilling code was added.
This was a danger only for --disable-spinlocks in combination with
atomic operations unsupported by the current platform.
While atomics.c was careful to signal that a separate semaphore ought
to be used when spinlock emulation is active, spin.c didn't actually
implement that mechanism. That's my (Andres') fault, it seems to have
gotten lost during the development of the atomic operations support.
Fix that issue and add test for nesting atomic operations inside a
spinlock.
Author: Andres Freund
Discussion: https://postgr.es/m/20200605023302.g6v3ydozy5txifji@alap3.anarazel.de
Backpatch: 9.5-
Advancing a replication slot did not recompute the oldest xmin and LSN
values across replication slots, preventing resource removal like
segments not recycled at checkpoint time. The original commit that
introduced the slot advancing in 9c7d06d never did the update of those
oldest values, and b0afdca removed this code.
This commit adds a TAP test to check segment recycling with advancing
for physical slots, enforcing an extra segment switch before advancing
to check if the segment gets correctly recycled after a checkpoint.
Reported-by: Andres Freund
Reviewed-by: Alexey Kondratov, Kyptaro Horiguchi
Discussion: https://postgr.es/m/20200609171904.kpltxxvjzislidks@alap3.anarazel.de
Backpatch-through: 11