Split the "Authentication and Security" section into two separate
sections "Authentication" and "SSL". The latter part has gotten much
longer over time, and doesn't primarily have to do with authentication.
Also, the row_security parameter was inconsistently categorized, so
clean that up while we're here.
In each of the supplied procedural languages (PL/pgSQL, PL/Perl,
PL/Python, PL/Tcl), add language-specific commit and rollback
functions/commands to control transactions in procedures in that
language. Add similar underlying functions to SPI. Some additional
cleanup so that transaction commit or abort doesn't blow away data
structures still used by the procedure call. Add execution context
tracking to CALL and DO statements so that transaction control commands
can only be issued in top-level procedure and block calls, not function
calls or other procedure or block calls.
- SPI
Add a new function SPI_connect_ext() that is like SPI_connect() but
allows passing option flags. The only option flag right now is
SPI_OPT_NONATOMIC. A nonatomic SPI connection can execute transaction
control commands, otherwise it's not allowed. This is meant to be
passed down from CALL and DO statements which themselves know in which
context they are called. A nonatomic SPI connection uses different
memory management. A normal SPI connection allocates its memory in
TopTransactionContext. For nonatomic connections we use PortalContext
instead. As the comment in SPI_connect_ext() (previously SPI_connect())
indicates, one could potentially use PortalContext in all cases, but it
seems safest to leave the existing uses alone, because this stuff is
complicated enough already.
SPI also gets new functions SPI_start_transaction(), SPI_commit(), and
SPI_rollback(), which can be used by PLs to implement their transaction
control logic.
- portalmem.c
Some adjustments were made in the code that cleans up portals at
transaction abort. The portal code could already handle a command
*committing* a transaction and continuing (e.g., VACUUM), but it was not
quite prepared for a command *aborting* a transaction and continuing.
In AtAbort_Portals(), remove the code that marks an active portal as
failed. As the comment there already predicted, this doesn't work if
the running command wants to keep running after transaction abort. And
it's actually not necessary, because pquery.c is careful to run all
portal code in a PG_TRY block and explicitly runs MarkPortalFailed() if
there is an exception. So the code in AtAbort_Portals() is never used
anyway.
In AtAbort_Portals() and AtCleanup_Portals(), we need to be careful not
to clean up active portals too much. This mirrors similar code in
PreCommit_Portals().
- PL/Perl
Gets new functions spi_commit() and spi_rollback()
- PL/pgSQL
Gets new commands COMMIT and ROLLBACK.
Update the PL/SQL porting example in the documentation to reflect that
transactions are now possible in procedures.
- PL/Python
Gets new functions plpy.commit and plpy.rollback.
- PL/Tcl
Gets new commands commit and rollback.
Reviewed-by: Andrew Dunstan <andrew.dunstan@2ndquadrant.com>
Add support for huge pages (called large pages on Windows) to the
Windows build.
This (probably) breaks compatibility with Windows versions prior to
Windows 2003 or Windows Vista.
Authors: Takayuki Tsunakawa and Thomas Munro
Reviewed by: Magnus Hagander, Amit Kapila
Apparently, Peter's compiler has faith that the switch test values here
could never not be valid values of their enums. Mine does not, and
I tend to agree with it.
When an UPDATE causes a row to no longer match the partition
constraint, try to move it to a different partition where it does
match the partition constraint. In essence, the UPDATE is split into
a DELETE from the old partition and an INSERT into the new one. This
can lead to surprising behavior in concurrency scenarios because
EvalPlanQual rechecks won't work as they normally did; the known
problems are documented. (There is a pending patch to improve the
situation further, but it needs more review.)
Amit Khandekar, reviewed and tested by Amit Langote, David Rowley,
Rajkumar Raghuwanshi, Dilip Kumar, Amul Sul, Thomas Munro, Álvaro
Herrera, Amit Kapila, and me. A few final revisions by me.
Discussion: http://postgr.es/m/CAJ3gD9do9o2ccQ7j7+tSgiE1REY65XRiMb=yJO3u3QhyP8EEPQ@mail.gmail.com
When an index column is an expression, it makes no sense to compare its
attribute numbers.
This seems to account for remaining buildfarm fallout from 8b08f7d482.
At least, it solves the issue in my local 32bit VM -- let's see what the
rest thinks.
AclObjectKind was basically just another enumeration for object types,
and we already have a preferred one for that. It's only used in
aclcheck_error. By using ObjectType instead, we can also give some more
precise error messages, for example "index" instead of "relation".
Reviewed-by: Michael Paquier <michael.paquier@gmail.com>
There used to be a lot of different *Type and *Kind symbol groups to
address objects within different commands, most of which have been
replaced by ObjectType, starting with
b256f24264. But this conversion was never
done for the ACL commands until now.
This change ends up being just a plain replacement of the types and
symbols, without any code restructuring needed, except deleting some now
redundant code.
Reviewed-by: Michael Paquier <michael.paquier@gmail.com>
Reviewed-by: Stephen Frost <sfrost@snowman.net>
When CREATE INDEX is run on a partitioned table, create catalog entries
for an index on the partitioned table (which is just a placeholder since
the table proper has no data of its own), and recurse to create actual
indexes on the existing partitions; create them in future partitions
also.
As a convenience gadget, if the new index definition matches some
existing index in partitions, these are picked up and used instead of
creating new ones. Whichever way these indexes come about, they become
attached to the index on the parent table and are dropped alongside it,
and cannot be dropped on isolation unless they are detached first.
To support pg_dump'ing these indexes, add commands
CREATE INDEX ON ONLY <table>
(which creates the index on the parent partitioned table, without
recursing) and
ALTER INDEX ATTACH PARTITION
(which is used after the indexes have been created individually on each
partition, to attach them to the parent index). These reconstruct prior
database state exactly.
Reviewed-by: (in alphabetical order) Peter Eisentraut, Robert Haas, Amit
Langote, Jesper Pedersen, Simon Riggs, David Rowley
Discussion: https://postgr.es/m/20171113170646.gzweigyrgg6pwsg4@alvherre.pgsql
For no apparent reason, this function was using a 16bit-wide inhseqno
value, rather than the correct 32 bit width which is what is stored in
the pg_inherits catalog. This becomes evident if you try to create a
table with more than 65535 parents, because this error appears:
ERROR: duplicate key value violates unique constraint «pg_inherits_relid_seqno_index»
DETAIL: Key (inhrelid, inhseqno)=(329371, 0) already exists.
Needless to say, having so many parents is an uncommon situations, which
explains why this error has never been reported despite being having
been introduced with the Postgres95 1.01 sources in commit d31084e9d111:
https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/commands/creatinh.c;hb=d31084e9d111#l349
Backpatch all the way back.
David Rowley noticed this while reviewing a patch of mine.
Discussion: https://postgr.es/m/CAKJS1f8Dn7swSEhOWwzZzssW7747YB=2Hi+T7uGud40dur69-g@mail.gmail.com
node->partitioned_rels is only set in UPDATE/DELETE cases, but
ExecInitModifyTable only uses its "rel" variable in INSERT cases,
so the extra logic to find the root rel is just a waste of complexity
and cycles.
Etsuro Fujita, reviewed by Amit Langote
Discussion: https://postgr.es/m/93cf9816-2f7d-0f67-8ed2-4a4e497a6ab8@lab.ntt.co.jp
Ability to advance both physical and logical replication slots using a
new user function pg_replication_slot_advance().
For logical advance that means records are consumed as fast as possible
and changes are not given to output plugin for sending. Makes 2nd phase
(after we reached SNAPBUILD_FULL_SNAPSHOT) of replication slot creation
faster, especially when there are big transactions as the reorder buffer
does not have to deal with data changes and does not have to spill to
disk.
Author: Petr Jelinek
Reviewed-by: Simon Riggs
The creates a single function JsonEncodeDateTime which will format these
data types in an efficient and consistent manner. This will be all the
more important when we come to jsonpath so we don't have to implement yet
more code doing the same thing in two more places.
This also extends the code to handle time and timetz types which were
not previously handled specially. This requires exposing the time2tm and
timetz2tm functions.
Patch from Nikita Glukhov
If a query against an inheritance tree runs concurrently with an ALTER
TABLE that's disinheriting one of the tree members, it's possible to get
a "could not find inherited attribute" error because after obtaining lock
on the removed member, make_inh_translation_list sees that its columns
have attinhcount=0 and decides they aren't the columns it's looking for.
An ideal fix, perhaps, would avoid including such a just-removed member
table in the query at all; but there seems no way to accomplish that
without adding expensive catalog rechecks or creating a likelihood of
deadlocks. Instead, let's just drop the check on attinhcount. In this
way, a query that's included a just-disinherited child will still
succeed, which is not a completely unreasonable behavior.
This problem has existed for a long time, so back-patch to all supported
branches. Also add an isolation test verifying related behaviors.
Patch by me; the new isolation test is based on Kyotaro Horiguchi's work.
Discussion: https://postgr.es/m/20170626.174612.23936762.horiguchi.kyotaro@lab.ntt.co.jp
If we flatten a subquery whose target list contains constants or
expressions, when those output columns are used in GROUPING SET columns,
the planner was capable of doing the wrong thing by merging a pulled-up
expression into the surrounding expression during const-simplification.
Then the late processing that attempts to match subexpressions to grouping
sets would fail to match those subexpressions to grouping sets, with the
effect that they'd not go to null when expected.
To fix, wrap such subquery outputs in PlaceHolderVars, ensuring that
they preserve their separate identity throughout the planner's expression
processing. This is a bit of a band-aid, because the wrapper defeats
const-simplification even in places where it would be safe to allow.
But a nicer fix would likely be too invasive to back-patch, and the
consequences of the missed optimizations probably aren't large in most
cases.
Back-patch to 9.5 where grouping sets were introduced.
Heikki Linnakangas, with small mods and better test cases by me;
additional review by Andrew Gierth
Discussion: https://postgr.es/m/7dbdcf5c-b5a6-ef89-4958-da212fe10176@iki.fi
Add the ability to label a column's default value in the catalog header,
and implement this for pg_attribute. A new function in Catalog.pm is
used to fill in a tuple with defaults. The build process will complain
loudly if a catalog entry is incomplete,
Commit 8137f2c323 labeled variable length columns for the C preprocessor.
Expose that label to genbki.pl so we can exclude those columns from schema
macros in a general fashion. Also, format schema macro entries according
to their types.
This means slightly less code maintenance, but more importantly it's a
proving ground for mechanisms intended to be used in later commits.
While at it, I (Álvaro) couldn't resist making some changes in
genbki.pl: rename some functions to actually indicate their purpose
instead of actively misleading onlookers; and don't iterate on the whole
of pg_type to find the entry for each catalog row, using a hash instead
of an array.
Author: John Naylor, some changes by Álvaro Herrera
Discussion: https://postgr.es/m/CAJVSVGVJHwD8sfDfZW9TbCHWKf=C1YDRM-rF=2JenRU_y+VcFg@mail.gmail.com
This should have been done in commit 18ce3a4ab, which added that parameter
to ExplainOneQuery, but it was overlooked. This makes it impossible for
a user of the hook to pass the queryEnv down to ExplainOnePlan.
It's too late to change this API in v10, I suppose, but fortunately
passing NULL to ExplainOnePlan will work in nearly all interesting
cases in v10. That might not be true forever, so we'd better fix it.
Tatsuro Yamada, reviewed by Thomas Munro
Discussion: https://postgr.es/m/890e8dd9-c1c7-a422-6892-874f5eaee048@lab.ntt.co.jp
It seems incorrect to assume that the list of CkptSortItems can never
contain duplicate page numbers: concurrent activity could result in some
page getting dropped from a low-numbered buffer and later loaded into a
high-numbered buffer while BufferSync is scanning the buffer pool.
If that happened, the comparator would give self-inconsistent results,
potentially confusing qsort(). Saving one comparison step is not worth
possibly getting the sort wrong.
So far as I can tell, nothing would actually go wrong given our current
implementation of qsort(). It might get a bit slower than expected
if there were a large number of duplicates of one value, but that's
surely a probability-epsilon case. Still, the comment is wrong,
and if we ever switched to another sort implementation it might be
less forgiving.
In passing, avoid casting away const-ness of the argument pointers;
I've not seen any compiler complaints from that, but it seems likely
that some compilers would not like it.
Back-patch to 9.6 where this code came in, just in case I've underestimated
the possible consequences.
Discussion: https://postgr.es/m/18437.1515607610@sss.pgh.pa.us
PL/pgSQL "pins" internally generated (unnamed) portals so that user code
cannot close them by guessing their names. This logic is also useful in
other languages and really for any code. So move that logic into SPI.
An unnamed portal obtained through SPI_cursor_open() and related
functions is now automatically pinned, and SPI_cursor_close()
automatically unpins a portal that is pinned.
In the core distribution, this affects PL/Perl and PL/Python, preventing
users from manually closing cursors created by spi_query and
plpy.cursor, respectively. (PL/Tcl does not currently offer any cursor
functionality.)
Reviewed-by: Andrew Dunstan <andrew.dunstan@2ndquadrant.com>
The previous code gave the same error message for attempting to drop
pinned and active portals, but those are separate states, so give
separate error messages.
Previously aggregate transition and combination functions were invoked
by special case code in nodeAgg.c, evaluating input and filters
separately using the expression evaluation machinery. That turns out
to not be great for performance for several reasons:
- repeated expression evaluations have some cost
- the transition functions invocations are poorly predicted, as
commonly there are multiple aggregates in a query, resulting in the
same call-stack invoking different functions.
- filter and input computation had to be done separately
- the special case code made it hard to implement JITing of the whole
transition function invocation
Address this by building one large expression that computes input,
evaluates filters, and invokes transition functions.
This leads to moderate speedups in queries bottlenecked by aggregate
computations, and enables large speedups for similar cases once JITing
is done.
There's potential for further improvement:
- It'd be nice if we could simplify the somewhat expensive
aggstate->all_pergroups lookups.
- right now there's still an advance_transition_function invocation in
nodeAgg.c, leading to some code duplication.
Author: Andres Freund
Discussion: https://postgr.es/m/20170901064131.tazjxwus3k2w3ybh@alap3.anarazel.de
As src/backend/access/transam/README says, PageGetLSN may only be called
by processes holding either exclusive lock on buffer, or a shared lock
on buffer plus buffer header lock. Therefore any place that only holds
a shared buffer lock must use BufferGetLSNAtomic instead of PageGetLSN,
which internally obtains buffer header lock prior to reading the LSN.
A few callsites failed to comply with this rule. This was detected by
running all tests under a new (not committed) assertion that verifies
PageGetLSN locking contract. All but one of the callsites that failed
the assertion are fixed by this patch. Remaining callsites were
inspected manually and determined not to need any change.
The exception (unfixed callsite) is in TestForOldSnapshot, which only
has a Page argument, making it impossible to access the corresponding
Buffer from it. Fixing that seems a much larger patch that will have to
be done separately; and that's just as well, since it was only
introduced in 9.6 and other bugs are much older.
Some of these bugs are ancient; backpatch all the way back to 9.3.
Authors: Jacob Champion, Asim Praveen, Ashwin Agrawal
Reviewed-by: Michaël Paquier
Discussion: https://postgr.es/m/CABAq_6GXgQDVu3u12mK9O5Xt5abBZWQ0V40LZCE+oUf95XyNFg@mail.gmail.com
These are compatible with Oracle and required for the datetime template
language for jsonpath in an upcoming patch.
Nikita Glukhov and Andrew Dunstan, reviewed by Pavel Stehule.
After having gotten rid of PortalGetHeapMemory(), there seems little
reason to keep one Portal access macro around that offers no actual
abstraction and isn't consistently used anyway.
Reviewed-by: Andrew Dunstan <andrew.dunstan@2ndquadrant.com>
Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org>
Rename PortalMemory to TopPortalContext, to avoid confusion with
PortalContext and align naming with similar top-level memory contexts.
Rename PortalData's "heap" field to portalContext. The "heap" naming
seems quite antiquated and confusing. Also get rid of the
PortalGetHeapMemory() macro and access the field directly, which we do
for other portal fields, so this abstraction doesn't buy anything.
Reviewed-by: Andrew Dunstan <andrew.dunstan@2ndquadrant.com>
Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org>
The initial implementation of list_qsort(), from commit ab7271677,
re-used the ListCells of the input list while not touching the List
header. This meant that anybody who still had a pointer to the
original header would now be in possession of a corrupted list,
a problem that seems sure to bite us eventually.
One possible solution is to re-use the original List header as well,
giving the function the semantics of update-in-place. However, that
doesn't seem like a very good idea either given the way that the
function is used in the planner: create_path functions aren't normally
supposed to modify their input lists. It doesn't look like there would
be a problem today, but it's not hard to foresee a time when modifying
a list of Paths in-place could have side-effects on some other append
path.
On the whole, and in view of the likelihood that this function might
be used in other contexts in the future, it seems best to get rid of
the micro-optimization of re-using the input list cells. Just build
a new list.
Discussion: https://postgr.es/m/16912.1515449066@sss.pgh.pa.us
Commit ab7271677 introduced code that attempts to order the child
scans of a Parallel Append node in a way that will minimize execution
time, based on total cost and startup cost. However, it failed to
think hard about what to do when estimated costs are exactly equal;
a case that's particularly likely to occur when comparing on startup
cost. In such a case the ordering of the child paths would be left
to the whims of qsort, an algorithm that isn't even stable.
We can improve matters by applying the rule used elsewhere in the
planner: if total costs are equal, sort on startup cost, and
vice versa. When both cost estimates are exactly equal, rather
than letting qsort do something unpredictable, sort based on the
child paths' relids, which should typically result in sorting in
inheritance order. (The latter provision requires inventing a
qsort-style comparator for bitmapsets, but maybe we'll have use
for that for other reasons in future.)
This results in a few plan changes in the select_parallel test,
but those all look more reasonable than before, when the actual
underlying cost numbers are taken into account.
Discussion: https://postgr.es/m/4944.1515446989@sss.pgh.pa.us
The general assumption for postmaster child processes is that they
should just exit(1), reasonably promptly, if the postmaster disappears.
condition_variable.c neglected this consideration and could be left
waiting forever, if the counterpart process it is waiting for has
done the right thing and exited.
We had some discussion of adjusting the WaitEventSet API to make it
harder to make this type of mistake in future; but for the moment,
and for v10, let's make this narrow fix.
Discussion: https://postgr.es/m/20412.1515456143@sss.pgh.pa.us
replorigin_drop() misunderstood the API for condition variables: it
had ConditionVariablePrepareToSleep and ConditionVariableCancelSleep
inside its test-and-sleep loop, rather than outside the loop as
intended. The net effect is a narrow race-condition window wherein,
if the process using a replication slot releases it immediately after
replorigin_drop() releases the ReplicationOriginLock, replorigin_drop()
would get into the condition variable's wait list too late and then
wait indefinitely for a signal that won't come.
Because there's a different CV for each replication slot, we can't
just move the ConditionVariablePrepareToSleep call to above the
test-and-sleep loop. What we can do, in the wake of commit 13db3b936,
is drop the ConditionVariablePrepareToSleep call entirely. This fix
depends on that commit because (at least in principle) the slot matching
the target replication origin might move around, so that once in a blue
moon successive loop iterations might involve different CVs. We can now
cope with such a scenario, at the cost of an extra trip through the
retry loop.
(There are ways we could fix this bug without depending on that commit,
but they're all a lot more complicated than this way.)
While at it, upgrade the rather skimpy comments in this function.
Back-patch to v10 where this code came in.
Discussion: https://postgr.es/m/19947.1515455433@sss.pgh.pa.us
The original coding here insisted that callers manually cancel any prepared
sleep for one condition variable before starting a sleep on another one.
While that's not a huge burden today, it seems like a gotcha that will bite
us in future if the use of condition variables increases; anything we can
do to make the use of this API simpler and more robust is attractive.
Hence, allow these functions to automatically switch their attention to
a different CV when required. This is safe for the same reason it was OK
for commit aced5a92b to let a broadcast operation cancel any prepared CV
sleep: whenever we return to the other test-and-sleep loop, we will
automatically re-prepare that CV, paying at most an extra test of that
loop's exit condition.
Back-patch to v10 where condition variables were introduced. Ordinarily
we would probably not back-patch a change like this, but since it does not
invalidate any coding pattern that was legal before, it seems safe enough.
Furthermore, there's an open bug in replorigin_drop() for which the
simplest fix requires this. Even if we chose to fix that in some more
complicated way, the hazard would remain that we might back-patch some
other bug fix that requires this behavior.
Patch by me, reviewed by Thomas Munro.
Discussion: https://postgr.es/m/2437.1515368316@sss.pgh.pa.us
There are plans to extend the syntax for ANALYZE, so we need to break
the link between VacuumStmt and AnalyzeStmt. But apart from that, the
syntax above is undocumented and, if discovered by users, might give
the impression that the VERBOSE option for VACUUM differs from the
verbose option from ANALYZE, which it does not.
Nathan Bossart, reviewed by Michael Paquier and Masahiko Sawada
Discussion: http://postgr.es/m/D3FC73E2-9B1A-4DB4-8180-55F57D116B4E@amazon.com
Previously, although the initial state of a proclist_node is expected
to be next == prev == 0, proclist_delete_offset would reset nodes to
next == prev == INVALID_PGPROCNO when removing them from a list.
This is the same state that a node in a singleton list has, so that
it's impossible to distinguish not-in-a-list from in-a-list. Change
proclist_delete_offset to reset removed nodes to next == prev == 0,
making it possible to distinguish those cases, and then add Asserts
to the list add and delete functions that the supplied node isn't
or is in a list at entry. Also tighten assertions about the node
being in the particular list (not some other one) where it is possible
to check that in O(1) time.
In ConditionVariablePrepareToSleep, since we don't expect the process's
cvWaitLink to already be in a list, remove the more-or-less-useless
proclist_contains check; we'd rather have proclist_push_tail's new
assertion fire if that happens.
Improve various comments related to proclists, too.
Patch by me, reviewed by Thomas Munro. This isn't back-patchable, since
there could theoretically be inlined copies of proclist_delete_offset in
third-party modules. But it's only improving debuggability anyway.
Discussion: https://postgr.es/m/CAEepm=0NWKehYw7NDoUSf8juuKOPRnCyY3vuaSvhrEWsOTAa3w@mail.gmail.com
In commit 561885db0, as part of normalizing RemovePgTempFiles's error
handling, I removed its behavior of silently ignoring ENOENT failures
during directory opens. Thomas Munro points out that this is a bad idea at
the top level, because we don't create pgsql_tmp directories until needed.
Thus this coding could produce LOG messages in perfectly normal situations,
which isn't what I intended. Restore the suppression of ENOENT logging,
but only at top level --- it would still be unexpected for a nested temp
directory to disappear between seeing it in the parent directory and
opening it.
Discussion: https://postgr.es/m/CAEepm=2y06SehAkTnd5sU_eVqdv5P-=Srt1y5vYNQk6yVDVaPw@mail.gmail.com
25fff40798 introduced
default monitoring roles. Apply these corrections:
* Allow access to pg_stat_get_wal_senders()
by role pg_read_all_stats
* Correct comment in pg_stat_get_wal_receiver()
to show it is no longer superuser-only.
Author: Feike Steenbergen
Reviewed-by: Michael Paquier
Apply to HEAD, then later backpatch to 10
In the wake of commit aced5a92b, the semantics of these results are
a bit squishy: we can tell whether we signaled some other process(es),
but we do not know which ones were real waiters versus mere sentinels
for ConditionVariableBroadcast operations. It does not help much that
ConditionVariableBroadcast will attempt to pass on the signal to the
next real waiter, because (a) there might not be one, and (b) that will
only happen awhile later, anyway. So these results could overstate how
much effect the calls really had.
However, no existing caller of either function pays any attention to its
result value, so it seems reasonable to just define that as a required
property of a correct algorithm. To encourage correctness and save some
tiny number of cycles, change both functions to return void.
Patch by me, per an observation by Thomas Munro. No back-patch, since
if any third parties happen to be using these functions, they might not
appreciate an API break in a minor release.
Discussion: https://postgr.es/m/CAEepm=0NWKehYw7NDoUSf8juuKOPRnCyY3vuaSvhrEWsOTAa3w@mail.gmail.com
In the admittedly-very-unlikely case that AddWaitEventToSet fails,
ConditionVariablePrepareToSleep would error out after already having
set cv_sleep_target, which is probably bad, and after having already
set cv_wait_event_set, which is very bad. Transaction abort might or
might not clean up cv_sleep_target properly; but there is nothing
that would be aware that the WaitEventSet wasn't fully constructed,
so that all future condition variable sleeps would be broken.
We can easily guard against these hazards with slight restructuring.
Back-patch to v10 where condition_variable.c was introduced.
Discussion: https://postgr.es/m/CAEepm=0NWKehYw7NDoUSf8juuKOPRnCyY3vuaSvhrEWsOTAa3w@mail.gmail.com
The original implementation of ConditionVariableBroadcast was, per its
self-description, "the dumbest way possible". Thomas Munro found out
it was a bit too dumb. An awakened process may immediately re-queue
itself, if the specific condition it's waiting for is not yet satisfied.
If this happens before ConditionVariableBroadcast is able to see the wait
queue as empty, then ConditionVariableBroadcast will re-awaken the same
process, repeating the cycle. Given unlucky timing this back-and-forth
can repeat indefinitely; loops lasting thousands of seconds have been
seen in testing.
To fix, add our own process to the end of the wait queue to serve as a
sentinel, and exit the broadcast loop once our process is not there
anymore. There are various special considerations described in the
comments, the principal disadvantage being that wakers can no longer
be sure whether they awakened a real waiter or just a sentinel. But in
practice nobody pays attention to the result of ConditionVariableSignal
or ConditionVariableBroadcast anyway, so that problem seems hypothetical.
Back-patch to v10 where condition_variable.c was introduced.
Tom Lane and Thomas Munro
Discussion: https://postgr.es/m/CAEepm=0NWKehYw7NDoUSf8juuKOPRnCyY3vuaSvhrEWsOTAa3w@mail.gmail.com
At present, we always raise an ERROR if the partition constraint
is violated, but a pending patch for UPDATE tuple routing will
consider instead moving the tuple to the correct partition.
Refactor to make that simpler.
Amit Khandekar, reviewed by Amit Langote, David Rowley, and me.
Discussion: http://postgr.es/m/CAJ3gD9cue54GbEzfV-61nyGpijvjZgCcghvLsB0_nL8Nm8HzCA@mail.gmail.com