The plan generated for sorted partial aggregation with "GROUP BY constant"
included a Sort node with no sort keys, which the executor does not like.
Per report from Steve Randall. I'd add a regression test case if I could
think of a compact one, but it doesn't seem worth expending lots of cycles
on.
Report: <CABVd52UAdGXpg_rCk46egpNKYdXOzCjuJ1zG26E2xBe_8bj+Fg@mail.gmail.com>
The idea behind SPI_push was to allow transitioning back into an
"unconnected" state when a SPI-using procedure calls unrelated code that
might or might not invoke SPI. That sounds good, but in practice the only
thing it does for us is to catch cases where a called SPI-using function
forgets to call SPI_connect --- which is a highly improbable failure mode,
since it would be exposed immediately by direct testing of said function.
As against that, we've had multiple bugs induced by forgetting to call
SPI_push/SPI_pop around code that might invoke SPI-using functions; these
are much harder to catch and indeed have gone undetected for years in some
cases. And we've had to band-aid around some problems of this ilk by
introducing conditional push/pop pairs in some places, which really kind
of defeats the purpose altogether; if we can't draw bright lines between
connected and unconnected code, what's the point?
Hence, get rid of SPI_push[_conditional], SPI_pop[_conditional], and the
underlying state variable _SPI_curid. It turns out SPI_restore_connection
can go away too, which is a nice side benefit since it was never more than
a kluge. Provide no-op macros for the deleted functions so as to avoid an
API break for external modules.
A side effect of this removal is that SPI_palloc and allied functions no
longer permit being called when unconnected; they'll throw an error
instead. The apparent usefulness of the previous behavior was a mirage
as well, because it was depended on by only a few places (which I fixed in
preceding commits), and it posed a risk of allocations being unexpectedly
long-lived if someone forgot a SPI_push call.
Discussion: <20808.1478481403@sss.pgh.pa.us>
Invent a new function heap_modify_tuple_by_cols() that is functionally
equivalent to SPI_modifytuple except that it always allocates its result
by simple palloc. I chose however to make the API details a bit more
like heap_modify_tuple: pass a tupdesc rather than a Relation, and use
bool convention for the isnull array.
Use this function in place of SPI_modifytuple at all call sites where the
intended behavior is to allocate in current context. (There actually are
only two call sites left that depend on the old behavior, which makes me
wonder if we should just drop this function rather than keep it.)
This new function is easier to use than heap_modify_tuple() for purposes
of replacing a single column (or, really, any fixed number of columns).
There are a number of places where it would simplify the code to change
over, but I resisted that temptation for the moment ... everywhere except
in plpgsql's exec_assign_value(); changing that might offer some small
performance benefit, so I did it.
This is on the way to removing SPI_push/SPI_pop, but it seems like
good code cleanup in its own right.
Discussion: <9633.1478552022@sss.pgh.pa.us>
There's basically no scenario where it's sensible for this to match
dropped columns, so put a test for dropped-ness into SPI_fnumber()
itself, and excise the test from the small number of callers that
were paying attention to the case. (Most weren't :-(.)
In passing, normalize tests at call sites: always reject attnum <= 0
if we're disallowing system columns. Previously there was a mixture
of "< 0" and "<= 0" tests. This makes no practical difference since
SPI_fnumber() never returns 0, but I'm feeling pedantic today.
Also, in the places that are actually live user-facing code and not
legacy cruft, distinguish "column not found" from "can't handle
system column".
Per discussion with Jim Nasby; thi supersedes his original patch
that just changed the behavior at one call site.
Discussion: <b2de8258-c4c0-1cb8-7b97-e8538e5c975c@BlueTreble.com>
The intent of the test is to check whether XLOG_SEG_SIZE is in a
particular range, but actually in one case it compares XLOG_BLCKSZ
by mistake. Repair.
Commit 88e9823026 introduced this
faulty test.
Kuntal Ghosh, reviewed by Michael Paquier.
When squeezing a bucket during vacuum, it's not necessary to retain
any tuples already marked as dead, so ignore them when deciding which
tuples must be moved in order to empty a bucket page. Similarly, when
splitting a bucket, relocating dead tuples to the new bucket is a
waste of effort; instead, just ignore them.
Amit Kapila, reviewed by me. Testing help provided by Ashutosh
Sharma.
Use a macro to generate the in and out functions for pseudotypes that
reject all input and output, saving many lines of redundant code.
Parameterize the error messages to reduce translatable strings.
Commit 7a2fe9bd0 improved merge append so that replacement of a tuple
takes log(N) operations, not twice log(N). Since cost_merge_append knew
about that explicitly, we should adjust it. This probably makes little
difference in practice, but the obsolete comment is confusing.
Ideally this would have been put in in 9.3 with the underlying behavior
change; but I'm not going to back-patch it, since there's some small chance
of changing a plan choice that somebody's optimized for.
Thomas Munro
Discussion: <CAEepm=0WQBSvuYcMOUj4Ga4NXpu2J=ejZcE=e=eiTjTX-6_gDw@mail.gmail.com>
Generally, WAL resource managers are only supposed to examine the
top 4 bits of a WAL record's xl_info; the rest are reserved for
the WAL mechanism itself. A few places were not consistent about
doing this with respect to XLOG_CHECKPOINT and XLOG_SWITCH records.
There's no bug currently, since no additional bits ever get set in
these specific record types, but that might not be true forever.
Let's follow the generic coding rule here too.
Michael Paquier
This is infrastructure for the complete SQL standard feature. No
support is included at this point for execution nodes or PLs. The
intent is to add that soon.
As this patch leaves things, standard syntax can create tuplestores
to contain old and/or new versions of rows affected by a statement.
References to these tuplestores are in the TriggerData structure.
C triggers can access the tuplestores directly, so they are usable,
but they cannot yet be referenced within a SQL statement.
LockBufferForCleanup() acquires a cleanup lock unconditionally, and
ConditionalLockBufferForCleanup() acquires a cleanup lock if it is
possible to do so without waiting; this patch adds a new API,
IsBufferCleanupOK(), which tests whether an exclusive lock already
held happens to be a cleanup lock. This is possible because a cleanup
lock simply means an exclusive lock plus the assurance any other pins
on the buffer are newer than our own pin. Therefore, just as the
existing functions decide that the exclusive lock that they've just
taken is a cleanup lock if they observe the pin count to be 1, this
new function allows us to observe that the pin count is 1 on a buffer
we've already locked.
This is useful in situations where a backend definitely wishes to
modify the buffer and also wishes to perform cleanup operations if
possible. The patch to eliminate heavyweight locking by hash indexes
uses this, and it may have other applications as well.
Amit Kapila, per a suggestion from me. Some comment adjustments by me
as well.
The foreign-key-aware logic for estimation of join sizes (added in commit
100340e2d) blindly tried to apply the concept to rels that are actually
parents of inheritance trees. This is just plain wrong so far as the
referenced relation is concerned, since the inheritance scan may well
produce lots of rows that are not participating in the constraint. It's
wrong for the referencing relation too, for the same reason; although on
that end we could conceivably detect whether all members of the inheritance
tree have equivalent FK constraints pointing to the same referenced rel,
and then proceed more or less as we do now. But pending somebody writing
code to do that, we must disable this, because it's producing completely
silly estimates when there's an FK linking the heads of inheritance trees.
Per bug #14404 from Clinton Adams. Back-patch to 9.6 where the new
estimation logic came in.
Report: <20161028200412.15987.96482@wrigleys.postgresql.org>
While converting expressions in an upper-level plan node so that they
reference Vars and expressions provided by the input plan node(s),
don't convert plain Const items, even if there happens to be a matching
Const in the input. It's silly to do so because a Var is more expensive to
execute than a Const. Moreover, converting can fool ExecCheckPlanOutput's
check that an insert or update query inserts nulls into dropped columns,
leading to "query provides a value for a dropped column" errors during
INSERT or UPDATE on a table with a dropped column. We could solve this
by making that check more complicated, but I don't see the point; this fix
should save a marginal number of cycles, and it also makes for less messy
EXPLAIN output, as shown by the ensuing regression test result changes.
Per report from Pavel Hanák. I have not incorporated a test case based
on that example, as there doesn't seem to be a simple way of checking
this in isolation without making a bunch of assumptions about other
planner and SQL-function behavior.
Back-patch to 9.6. This setrefs.c behavior exists much further back,
but there is not currently reason to think that it causes problems
before 9.6.
Discussion: <83shraampf.fsf@is-it.eu>
This serves as implicit documentation and is handy if someone wants to
tweak things. The rules are not part of a normal build, like this
entire directory.
tsquery_rewrite() tries to find matches to subsets of AND/OR conditions;
for example, in the query 'a | b | c' the substitution subquery 'a | c'
should match and lead to replacement of the first and third items.
That's fine, but the matching algorithm apparently takes about O(2^N)
for an N-clause query (I say "apparently" because the code is also both
unintelligible and uncommented). We could probably do better than that
even without any extra assumptions --- but actually, we know that the
subclauses are sorted, indeed are depending on that elsewhere in this very
same function. So we can just scan the two lists a single time to detect
matches, as though we were doing a merge join.
Also do a re-flattening call (QTNTernary()) in tsquery_rewrite_query, just
to make sure that the tree fits the expectations of the next search cycle.
I didn't try to devise a test case for this, but I'm pretty sure that the
oversight could have led to failure to match in some cases where a match
would be expected.
Improve comments, and also stick a CHECK_FOR_INTERRUPTS into
dofindsubquery, just in case it's still too slow for somebody.
Per report from Andreas Seltenreich. Back-patch to all supported branches.
Discussion: <8760oasf2y.fsf@credativ.de>
QTNTernary() contains logic to flatten, eg, '(a & b) & c' into 'a & b & c',
which is all well and good, but it tries to do that to NOT nodes as well,
so that '!!a' gets changed to '!a'. Explicitly restrict the conversion to
be done only on AND and OR nodes, and add a test case illustrating the bug.
In passing, provide some comments for the sadly naked functions in
tsquery_util.c, and simplify some baroque logic in QTNFree(), which
I think may have been leaking some items it intended to free.
Noted while investigating a complaint from Andreas Seltenreich.
Back-patch to all supported versions.
In the previous coding, if an aggregate's transition function returned an
expanded array, nodeAgg.c and nodeWindowAgg.c would always copy it and thus
force it into the flat representation. This led to ping-ponging between
flat and expanded formats, which costs a lot. For an aggregate using
array_append as transition function, I measured about a 15X slowdown
compared to the pre-9.5 code, when working on simple int[] arrays.
Of course, the old code was already O(N^2) in this usage due to copying
flat arrays all the time, but it wasn't quite this inefficient.
To fix, teach nodeAgg.c and nodeWindowAgg.c to allow expanded transition
values without copying, so long as the transition function takes care to
return the transition value already properly parented under the aggcontext.
That puts a bit of extra responsibility on the transition function, but
doing it this way allows us to not need any extra logic in the fast path
of advance_transition_function (ie, with a pass-by-value transition value,
or with a modified-in-place pass-by-reference value). We already know
that that's a hot spot so I'm loath to add any cycles at all there. Also,
while only array_append currently knows how to follow this convention,
this solution allows other transition functions to opt-in without needing
to have a whitelist in the core aggregation code.
(The reason we would need a whitelist is that currently, if you pass a
R/W expanded-object pointer to an arbitrary function, it's allowed to do
anything with it including deleting it; that breaks the core agg code's
assumption that it should free discarded values. Returning a value under
aggcontext is the transition function's signal that it knows it is an
aggregate transition function and will play nice. Possibly the API rules
for expanded objects should be refined, but that would not be a
back-patchable change.)
With this fix, an aggregate using array_append is no longer O(N^2), so it's
much faster than pre-9.5 code rather than much slower. It's still a bit
slower than the bespoke infrastructure for array_agg, but the differential
seems to be only about 10%-20% rather than orders of magnitude.
Discussion: <6315.1477677885@sss.pgh.pa.us>
This bug exists as far back as 9.0, when Hot Standby was introduced,
so back-patch to all supported branches.
Report and patch by Takayuki Tsunakawa, reviewed by Michael Paquier
and Kuntal Ghosh.
If a restartpoint flushed no dirty buffers, it could fail to update
the minimum recovery point, leading to a minimum recovery point prior
to the starting REDO location. perform_base_backup() would interpret
that as meaning that no WAL files at all needed to be included in the
backup, failing an internal sanity check. To fix, have restartpoints
always update the minimum recovery point to just after the checkpoint
record itself, so that the file (or files) containing the checkpoint
record will always be included in the backup.
Code by Amit Kapila, per a design suggestion by me, with some
additional work on the code comment by me. Test case by Michael
Paquier. Report by Kyotaro Horiguchi.
The code to change the deferrability properties of a foreign-key constraint
updated all the associated triggers to match; but a moment's examination of
the code that creates those triggers in the first place shows that only
some of them should track the constraint's deferrability properties. This
leads to odd failures in subsequent exercise of the foreign key, as the
triggers are fired at the wrong times. Fix that, and add a regression test
comparing the trigger properties produced by ALTER CONSTRAINT with those
you get by creating the constraint as-intended to begin with.
Per report from James Parks. Back-patch to 9.4 where this ALTER
functionality was introduced.
Report: <CAJ3Xv+jzJ8iNNUcp4RKW8b6Qp1xVAxHwSXVpjBNygjKxcVuE9w@mail.gmail.com>
An oversight in setting the boundaries of known commit timestamps during
startup caused old commit timestamps to become inaccessible after a
server restart.
Author and reporter: Julien Rouhaud
Review, test code: Craig Ringer
INSERT ... ON CONFLICT (specifically ExecCheckHeapTupleVisible) contains
another example of this unsafe coding practice. It is much harder to get
a failure out of it than the case fixed in commit 6292c2339, because in
most scenarios any hint bits that could be set would have already been set
earlier in the command. However, Konstantin Knizhnik reported a failure
with a custom transaction manager, and it's clearly possible to get a
failure via a race condition in async-commit mode.
For lack of a reproducible example, no regression test case in this
commit.
I did some testing with Asserts added to tqual.c's functions, and can say
that running "make check-world" exposed these two bugs and no others.
The Asserts are messy enough that I've not added them to the code for now.
Report: <57EE93C8.8080504@postgrespro.ru>
Related-Discussion: <CAO3NbwOycQjt2Oqy2VW-eLTq2M5uGMyHnGm=RNga4mjqcYD7gQ@mail.gmail.com>
A transaction that conflicts against itself, for example
INSERT INTO t(pk) VALUES (1),(1) ON CONFLICT DO NOTHING;
should behave the same regardless of isolation level. It certainly
shouldn't throw a serialization error, as retrying will not help.
We got this wrong due to the ON CONFLICT logic not considering the case,
as reported by Jason Dusek.
Core of this patch is by Peter Geoghegan (based on an earlier patch by
Thomas Munro), though I didn't take his proposed code refactoring for fear
that it might have unexpected side-effects. Test cases by Thomas Munro
and myself.
Report: <CAO3NbwOycQjt2Oqy2VW-eLTq2M5uGMyHnGm=RNga4mjqcYD7gQ@mail.gmail.com>
Related-Discussion: <57EE93C8.8080504@postgrespro.ru>
Despite the argumentation I wrote in commit 7a2fe85b0, it's unsafe to do
this, because in corner cases it's possible for HeapTupleSatisfiesSelf
to try to set hint bits on the target tuple; and at least since 8.2 we
have required the buffer content lock to be held while setting hint bits.
The added regression test exercises one such corner case. Unpatched, it
causes an assertion failure in assert-enabled builds, or otherwise would
cause a hint bit change in a buffer we don't hold lock on, which given
the right race condition could result in checksum failures or other data
consistency problems. The odds of a problem in the field are probably
pretty small, but nonetheless back-patch to all supported branches.
Report: <19391.1477244876@sss.pgh.pa.us>
Now that the upper planner uses paths, and now that we have proper hooks
to inject paths into the upper planning process, it's possible for
foreign data wrappers to arrange to push aggregates to the remote side
instead of fetching all of the rows and aggregating them locally. This
figures to be a massive win for performance, so teach postgres_fdw to
do it.
Jeevan Chalke and Ashutosh Bapat. Reviewed by Ashutosh Bapat with
additional testing by Prabhat Sahu. Various mostly cosmetic changes
by me.
With track_io_timing = on, EXPLAIN (ANALYZE, BUFFERS) will emit fields
named like "I/O Read Time". The slash makes that invalid as an XML
element name, so that adding FORMAT XML would produce invalid XML.
We already have code in there to translate spaces to dashes, so let's
generalize that to convert anything that isn't a valid XML name character,
viz letters, digits, hyphens, underscores, and periods. We could just
reject slashes, which would run a bit faster. But the fact that this went
unnoticed for so long doesn't give me a warm feeling that we'd notice the
next creative violation, so let's make it a permanent fix.
Reported by Markus Winand, though this isn't his initial patch proposal.
Back-patch to 9.2 where track_io_timing was added. The problem is only
latent in 9.1, so I don't feel a need to fix it there.
Discussion: <E0BF6A45-68E8-45E6-918F-741FB332C6BB@winand.at>
"xlog" is not a particularly clear abbreviation for "write-ahead log",
and it sometimes confuses users into believe that the contents of the
"pg_xlog" directory are not critical data, leading to unpleasant
consequences. So, rename the directory to "pg_wal".
This patch modifies pg_upgrade and pg_basebackup to understand both
the old and new directory layouts; the former is necessary given the
purpose of the tool, while the latter merely avoids an unnecessary
backward-compatibility break.
We may wish to consider renaming other programs, switches, and
functions which still use the old "xlog" naming to also refer to
"wal". However, that's still under discussion, so let's do just this
much for now.
Discussion: CAB7nPqTeC-8+zux8_-4ZD46V7YPwooeFxgndfsq5Rg8ibLVm1A@mail.gmail.com
Michael Paquier
IANA got rid of the really silly "abbreviation" and replaced it with one
that's only moderately silly. But it's still pointless, so keep on not
showing it.
When a relation is truncated, it is important that the FSM is truncated as
well. Otherwise, after recovery, the FSM can return a page that has been
truncated away, leading to errors like:
ERROR: could not read block 28991 in file "base/16390/572026": read only 0
of 8192 bytes
We were using MarkBufferDirtyHint() to dirty the buffer holding the last
remaining page of the FSM, but during recovery, that might in fact not
dirty the page, and the FSM update might be lost.
To fix, use the stronger MarkBufferDirty() function. MarkBufferDirty()
requires us to do WAL-logging ourselves, to protect from a torn page, if
checksumming is enabled.
Also fix an oversight in visibilitymap_truncate: it also needs to WAL-log
when checksumming is enabled.
Analysis by Pavan Deolasee.
Discussion: <CABOikdNr5vKucqyZH9s1Mh0XebLs_jRhKv6eJfNnD2wxTn=_9A@mail.gmail.com>
CommandId is declared as uint32, and values up to 4G are indeed legal.
cidout() handles them properly by treating the value as unsigned int.
But cidin() was just using atoi(), which has platform-dependent behavior
for values outside the range of signed int, as reported by Bart Lengkeek
in bug #14379. Use strtoul() instead, as xidin() does.
In passing, make some purely cosmetic changes to make xidin/xidout
look more like cidin/cidout; the former didn't have a monopoly on
best practice IMO.
Neither xidin nor cidin make any attempt to throw error for invalid input.
I didn't change that here, and am not sure it's worth worrying about
since neither is really a user-facing type. The point is just to ensure
that indubitably-valid inputs work as expected.
It's been like this for a long time, so back-patch to all supported
branches.
Report: <20161018152550.1413.6439@wrigleys.postgresql.org>
This reverts commit 9e083fd468. That was a
few bricks shy of a load:
* Query cancel stopped working
* Buildfarm member pademelon stopped working, because the box doesn't have
/dev/urandom nor /dev/random.
This clearly needs some more discussion, and a quite different patch, so
revert for now.
This value might not be to everyone's taste; in particular, some
people might prefer %t to %m, and others may want %u, %d, or other
fields. However, it's a vast improvement on the old default of ''.
Christoph Berg
Have tuplesort_gettupleslot() copy the contents of its current table slot
as needed. This is based on an approach taken by tuplestore_gettupleslot().
In the future, tuplesort_gettupleslot() may also be taught to avoid copying
the tuple where caller can determine that that is safe (the
tuplestore_gettupleslot() interface already offers this option to callers).
Patch by Peter Geoghegan. Fixes bug #14344, reported by Regina Obe.
Report: <20160929035538.20224.39628@wrigleys.postgresql.org>
Backpatch-through: 9.6
This adds a new routine, pg_strong_random() for generating random bytes,
for use in both frontend and backend. At the moment, it's only used in
the backend, but the upcoming SCRAM authentication patches need strong
random numbers in libpq as well.
pg_strong_random() is based on, and replaces, the existing implementation
in pgcrypto. It can acquire strong random numbers from a number of sources,
depending on what's available:
- OpenSSL RAND_bytes(), if built with OpenSSL
- On Windows, the native cryptographic functions are used
- /dev/urandom
- /dev/random
Original patch by Magnus Hagander, with further work by Michael Paquier
and me.
Discussion: <CAB7nPqRy3krN8quR9XujMVVHYtXJ0_60nqgVc6oUk8ygyVkZsA@mail.gmail.com>
The more efficient hashtable speeds up hash-aggregations with more than
a few hundred groups significantly. Improvements of over 120% have been
measured.
Due to the the different hash table queries that not fully
determined (e.g. GROUP BY without ORDER BY) may change their result
order.
The conversion is largely straight-forward, except that, due to the
static element types of simplehash.h type hashes, the additional data
some users store in elements (e.g. the per-group working data for hash
aggregaters) is now stored in TupleHashEntryData->additional. The
meaning of BuildTupleHashTable's entrysize (renamed to additionalsize)
has been changed to only be about the additionally stored size. That
size is only used for the initial sizing of the hash-table.
Reviewed-By: Tomas Vondra
Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
Use the new simplehash.h to speed up tidbitmap.c uses. For bitmap scan
heavy queries speedups of over 100% have been measured. Both lossy and
exact scans benefit, but the wins are bigger for mostly exact scans.
The conversion is mostly trivial, except that tbm_lossify() now restarts
lossifying at the point it previously stopped. Otherwise the hash table
becomes unbalanced because the scan in done in hash-order, leaving the
end of the hashtable more densely filled then the beginning. That caused
performance issues with dynahash as well, but due to the open chaining
they were less pronounced than with the linear adressing from
simplehash.h.
Reviewed-By: Tomas Vondra
Discussion: <20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de>
bitshiftright() and bitshiftleft() would recursively call each other
infinitely if the user passed INT_MIN for the shift amount, due to integer
overflow in negating the shift amount. To fix, clamp to -VARBITMAXLEN.
That doesn't change the results since any shift distance larger than the
input bit string's length produces an all-zeroes result.
Also fix some places that seemed inadequately paranoid about input typmods
exceeding VARBITMAXLEN. While a typmod accepted by anybit_typmodin() will
certainly be much less than that, at least some of these spots are
reachable with user-chosen integer values.
Andreas Seltenreich and Tom Lane
Discussion: <87d1j2zqtz.fsf@credativ.de>
pgstat_twophase_postcommit is supposed to duplicate the math in
AtEOXact_PgStat, but it had missed out the bit about clearing
t_delta_live_tuples/t_delta_dead_tuples for a TRUNCATE.
It's harder than you might think to replicate the issue here, because
those counters would only be nonzero when a previous transaction in
the same backend had added/deleted tuples in the truncated table,
and those counts hadn't been sent to the stats collector yet.
Evident oversight in commit d42358efb. I've not added a regression
test for this; we tried to add one in d42358efb, and had to revert it
because it was too timing-sensitive for the buildfarm.
Back-patch to 9.5 where d42358efb came in.
Stas Kelvich
Discussion: <EB57BF68-C06D-4737-BDDC-4BA778F4E62B@postgrespro.ru>
It's not good for an inherited child constraint to be marked connoinherit;
that would result in the constraint not propagating to grandchild tables,
if any are created later. The code mostly prevented this from happening
but there was one case that was missed.
This is somewhat related to commit e55a946a8, which also tightened checks
on constraint merging. Hence, back-patch to 9.2 like that one. This isn't
so much because there's a concrete feature-related reason to stop there,
as to avoid having more distinct behaviors than we have to in this area.
Amit Langote
Discussion: <b28ee774-7009-313d-dd55-5bdd81242c41@lab.ntt.co.jp>
Even if Linux's mmap() is okay with a partial-hugepage request, munmap()
is not, as reported by Chris Richards. Therefore it behooves us to try
a bit harder to find out the actual hugepage size, instead of assuming
that we can skate by with a guess.
For the moment, just look into /proc/meminfo to find out the default
hugepage size, and use that. Later, on kernels that support requests
for nondefault sizes, we might try to consider other alternatives.
But that smells more like a new feature than a bug fix, especially if
we want to provide any way for the DBA to control it, so leave it for
another day.
I set this up to allow easy addition of platform-specific code for
non-Linux platforms, if needed; but right now there are no reports
suggesting that we need to work harder on other platforms.
Back-patch to 9.4 where hugepage support was introduced.
Discussion: <31056.1476303954@sss.pgh.pa.us>
Fix detaching of the mmap'd segment to have its own on_shmem_exit callback,
rather than piggybacking on the one for detaching from the SysV segment.
That was confusing, and given the distance between the two attach calls,
it was trouble waiting to happen.
Make the detaching calls idempotent by clearing AnonymousShmem to show
we've already unmapped. I spent quite a bit of time yesterday trying
to find a path that would allow the munmap()'s to be done twice, and
while I did not succeed, it seems silly that there's even a question.
Make the #ifdef logic less confusing by separating "do we want to use
anonymous shmem" from EXEC_BACKEND. Even though there's no current
scenario where those conditions are different, it is not helpful for
different places in the same file to be testing EXEC_BACKEND for what
are fundamentally different reasons.
Don't do on_exit_reset() in StartBackgroundWorker(). At best that's
useless (InitPostmasterChild would have done it already) and at worst
it could zap some callback that's unrelated to shared memory.
Improve comments, and simplify the huge_pages enablement logic slightly.
Back-patch to 9.4 where hugepage support was introduced.
Arguably this should go into 9.3 as well, but the code looks
significantly different there, and I doubt it's worth the
trouble of adapting the patch given I can't show a live bug.
Commit 0b62fd036 did a fairly sloppy job of refactoring setPath()
to support jsonb_insert() along with jsonb_set(). In its defense,
though, there was no regression test case exercising the case of
replacing an existing element in a jsonb array.
Per bug #14366 from Peng Sun. Back-patch to 9.6 where bug was introduced.
Report: <20161012065349.1412.47858@wrigleys.postgresql.org>
These functions were originally added in commit d8cedf67a to support
use of int2vector columns as catcache lookup keys. However, there are
no catcaches that use such columns. (Indeed I now think it must always
have been dead code: a catcache with such a key column would need an
underlying unique index on the column, but we've never had an int2vector
btree opclass.)
Getting rid of the int2vector-specific operator and function does not
lose any functionality, because operations on int2vectors will now fall
back to the generic anyarray support. This avoids a wart that a btree
index on an int2vector column (made using anyarray_ops) would fail to
match equality searches, because int2vectoreq wasn't a member of the
opclass. We don't really care much about that, since int2vector is not
meant as a type for users to use, but it's silly to have extra code and
less functionality.
If we ever do want a catcache to be indexed by an int2vector column,
we'd need to put back full btree and hash opclasses for int2vector,
comparable to the support for oidvector. (The anyarray code can't be
used at such a low level, because it needs to do catcache lookups.)
But we'll deal with that if/when the need arises.
Also worth noting is that removal of the hash int2vector_ops opclass will
break any user-created hash indexes on int2vector columns. While hash
anyarray_ops would serve the same purpose, it would probably not compute
the same hash values and thus wouldn't be on-disk-compatible. Given that
int2vector isn't a user-facing type and we're planning other incompatible
changes in hash indexes for v10 anyway, this doesn't seem like something
to worry about, but it's probably worth mentioning here.
Amit Langote
Discussion: <d9bb74f8-b194-7307-9ebd-90645d377e45@lab.ntt.co.jp>
Pass the buffer size as argument to LogicalTapeRewindForRead, rather than
setting it earlier with the separate LogicTapeAssignReadBufferSize call.
This way, the buffer size is set closer to where it's actually used, which
makes the code easier to understand.
This makes the calculation for how much memory to use for the buffers less
precise. We now use the same amount of memory for every tape, rounded down
to the nearest BLCKSZ boundary, instead of using one more block for some
tapes, to get the total up to exact amount of memory available. That should
be OK, merging isn't too sensitive to the exact amount of memory used.
Reviewed by Peter Geoghegan
Discussion: <0f607c4b-df23-353e-bf56-c0389d28495f@iki.fi>
While this isn't a lot of code, it's been essentially untestable for
a very long time, because libpq doesn't support anything older than
protocol 2.0, and has not since release 6.3. There's no reason to
believe any other client-side code still uses that protocol, either.
Discussion: <2661.1475849167@sss.pgh.pa.us>
SCO OpenServer and SCO UnixWare are more or less dead platforms.
We have never had a buildfarm member testing the "sco" port, and
the last "unixware" member was last heard from in 2012, so it's
fair to doubt that the code even compiles anymore on either one.
Remove both ports. We can always undo this if someone shows up
with an interest in maintaining and testing these platforms.
Discussion: <17177.1476136994@sss.pgh.pa.us>
The transfunction was told that its first argument and result were
of the window function output type, not the aggregate state type.
This'd only matter if the transfunction consults get_fn_expr_argtype,
which typically only polymorphic functions would do.
Although we have several regression tests around polymorphic aggs,
none of them detected this mistake --- in fact, they still didn't
fail when I injected the same mistake into nodeAgg.c. So add some
more tests covering both plain agg and window-function-agg cases.
Per report from Sebastian Luque. Back-patch to 9.6 where the error
was introduced (by sloppy refactoring in commit 804163bc2, looks like).
Report: <87int2qkat.fsf@gmail.com>
Historically, we've allowed users to add a CHECK constraint to a child
table and then add an identical CHECK constraint to the parent. This
results in "merging" the two constraints so that the pre-existing
child constraint ends up with both conislocal = true and coninhcount > 0.
However, if you tried to do it in the other order, you got a duplicate
constraint error. This is problematic for pg_dump, which needs to issue
separated ADD CONSTRAINT commands in some cases, but has no good way to
ensure that the constraints will be added in the required order.
And it's more than a bit arbitrary, too. The goal of complaining about
duplicated ADD CONSTRAINT commands can be served if we reject the case of
adding a constraint when the existing one already has conislocal = true;
but if it has conislocal = false, let's just make the ADD CONSTRAINT set
conislocal = true. In this way, either order of adding the constraints
has the same end result.
Another problem was that the code allowed creation of a parent constraint
marked convalidated that is merged with a child constraint that is
!convalidated. In this case, an inheritance scan of the parent table could
emit some rows violating the constraint condition, which would be an
unexpected result given the marking of the parent constraint as validated.
Hence, forbid merging of constraints in this case. (Note: valid child and
not-valid parent seems fine, so continue to allow that.)
Per report from Benedikt Grundmann. Back-patch to 9.2 where we introduced
possibly-not-valid check constraints. The second bug obviously doesn't
apply before that, and I think the first doesn't either, because pg_dump
only gets into this situation when dealing with not-valid constraints.
Report: <CADbMkNPT-Jz5PRSQ4RbUASYAjocV_KHUWapR%2Bg8fNvhUAyRpxA%40mail.gmail.com>
Discussion: <22108.1475874586@sss.pgh.pa.us>
The need for this was previously obscured even on picky platforms
by the hack we used to support direct cross-module references in
the transforms contrib modules. Now that that hack is gone, the
undefined symbol is exposed, as reported by Robert Haas.
Back-patch to 9.5 where we started to use -Wl,-undefined,dynamic_lookup.
I'm a bit surprised that the older branches don't seem to contain
any gettext references in this module, but since they don't fail
at build time, they must not. (We might be able to get away with
leaving this alone in 9.5/9.6, but I think it's cleaner if the
reference gets resolved at link time.)
Report: <CA+TgmoaHJKU5kcWZcYduATYVT7Mnx+8jUnycaYYL7OtCwCigug@mail.gmail.com>
I somehow had assumed that in the spinlock (in turn possibly using
semaphores) based fallback atomics implementation 32 bit writes could be
done without a lock. As far as the write goes that's correct, since
postgres supports only platforms with single-copy atomicity for aligned
32bit writes. But writing without holding the spinlock breaks
read-modify-write operations like pg_atomic_compare_exchange_u32(),
since they'll potentially "miss" a concurrent write, which can't happen
in actual hardware implementations.
In 9.6+ when using the fallback atomics implementation this could lead
to buffer header locks not being properly marked as released, and
potentially some related state corruption. I don't see a related danger
in 9.5 (earliest release with the API), because pg_atomic_write_u32()
wasn't used in a concurrent manner there.
The state variable of local buffers, before this change, were
manipulated using pg_atomic_write_u32(), to avoid unnecessary
synchronization overhead. As that'd not be the case anymore, introduce
and use pg_atomic_unlocked_write_u32(), which does not correctly
interact with RMW operations.
This bug only caused issues when postgres is compiled on platforms
without atomics support (i.e. no common new platform), or when compiled
with --disable-atomics, which explains why this wasn't noticed in
testing.
Reported-By: Tom Lane
Discussion: <14947.1475690465@sss.pgh.pa.us>
Backpatch: 9.5-, where the atomic operations API was introduced.
0xc19c is not a valid UTF-8 byte sequence. It doesn't do any harm, AFAICS,
but it's surely not intentional. No backpatching though, just to be sure.
In the passing, also add a file header comment to the file, like the
UCS_to_SJIS.pl script would produce. (The file was originally created with
UCS_to_SJIS.pl, but has been modified by hand since then. That's
questionable, but I'll leave fixing that for later.)
Kyotaro Horiguchi
Discussion: <20160907.155050.233844095.horiguchi.kyotaro@lab.ntt.co.jp>
LogicalTapeRewind() should not allocate large read buffer, if the tape
is completely empty. The calling code relies on that, for its
calculation of how much memory to allocate for the read buffers. That
lead to massive overallocation of memory, if maxTapes was high, but
only a few tapes were actually used.
Reported by Tomas Vondra
Discussion: <7303da46-daf7-9c68-3cc1-9f83235cf37e@2ndquadrant.com>
Windows apparently has a constant named WAIT_TIMEOUT, and some of these
other names are pretty generic, too. Insert "PG_" at the front of each
name in order to disambiguate.
Michael Paquier
WaitLatch, WaitLatchOrSocket, and WaitEventSetWait now taken an
additional wait_event_info parameter; legal values are defined in
pgstat.h. This makes it possible to uniquely identify every point in
the core code where we are waiting for a latch; extensions can pass
WAIT_EXTENSION.
Because latches were the major wait primitive not previously covered
by this patch, it is now possible to see information in
pg_stat_activity on a large number of important wait events not
previously addressed, such as ClientRead, ClientWrite, and SyncRep.
Unfortunately, many of the wait events added by this patch will fail
to appear in pg_stat_activity because they're only used in background
processes which don't currently appear in pg_stat_activity. We should
fix this either by creating a separate view for such information, or
else by deciding to include them in pg_stat_activity after all.
Michael Paquier and Robert Haas, reviewed by Alexander Korotkov and
Thomas Munro.
Before initializing iteration over a subtransaction's changes, the last
few changes were not spilled to disk. That's correct if the transaction
didn't spill to disk, but otherwise... This bug can lead to missed or
misorderd subtransaction contents when they were spilled to disk.
Move spilling of the remaining in-memory changes to
ReorderBufferIterTXNInit(), where it can easily be applied to the top
transaction and, if present, subtransactions.
Since this code had too many bugs already, noticeably increase test
coverage.
Fixes: #14319
Reported-By: Huan Ruan
Discussion: <20160909012610.20024.58169@wrigleys.postgresql.org>
Backport: 9,4-, where logical decoding was added
Commit 88e982302 invented GUC_UNIT_XSEGS for min_wal_size and max_wal_size,
but neglected to make it display sensibly in pg_settings.unit (by adding a
case to the switch in GetConfigOptionByNum). Fix that, and adjust said
switch to throw a run-time error the next time somebody forgets.
In passing, avoid using a static buffer for the output string --- the rest
of this function pstrdup's from a local buffer, and I see no very good
reason why the units code should do it differently and less safely.
Per report from Otar Shavadze. Back-patch to 9.5 where the new unit type
was added.
Report: <CAG-jOyA=iNFhN+yB4vfvqh688B7Tr5SArbYcFUAjZi=0Exp-Lg@mail.gmail.com>
Attempting to COPY a subset of columns from a table with RLS enabled
would fail due to an invalid query being constructed (using a single
ColumnRef with the list of fields to exact in 'fields', but that's for
the different levels of an indirection for a single column, not for
specifying multiple columns).
Correct by building a ColumnRef and then RestTarget for each column
being requested and then adding those to the targetList for the select
query. Include regression tests to hopefully catch if this is broken
again in the future.
Patch-By: Adam Brightwell
Reviewed-By: Michael Paquier
Don't pre-read tuples into SortTuple slots during merge. Instead, use the
memory for larger read buffers in logtape.c. We're doing the same number
of READTUP() calls either way, but managing the pre-read SortTuple slots
is much more complicated. Also, the on-tape representation is more compact
than SortTuples, so we can fit more pre-read tuples into the same amount
of memory this way. And we have better cache-locality, when we use just a
small number of SortTuple slots.
Now that we only hold one tuple from each tape in the SortTuple slots, we
can greatly simplify the "batch memory" management. We now maintain a
small set of fixed-sized slots, to hold the tuples, and fall back to
palloc() for larger tuples. We use this method during all merge phases,
not just the final merge, and also when randomAccess is requested, and
also in the TSS_SORTEDONTAPE case. In other words, it's used whenever we
do an external sort.
Reviewed by Peter Geoghegan and Claudio Freire.
Discussion: <CAM3SWZTpaORV=yQGVCG8Q4axcZ3MvF-05xe39ZvORdU9JcD6hQ@mail.gmail.com>
Without this, an extension containing an access method is not properly
dumped/restored during pg_upgrade --- the AM ends up not being a member
of the extension after upgrading.
Another oversight in commit 473b93287, reported by Andrew Dunstan.
Report: <f7ac29f3-515c-2a44-21c5-ec925053265f@dunslane.net>
In standard Unix builds, postmaster child processes do ClosePostmasterPorts
immediately after InitPostmasterChild, that is almost immediately after
being spawned. This is important because we don't want children holding
open the postmaster's end of the postmaster death watch pipe.
However, in EXEC_BACKEND builds, SubPostmasterMain was postponing this
responsibility significantly, in order to make it slightly more convenient
to pass the right flag value to ClosePostmasterPorts. This is bad,
particularly seeing that process_shared_preload_libraries() might invoke
nearly-arbitrary code. Rearrange so that we do it as soon as we've
fetched the socket FDs via read_backend_variables().
Also move the comment explaining about randomize_va_space to before the
call of PGSharedMemoryReAttach, which is where it's relevant. The old
placement was appropriate when the reattach happened inside
CreateSharedMemoryAndSemaphores, but that was a long time ago.
Back-patch to 9.3; the patch doesn't apply cleanly before that, and
it doesn't seem worth a lot of effort given that we've had no actual
field complaints traceable to this.
Discussion: <4157.1475178360@sss.pgh.pa.us>
The list of files and directories that pg_basebackup excludes from the
backup was somewhat incomplete and unorganized. Change that with having
the exclusion driven from tables. Clean up some code around it. Also
document the exclusions in more detail so that users of pg_start_backup
can make use of it as well.
The contents of these directories are now excluded from the backup:
pg_dynshmem, pg_notify, pg_serial, pg_snapshots, pg_subtrans
Also fix a bug that a pg_repl_slot or pg_stat_tmp being a symlink would
cause a corrupt tar header to be created. Now such symlinks are
included in the backup as empty directories. Bug found by Ashutosh
Sharma <ashu.coek88@gmail.com>.
From: David Steele <david@pgmasters.net>
Reviewed-by: Michael Paquier <michael.paquier@gmail.com>
Add a validity flag to DCHCacheEntry and NUMCacheEntry entries, and
do not set it true until after we've parsed the supplied format string.
This allows dealing with possible errors while parsing the format
without the baroque hack that was there before (which only covered
errors within NUMDesc_prepare, anyway). We can get rid of the PG_TRY in
NUMDesc_prepare, as well as last_NUMCacheEntry and NUM_cache_remove.
(Essentially, this reverts commit ff783fbae in favor of a less fragile
solution; the problems with that approach are well illustrated by later
hacking such as 55f927a46.)
In passing, define the size of these caches as DCH_CACHE_ENTRIES not
DCH_CACHE_FIELDS + 1 (whoever thought that was a good definition?)
and likewise for the NUM cache. Also const-ify format string parameters
where convenient, and merge duplicated cache lookup logic.
This is primarily driven by a proposed patch from Artur Zakirov,
which introduced some ereport's into format string parsing for
the datetime case. He proposed preventing the creation of invalid
cache entries by parsing the format string first into a local-variable
array, and then copying that to a cache entry. That seemed a bit
ugly to me, and anyway randomly different from the way the identical
problem had been solved for the numeric case. Let's make the two
sets of code more similar not less so.
I'm not sure whether we'll adopt the new error conditions Artur proposes,
but this patch seems like good code cleanup and future-proofing in any
case. The existing code is critically (and undocumented-ly) dependent on
no elog being thrown out of several nontrivial functions, which is trouble
waiting to happen, though it doesn't seem to be actively broken today.
Discussion: <b2a39359-3282-b402-f4a3-057aae500ee7@postgrespro.ru>
Historically, something like to_date('2009-06-40','YYYY-MM-DD') would
return '2009-07-10' because there was no prohibition on out-of-range
month or day numbers. This has been widely panned, and it also turns
out that Oracle throws an error in such cases. Since these functions
are nominally Oracle-compatibility features, let's change that.
There's no particular restriction on year (modulo the fact that the
scanner may not believe that more than 4 digits are year digits,
a matter to be addressed separately if at all). But we now check month,
day, hour, minute, second, and fractional-second fields, as well as
day-of-year and second-of-day fields if those are used.
Currently, no checks are made on ISO-8601-style week numbers or day
numbers; it's not very clear what the appropriate rules would be there,
and they're probably so little used that it's not worth sweating over.
Artur Zakirov, reviewed by Amul Sul, further adjustments by me
Discussion: <1873520224.1784572.1465833145330.JavaMail.yahoo@mail.yahoo.com>
See-Also: <57786490.9010201@wars-nicht.de>
The previous patch broke this by returning NULL for a failed CRC check,
which pg_controldata would then try to read. Fix by returning the
result of the CRC check in a separate argument.
Michael Paquier and myself
Commit 3fe3511d05 introduced a new
case into this function, but neglected to ensure that the "ondisk"
pointer got updated after a possible reallocation as the code does
in other cases.
Stas Kelvich, per diagnosis by Konstantin Knizhnik.
This makes the parameter easier to extend, to support other password-based
authentication protocols than MD5. (SCRAM is being worked on.)
The GUC still accepts on/off as aliases for "md5" and "plain", although
we may want to remove those once we actually add support for another
password hash type.
Michael Paquier, reviewed by David Steele, with some further edits by me.
Discussion: <CAB7nPqSMXU35g=W9X74HVeQp0uvgJxvYOuA4A-A3M+0wfEBv-w@mail.gmail.com>
Pushing an upper-level restriction clause into an unflattened
subquery-in-FROM is okay when the subquery contains no SRFs in its
targetlist, or when it does but the SRFs are unreferenced by the clause
*and the clause is not volatile*. Otherwise, we're changing the number
of times the clause is evaluated, which is bad for volatile quals, and
possibly changing the result, since a volatile qual might succeed for some
SRF output rows and not others despite not referencing any of the changing
columns. (Indeed, if the clause is something like "random() > 0.5", the
user is probably expecting exactly that behavior.)
We had most of these restrictions down, but not the one about the upper
clause not being volatile. Fix that, and add a regression test to
illustrate the expected behavior.
Although this is definitely a bug, it doesn't seem like back-patch
material, since possibly some users don't realize that the broken
behavior is broken and are relying on what happens now. Also, while
the added test is quite cheap in the wake of commit a4c35ea1c, it would
be much more expensive (or else messier) in older branches.
Per report from Tom van Tilburg.
Discussion: <CAP3PPDiucxYCNev52=YPVkrQAPVF1C5PFWnrQPT7iMzO1fiKFQ@mail.gmail.com>
<sys/select.h> is required by POSIX.1-2001 to get the prototype of
select(2), but nearly no systems enforce that because older standards
let you get away with including some other headers. Recent OpenBSD
hacking has removed that frail touch of friendliness, however, which
broke some compiles; fix all the way back to 9.1 by adding the required
standard. Only vacuumdb.c was reported to fail, but it seems easier to
fix the whole lot in a fell swoop.
Per bug #14334 by Sean Farrell.
We had thirty different GIN array opclasses sharing the same operators and
support functions. That still didn't cover all the built-in types, nor
did it cover arrays of extension-added types. What we want is a single
polymorphic opclass for "anyarray". There were two missing features needed
to make this possible:
1. We have to be able to declare the index storage type as ANYELEMENT
when the opclass is declared to index ANYARRAY. This just takes a few
more lines in index_create(). Although this currently seems of use only
for GIN, there's no reason to make index_create() restrict it to that.
2. We have to be able to identify the proper GIN compare function for
the index storage type. This patch proceeds by making the compare function
optional in GIN opclass definitions, and specifying that the default btree
comparison function for the index storage type will be looked up when the
opclass omits it. Again, that seems pretty generically useful.
Since the comparison function lookup is done in initGinState(), making
use of the second feature adds an additional cache lookup to GIN index
access setup. It seems unlikely that that would be very noticeable given
the other costs involved, but maybe at some point we should consider
making GinState data persist longer than it now does --- we could keep it
in the index relcache entry, perhaps.
Rather fortuitously, we don't seem to need to do anything to get this
change to play nice with dump/reload or pg_upgrade scenarios: the new
opclass definition is automatically selected to replace existing index
definitions, and the on-disk data remains compatible. Also, if a user has
created a custom opclass definition for a non-builtin type, this doesn't
break that, since CREATE INDEX will prefer an exact match to opcintype
over a match to ANYARRAY. However, if there's anyone out there with
handwritten DDL that explicitly specifies _bool_ops or one of the other
replaced opclass names, they'll need to adjust that.
Tom Lane, reviewed by Enrique Meneses
Discussion: <14436.1470940379@sss.pgh.pa.us>
We weren't terribly consistent about whether to call Apple's OS "OS X"
or "Mac OS X", and the former is probably confusing to people who aren't
Apple users. Now that Apple has rebranded it "macOS", follow their lead
to establish a consistent naming pattern. Also, avoid the use of the
ancient project name "Darwin", except as the port code name which does not
seem desirable to change. (In short, this patch touches documentation and
comments, but no actual code.)
I didn't touch contrib/start-scripts/osx/, either. I suspect those are
obsolete and due for a rewrite, anyway.
I dithered about whether to apply this edit to old release notes, but
those were responsible for quite a lot of the inconsistencies, so I ended
up changing them too. Anyway, Apple's being ahistorical about this,
so why shouldn't we be?
Apparent copy-and-pasteo in standby_desc_invalidations() had two
entries for msg->id == SHAREDINVALRELMAP_ID.
Aleksander Alekseev
Discussion: <20160923090814.GB1238@e733>
We must test GetLastError() even when CreateFileMapping() returns a
non-null handle. If that value were left over from some previous system
call, we might be fooled into thinking the segment already existed.
Experimentation on Windows 7 suggests that CreateFileMapping() clears
the error code on success, but it is not documented to do so, so let's
not rely on that happening in all Windows releases.
Amit Kapila
Discussion: <20811.1474390987@sss.pgh.pa.us>
Commits 470d886c3 et al intended to fix the problem that the postmaster
selected the same "random" DSM control segment ID on every start. But
using PostmasterRandom() for that destroys the intended property that the
delay between random_start_time and random_stop_time will be unpredictable.
(Said delay is probably already more predictable than we could wish, but
that doesn't mean that reducing it by a couple orders of magnitude is OK.)
Revert the previous patch and add a comment warning against misuse of
PostmasterRandom. Fix the original problem by calling srandom() early in
PostmasterMain, using a low-security seed that will later be overwritten
by PostmasterRandom.
Discussion: <20789.1474390434@sss.pgh.pa.us>
Past refactorings have removed all but one reference to SizeOfIptrData
(and that one place was in a pretty noncritical spot). Since nobody's
complained, it seems probable that there are no supported compilers
that don't think sizeof(ItemPointerData) is 6. If there are, we're
wasting MAXALIGN per heap tuple anyway, so it's rather silly to worry
about whether we can shave space in places like WAL records.
Pavan Deolasee
Discussion: <CABOikdOOawDda4hwLOT6zdA6MFfPLu3Z2YBZkX0JdayNS6JOeQ@mail.gmail.com>
ExecInitCteScan supposed that it didn't have to do anything to the extra
tuplestore read pointer it gets from tuplestore_alloc_read_pointer.
However, it needs this read pointer to be positioned at the start of the
tuplestore, while tuplestore_alloc_read_pointer is actually defined as
cloning the current position of read pointer 0. In normal situations
that accidentally works because we initialize the whole plan tree at once,
before anything gets read. But it fails in an EvalPlanQual recheck, as
illustrated in bug #14328 from Dima Pavlov. To fix, just forcibly rewind
the pointer after tuplestore_alloc_read_pointer. The cost of doing so is
negligible unless the tuplestore is already in TSS_READFILE state, which
wouldn't happen in normal cases. We could consider altering tuplestore's
API to make that case cheaper, but that would make for a more invasive
back-patch and it doesn't seem worth it.
This has been broken probably for as long as we've had CTEs, so back-patch
to all supported branches.
Discussion: <32468.1474548308@sss.pgh.pa.us>
Move the updating of the control file to "in production" status until
the point where WAL writes are allowed. Before, there could be a
significant gap between the control file update and write transactions
actually being allowed. This makes it more reliable to use the control
status to verify the end of a promotion.
From: Michael Paquier <michael.paquier@gmail.com>
pg_ctl used to determine whether a server was in standby mode by looking
for a recovery.conf file. With this change, it instead looks into
pg_control, which is potentially more accurate. There are also
occasional discussions about removing recovery.conf, so this removes one
dependency.
Reviewed-by: Michael Paquier <michael.paquier@gmail.com>
Otherwise, attempts to run multiple postmasters running on the same
machine may fail, because Windows sometimes returns ERROR_ACCESS_DENIED
rather than ERROR_ALREADY_EXISTS when there is an existing segment.
Hitting this bug is much more likely because of another defect not
fixed by this patch, namely that dsm_postmaster_startup() uses
random() which returns the same value every time. But that's not
a reason not to fix this.
Kyotaro Horiguchi and Amit Kapila, reviewed by Michael Paquier
Discussion: <CAA4eK1JyNdMeF-dgrpHozDecpDfsRZUtpCi+1AbtuEkfG3YooQ@mail.gmail.com>
The GiST search queue is implemented as a pairing heap rather than as
Red-Black Tree, since 9.5 (commit e7032610). I neglected these comments
in that commit.
This function has no direct callers at present, but it's convenient for
manual use in a debugger, rather than having to inspect memory and do
bit-counting in your head.
In passing, get rid of useless outBitmapset() wrapper around
_outBitmapset(); let's just export the function that does the work.
Likewise for outToken().
Ashutosh Bapat, tweaked a bit by me
Discussion: <CAFjFpRdiht8e1HTVirbubr4YzaON5iZTzFJjq909y4sU8M_6eA@mail.gmail.com>