On closer investigation, commits f3ea3e3e8 et al were a few bricks
shy of a load. What we need is not so much to lock down the result
type of a FieldSelect, as to lock down the existence of the column
it's trying to extract. Otherwise, we can break it by dropping that
column. The dependency on the result type is then held indirectly
through the column, and doesn't need to be recorded explicitly.
Out of paranoia, I left in the code to record a dependency on the
result type, but it's used only if we can't identify the pg_class OID
for the column. That shouldn't ever happen right now, AFAICS, but
it seems possible that in future the input node could be marked as
being of type RECORD rather than some specific composite type.
Likewise for FieldStore.
Like the previous patch, back-patch to all supported branches.
Discussion: https://postgr.es/m/22571.1509064146@sss.pgh.pa.us
If we try to run a parallel plan in serial mode because, for example,
it's going to be scanned via a cursor, but for some reason we're
already in parallel mode (for example because an outer query is
running in parallel), we'd incorrectly try to launch workers.
Fix by adding a flag to the EState, so that we can be certain that
ExecutePlan() and ExecGather()/ExecGatherMerge() will have the same
idea about whether we are executing serially or in parallel.
Report and fix by Amit Kapila with help from Kuntal Ghosh. A few
tweaks by me.
Discussion: http://postgr.es/m/CAA4eK1+_BuZrmVCeua5Eqnm4Co9DAXdM5HPAOE2J19ePbR912Q@mail.gmail.com
This is the last major omission in our domains feature: you can now
make a domain over anything that's not a pseudotype.
The major complication from an implementation standpoint is that places
that might be creating tuples of a domain type now need to be prepared
to apply domain_check(). It seems better that unprepared code fail
with an error like "<type> is not composite" than that it silently fail
to apply domain constraints. Therefore, relevant infrastructure like
get_func_result_type() and lookup_rowtype_tupdesc() has been adjusted
to treat domain-over-composite as a distinct case that unprepared code
won't recognize, rather than just transparently treating it the same
as plain composite. This isn't a 100% solution to the possibility of
overlooked domain checks, but it catches most places.
In passing, improve typcache.c's support for domains (it can now cache
the identity of a domain's base type), and rewrite the argument handling
logic in jsonfuncs.c's populate_record[set]_worker to reduce duplicative
per-call lookups.
I believe this is code-complete so far as the core and contrib code go.
The PLs need varying amounts of work, which will be tackled in followup
patches.
Discussion: https://postgr.es/m/4206.1499798337@sss.pgh.pa.us
Previously, we skipped using search_indexed_tlist_for_sortgroupref()
if the tlist expression being sought in the child plan node was merely
a Var. This is purely an optimization, based on the theory that
search_indexed_tlist_for_var() is faster, and one copy of a Var should
be as good as another. However, the GROUPING SETS patch broke the
latter assumption: grouping columns containing the "same" Var can
sometimes have different outputs, as shown in the test case added here.
So do it the hard way whenever a ressortgroupref marking exists.
(If this seems like a bottleneck, we could imagine building a tlist index
data structure for ressortgroupref values, as we do for Vars. But I'll
let that idea go until there's some evidence it's worthwhile.)
Back-patch to 9.6. The problem also exists in 9.5 where GROUPING SETS
came in, but this patch is insufficient to resolve the problem in 9.5:
there is some obscure dependency on the upper-planner-pathification
work that happened in 9.6. Given that this is such a weird corner case,
and no end users have complained about it, it doesn't seem worth the work
to develop a fix for 9.5.
Patch by me, per a report from Heikki Linnakangas. (This does not fix
Heikki's original complaint, just the follow-on one.)
Discussion: https://postgr.es/m/aefc657e-edb2-64d5-6df1-a0828f6e9104@iki.fi
json_build_object and json_build_array and the jsonb equivalents did not
correctly process explicit VARIADIC arguments. They are modified to use
the new extract_variadic_args() utility function which abstracts away
the details of the call method.
Michael Paquier, reviewed by Tom Lane and Dmitry Dolgov.
Backpatch to 9.5 for the jsonb fixes and 9.4 for the json fixes, as
that's where they originated.
This is epecially useful in the case or "VARIADIC ANY" functions. The
caller can get the artguments and types regardless of whether or not and
explicit VARIADIC array argument has been used. The function also
provides an option to convert arguments on type "unknown" to to "text".
Michael Paquier and me, reviewed by Tom Lane.
Backpatch to 9.4 in order to support the following json bug fix.
Although joinaliasvars lists coming out of the parser are quite simple,
those lists can contain arbitrarily complex expressions after subquery
pullup. We do not perform expression preprocessing on them, meaning that
expressions in those lists will not meet the expectations of later phases
of the planner (for example, that they do not contain SubLinks). This had
been thought pretty harmless, since we don't intentionally touch those
lists in later phases --- but Andreas Seltenreich found a case in which
adjust_appendrel_attrs() could recurse into a joinaliasvars list and then
die on its assertion that it never sees a SubLink. We considered a couple
of localized fixes to prevent that specific case from looking at the
joinaliasvars lists, but really this seems like a generic hazard for all
expression processing in the planner. Therefore, probably the best answer
is to delete the joinaliasvars lists from the parsetree at the end of
expression preprocessing, so that there are no reachable expressions that
haven't been through preprocessing.
The case Andreas found seems to be harmless in non-Assert builds, and so
far there are no field reports suggesting that there are user-visible
effects in other cases. I considered back-patching this anyway, but
it turns out that Andreas' test doesn't fail at all in 9.4-9.6, because
in those versions adjust_appendrel_attrs contains code (added in commit
842faa714 and removed again in commit 215b43cdc) to process SubLinks
rather than complain about them. Barring discovery of another path by
which unprocessed joinaliasvars lists can cause trouble, the most
prudent compromise seems to be to patch this into v10 but not further.
Patch by me, with thanks to Amit Langote for initial investigation
and review.
Discussion: https://postgr.es/m/87r2tvt9f1.fsf@ansel.ydns.eu
find_expr_references() neglected to record a dependency on the result type
of a FieldSelect node, allowing a DROP TYPE to break a view or rule that
contains such an expression. I think we'd omitted this case intentionally,
reasoning that there would always be a related dependency ensuring that the
DROP would cascade to the view. But at least with nested field selection
expressions, that's not true, as shown in bug #14867 from Mansur Galiev.
Add the dependency, and for good measure a dependency on the node's exposed
collation.
Likewise add a dependency on the result type of a FieldStore. I think here
the reasoning was that it'd only appear within an assignment to a field,
and the dependency on the field's column would be enough ... but having
seen this example, I think that's wrong for nested-composites cases.
Looking at nearby code, I notice we're not recording a dependency on the
exposed collation of CoerceViaIO, which seems inconsistent with our choices
for related node types. Maybe that's OK but I'm feeling suspicious of this
code today, so let's add that; it certainly can't hurt.
This patch does not do anything to protect already-existing views, only
views created after it's installed. But seeing that the issue has been
there a very long time and nobody noticed till now, that's probably good
enough.
Back-patch to all supported branches.
Discussion: https://postgr.es/m/20171023150118.1477.19174@wrigleys.postgresql.org
Like the similar logic for arrays and records, it's necessary to examine
the range's subtype to decide whether the range type can support hashing.
We can omit checking the subtype for btree-defined operations, though,
since range subtypes are required to have those operations. (Possibly
that simplification for btree cases led us to overlook that it does
not apply for hash cases.)
This is only an issue if the subtype lacks hash support, which is not
true of any built-in range type, but it's easy to demonstrate a problem
with a range type over, eg, money: you can get a "could not identify
a hash function" failure when the planner is misled into thinking that
hash join or aggregation would work.
This was born broken, so back-patch to all supported branches.
The previous coding would report that an array type supports extended
hashing if its element type supports regular hashing. This bug is
only latent at the moment, since AFAICS there is not yet any code
that depends on checking presence of extended-hashing support to make
any decisions. (And in any case it wouldn't matter unless the element
type has only regular hashing, which isn't true of any core data type.)
But that doesn't make it less broken. Extend the
cache_array_element_properties infrastructure to check this properly.
setTargetTable threw an error if the proposed target RangeVar's relname
matched any visible CTE or ENR. This breaks backwards compatibility in
the CTE case, since pre-v10 we never looked for a CTE here at all, so that
CTE names did not mask regular tables. It does seem like a good idea to
throw an error for the ENR case, though, thus causing ENRs to mask tables
for this purpose; ENRs are new in v10 so we're not breaking existing code,
and we may someday want to allow them to be the targets of DML.
To fix that, replace use of getRTEForSpecialRelationTypes, which was
overkill anyway, with use of scanNameSpaceForENR.
A second problem was that the check neglected to verify null schemaname,
so that a CTE or ENR could incorrectly be thought to match a qualified
RangeVar. That happened because getRTEForSpecialRelationTypes relied
on its caller to have checked for null schemaname. Even though the one
remaining caller got it right, this is obviously bug-prone, so move
the check inside getRTEForSpecialRelationTypes.
Also, revert commit 18ce3a4ab's extremely poorly thought out decision to
add a NULL return case to parserOpenTable --- without either documenting
that or adjusting any of the callers to check for it. The current bug
seems to have arisen in part due to working around that bad idea.
In passing, remove the one-line shim functions transformCTEReference and
transformENRReference --- they don't seem to be adding any clarity or
functionality.
Per report from Hugo Mercier (via Julien Rouhaud). Back-patch to v10
where the bug was introduced.
Thomas Munro, with minor editing by me
Discussion: https://postgr.es/m/CAOBaU_YdPVH+PTtiKSSLOiiW3mVDYsnNUekK+XPbHXiP=wrFLA@mail.gmail.com
Flex generates a lot of functions that are not actually used. In order
to avoid coverage figures being ruined by that, mark up the part of the
.l files where the generated code appears by lcov exclusion markers.
That way, lcov will typically only reported on coverage for the .l file,
which is under our control, but not for the .c file.
Reviewed-by: Michael Paquier <michael.paquier@gmail.com>
There is no reason to insist that direct arguments must match before
we can merge transition states of two aggregate calls. They're only
used during the finalfn call, so we can treat them as like the finalfn
itself. This allows, eg, merging of
select
percentile_cont(0.25) within group (order by a),
percentile_disc(0.5) within group (order by a)
from ...
This didn't matter (and could not have been tested) before we allowed
state merging of OSAs otherwise.
Discussion: https://postgr.es/m/CAB4ELO5RZhOamuT9Xsf72ozbenDLLXZKSk07FiSVsuJNZB861A@mail.gmail.com
The built-in OSAs all share the same transition function, so they can
share transition state as long as the final functions cooperate to not
do the sort step more than once. To avoid running the tuplesort object
in randomAccess mode unnecessarily, add a bit of infrastructure to
nodeAgg.c to let the aggregate functions find out whether the transition
state is actually being shared or not.
This doesn't work for the hypothetical aggregates, since those inject
a hypothetical row that isn't traceable to the shared input state.
So they remain marked aggfinalmodify = 'w'.
Discussion: https://postgr.es/m/CAB4ELO5RZhOamuT9Xsf72ozbenDLLXZKSk07FiSVsuJNZB861A@mail.gmail.com
An aggregate's input expression(s) are not supposed to be evaluated
at all for a row where its FILTER test fails ... but commit 8ed3f11bb
overlooked that requirement. Reshuffle so that aggregates having a
filter clause evaluate their arguments separately from those without.
This still gets the benefit of doing only one ExecProject in the
common case of multiple Aggrefs, none of which have filters.
While at it, arrange for filter clauses to be included in the common
ExecProject evaluation, thus perhaps buying a little bit even when
there are filters.
Back-patch to v10 where the bug was introduced.
Discussion: https://postgr.es/m/30065.1508161354@sss.pgh.pa.us
While poking around in the aggregate logic, I noticed that commit
8ed3f11bb broke the logic in nodeAgg.c that purports to detect nested
aggregates, by moving initialization of regular aggregate argument
expressions out of the code segment that checks for that.
You could argue that this check is unnecessary, but it's not much code
so I'm inclined to keep it as a backstop against parser and planner
bugs. However, there's certainly zero value in checking only some of
the subexpressions.
We can make the check complete again, and as a bonus make it a good
deal more bulletproof against future mistakes of the same ilk, by
moving it out to the outermost level of ExecInitAgg. This means we
need to check only once per Agg node not once per aggregate, which
also seems like a good thing --- if the check does find something
wrong, it's not urgent that we report it before the plan node
initialization finishes.
Since this requires remembering the original length of the aggs list,
I deleted a long-obsolete stanza that changed numaggs from 0 to 1.
That's so old it predates our decision that palloc(0) is a valid
operation, in (digs...) 2004, see commit 24a1e20f1.
In passing improve a few comments.
Back-patch to v10, just in case.
Up to now, there's been hard-wired assumptions that normal aggregates'
final functions never modify their transition states, while ordered-set
aggregates' final functions always do. This has always been a bit
limiting, and in particular it's getting in the way of improving the
built-in ordered-set aggregates to allow merging of transition states.
Therefore, let's introduce catalog and CREATE AGGREGATE infrastructure
that lets the finalfn's behavior be declared explicitly.
There are now three possibilities for the finalfn behavior: it's purely
read-only, it trashes the transition state irrecoverably, or it changes
the state in such a way that no more transfn calls are possible but the
state can still be passed to other, compatible finalfns. There are no
examples of this third case today, but we'll shortly make the built-in
OSAs act like that.
This change allows user-defined aggregates to explicitly disclaim support
for use as window functions, and/or to prevent transition state merging,
if their implementations cannot handle that. While it was previously
possible to handle the window case with a run-time error check, there was
not any way to prevent transition state merging, which in retrospect is
something commit 804163bc2 should have provided for. But better late
than never.
In passing, split out pg_aggregate.c's extern function declarations into
a new header file pg_aggregate_fn.h, similarly to what we've done for
some other catalog headers, so that pg_aggregate.h itself can be safe
for frontend files to include. This lets pg_dump use the symbolic
names for relevant constants.
Discussion: https://postgr.es/m/4834.1507849699@sss.pgh.pa.us
The following are the individual improvements:
1) Avoidance of FunctionCallInfo based function calls, replaced by
more efficient functions with a native C argument interface.
2) Don't extract columns from a cache entry's tuple whenever matching
entries - instead store them as a Datum array. This also allows to
get rid of having to build dummy tuples for negative & list
entries, and of a hack for dealing with cstring vs. text weirdness.
3) Reorder members of catcache.h struct, so imortant entries are more
likely to be on one cacheline.
4) Allowing the compiler to specialize critical SearchCatCache for a
specific number of attributes allows to unroll loops and avoid
other nkeys dependant initialization.
5) Only initializing the ScanKey when necessary, i.e. catcache misses,
greatly reduces cache unnecessary cpu cache misses.
6) Split of the cache-miss case from the hash lookup, reducing stack
allocations etc in the common case.
7) CatCTup and their corresponding heaptuple are allocated in one
piece.
This results in making cache lookups themselves roughly three times as
fast - full-system benchmarks obviously improve less than that.
I've also evaluated further techniques:
- replace open coded hash with simplehash - the list walk right now
shows up in profiles. Unfortunately it's not easy to do so safely as
an entry's memory location can change at various times, which
doesn't work well with the refcounting and cache invalidation.
- Cacheline-aligning CatCTup entries - helps some with performance,
but the win isn't big and the code for it is ugly, because the
tuples have to be freed as well.
- add more proper functions, rather than macros for
SearchSysCacheCopyN etc., but right now they don't show up in
profiles.
The reason the macro wrapper for syscache.c/h have to be changed,
rather than just catcache, is that doing otherwise would require
exposing the SysCache array to the outside. That might be a good idea
anyway, but it's for another day.
Author: Andres Freund
Reviewed-By: Robert Haas
Discussion: https://postgr.es/m/20170914061207.zxotvyopetm7lrrp@alap3.anarazel.de
If a Parallel Bitmap Heap scan's chain of leftmost descendents
includes a BitmapOr whose first child is a BitmapAnd, the prior coding
would mistakenly create a non-shared TIDBitmap and then try to perform
shared iteration.
Report by Tomas Vondra. Patch by Dilip Kumar.
Discussion: http://postgr.es/m/50e89684-8ad9-dead-8767-c9545bafd3b6@2ndquadrant.com
I (tgl) objected to the obscure implementation introduced in commit
1c497fa72. This one seems a bit less action-at-a-distance-y, at the
price of repeating a few lines of code.
Improve the comments about what the function is doing, too.
Amit Khandekar, whacked around a bit more by me
Discussion: https://postgr.es/m/CAJ3gD9egYTyHUH0nTMxm8-1m3RvdqEbaTyGC-CUNtYf7tKNDaQ@mail.gmail.com
After calling ldap_unbind_s() we probably shouldn't try to use the LDAP
connection again to call ldap_get_option(), even if it failed. The OpenLDAP
man page for ldap_unbind[_s] says "Once it is called, the connection to the
LDAP server is closed, and the ld structure is invalid." Otherwise, as a
general rule we should probably call ldap_unbind() before returning in all
paths to avoid leaking resources. It is unlikely there is any practical
leak problem since failure to authenticate currently results in the backend
exiting soon afterwards.
Author: Thomas Munro
Reviewed-By: Alvaro Herrera, Peter Eisentraut
Discussion: https://postgr.es/m/20170914141205.eup4kxzlkagtmfac%40alvherre.pgsql
The previous convention doesn't lend itself to creating ResultRelInfos
lazily, as we already do in ExecGetTriggerResultRel. This patch
doesn't make anything lazier than before, but the pending patch for
UPDATE tuple routing proposes to do so (and there might be other
opportunities as well).
Amit Khandekar with some adjustments by me.
Discussion: http://postgr.es/m/CA+TgmoYPVP9Lyf6vUFA5DwxS4c--x6LOj2y36BsJaYtp62eXPQ@mail.gmail.com
If we merge the transition calculations for two different aggregates,
it's reasonable to assume that the transition function should not care
which of those Aggref structs it gets from AggGetAggref(). It is not
reasonable to make the same assumption about an aggregate final function,
however. Commit 804163bc2 broke this, as it will pass whichever Aggref
was first associated with the transition state in both cases.
This doesn't create an observable bug so far as the core system is
concerned, because the only existing uses of AggGetAggref() are in
ordered-set aggregates that happen to not pay attention to anything
but the input properties of the Aggref; and besides that, we disabled
sharing of transition calculations for OSAs yesterday. Nonetheless,
if some third-party code were using AggGetAggref() in a normal aggregate,
they would be entitled to call this a bug. Hence, back-patch the fix
to 9.6 where the problem was introduced.
In passing, improve some of the comments about transition state sharing.
Discussion: https://postgr.es/m/CAB4ELO5RZhOamuT9Xsf72ozbenDLLXZKSk07FiSVsuJNZB861A@mail.gmail.com
This ought to work, but the built-in OSAs are not capable of coping,
because their final-functions destructively modify their transition
state (specifically, the contained tuplesort object). That was fine
when those functions were written, but commit 804163bc2 moved the
goalposts without telling orderedsetaggs.c.
We should fix the built-in OSAs to support this, but it will take
a little work, especially if we don't want to sacrifice performance
in the normal non-shared-state case. Given that it took a year after
9.6 release for anyone to notice this bug, we should not prioritize
sharable-state over nonsharable-state performance. And a proper fix
is likely to be more complicated than we'd want to back-patch, too.
Therefore, let's just put in this stop-gap patch to prevent nodeAgg.c
from choosing to use shared state for OSAs. We can revert it in HEAD
when we get a better fix.
Report from Lukas Eder, diagnosis by me, patch by David Rowley.
Back-patch to 9.6 where the problem was introduced.
Discussion: https://postgr.es/m/CAB4ELO5RZhOamuT9Xsf72ozbenDLLXZKSk07FiSVsuJNZB861A@mail.gmail.com
There's three categories of changes leading to better performance:
- Splitting the per-attribute part of SendRowDescriptionMessage into a
v2 and a v3 version allows avoiding branches for every attribute.
- Preallocating the size of the buffer to be big enough for all
attributes and then using pq_write* avoids unnecessary buffer
size checks & resizing.
- Reusing a persistently allocated StringInfo for all
SendRowDescriptionMessage() invocations avoids repeated allocations
& reallocations.
Author: Andres Freund
Discussion: https://postgr.es/m/20170914063418.sckdzgjfrsbekae4@alap3.anarazel.de
This takes advantage of the infrastructure introduced by commit
81c5e46c49 to greatly reduce the
likelihood that two different queries will end up with the same query
ID. It's still possible, of course, but whereas before it the chances
of a collision reached 25% around 50,000 queries, it will now take
more than 3 billion queries.
Backward incompatibility: Because the type exposed at the SQL level is
int8, users may now see negative query IDs in the pg_stat_statements
view (and also, query IDs more than 4 billion, which was the old
limit).
Patch by me, reviewed by Michael Paquier and Peter Geoghegan.
Discussion: http://postgr.es/m/CA+TgmobG_Kp4cBKFmsznUAaM1GWW6hhRNiZC0KjRMOOeYnz5Yw@mail.gmail.com
This avoids newly allocating, and then possibly growing, the
stringbuffer for every row. For wide rows this can substantially
reduce memory allocator overhead, at the price of not immediately
reducing memory usage after outputting an especially wide row.
Author: Andres Freund
Discussion: https://postgr.es/m/20170914063418.sckdzgjfrsbekae4@alap3.anarazel.de
There's three prongs to achieve greater efficiency here:
1) Allow reusing a stringbuffer across pq_beginmessage/endmessage,
with the new pq_beginmessage_reuse/endmessage_reuse. This can be
beneficial both because it avoids allocating the initial buffer,
and because it's more likely to already have an correctly sized
buffer.
2) Replacing pq_sendint() with pq_sendint$width() inline
functions. Previously unnecessary and unpredictable branches in
pq_sendint() were needed. Additionally the replacement functions
are implemented more efficiently. pq_sendint is now deprecated, a
separate commit will convert all in-tree callers.
3) Add pq_writeint$width(), pq_writestring(). These rely on sufficient
space in the StringInfo's buffer, avoiding individual space checks
& potential individual resizing. To allow this to be used for
strings, expose mbutil.c's MAX_CONVERSION_GROWTH.
Followup commits will make use of these facilities.
Author: Andres Freund
Discussion: https://postgr.es/m/20170914063418.sckdzgjfrsbekae4@alap3.anarazel.de
In a lot of the places having appendBinaryStringInfo() maintain a
trailing NUL byte wasn't actually meaningful, e.g. when appending an
integer which can contain 0 in one of its bytes.
Removing this yields some small speedup, but more importantly will be
more consistent when providing faster variants of pq_sendint etc.
Author: Andres Freund
Discussion: https://postgr.es/m/20170914063418.sckdzgjfrsbekae4@alap3.anarazel.de
resowner/README contained advice to use a PG_TRY block to restore the
old CurrentResourceOwner value anywhere that that variable is transiently
changed. That advice was only inconsistently followed, however, and
on reflection it seems like unnecessary overhead. We don't bother
with such a convention for transient CurrentMemoryContext changes,
on the grounds that any (sub)transaction abort will start out by
resetting CurrentMemoryContext to what it wants. But the same is
true of CurrentResourceOwner, so there seems no need to treat it
differently.
Hence, remove PG_TRY blocks that exist only to restore CurrentResourceOwner
before re-throwing the error. There are a couple of places that restore
it along with some other actions, and I left those alone; the restore is
probably unnecessary but no noticeable gain will result from removing it.
Discussion: https://postgr.es/m/5236.1507583529@sss.pgh.pa.us
The previous coding in ProcessInterrupts() could lead to
idle_in_transaction_session_timeout being ignored, when
statement_timeout occurred earlier.
The problem was that ProcessInterrupts() would return before
processing the transaction timeout if QueryCancelPending was set while
QueryCancelHoldoffCount != 0 - which is the case when reading new
commands from the client. Ergo when the idle transaction timeout would
hit.
Fix that by removing the early return. Alternatively the transaction
timeout code could have been moved up, but that early return seems
like an issue that could hit other cases too.
Author: Lukas Fittl
Bug: #14821
Discussion:
https://www.postgresql.org/message-id/20170921010956.17345.61461%40wrigleys.postgresql.orghttps://www.postgresql.org/message-id/CAP53PkxQnv3OWJpyNPGJYT62uY=n1=2CF_Lpc6gVOFnc0-gazw@mail.gmail.com
Backpatch: 9.6-, where idle_in_transaction_session_timeout was introduced.
The GRANT reference page, which lists the default privileges for new
objects, failed to mention that USAGE is granted by default for data
types and domains. As a lesser sin, it also did not specify anything
about the initial privileges for sequences, FDWs, foreign servers,
or large objects. Fix that, and add a comment to acldefault() in the
probably vain hope of getting people to maintain this list in future.
Noted by Laurenz Albe, though I editorialized on the wording a bit.
Back-patch to all supported branches, since they all have this behavior.
Discussion: https://postgr.es/m/1507620895.4152.1.camel@cybertec.at
Up to now async.c has used TransactionIdIsInProgress() to detect whether
a notify message's source transaction is still running. However, that
function has a quick-exit path that reports that XIDs before RecentXmin
are no longer running. If a listening backend is doing nothing but
listening, and not running any queries, there is nothing that will advance
its value of RecentXmin. Once 2 billion transactions elapse, the
RecentXmin check causes active transactions to be reported as not running.
If they aren't committed yet according to CLOG, async.c decides they
aborted and discards their messages. The timing for that is a bit tight
but it can happen when multiple backends are sending notifies concurrently.
The net symptom therefore is that a sufficiently-long-surviving
listen-only backend starts to miss some fraction of NOTIFY traffic,
but only under heavy load.
The only function that updates RecentXmin is GetSnapshotData().
A brute-force fix would therefore be to take a snapshot before
processing incoming notify messages. But that would add cycles,
as well as contention for the ProcArrayLock. We can be smarter:
having taken the snapshot, let's use that to check for running
XIDs, and not call TransactionIdIsInProgress() at all. In this
way we reduce the number of ProcArrayLock acquisitions from one
per message to one per notify interrupt; that's the same under
light load but should be a benefit under heavy load. Light testing
says that this change is a wash performance-wise for normal loads.
I looked around for other callers of TransactionIdIsInProgress()
that might be at similar risk, and didn't find any; all of them
are inside transactions that presumably have already taken a
snapshot.
Problem report and diagnosis by Marko Tiikkaja, patch by me.
Back-patch to all supported branches, since it's been like this
since 9.0.
Discussion: https://postgr.es/m/20170926182935.14128.65278@wrigleys.postgresql.org
The previous placement of the fallback implementation in libpgcommon
was problematic, because libpqport functions need strnlen
functionality.
Move replacement into libpgport. Provide strnlen() under its posix
name, instead of pg_strnlen(). Fix stupid configure bug, executing the
test only when compiled with threading support.
Author: Andres Freund
Discussion: https://postgr.es/m/E1e1gR2-0005fB-SI@gemulon.postgresql.org
Previously nodeProjectSet only released memory once per input tuple,
rather than once per returned tuple. If the computation of an
individual returned tuple requires a lot of memory, that can lead to
problems.
Instead change things so that the expression context can be reset once
per output tuple, which requires a new memory context to store SRF
arguments in.
This is a longstanding issue, but was hard to fix before 9.6, due to
the way tSRFs where evaluated. But it's fairly easy to fix now. We
could backpatch this into 10, but given there've been fewc omplaints
that doesn't seem worth the risk so far.
Reported-By: Lucas Fairchild
Author: Andres Freund, per discussion with Tom Lane
Discussion: https://postgr.es/m/4514.1507318623@sss.pgh.pa.us
copy_file() reads and writes data 64KB at a time (with default BLCKSZ),
and historically has issued a pg_flush_data request after each write.
This turns out to interact really badly with macOS's new APFS file
system: a large file copy takes over 100X longer than it ought to on
APFS, as reported by Brent Dearth. While that's arguably a macOS bug,
it's not clear whether Apple will do anything about it in the near
future, and in any case experimentation suggests that issuing flushes
a bit less often can be helpful on other platforms too.
Hence, rearrange the logic in copy_file() so that flush requests are
issued once per N writes rather than every time through the loop.
I set the FLUSH_DISTANCE to 32MB on macOS (any less than that still
results in a noticeable speed degradation on APFS), but 1MB elsewhere.
In limited testing on Linux and FreeBSD, this seems slightly faster
than the previous code, and certainly no worse. It helps noticeably
on macOS even with the older HFS filesystem.
A simpler change would have been to just increase the size of the
copy buffer without changing the loop logic, but that seems likely
to trash the processor cache without really helping much.
Back-patch to 9.6 where we introduced msync() as an implementation
option for pg_flush_data(). The problem seems specific to APFS's
mmap/msync support, so I don't think we need to go further back.
Discussion: https://postgr.es/m/CADkxhTNv-j2jw2g8H57deMeAbfRgYBoLmVuXkC=YCFBXRuCOww@mail.gmail.com
If the operator is a strict btree equality operator, and X isn't volatile,
then the clause must yield true for any non-null value of X, or null if X
is null. At top level of a WHERE clause, we can ignore the distinction
between false and null results, so it's valid to simplify the clause to
"X IS NOT NULL". This is a useful improvement mainly because we'll get
a far better selectivity estimate in most cases.
Because such cases seldom arise in well-written queries, it is unappetizing
to expend a lot of planner cycles looking for them ... but it turns out
that there's a place we can shoehorn this in practically for free, because
equivclass.c already has to detect and reject candidate equivalences of the
form X = X. That doesn't catch every place that it would be valid to
simplify to X IS NOT NULL, but it catches the typical case. Working harder
doesn't seem justified.
Patch by me, reviewed by Petr Jelinek
Discussion: https://postgr.es/m/CAMjNa7cC4X9YR-vAJS-jSYCajhRDvJQnN7m2sLH1wLh-_Z2bsw@mail.gmail.com
The logical decoding functions do BeginInternalSubTransaction and
RollbackAndReleaseCurrentSubTransaction to clean up after themselves.
It turns out that AtEOSubXact_SPI has an unrecognized assumption that
we always need to cancel the active SPI operation in the SPI context
that surrounds the subtransaction (if there is one). That's true
when the RollbackAndReleaseCurrentSubTransaction call is coming from
the SPI-using function itself, but not when it's happening inside
some unrelated function invoked by a SPI query. In practice the
affected callers are the various PLs.
To fix, record the current subtransaction ID when we begin a SPI
operation, and clean up only if that ID is the subtransaction being
canceled.
Also, remove AtEOSubXact_SPI's assertion that it must have cleaned
up the surrounding SPI context's active tuptable. That's proven
wrong by the same test case.
Also clarify (or, if you prefer, reinterpret) the calling conventions
for _SPI_begin_call and _SPI_end_call. The memory context cleanup
in the latter means that these have always had the flavor of a matched
resource-management pair, but they weren't documented that way before.
Per report from Ben Chobot.
Back-patch to 9.4 where logical decoding came in. In principle,
the SPI changes should go all the way back, since the problem dates
back to commit 7ec1c5a86. But given the lack of field complaints
it seems few people are using internal subtransactions in this way.
So I don't feel a need to take any risks in 9.2/9.3.
Discussion: https://postgr.es/m/73FBA179-C68C-4540-9473-71E865408B15@silentmedia.com
Both ExecMakeFunctionResultSet() and evaluation of simple expressions
need to be done in the per-tuple memory context, not per-query, else
we leak data until end of query. This is a consideration that was
missed while refactoring code in the ProjectSet patch (note that in
pre-v10, ExecMakeFunctionResult is called in the per-tuple context).
Per bug #14843 from Ben M. Diagnosed independently by Andres and myself.
Discussion: https://postgr.es/m/20171005230321.28561.15927@wrigleys.postgresql.org
Sloppy loop coding in set_status_by_pages() resulted in fetching one array
element more than it should from the subxids[] array. The odds of this
resulting in SIGSEGV are pretty small, but we've certainly seen that happen
with similar mistakes elsewhere. While at it, we can get rid of an extra
TransactionIdToPage() calculation per loop.
Per report from David Binderman. Back-patch to all supported branches,
since this code is quite old.
Discussion: https://postgr.es/m/HE1PR0802MB2331CBA919CBFFF0C465EB429C710@HE1PR0802MB2331.eurprd08.prod.outlook.com
When some tuple versions in an update chain are frozen due to them being
older than freeze_min_age, the xmax/xmin trail can become broken. This
breaks HOT (and probably other things). A subsequent VACUUM can break
things in more serious ways, such as leaving orphan heap-only tuples
whose root HOT redirect items were removed. This can be seen because
index creation (or REINDEX) complain like
ERROR: XX000: failed to find parent tuple for heap-only tuple at (0,7) in table "t"
Because of relfrozenxid contraints, we cannot avoid the freezing of the
early tuples, so we must cope with the results: whenever we see an Xmin
of FrozenTransactionId, consider it a match for whatever the previous
Xmax value was.
This problem seems to have appeared in 9.3 with multixact changes,
though strictly speaking it seems unrelated.
Since 9.4 we have commit 37484ad2a "Change the way we mark tuples as
frozen", so the fix is simple: just compare the raw Xmin (still stored
in the tuple header, since freezing merely set an infomask bit) to the
Xmax. But in 9.3 we rewrite the Xmin value to FrozenTransactionId, so
the original value is lost and we have nothing to compare the Xmax with.
To cope with that case we need to compare the Xmin with FrozenXid,
assume it's a match, and hope for the best. Sadly, since you can
pg_upgrade a 9.3 instance containing half-frozen pages to newer
releases, we need to keep the old check in newer versions too, which
seems a bit brittle; I hope we can somehow get rid of that.
I didn't optimize the new function for performance. The new coding is
probably a bit slower than before, since there is a function call rather
than a straight comparison, but I'd rather have it work correctly than
be fast but wrong.
This is a followup after 20b6552242 fixed a few related problems.
Apparently, in 9.6 and up there are more ways to get into trouble, but
in 9.3 - 9.5 I cannot reproduce a problem anymore with this patch, so
there must be a separate bug.
Reported-by: Peter Geoghegan
Diagnosed-by: Peter Geoghegan, Michael Paquier, Daniel Wood,
Yi Wen Wong, Álvaro
Discussion: https://postgr.es/m/CAH2-Wznm4rCrhFAiwKPWTpEw2bXDtgROZK7jWWGucXeH3D1fmA@mail.gmail.com
Instead of joining two partitioned tables in their entirety we can, if
it is an equi-join on the partition keys, join the matching partitions
individually. This involves teaching the planner about "other join"
rels, which are related to regular join rels in the same way that
other member rels are related to baserels. This can use significantly
more CPU time and memory than regular join planning, because there may
now be a set of "other" rels not only for every base relation but also
for every join relation. In most practical cases, this probably
shouldn't be a problem, because (1) it's probably unusual to join many
tables each with many partitions using the partition keys for all
joins and (2) if you do that scenario then you probably have a big
enough machine to handle the increased memory cost of planning and (3)
the resulting plan is highly likely to be better, so what you spend in
planning you'll make up on the execution side. All the same, for now,
turn this feature off by default.
Currently, we can only perform joins between two tables whose
partitioning schemes are absolutely identical. It would be nice to
cope with other scenarios, such as extra partitions on one side or the
other with no match on the other side, but that will have to wait for
a future patch.
Ashutosh Bapat, reviewed and tested by Rajkumar Raghuwanshi, Amit
Langote, Rafia Sabih, Thomas Munro, Dilip Kumar, Antonin Houska, Amit
Khandekar, and by me. A few final adjustments by me.
Discussion: http://postgr.es/m/CAFjFpRfQ8GrQvzp3jA2wnLqrHmaXna-urjm_UY9BqXj=EaDTSA@mail.gmail.com
Discussion: http://postgr.es/m/CAFjFpRcitjfrULr5jfuKWRPsGUX0LQ0k8-yG0Qw2+1LBGNpMdw@mail.gmail.com
If the table attached as a partition is itself partitioned, individual
partitions might have constraints strong enough to skip scanning the
table even if the table actually attached does not. This is pretty
cheap to check, and possibly a big win if it works out.
Amit Langote, with test case changes by me.
Discussion: http://postgr.es/m/1f08b844-0078-aa8d-452e-7af3bf77d05f@lab.ntt.co.jp
Haribabu Kommi, reviewed by Dilip Kumar and Rafia Sabih. Various
cosmetic changes by me to explain why this appears to be safe but
allowing inserts in parallel mode in general wouldn't be. Also, I
removed the REFRESH MATERIALIZED VIEW case from Haribabu's patch,
since I'm not convinced that case is OK, and hacked on the
documentation somewhat.
Discussion: http://postgr.es/m/CAJrrPGdo5bak6qnPWe8Kpi8g_jfQEs-G4SYmG9y+OFaw2-dPvA@mail.gmail.com
Remove obsolete references to get_rel_oids(). Avoid listing specific
relkinds in the comments, since we seem unable to keep such things
in sync with the code, and it's not all that helpful anyhow.
Noted by Michael Paquier, though I rewrote the comments a bit more.
Discussion: https://postgr.es/m/CAB7nPqTWiN9zwKTaOrsnKiGDChqRt7C1+CiiDk4N4OMn92rs6A@mail.gmail.com
A lot of semi-internal code just prints out numeric SPI error codes,
which is not very helpful. We already have an API function to convert
the codes to a string, so let's make more use of that.
Reviewed-by: Michael Paquier <michael.paquier@gmail.com>
These are two completely unrelated code paths, so it doesn't make sense
to pack them into one function.
Add attribute noreturn to ri_ReportViolation().
Reviewed-by: Michael Paquier <michael.paquier@gmail.com>
Turns out we have enough functions that the binary search is quite
noticeable in profiles.
Thus have Gen_fmgrtab.pl build a new mapping from a builtin function's
oid to an index in the existing fmgr_builtins array. That keeps the
additional memory usage at a reasonable amount.
Author: Andres Freund, with input from Tom Lane
Discussion: https://postgr.es/m/20170914065128.a5sk7z4xde5uy3ei@alap3.anarazel.de
Not much to say about this; does what it says on the tin.
However, formerly, if there was a column list then the ANALYZE action was
implied; now it must be specified, or you get an error. This is because
it would otherwise be a bit unclear what the user meant if some tables
have column lists and some don't.
Nathan Bossart, reviewed by Michael Paquier and Masahiko Sawada, with some
editorialization by me
Discussion: https://postgr.es/m/E061A8E3-5E3D-494D-94F0-E8A9B312BBFC@amazon.com
Commit 597a87ccc introduced a latch pointer variable to replace use
of a long-lived shared latch in the shared WalRcvData structure.
This was not well thought out, because there are now hazards of the
pointer variable changing while it's being inspected by another
process. This could obviously lead to a core dump in code like
if (WalRcv->latch)
SetLatch(WalRcv->latch);
and there's a more remote risk of a torn read, if we have any
platforms where reading/writing a pointer is not atomic.
An actual problem would occur only if the walreceiver process
exits (gracefully) while the startup process is trying to
signal it, but that seems well within the realm of possibility.
To fix, treat the pointer variable (not the referenced latch)
as being protected by the WalRcv->mutex spinlock. There
remains a race condition that we could apply SetLatch to a
process latch that no longer belongs to the walreceiver, but
I believe that's harmless: at worst it'd cause an extra wakeup
of the next process to use that PGPROC structure.
Back-patch to v10 where the faulty code was added.
Discussion: https://postgr.es/m/22735.1507048202@sss.pgh.pa.us
1. Since commit b1a9bad9e7 we had pstrdup() inside a
spinlock-protected critical section; reported by Andreas Seltenreich.
Turn those into strlcpy() to stack-allocated variables instead.
Backpatch to 9.6.
2. Since commit 9ed551e0a4 we had a pfree() uselessly inside a
spinlock-protected critical section. Tom Lane noticed in code review.
Move down. Backpatch to 9.6.
3. Since commit 64233902d2 we had GetCurrentTimestamp() (a kernel
call) inside a spinlock-protected critical section. Tom Lane noticed in
code review. Move it up. Backpatch to 9.2.
4. Since commit 1bb2558046 we did elog(PANIC) while holding spinlock.
Tom Lane noticed in code review. Release spinlock before dying.
Backpatch to 9.2.
Discussion: https://postgr.es/m/87h8vhtgj2.fsf@ansel.ydns.eu
All postgres internal usages are replaced, it's just libpq example
usages that haven't been converted. External users of libpq can't
generally rely on including postgres internal headers.
Note that this includes replacing open-coded byte swapping of 64bit
integers (using two 32 bit swaps) with a single 64bit swap.
Where it looked applicable, I have removed netinet/in.h and
arpa/inet.h usage, which previously provided the relevant
functionality. It's perfectly possible that I missed other reasons for
including those, the buildfarm will tell.
Author: Andres Freund
Discussion: https://postgr.es/m/20170927172019.gheidqy6xvlxb325@alap3.anarazel.de
Allowing arrays with a domain type as their element type was left un-done
in the original domain patch, but not for any very good reason. This
omission leads to such surprising results as array_agg() not working on
a domain column, because the parser can't identify a suitable output type
for the polymorphic aggregate.
In order to fix this, first clean up the APIs of coerce_to_domain() and
some internal functions in parse_coerce.c so that we consistently pass
around a CoercionContext along with CoercionForm. Previously, we sometimes
passed an "isExplicit" boolean flag instead, which is strictly less
information; and coerce_to_domain() didn't even get that, but instead had
to reverse-engineer isExplicit from CoercionForm. That's contrary to the
documentation in primnodes.h that says that CoercionForm only affects
display and not semantics. I don't think this change fixes any live bugs,
but it makes things more consistent. The main reason for doing it though
is that now build_coercion_expression() receives ccontext, which it needs
in order to be able to recursively invoke coerce_to_target_type().
Next, reimplement ArrayCoerceExpr so that the node does not directly know
any details of what has to be done to the individual array elements while
performing the array coercion. Instead, the per-element processing is
represented by a sub-expression whose input is a source array element and
whose output is a target array element. This simplifies life in
parse_coerce.c, because it can build that sub-expression by a recursive
invocation of coerce_to_target_type(). The executor now handles the
per-element processing as a compiled expression instead of hard-wired code.
The main advantage of this is that we can use a single ArrayCoerceExpr to
handle as many as three successive steps per element: base type conversion,
typmod coercion, and domain constraint checking. The old code used two
stacked ArrayCoerceExprs to handle type + typmod coercion, which was pretty
inefficient, and adding yet another array deconstruction to do domain
constraint checking seemed very unappetizing.
In the case where we just need a single, very simple coercion function,
doing this straightforwardly leads to a noticeable increase in the
per-array-element runtime cost. Hence, add an additional shortcut evalfunc
in execExprInterp.c that skips unnecessary overhead for that specific form
of expression. The runtime speed of simple cases is within 1% or so of
where it was before, while cases that previously required two levels of
array processing are significantly faster.
Finally, create an implicit array type for every domain type, as we do for
base types, enums, etc. Everything except the array-coercion case seems
to just work without further effort.
Tom Lane, reviewed by Andrew Dunstan
Discussion: https://postgr.es/m/9852.1499791473@sss.pgh.pa.us
get_rel_oids used to not take any relation locks at all, but that stopped
being a good idea with commit 3c3bb9933, which inserted a syscache lookup
into the function. A concurrent DROP TABLE could now produce "cache lookup
failed", which we don't want to have happen in normal operation. The best
solution seems to be to transiently take a lock on the relation named by
the RangeVar (which also makes the result of RangeVarGetRelid a lot less
spongy). But we shouldn't hold the lock beyond this function, because we
don't want VACUUM to lock more than one table at a time. (That would not
be a big problem right now, but it will become one after the pending
feature patch to allow multiple tables to be named in VACUUM.)
In passing, adjust vacuum_rel and analyze_rel to document that we don't
trust the passed RangeVar to be accurate, and allow the RangeVar to
possibly be NULL --- which it is anyway for a whole-database VACUUM,
though we accidentally didn't crash for that case.
The passed RangeVar is in fact inaccurate when dealing with a child
partition, as of v10, and it has been wrong for a whole long time in the
case of vacuum_rel() recursing to a TOAST table. None of these things
present visible bugs up to now, because the passed RangeVar is in fact
only consulted for autovacuum logging, and in that particular context it's
always accurate because autovacuum doesn't let vacuum.c expand partitions
nor recurse to toast tables. Still, this seems like trouble waiting to
happen, so let's nail the door at least partly shut. (Further cleanup
is planned, in HEAD only, as part of the pending feature patch.)
Fix some sadly inaccurate/obsolete comments too. Back-patch to v10.
Michael Paquier and Tom Lane
Discussion: https://postgr.es/m/25023.1506107590@sss.pgh.pa.us
Avoid the coding pattern "*op->resvalue = f();", as some compilers think
that requires them to evaluate "op->resvalue" before the function call.
Unless there are lots of free registers, this can lead to a useless
register spill and reload across the call.
I changed all the cases like this in ExecInterpExpr(), but didn't bother
in the out-of-line opcode eval subroutines, since those are presumably
not as performance-critical.
Discussion: https://postgr.es/m/2508.1506630094@sss.pgh.pa.us
Add bgw_type field to background worker structure. It is intended to be
set to the same value for all workers of the same type, so they can be
grouped in pg_stat_activity, for example.
The backend_type column in pg_stat_activity now shows bgw_type for a
background worker. The ps listing also no longer calls out that a
process is a background worker but just show the bgw_type. That way,
being a background worker is more of an implementation detail now that
is not shown to the user. However, most log messages still refer to
'background worker "%s"'; otherwise constructing sensible and
translatable log messages would become tricky.
Reviewed-by: Michael Paquier <michael.paquier@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
At the time replacement_sort_tuples was introduced, there were still
cases where replacement selection sort noticeably outperformed using
quicksort even for the first run. However, those cases seem to have
evaporated as a result of further improvements made since that time
(and perhaps also advances in CPU technology). So remove replacement
selection and the controlling GUC entirely. This makes tuplesort.c
noticeably simpler and probably paves the way for further
optimizations someone might want to do later.
Peter Geoghegan, with review and testing by Tomas Vondra and me.
Discussion: https://postgr.es/m/CAH2-WzmmNjG_K0R9nqYwMq3zjyJJK+hCbiZYNGhAy-Zyjs64GQ@mail.gmail.com
Vacuum calls page-level HOT prune to remove dead HOT tuples before doing
liveness checks (HeapTupleSatisfiesVacuum) on the remaining tuples. But
concurrent transaction commit/abort may turn DEAD some of the HOT tuples
that survived the prune, before HeapTupleSatisfiesVacuum tests them.
This happens to activate the code that decides to freeze the tuple ...
which resuscitates it, duplicating data.
(This is especially bad if there's any unique constraints, because those
are now internally violated due to the duplicate entries, though you
won't know until you try to REINDEX or dump/restore the table.)
One possible fix would be to simply skip doing anything to the tuple,
and hope that the next HOT prune would remove it. But there is a
problem: if the tuple is older than freeze horizon, this would leave an
unfrozen XID behind, and if no HOT prune happens to clean it up before
the containing pg_clog segment is truncated away, it'd later cause an
error when the XID is looked up.
Fix the problem by having the tuple freezing routines cope with the
situation: don't freeze the tuple (and keep it dead). In the cases that
the XID is older than the freeze age, set the HEAP_XMAX_COMMITTED flag
so that there is no need to look up the XID in pg_clog later on.
An isolation test is included, authored by Michael Paquier, loosely
based on Daniel Wood's original reproducer. It only tests one
particular scenario, though, not all the possible ways for this problem
to surface; it be good to have a more reliable way to test this more
fully, but it'd require more work.
In message https://postgr.es/m/20170911140103.5akxptyrwgpc25bw@alvherre.pgsql
I outlined another test case (more closely matching Dan Wood's) that
exposed a few more ways for the problem to occur.
Backpatch all the way back to 9.3, where this problem was introduced by
multixact juggling. In branches 9.3 and 9.4, this includes a backpatch
of commit e5ff9fefcd50 (of 9.5 era), since the original is not
correctable without matching the coding pattern in 9.5 up.
Reported-by: Daniel Wood
Diagnosed-by: Daniel Wood
Reviewed-by: Yi Wen Wong, Michaël Paquier
Discussion: https://postgr.es/m/E5711E62-8FDF-4DCA-A888-C200BF6B5742@amazon.com
float8_numeric() and float4_numeric() failed to consider the possibility
that the input is an IEEE infinity. The results depended on the
platform-specific behavior of sprintf(): on most platforms you'd get
something like
ERROR: invalid input syntax for type numeric: "inf"
but at least on Windows it's possible for the conversion to succeed and
deliver a finite value (typically 1), due to a nonstandard output format
from sprintf and lack of syntax error checking in these functions.
Since our numeric type lacks the concept of infinity, a suitable conversion
is impossible; the best thing to do is throw an explicit error before
letting sprintf do its thing.
While at it, let's use snprintf not sprintf. Overrunning the buffer
should be impossible if sprintf does what it's supposed to, but this
is cheap insurance against a stack smash if it doesn't.
Problem reported by Taiki Kondo. Patch by me based on fix suggestion
from KaiGai Kohei. Back-patch to all supported branches.
Discussion: https://postgr.es/m/12A9442FBAE80D4E8953883E0B84E088C8C7A2@BPXM01GP.gisp.nec.co.jp
This reverts commit 15bc038f9, along with the followon commits 1635e80d3
and 984c92074 that tried to clean up the problems exposed by bug #14825.
The result was incomplete because it failed to address parallel-query
requirements. With 10.0 release so close upon us, now does not seem like
the time to be adding more code to fix that. I hope we can un-revert this
code and add the missing parallel query support during the v11 cycle.
Back-patch to v10.
Discussion: https://postgr.es/m/20170922185904.1448.16585@wrigleys.postgresql.org
posix_fallocate() is not quite a drop-in replacement for fallocate(),
because it is defined to return the error code as its function result,
not in "errno". I (tgl) missed this because RHEL6's version seems
to set errno as well. That is not the case on more modern Linuxen,
though, as per buildfarm results.
Aside from fixing the return-convention confusion, remove the test
for ENOSYS; we expect that glibc will mask that for posix_fallocate,
though it does not for fallocate. Keep the test for EINTR, because
POSIX specifies that as a possible result, and buildfarm results
suggest that it can happen in practice.
Back-patch to 9.4, like the previous commit.
Thomas Munro
Discussion: https://postgr.es/m/1002664500.12301802.1471008223422.JavaMail.yahoo@mail.yahoo.com
The blacklist mechanism added by the preceding commit directly fixes
most of the practical cases that the same-transaction test was meant
to cover. What remains is use-cases like
begin;
create type e as enum('x');
alter type e add value 'y';
-- use 'y' somehow
commit;
However, because the same-transaction test is heuristic, it fails on
small variants of that, such as renaming the type or changing its
owner. Rather than try to explain the behavior to users, let's
remove it and just have a rule that the newly added value can't be
used before being committed, full stop. Perhaps later it will be
worth the implementation effort and overhead to have a more accurate
test for type-was-created-in-this-transaction. We'll wait for some
field experience with v10 before deciding to do that.
Back-patch to v10.
Discussion: https://postgr.es/m/20170922185904.1448.16585@wrigleys.postgresql.org
Commit 15bc038f9 allowed ALTER TYPE ADD VALUE to be executed inside
transaction blocks, by disallowing the use of the added value later
in the same transaction, except under limited circumstances. However,
the test for "limited circumstances" was heuristic and could reject
references to enum values that were created during CREATE TYPE AS ENUM,
not just later. This breaks the use-case of restoring pg_dump scripts
in a single transaction, as reported in bug #14825 from Balazs Szilfai.
We can improve this by keeping a "blacklist" table of enum value OIDs
created by ALTER TYPE ADD VALUE during the current transaction. Any
visible-but-uncommitted value whose OID is not in the blacklist must
have been created by CREATE TYPE AS ENUM, and can be used safely
because it could not have a lifespan shorter than its parent enum type.
This change also removes the restriction that a renamed enum value
can't be used before being committed (unless it was on the blacklist).
Andrew Dunstan, with cosmetic improvements by me.
Back-patch to v10.
Discussion: https://postgr.es/m/20170922185904.1448.16585@wrigleys.postgresql.org
A FOR ALL TABLES publication naturally considers all base tables to be a
candidate for replication. This includes transient heaps that are
created during a table rewrite during DDL. This causes failures on the
subscriber side because it will not have a table like pg_temp_16386 to
receive data (and if it did, it would be the wrong table).
The prevent this problem, we filter out any tables that match this
naming pattern and match an actual table from FOR ALL TABLES
publications. This is only a heuristic, meaning that user tables that
match that naming could accidentally be omitted. A more robust solution
might require an explicit marking of such tables in pg_class somehow.
Reported-by: yxq <yxq@o2.pl>
Bug: #14785
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Petr Jelinek <petr.jelinek@2ndquadrant.com>
This was intended as infrastructure for weakening VACUUM's locking
requirements, similar to what was done for btree indexes in commit
2ed5b87f96. However, for hash indexes,
it seems that the improvements which are possible are actually
extremely marginal. Furthermore, performing the LSN cross-check will
end up skipping cleanup far more often than is necessary; we only care
about page modifications due to a VACUUM, but the LSN check will fail
if ANY modification has occurred. So, rather than pressing forward
with that "optimization", just rip the LSN field out.
Patch by me, reviewed by Ashutosh Sharma and Amit Kapila
Discussion: http://postgr.es/m/CAA4eK1JxqqcuC5Un7YLQVhOYSZBS+t=3xqZuEkt5RyquyuxpwQ@mail.gmail.com
On Linux, shared memory segments created with shm_open() are backed by
swap files created in tmpfs. If the swap file needs to be extended,
but there's no tmpfs space left, you get a very unfriendly SIGBUS trap.
To avoid this, force allocation of the full request size when we create
the segment. This adds a few cycles, but none that we wouldn't expend
later anyway, assuming the request isn't hugely bigger than the actual
need.
Make this code #ifdef __linux__, because (a) there's not currently a
reason to think the same problem exists on other platforms, and (b)
applying posix_fallocate() to an FD created by shm_open() isn't very
portable anyway.
Back-patch to 9.4 where the DSM code came in.
Thomas Munro, per a bug report from Amul Sul
Discussion: https://postgr.es/m/1002664500.12301802.1471008223422.JavaMail.yahoo@mail.yahoo.com
If construct_array() or construct_md_array() were given a dimension of
zero, they'd produce an array that contains no elements but has positive
dimension. This violates a general expectation that empty arrays should
have ndims = 0; in particular, while arrays like this print as empty,
they don't compare equal to other empty arrays.
Up to now we've expected callers to avoid making such calls and instead
be careful to call construct_empty_array() if there would be no elements.
But this has always been an easily missed case, and we've repeatedly had to
fix callers to do it right. In bug #14826, Erwin Brandstetter pointed out
yet another such oversight, in ts_lexize(); and a bit of examination of
other call sites found at least two more with similar issues. So let's
fix the problem centrally and permanently by changing these two functions
to construct a proper zero-D empty array whenever the array would be empty.
This renders a few explicit calls of construct_empty_array() redundant,
but the only such place I found that really seemed worth changing was in
ExecEvalArrayExpr().
Although this fixes some very old bugs, no back-patch: the problem is
pretty minor and the risk of changing behavior seems to outweigh the
benefit in stable branches.
Discussion: https://postgr.es/m/20170923125723.1448.39412@wrigleys.postgresql.org
Discussion: https://postgr.es/m/20570.1506198383@sss.pgh.pa.us
There is no reason to ever prevent the use of SortSupport on Windows
when ICU locales are used. We previously avoided SortSupport on Windows
with UTF-8 server encoding and a non C-locale due to restrictions in
Windows' libc functionality.
This is now considered to be a restriction in one platform's libc
collation provider, and not a more general platform restriction.
Reported-by: Peter Geoghegan <pg@bowt.ie>
Invoke vacuum(), as well as "work item" processing, in the PortalContext
that do_autovacuum() has manufactured, which will be reset before each
such invocation. This ensures cleanup of any memory leaked by these
operations. It also avoids the rather dangerous practice of calling
vacuum() in a context that vacuum() itself will destroy while it runs.
There's no known live bug there, but it's not hard to imagine introducing
one if we leave it like this.
Tom Lane, reviewed by Michael Paquier and Alvaro Herrera
Discussion: https://postgr.es/m/13849.1506114543@sss.pgh.pa.us
The file handling functions from fd.c were called with a diverse mix of
notations for the file permissions when they were opening new files.
Almost all files created by the server should have the same permissions
set. So change the API so that e.g. OpenTransientFile() automatically
uses the standard permissions set, and OpenTransientFilePerm() is a new
function that takes an explicit permissions set for the few cases where
it is needed. This also saves an unnecessary argument for call sites
that are just opening an existing file.
While we're reviewing these APIs, get rid of the FileName typedef and
use the standard const char * for the file name and mode_t for the file
mode. This makes these functions match other file handling functions
and removes an unnecessary layer of mysteriousness. We can also get rid
of a few casts that way.
Author: David Steele <david@pgmasters.net>
In two cases, we set a different umask for some piece of code and
restore it afterwards. But if the contained code errors out, the umask
is not restored. So add TRY/CATCH blocks to fix that.
Commit 09cb5c0e7d added a similar
optimization to btree back in 2006, but nobody bothered to implement
the same thing for hash indexes, probably because they weren't
WAL-logged and had lots of other performance problems as well. As
with the corresponding btree case, this eliminates the problem of
potentially needing to refind our position within the page, and cuts
down on pin/unpin traffic as well.
Ashutosh Sharma, reviewed by Alexander Korotkov, Jesper Pedersen,
Amit Kapila, and me. Some final edits to comments and README by
me.
Discussion: http://postgr.es/m/CAE9k0Pm3KTx93K8_5j6VMzG4h5F+SyknxUwXrN-zqSZ9X8ZS3w@mail.gmail.com
Adjust commentary in regc_pg_locale.c to remove mention of the possibility
of not having <wctype.h> functions, since we no longer consider that.
Eliminate duplicate code in wparser_def.c by generalizing the p_iswhat
macro to take a parameter saying what to return for non-ASCII chars
in C locale. (That's not really a consequence of the
USE_WIDE_UPPER_LOWER-ectomy, but I noticed it while doing that.)
These functions are required by SUS v2, which is our minimum baseline
for Unix platforms, and are present on all interesting Windows versions
as well. Even our oldest buildfarm members have them. Thus, we were not
testing the "!USE_WIDE_UPPER_LOWER" code paths, which explains why the bug
fixed in commit e6023ee7f escaped detection. Per discussion, there seems
to be no more real-world value in maintaining this option. Hence, remove
the configure-time tests for wcstombs() and towlower(), remove the
USE_WIDE_UPPER_LOWER symbol, and remove all the !USE_WIDE_UPPER_LOWER code.
There's not actually all that much of the latter, but simplifying the #if
nests is a win in itself.
Discussion: https://postgr.es/m/20170921052928.GA188913@rfd.leadboat.com
The placement of the ifdef blocks in formatting.c was pretty bogus, so
the code failed to compile if USE_WIDE_UPPER_LOWER was not defined.
Reported-by: Peter Geoghegan <pg@bowt.ie>
Reported-by: Noah Misch <noah@leadboat.com>
Previously, the code didn't think about this case and would just try to
analyze such a column twice. That would fail at the point of inserting
the second version of the pg_statistic row, with obscure error messsages
like "duplicate key value violates unique constraint" or "tuple already
updated by self", depending on context and PG version. We could allow
the case by ignoring duplicate column specifications, but it seems better
to reject it explicitly.
The bogus error messages seem like arguably a bug, so back-patch to
all supported versions.
Nathan Bossart, per a report from Michael Paquier, and whacked
around a bit by me.
Discussion: https://postgr.es/m/E061A8E3-5E3D-494D-94F0-E8A9B312BBFC@amazon.com
These variables are only ever written to in assertion-enabled builds,
and the latest Microsoft compilers complain about such variables in
non-assertion-enabled builds.
Apparently they don't worry so much about variables that are written to
but not read from, so most of our PG_USED_FOR_ASSERTS_ONLY variables
don't cause the problem.
Discussion: https://postgr.es/m/7800.1505950322@sss.pgh.pa.us
This is not used for anything yet, but it is necessary infrastructure
for partition-wise join and for partition pruning without constraint
exclusion.
Ashutosh Bapat, reviewed by Amit Langote and with quite a few changes,
mostly cosmetic, by me. Additional review and testing of this patch
series by Antonin Houska, Amit Khandekar, Rafia Sabih, Rajkumar
Raghuwanshi, Thomas Munro, and Dilip Kumar.
Discussion: http://postgr.es/m/CAFjFpRfneFG3H+F6BaiXemMrKF+FY-POpx3Ocy+RiH3yBmXSNw@mail.gmail.com
pg_newlocale_from_collation() used malloc() and strdup() directly,
which is generally not per backend coding style, and it didn't bother
to check for failure results, but would just SIGSEGV instead. Also,
if one of the numerous error checks in the middle of the function
failed, the already-allocated memory would be leaked permanently.
Admittedly, it's not a lot of memory, but it could build up if this
function were called repeatedly for a bad collation.
The first two problems are easily cured by palloc'ing in TopMemoryContext
instead of calling libc directly. We can fairly easily dodge the leakage
problem for the struct pg_locale_struct by filling in a temporary variable
and allocating permanent storage only once we reach the bottom of the
function. It's harder to get rid of the potential leakage for ICU's copy
of the collcollate string, but at least that's only allocated after most
of the error checks; so live with that aspect.
Back-patch to v10 where this code came in, with one or another of the
ICU patches.
Remove gratuitous differences in the process names shown in
pg_stat_activity.backend_type and the ps output.
Reviewed-by: Takayuki Tsunakawa <tsunakawa.takay@jp.fujitsu.com>
For performance reasons a larger segment size than the default 16MB
can be useful. A larger segment size has two main benefits: Firstly,
in setups using archiving, it makes it easier to write scripts that
can keep up with higher amounts of WAL, secondly, the WAL has to be
written and synced to disk less frequently.
But at the same time large segment size are disadvantageous for
smaller databases. So far the segment size had to be configured at
compile time, often making it unrealistic to choose one fitting to a
particularly load. Therefore change it to a initdb time setting.
This includes a breaking changes to the xlogreader.h API, which now
requires the current segment size to be configured. For that and
similar reasons a number of binaries had to be taught how to recognize
the current segment size.
Author: Beena Emerson, editorialized by Andres Freund
Reviewed-By: Andres Freund, David Steele, Kuntal Ghosh, Michael
Paquier, Peter Eisentraut, Robert Hass, Tushar Ahuja
Discussion: https://postgr.es/m/CAOG9ApEAcQ--1ieKbhFzXSQPw_YLmepaa4hNdnY5+ZULpt81Mw@mail.gmail.com
The preceding patch allowed us to remove useless GiST support functions.
This patch actually does that for all the no-op cases in the core GiST
code. This buys us whatever performance gain is to be had, and more
importantly exercises the preceding patch.
There remain no-op functions in the contrib GiST opclasses, but those
will take more work to remove.
Discussion: https://postgr.es/m/CAJEAwVELVx9gYscpE=Be6iJxvdW5unZ_LkcAaVNSeOwvdwtD=A@mail.gmail.com
There are common use-cases in which the compress and/or decompress
functions can be omitted, with the result being that we make no
data transformation when storing or retrieving index values.
Previously, you had to provide a no-op function anyway, but this
patch allows such opclass support functions to be omitted.
Furthermore, if the compress function is omitted, then the core code
knows that the stored representation is the same as the original data.
This means we can allow index-only scans without requiring a fetch
function to be provided either. Previously you had to provide a
no-op fetch function if you wanted IOS to work.
This reportedly provides a small performance benefit in such cases,
but IMO the real reason for doing it is just to reduce the amount of
useless boilerplate code that has to be written for GiST opclasses.
Andrey Borodin, reviewed by Dmitriy Sarafannikov
Discussion: https://postgr.es/m/CAJEAwVELVx9gYscpE=Be6iJxvdW5unZ_LkcAaVNSeOwvdwtD=A@mail.gmail.com
The use of strnlen rather than strlen was just paranoia. Instead of
giving up on the paranoia, just implement the safeguard
differently. And add a comment explaining why we're careful.
Author: Andres Freund
Discussion: https://postgr.es/m/E1duOkJ-0001Mc-U5@gemulon.postgresql.org
Previously multi-byte aware truncation was done on every
pgstat_report_activity() call - proving to be a bottleneck for
workloads with long query strings that execute quickly.
Instead move the truncation to the read side, which commonly is
executed far less frequently. That's possible because all server
encodings allow to determine the length of a multi-byte string from
the first byte.
Rename PgBackendStatus.st_activity to st_activity_raw so existing
extension users of the field break - their code has to be adjusted to
use pgstat_clip_activity().
Author: Andres Freund
Tested-By: Khuntal Ghosh
Reviewed-By: Robert Haas, Tom Lane
Discussion: https://postgr.es/m/20170912071948.pa7igbpkkkviecpz@alap3.anarazel.de
Previously statement_timeout, in the extended protocol, affected all
messages till a Sync message. For clients that pipeline/batch query
execution that's problematic.
Instead disable timeout after each Execute message, and enable, if
necessary, the timer in start_xact_command(). As that's done only for
Execute and not Parse / Bind, pipelining the latter two could still
cause undesirable timeouts. But a survey of protocol implementations
shows that all drivers issue Sync messages when preparing, and adding
timeout rearming to both is fairly expensive for the common parse /
bind / execute sequence.
Author: Tatsuo Ishii, editorialized by Andres Freund
Reviewed-By: Takayuki Tsunakawa, Andres Freund
Discussion: https://postgr.es/m/20170222.115044.1665674502985097185.t-ishii@sraoss.co.jp
The bug was caused by not re-reading the control file during crash
recovery restarts, which lead to an attempt to pfree() shared memory
contents. The fix is to re-read the control file, which seems good
anyway.
It's unclear as of this moment, whether we want to keep the
refactoring introduced in the commit referenced above, or come up with
an alternative approach. But fixing the bug in the mean time seems
like a good idea regardless.
A followup commit will introduce regression test coverage for crash
restarts.
Reported-By: Tom Lane
Discussion: https://postgr.es/m/14134.1505572349@sss.pgh.pa.us
Make the btree page-flags test macros (P_ISLEAF and friends) return clean
boolean values, rather than values that might not fit in a bool. Use them
in a few places that were randomly referencing the flag bits directly.
In passing, change access/nbtree/'s only direct use of BUFFER_LOCK_SHARE to
BT_READ. (Some think we should go the other way, but as long as we have
BT_READ/BT_WRITE, let's use them consistently.)
Masahiko Sawada, reviewed by Doug Doole
Discussion: https://postgr.es/m/CAD21AoBmWPeN=WBB5Jvyz_Nt3rmW1ebUyAnk3ZbJP3RMXALJog@mail.gmail.com
By project convention, these names should include "P" when dealing with a
pointer type; that is, if the result of a GETARG macro is of type FOO *,
it should be called PG_GETARG_FOO_P not just PG_GETARG_FOO. Some newer
types such as JSONB and ranges had not followed the convention, and a
number of contrib modules hadn't gotten that memo either. Rename the
offending macros to improve consistency.
In passing, fix a few places that thought PG_DETOAST_DATUM() returns
a Datum; it does not, it returns "struct varlena *". Applying
DatumGetPointer to that happens not to cause any bad effects today,
but it's formally wrong. Also, adjust an ltree macro that was designed
without any thought for what pgindent would do with it.
This is all cosmetic and shouldn't have any impact on generated code.
Mark Dilger, some further tweaks by me
Discussion: https://postgr.es/m/EA5676F4-766F-4F38-8348-ECC7DB427C6A@gmail.com
If we failed to get a background worker slot, the code just walked
away from the logicalrep-worker slot it already had, leaving that
looking like the worker is still starting up. This led to an indefinite
hang in subscription startup, as reported by Thomas Munro. We must
release the slot on failure.
Also fix a thinko: we must capture the worker slot's generation before
releasing LogicalRepWorkerLock the first time, else testing to see if
it's changed is pretty meaningless.
BTW, the CHECK_FOR_INTERRUPTS() in WaitForReplicationWorkerAttach is a
ticking time bomb, even without considering the possibility of elog(ERROR)
in one of the other functions it calls. Really, this entire business needs
a redesign with some actual thought about error recovery. But for now
I'm just band-aiding the case observed in testing.
Back-patch to v10 where this code was added.
Discussion: https://postgr.es/m/CAEepm=2bP3TBMFBArP6o20AZaRduWjMnjCjt22hSdnA-EvrtCw@mail.gmail.com
When ALTER SUBSCRIPTION DISABLE is run in the same transaction before
DROP SUBSCRIPTION, the latter will hang because workers will still be
running, not having seen the DISABLE committed, and DROP SUBSCRIPTION
will wait until the workers have vacated the replication origin slots.
Previously, DROP SUBSCRIPTION killed the logical replication workers
immediately only if it was going to drop the replication slot, otherwise
it scheduled the worker killing for the end of the transaction, as a
result of 7e174fa793. This, however,
causes the present problem. To fix, kill the workers immediately in all
cases. This covers all cases: A subscription that doesn't have a
replication slot must be disabled. It was either disabled in the same
transaction, or it was already disabled before the current transaction,
but then there shouldn't be any workers left and this won't make a
difference.
Reported-by: Arseny Sher <a.sher@postgrespro.ru>
Discussion: https://www.postgresql.org/message-id/flat/87mv6av84w.fsf%40ars-thinkpad
AfterTriggerEndQuery correctly notes that the query_stack could get
repalloc'd during a trigger firing, but it nonetheless passes the address
of a query_stack entry to afterTriggerInvokeEvents, so that if such a
repalloc occurs, afterTriggerInvokeEvents is already working with an
obsolete dangling pointer while it scans the rest of the events. Oops.
The only code at risk is its "delete_ok" cleanup code, so we can
prevent unsafe behavior by passing delete_ok = false instead of true.
However, that could have a significant performance penalty, because the
point of passing delete_ok = true is to not have to re-scan possibly
a large number of dead trigger events on the next time through the loop.
There's more than one way to skin that cat, though. What we can do is
delete all the "chunks" in the event list except the last one, since
we know all events in them must be dead. Deleting the chunks is work
we'd have had to do later in AfterTriggerEndQuery anyway, and it ends
up saving rescanning of just about the same events we'd have gotten
rid of with delete_ok = true.
In v10 and HEAD, we also have to be careful to mop up any per-table
after_trig_events pointers that would become dangling. This is slightly
annoying, but I don't think that normal use-cases will traverse this code
path often enough for it to be a performance problem.
It's pretty hard to hit this in practice because of the unlikelihood
of the query_stack getting resized at just the wrong time. Nonetheless,
it's definitely a live bug of ancient standing, so back-patch to all
supported branches.
Discussion: https://postgr.es/m/2891.1505419542@sss.pgh.pa.us
Commit 0f79440fb introduced mechanism to keep AFTER STATEMENT triggers
from firing more than once per statement, which was formerly possible
if more than one FK enforcement action had to be applied to a given
table. Add a similar mechanism for BEFORE STATEMENT triggers, so that
we don't have the unexpected situation of firing BEFORE STATEMENT
triggers more often than AFTER STATEMENT.
As with the previous patch, back-patch to v10.
Discussion: https://postgr.es/m/22315.1505584992@sss.pgh.pa.us
The elements of RecordCacheArray are TupleDesc, not TupleDesc *.
Those are actually the same size, so that this error is harmless,
but it's still wrong --- and it might bite us someday, if TupleDesc
ever became a struct, say.
Per Coverity.
The standard says that all changes of the same kind (insert, update, or
delete) caused in one table by a single SQL statement should be reported
in a single transition table; and by that, they mean to include foreign key
enforcement actions cascading from the statement's direct effects. It's
also reasonable to conclude that if the standard had wCTEs, they would say
that effects of wCTEs applying to the same table as each other or the outer
statement should be merged into one transition table. We weren't doing it
like that.
Hence, arrange to merge tuples from multiple update actions into a single
transition table as much as we can. There is a problem, which is that if
the firing of FK enforcement triggers and after-row triggers with
transition tables is interspersed, we might need to report more tuples
after some triggers have already seen the transition table. It seems like
a bad idea for the transition table to be mutable between trigger calls.
There's no good way around this without a major redesign of the FK logic,
so for now, resolve it by opening a new transition table each time this
happens.
Also, ensure that AFTER STATEMENT triggers fire just once per statement,
or once per transition table when we're forced to make more than one.
Previous versions of Postgres have allowed each FK enforcement query
to cause an additional firing of the AFTER STATEMENT triggers for the
referencing table, but that's certainly not per spec. (We're still
doing multiple firings of BEFORE STATEMENT triggers, though; is that
something worth changing?)
Also, forbid using transition tables with column-specific UPDATE triggers.
The spec requires such transition tables to show only the tuples for which
the UPDATE trigger would have fired, which means maintaining multiple
transition tables or else somehow filtering the contents at readout.
Maybe someday we'll bother to support that option, but it looks like a
lot of trouble for a marginal feature.
The transition tables are now managed by the AfterTriggers data structures,
rather than being directly the responsibility of ModifyTable nodes. This
removes a subtransaction-lifespan memory leak introduced by my previous
band-aid patch 3c4359521.
In passing, refactor the AfterTriggers data structures to reduce the
management overhead for them, by using arrays of structs rather than
several parallel arrays for per-query-level and per-subtransaction state.
I failed to resist the temptation to do some copy-editing on the SGML
docs about triggers, above and beyond merely documenting the effects
of this patch.
Back-patch to v10, because we don't want the semantics of transition
tables to change post-release.
Patch by me, with help and review from Thomas Munro.
Discussion: https://postgr.es/m/20170909064853.25630.12825@wrigleys.postgresql.org
This code is unsafe, as proven by buildfarm failures, because it tries
to access shared memory that might already be gone. It's also unnecessary,
because we're about to exit the process anyway and so the record type cache
should never be accessed again. The idea was to lay some foundations for
someday recycling workers --- which would require attaching to a different
shared tupdesc registry --- but that will require considerably more
thought. In the meantime let's save some bytes by just removing the
nonfunctional code.
Problem identification, and proposal to fix by removing functionality
from the detach function, by Thomas Munro. I went a bit further by
removing the function altogether.
Discussion: https://postgr.es/m/E1dsguX-00056N-9x@gemulon.postgresql.org
Tuples can have type RECORDOID and a typmod number that identifies a blessed
TupleDesc in a backend-private cache. To support the sharing of such tuples
through shared memory and temporary files, provide a typmod registry in
shared memory.
To achieve that, introduce per-session DSM segments, created on demand when a
backend first runs a parallel query. The per-session DSM segment has a
table-of-contents just like the per-query DSM segment, and initially the
contents are a shared record typmod registry and a DSA area to provide the
space it needs to grow.
State relating to the current session is accessed via a Session object
reached through global variable CurrentSession that may require significant
redesign further down the road as we figure out what else needs to be shared
or remodelled.
Author: Thomas Munro
Reviewed-By: Andres Freund
Discussion: https://postgr.es/m/CAEepm=0ZtQ-SpsgCyzzYpsXS6e=kZWqk3g5Ygn3MDV7A8dabUA@mail.gmail.com
Previously we read the control file in multiple places. But soon the
segment size will be configurable and stored in the control file, and
that needs to be available earlier than it currently is needed.
Instead of adding yet another place where it's read, refactor things
so there's a single processing of the control file during startup (in
EXEC_BACKEND that's every individual backend's startup).
Author: Andres Freund
Discussion: http://postgr.es/m/20170913092828.aozd3gvvmw67gmyc@alap3.anarazel.de
Flattening the partitioning hierarchy at this stage makes various
desirable optimizations difficult. The original use case for this
patch was partition-wise join, which wants to match up the partitions
in one partitioning hierarchy with those in another such hierarchy.
However, it now seems that it will also be useful in making partition
pruning work using the PartitionDesc rather than constraint exclusion,
because with a flattened expansion, we have no easy way to figure out
which PartitionDescs apply to which leaf tables in a multi-level
partition hierarchy.
As it turns out, we end up creating both rte->inh and !rte->inh RTEs
for each intermediate partitioned table, just as we previously did for
the root table. This seems unnecessary since the partitioned tables
have no storage and are not scanned. We might want to go back and
rejigger things so that no partitioned tables (including the parent)
need !rte->inh RTEs, but that seems to require some adjustments not
related to the core purpose of this patch.
Ashutosh Bapat, reviewed by me and by Amit Langote. Some final
adjustments by me.
Discussion: http://postgr.es/m/CAFjFpRd=1venqLL7oGU=C1dEkuvk2DJgvF+7uKbnPHaum1mvHQ@mail.gmail.com
With this change, the order of leaf partitions as returned by
RelationGetPartitionDispatchInfo should now be the same as the
order used by expand_inherited_rtentry. This will make it simpler
for future patches to match up the partition dispatch information
with the planner data structures. The new code is also, in my
opinion anyway, simpler and easier to understand.
Amit Langote, reviewed by Amit Khandekar. I also reviewed and
made a few cosmetic revisions.
Discussion: http://postgr.es/m/d98d4761-5071-1762-501e-0e15047c714b@lab.ntt.co.jp
During the development of d47cfef711 the CFI()s in ExecScan() were
moved back and forth, ending up in the wrong place. Thus queries that
largely spend their time in ExecScan(), and have neither projection
nor a qual, can't be cancelled in a timely manner.
Reported-By: Jeff Janes
Author: Andres Freund
Discussion: https://postgr.es/m/CAMkU=1weDXp8eLLPt9SO1LEUsJYYK9cScaGhLKpuN+WbYo9b5g@mail.gmail.com
Backpatch: 10, as d47cfef711
Historically, the selectivity functions have simply not distinguished
< from <=, or > from >=, arguing that the fraction of the population that
satisfies the "=" aspect can be considered to be vanishingly small, if the
comparison value isn't any of the most-common-values for the variable.
(If it is, the code path that executes the operator against each MCV will
take care of things properly.) But that isn't really true unless we're
dealing with a continuum of variable values, and in practice we seldom are.
If "x = const" would estimate a nonzero number of rows for a given const
value, then it follows that we ought to estimate different numbers of rows
for "x < const" and "x <= const", even if the const is not one of the MCVs.
Handling this more honestly makes a significant difference in edge cases,
such as the estimate for a tight range (x BETWEEN y AND z where y and z
are close together).
Hence, split scalarltsel into scalarltsel/scalarlesel, and similarly
split scalargtsel into scalargtsel/scalargesel. Adjust <= and >=
operator definitions to reference the new selectivity functions.
Improve the core ineq_histogram_selectivity() function to make a
correction for equality. (Along the way, I learned quite a bit about
exactly why that function gives good answers, which I tried to memorialize
in improved comments.)
The corresponding join selectivity functions were, and remain, just stubs.
But I chose to split them similarly, to avoid confusion and to prevent the
need for doing this exercise again if someone ever makes them less stubby.
In passing, change ineq_histogram_selectivity's clamp for extreme
probability estimates so that it varies depending on the histogram
size, instead of being hardwired at 0.0001. With the default histogram
size of 100 entries, you still get the old clamp value, but bigger
histograms should allow us to put more faith in edge values.
Tom Lane, reviewed by Aleksander Alekseev and Kuntal Ghosh
Discussion: https://postgr.es/m/12232.1499140410@sss.pgh.pa.us
The previous error message when attempting to run a general SQL command
in a physical replication WAL sender was a bit sloppy.
Reported-by: Fujii Masao <masao.fujii@gmail.com>
Commit 83aaac41c6 introduced the use of
LDAP_NO_ATTRS to avoid requesting a dummy attribute when doing search+bind
LDAP authentication. It turns out that not all LDAP implementations define
that macro, but its value is fixed by the protocol so we can define it
ourselves if it's missing.
Author: Thomas Munro
Reported-By: Ashutosh Sharma
Discussion: https://postgr.es/m/CAE9k0Pm6FKCfPCiAr26-L_SMGOA7dT_k0%2B3pEbB8%2B-oT39xRpw%40mail.gmail.com
Before, only filters of the form "(<ldapsearchattribute>=<user>)"
could be used to search an LDAP server. Introduce ldapsearchfilter
so that more general filters can be configured using patterns, like
"(|(uid=$username)(mail=$username))" and "(&(uid=$username)
(objectClass=posixAccount))". Also allow search filters to be included
in an LDAP URL.
Author: Thomas Munro
Reviewed-By: Peter Eisentraut, Mark Cave-Ayland, Magnus Hagander
Discussion: https://postgr.es/m/CAEepm=0XTkYvMci0WRubZcf_1am8=gP=7oJErpsUfRYcKF2gwg@mail.gmail.com
This allows the compiler/linker to move the static variables to a
read-only segment. Not all the signature changes are necessary, but
it seems better to apply const in a consistent manner.
Reviewed-By: Tom Lane
Discussion: https://postgr.es/m/20170910232154.asgml44ji2b7lv3d@alap3.anarazel.de
AFTER triggers using transition tables crashed if they were fired due
to a foreign key ON CASCADE update. This is because ExecEndModifyTable
flushes the transition tables, on the assumption that any trigger that
could need them was already fired during ExecutorFinish. Normally
that's true, because we don't allow transition-table-using triggers
to be deferred. However, foreign key CASCADE updates force any
triggers on the referencing table to be deferred to the outer query
level, by means of the EXEC_FLAG_SKIP_TRIGGERS flag. I don't recall
all the details of why it's like that and am pretty loath to redesign
it right now. Instead, just teach ExecEndModifyTable to skip destroying
the TransitionCaptureState when that flag is set. This will allow the
transition table data to survive until end of the current subtransaction.
This isn't a terribly satisfactory solution, because (1) we might be
leaking the transition tables for much longer than really necessary,
and (2) as things stand, an AFTER STATEMENT trigger will fire once per
RI updating query, ie once per row updated or deleted in the referenced
table. I suspect that is not per SQL spec. But redesigning this is a
research project that we're certainly not going to get done for v10.
So let's go with this hackish answer for now.
In passing, tweak AfterTriggerSaveEvent to not save the transition_capture
pointer into the event record for a deferrable trigger. This is not
necessary to fix the current bug, but it avoids letting dangling pointers
to long-gone transition tables persist in the trigger event queue. That's
at least a safety feature. It might also allow merging shared trigger
states in more cases than before.
I added a regression test that demonstrates the crash on unpatched code,
and also exposes the behavior of firing the AFTER STATEMENT triggers
once per row update.
Per bug #14808 from Philippe Beaudoin. Back-patch to v10.
Discussion: https://postgr.es/m/20170909064853.25630.12825@wrigleys.postgresql.org
This code isn't used, and there's no clear reason why anybody would ever
want to use it. These traversal mechanisms don't yield a visitation order
that is semantically meaningful for any external purpose, nor are they
any faster or simpler than the left-to-right or right-to-left traversals.
(In fact, some rough testing suggests they are slower :-(.) Moreover,
these mechanisms are impossible to test in any arm's-length fashion; doing
so requires knowledge of the red-black tree's internal implementation.
Hence, let's just jettison them.
Discussion: https://postgr.es/m/17735.1505003111@sss.pgh.pa.us
The previous coding of get_qual_for_list() was careful to copy everything
it was using from the input data structure. The new version missed
making a copy of pass-by-ref datum values that it's inserting into Consts.
This is not optional, however, as revealed by buildfarm failures on
machines running -DRELCACHE_FORCE_RELEASE: we're copying from a relcache
entry that could go away before the required lifespan of our output
expression. I'm pretty sure -DCLOBBER_CACHE_ALWAYS machines won't like
this either, but none of them have reported in yet.
map_partition_varattnos() failed to set its found_whole_row output
parameter if the given expression list was NIL. This seems to be
a pre-existing bug that chanced to be exposed by commit 6f6b99d13.
It might be unreachable in v10, but I have little faith in that
proposition, so back-patch.
Per buildfarm.
In commit fccebe421, we hacked get_actual_variable_range() to scan the
index with SnapshotDirty, so that if there are many uncommitted tuples
at the end of the index range, it wouldn't laboriously scan through all
of them looking for a live value to return. However, that didn't fix it
for the case of many recently-dead tuples at the end of the index;
SnapshotDirty recognizes those as committed dead and so we're back to
the same problem.
To improve the situation, invent a "SnapshotNonVacuumable" snapshot type
and use that instead. The reason this helps is that, if the snapshot
rejects a given index entry, we know that the indexscan will mark that
index entry as killed. This means the next get_actual_variable_range()
scan will proceed past that entry without visiting the heap, making the
scan a lot faster. We may end up accepting a recently-dead tuple as
being the estimated extremal value, but that doesn't seem much worse than
the compromise we made before to accept not-yet-committed extremal values.
The cost of the scan is still proportional to the number of dead index
entries at the end of the range, so in the interval after a mass delete
but before VACUUM's cleaned up the mess, it's still possible for
get_actual_variable_range() to take a noticeable amount of time, if you've
got enough such dead entries. But the constant factor is much much better
than before, since all we need to do with each index entry is test its
"killed" bit.
We chose to back-patch commit fccebe421 at the time, but I'm hesitant to
do so here, because this form of the problem seems to affect many fewer
people. Also, even when it happens, it's less bad than the case fixed
by commit fccebe421 because we don't get the contention effects from
expensive TransactionIdIsInProgress tests.
Dmitriy Sarafannikov, reviewed by Andrey Borodin
Discussion: https://postgr.es/m/05C72CF7-B5F6-4DB9-8A09-5AC897653113@yandex.ru
It is equivalent in ANSI C to write (*funcptr) () and funcptr(). These
two styles have been applied inconsistently. After discussion, we'll
use the more verbose style for plain function pointer variables, to make
it clear that it's a variable, and the shorter style when the function
pointer is in a struct (s.func() or s->func()), because then it's clear
that it's not a plain function name, and otherwise the excessive
punctuation makes some of those invocations hard to read.
Discussion: https://www.postgresql.org/message-id/f52c16db-14ed-757d-4b48-7ef360b1631d@2ndquadrant.com
This doesn't allow routing tuple to the foreign partitions themselves,
but it permits tuples to be routed to regular partitions despite the
presence of foreign partitions in the same inheritance hierarchy.
Etsuro Fujita, reviewed by Amit Langote and by me.
Discussion: http://postgr.es/m/bc3db4c1-1693-3b8a-559f-33ad2b50b7ad@lab.ntt.co.jp
Issuing a savepoint-related command in a Query message that contains
multiple SQL statements led to a FATAL exit with a complaint about
"unexpected state STARTED". This is a shortcoming of commit 4f896dac1,
which attempted to prevent such misbehaviors in multi-statement strings;
its quick hack of marking the individual statements as "not top-level"
does the wrong thing in this case, and isn't a very accurate description
of the situation anyway.
To fix, let's introduce into xact.c an explicit model of what happens for
multi-statement Query strings. This is an "implicit transaction block
in progress" state, which for many purposes works like the normal
TBLOCK_INPROGRESS state --- in particular, IsTransactionBlock returns true,
causing the desired result that PreventTransactionChain will throw error.
But in case of error abort it works like TBLOCK_STARTED, allowing the
transaction to be cancelled without need for an explicit ROLLBACK command.
Commit 4f896dac1 is reverted in toto, so that we go back to treating the
individual statements as "top level". We could have left it as-is, but
this allows sharpening the error message for PreventTransactionChain
calls inside functions.
Except for getting a normal error instead of a FATAL exit for savepoint
commands, this patch should result in no user-visible behavioral change
(other than that one error message rewording). There are some things
we might want to do in the line of changing the appearance or wording of
error and warning messages around this behavior, which would be much
simpler to do now that it's an explicitly modeled state. But I haven't
done them here.
Although this fixes a long-standing bug, no backpatch. The consequences
of the bug don't seem severe enough to justify the risk that this commit
itself creates some new issue.
Patch by me, but it owes something to previous investigation by
Takayuki Tsunakawa, who also reported the bug in the first place.
Also thanks to Michael Paquier for reviewing.
Discussion: https://postgr.es/m/0A3221C70F24FB45833433255569204D1F6BE40D@G01JPEXMBYT05
Index columns are referenced by ordinal number rather than name, e.g.
CREATE INDEX coord_idx ON measured (x, y, (z + t));
ALTER INDEX coord_idx ALTER COLUMN 3 SET STATISTICS 1000;
Incompatibility note for release notes:
\d+ for indexes now also displays Stats Target
Authors: Alexander Korotkov, with contribution by Adrien NAYRAT
Review: Adrien NAYRAT, Simon Riggs
Wordsmith: Simon Riggs
The NAMEDTUPLESTORE patch piggybacked on the infrastructure for
TABLEFUNC/VALUES/CTE RTEs, none of which can ever have dropped columns,
so the possibility was ignored most places. Fix that, including adding a
specification to parsenodes.h about what it's supposed to look like.
In passing, clean up assorted comments that hadn't been maintained
properly by said patch.
Per bug #14799 from Philippe Beaudoin. Back-patch to v10.
Discussion: https://postgr.es/m/20170906120005.25630.84360@wrigleys.postgresql.org
The parenthesized style has only been used in a few modules. Change
that to use the style that is predominant across the whole tree.
Reviewed-by: Michael Paquier <michael.paquier@gmail.com>
Reviewed-by: Ryan Murphy <ryanfmurphy@gmail.com>
Throttling for sending a base backup in walsender is broken for the case
where there is a lot of WAL traffic, because the latch used to put the
walsender to sleep is also signalled by regular WAL traffic (and each
signal causes an additional batch of data to be sent); the net effect is
that there is no or little actual throttling. This is undesirable, so
rewrite the sleep into a loop to achieve the desired effeect.
Author: Jeff Janes, small tweaks by me
Reviewed-by: Antonin Houska
Discussion: https://postgr.es/m/CAMkU=1xH6mde-yL-Eo1TKBGNd0PB1-TMxvrNvqcAkN-qr2E9mw@mail.gmail.com
Some compilers complain, not unreasonably, about left-shifting an
int32 "1" and then assigning the result to an int64. In practice
I sure hope that this data structure never gets large enough that
an overflow would actually occur; but let's cast the constant to
the right type to avoid the hazard.
In passing, fix a typo in dshash.h.
Amit Kapila, adjusted as per comment from Thomas Munro.
Discussion: https://postgr.es/m/CAA4eK1+5vfVMYtjK_NX8O3-42yM3o80qdqWnQzGquPrbq6mb+A@mail.gmail.com
Move the responsibility for creating/destroying TupleQueueReaders into
execParallel.c, to avoid duplicative coding in nodeGather.c and
nodeGatherMerge.c. Also, instead of having DestroyTupleQueueReader do
shm_mq_detach, do it in the caller (which is now only ExecParallelFinish).
This means execParallel.c does both the attaching and detaching of the
tuple-queue-reader shm_mqs, which seems less weird than the previous
arrangement.
These changes also eliminate a vestigial memory leak (of the pei->tqueue
array). It's now demonstrable that rescans of Gather or GatherMerge don't
leak memory.
Discussion: https://postgr.es/m/8670.1504192177@sss.pgh.pa.us
Add the maxrss field to the getrusage output (log_*_stats). This was
previously omitted because of portability concerns, but we feel this
might not be a concern anymore.
based on patch by Justin Pryzby <pryzby@telsasoft.com>
Commit 0e141c0fbb introduced a mechanism
to reduce contention on ProcArrayLock by having a single process clear
XIDs in the procArray on behalf of multiple processes, reducing the
need to hand the lock around. A previous attempt to introduce a similar
mechanism for CLogControlLock in ccce90b398
crashed and burned, but the design problem which resulted in those
failures is believed to have been corrected in this version.
Amit Kapila, with some cosmetic changes by me. See the previous commit
message for additional credits.
Discussion: http://postgr.es/m/CAA4eK1KudxzgWhuywY_X=yeSAhJMT4DwCjroV5Ay60xaeB2Eew@mail.gmail.com
Do for replication origins what the previous commit did for replication
slots: restore the original behavior of replication origin drop to raise
an error rather than blocking, because users might be depending on the
original behavior. Maintain the blocking behavior when invoked
internally from logical replication subscription handling.
Discussion: https://postgr.es/m/20170830133922.tlpo3lgfejm4n2cs@alvherre.pgsql
Commit 9915de6c1c changed the default behavior of
DROP_REPLICATION_SLOT so that it would wait until any session holding
the slot active would release it, instead of raising an error. But
users are already depending on the original behavior, so revert to it by
default and add a WAIT option to invoke the new behavior.
Per complaint from Simone Gotti, in
Discussion: https://postgr.es/m/CAEvsy6Wgdf90O6pUvg2wSVXL2omH5OPC-38OD4Zzgk-FXavj3Q@mail.gmail.com
This will be useful for hash partitioning, which needs a way to seed
the hash functions to avoid problems such as a hash index on a hash
partitioned table clumping all values into a small portion of the
bucket space; it's also useful for anything that wants a 64-bit hash
value rather than a 32-bit hash value.
Just in case somebody wants a 64-bit hash value that is compatible
with the existing 32-bit hash values, make the low 32-bits of the
64-bit hash value match the 32-bit hash value when the seed is 0.
Robert Haas and Amul Sul
Discussion: http://postgr.es/m/CA+Tgmoafx2yoJuhCQQOL5CocEi-w_uG4S2xT0EtgiJnPGcHW3g@mail.gmail.com
Rescanning a GatherMerge led to leaking some memory in the executor's
query-lifespan context, because most of the node's working data structures
were simply abandoned and rebuilt from scratch. In practice, this might
never amount to much, given the cost of relaunching worker processes ---
but it's still pretty messy, so let's fix it.
We can rearrange things so that the tuple arrays are simply cleared and
reused, and we don't need to rebuild the TupleTableSlots either, just
clear them. One small complication is that because we might get a
different number of workers on each iteration, we can't keep the old
convention that the leader's gm_slots[] entry is the last one; the leader
might clobber a TupleTableSlot that we need for a worker in a future
iteration. Hence, adjust the logic so that the leader has slot 0 always,
while the active workers have slots 1..n.
Back-patch to v10 to keep all the existing versions of nodeGatherMerge.c
in sync --- because of the renumbering of the slots, there would otherwise
be a very large risk that any future backpatches in this module would
introduce bugs.
Discussion: https://postgr.es/m/8670.1504192177@sss.pgh.pa.us
Previously, we expanded the inheritance hierarchy in the order in
which find_all_inheritors had locked the tables, but that turns out
to block quite a bit of useful optimization. For example, a
partition-wise join can't count on two tables with matching bounds
to get expanded in the same order.
Where possible, this change results in expanding partitioned tables in
*bound* order. Bound order isn't well-defined for a list-partitioned
table with a null-accepting partition or for a list-partitioned table
where the bounds for a single partition are interleaved with other
partitions. However, when expansion in bound order is possible, it
opens up further opportunities for optimization, such as
strength-reducing MergeAppend to Append when the expansion order
matches the desired sort order.
Patch by me, with cosmetic revisions by Ashutosh Bapat.
Discussion: http://postgr.es/m/CA+TgmoZrKj7kEzcMSum3aXV4eyvvbh9WD=c6m=002WMheDyE3A@mail.gmail.com
The logic around shm_mq_detach was a few bricks shy of a load, because
(contrary to the comments for shm_mq_attach) all it did was update the
shared shm_mq state. That left us leaking a bit of process-local
memory, but much worse, the on_dsm_detach callback for shm_mq_detach
was still armed. That means that whenever we ultimately detach from
the DSM segment, we'd run shm_mq_detach again for already-detached,
possibly long-dead queues. This accidentally fails to fail today,
because we only ever re-use a shm_mq's memory for another shm_mq, and
multiple detach attempts on the last such shm_mq are fairly harmless.
But it's gonna bite us someday, so let's clean it up.
To do that, change shm_mq_detach's API so it takes a shm_mq_handle
not the underlying shm_mq. This makes the callers simpler in most
cases anyway. Also fix a few places in parallel.c that were just
pfree'ing the handle structs rather than doing proper cleanup.
Back-patch to v10 because of the risk that the revenant shm_mq_detach
callbacks would cause a live bug sometime. Since this is an API
change, it's too late to do it in 9.6. (We could make a variant
patch that preserves API, but I'm not excited enough to do that.)
Discussion: https://postgr.es/m/8670.1504192177@sss.pgh.pa.us
Comment the fields of GatherMergeState, and organize them a bit more
sensibly. Comment GMReaderTupleBuffer more usefully too. Improve
assorted other comments that were obsolete or just not very good English.
Get rid of the use of a GMReaderTupleBuffer for the leader process;
that was confusing, since only the "done" field was used, and that
in a way redundant with need_to_scan_locally.
In gather_merge_init, avoid calling load_tuple_array for
already-known-exhausted workers. I'm not sure if there's a live bug there,
but the case is unlikely to be well tested due to timing considerations.
Remove some useless code, such as duplicating the tts_isempty test done by
TupIsNull.
Remove useless initialization of ps.qual, replacing that with an assertion
that we have no qual to check. (If we did, the code would fail to check
it.)
Avoid applying heap_copytuple to a null tuple. While that fails to crash,
it's confusing and it makes the code less legible not more so IMO.
Propagate a couple of these changes into nodeGather.c, as well.
Back-patch to v10, partly because of the possibility that the
gather_merge_init change is fixing a live bug, but mostly to keep
the branches in sync to ease future bug fixes.
Previously, the parallel executor logic did reinitialization of shared
state within the ExecReScan code for parallel-aware scan nodes. This is
problematic, because it means that the ExecReScan call has to occur
synchronously (ie, during the parent Gather node's ReScan call). That is
swimming very much against the tide so far as the ExecReScan machinery is
concerned; the fact that it works at all today depends on a lot of fragile
assumptions, such as that no plan node between Gather and a parallel-aware
scan node is parameterized. Another objection is that because ExecReScan
might be called in workers as well as the leader, hacky extra tests are
needed in some places to prevent unwanted shared-state resets.
Hence, let's separate this code into two functions, a ReInitializeDSM
call and the ReScan call proper. ReInitializeDSM is called only in
the leader and is guaranteed to run before we start new workers.
ReScan is returned to its traditional function of resetting only local
state, which means that ExecReScan's usual habits of delaying or
eliminating child rescan calls are safe again.
As with the preceding commit 7df2c1f8d, it doesn't seem to be necessary
to make these changes in 9.6, which is a good thing because the FDW and
CustomScan APIs are impacted.
Discussion: https://postgr.es/m/CAA4eK1JkByysFJNh9M349u_nNjqETuEnY_y1VUc_kJiU0bxtaQ@mail.gmail.com
The ExecReScan machinery contains various optimizations for postponing
or skipping rescans of plan subtrees; for example a HashAgg node may
conclude that it can re-use the table it built before, instead of
re-reading its input subtree. But that is wrong if the input contains
a parallel-aware table scan node, since the portion of the table scanned
by the leader process is likely to vary from one rescan to the next.
This explains the timing-dependent buildfarm failures we saw after
commit a2b70c89c.
The established mechanism for showing that a plan node's output is
potentially variable is to mark it as depending on some runtime Param.
Hence, to fix this, invent a dummy Param (one that has a PARAM_EXEC
parameter number, but carries no actual value) associated with each Gather
or GatherMerge node, mark parallel-aware nodes below that node as dependent
on that Param, and arrange for ExecReScanGather[Merge] to flag that Param
as changed whenever the Gather[Merge] node is rescanned.
This solution breaks an undocumented assumption made by the parallel
executor logic, namely that all rescans of nodes below a Gather[Merge]
will happen synchronously during the ReScan of the top node itself.
But that's fundamentally contrary to the design of the ExecReScan code,
and so was doomed to fail someday anyway (even if you want to argue
that the bug being fixed here wasn't a failure of that assumption).
A follow-on patch will address that issue. In the meantime, the worst
that's expected to happen is that given very bad timing luck, the leader
might have to do all the work during a rescan, because workers think
they have nothing to do, if they are able to start up before the eventual
ReScan of the leader's parallel-aware table scan node has reset the
shared scan state.
Although this problem exists in 9.6, there does not seem to be any way
for it to manifest there. Without GatherMerge, it seems that a plan tree
that has a rescan-short-circuiting node below Gather will always also
have one above it that will short-circuit in the same cases, preventing
the Gather from being rescanned. Hence we won't take the risk of
back-patching this change into 9.6. But v10 needs it.
Discussion: https://postgr.es/m/CAA4eK1JkByysFJNh9M349u_nNjqETuEnY_y1VUc_kJiU0bxtaQ@mail.gmail.com
Up until now, when parallel query was used, no details about the
sort method or space used by the workers were available; details
were shown only for any sorting done by the leader. Fix that.
Commit 1177ab1dab forced the test case
added by commit 1f6d515a67 to run
without parallelism; now that we have this infrastructure, allow
that again, with a little tweaking to make it pass with and without
force_parallel_mode.
Robert Haas and Tom Lane
Discussion: http://postgr.es/m/CA+Tgmoa2VBZW6S8AAXfhpHczb=Rf6RqQ2br+zJvEgwJ0uoD_tQ@mail.gmail.com
If we only need, say, 10 tuples in total, then we certainly don't need
more than 10 tuples from any single process. Pushing down the limit
lets workers exit early when possible. For Gather Merge, there is
an additional benefit: a Sort immediately below the Gather Merge can
be done as a bounded sort if there is an applicable limit.
Robert Haas and Tom Lane
Discussion: http://postgr.es/m/CA+TgmoYa3QKKrLj5rX7UvGqhH73G1Li4B-EKxrmASaca2tFu9Q@mail.gmail.com
Minor improvements for commit 1f6d515a6. We do not need the (rather
expensive) test for SRFs in the targetlist, because since v10 any
such SRFs would appear in separate ProjectSet nodes. Also, make the
code look more like the existing cases by turning it into a simple
recursion --- the argument that there might be some performance
benefit to contorting the code seems unfounded to me, especially since
any good compiler should turn the tail-recursion into iteration anyway.
Discussion: http://postgr.es/m/CADE5jYLuugnEEUsyW6Q_4mZFYTxHxaVCQmGAsF0yiY8ZDggi-w@mail.gmail.com
Commit 8c0d7bafad introduced dshash with hash
and compare functions like DynaHash's, and also variants that take a user
data pointer instead of size. Simplify the interface by merging them into
a single pair of function pointer types that take both size and a user data
pointer.
Since it is anticipated that memcmp and tag_hash behavior will be a common
requirement, provide wrapper functions dshash_memcmp and dshash_memhash that
conform to the new function types.
Author: Thomas Munro
Reviewed-By: Andres Freund
Discussion: https://postgr.es/m/20170823054644.efuzftxjpfi6wwqs%40alap3.anarazel.de
Commit 16be2fd100 added DSA_ALLOC_HUGE,
DSA_ALLOC_ZERO and DSA_ALLOC_NO_OOM which have the same numerical
values and meanings as the similarly named MCXT_... macros. In one
place we accidentally used MCXT_ALLOC_NO_OOM when DSA_ALLOC_NO_OOM is
wanted, so tidy that up.
Author: Thomas Munro
Discussion: http://postgr.es/m/CAEepm=2AimHxVkkxnMfQvbZMkXy0uKbVa0-D38c5-qwrCm4CMQ@mail.gmail.com
Backpatch: 10, where dsa was introduced.
Add general purpose chaining hash tables for DSA memory. Unlike
DynaHash in shared memory mode, these hash tables can grow as
required, and cope with being mapped into different addresses in
different backends.
There is a wide range of potential users for such a hash table, though
it's very likely the interface will need to evolve as we come to
understand the needs of different kinds of users. E.g support for
iterators and incremental resizing is planned for later commits and
the details of the callback signatures are likely to change.
Author: Thomas Munro
Reviewed-By: John Gorman, Andres Freund, Dilip Kumar, Robert Haas
Discussion:
https://postgr.es/m/CAEepm=3d8o8XdVwYT6O=bHKsKAM2pu2D6sV1S_=4d+jStVCE7w@mail.gmail.comhttps://postgr.es/m/CAEepm=0ZtQ-SpsgCyzzYpsXS6e=kZWqk3g5Ygn3MDV7A8dabUA@mail.gmail.com
Previously, tuple descriptors were stored in chains keyed by a fixed size
array of OIDs. That meant there were effectively two levels of collision
chain -- one inside and one outside the hash table. Instead, let dynahash.c
look after conflicts for us by supplying a proper hash and equal function
pair.
This is a nice cleanup on its own, but also simplifies followup
changes allowing blessed TupleDescs to be shared between backends
participating in parallel query.
Author: Thomas Munro
Reviewed-By: Andres Freund
Discussion: https://postgr.es/m/CAEepm%3D34GVhOL%2BarUx56yx7OPk7%3DqpGsv3CpO54feqjAwQKm5g%40mail.gmail.com
Users can still create them themselves. Instead, document Unicode TR 35
collation options for ICU, so users can create all this themselves.
Reviewed-by: Peter Geoghegan <pg@bowt.ie>
Install language+region combinations even if they are not distinct from
the language's base locale. This gives better long-term stability of
the set of predefined locales and makes the predefined locales less
implementation-dependent and more practical for users.
Reviewed-by: Peter Geoghegan <pg@bowt.ie>
Periodically while the server is running, and at shutdown, write out a
list of blocks in shared buffers. When the server reaches consistency
-- unfortunatey, we can't do it before that point without breaking
things -- reload those blocks into any still-unused shared buffers.
Mithun Cy and Robert Haas, reviewed and tested by Beena Emerson,
Amit Kapila, Jim Nasby, and Rafia Sabih.
Discussion: http://postgr.es/m/CAD__OugubOs1Vy7kgF6xTjmEqTR4CrGAv8w+ZbaY_+MZeitukw@mail.gmail.com
It appeared in a conditional that excludes AIX, Cygwin and MinGW. Give
ICU support a chance to work on those platforms. Back-patch to v10,
where ICU support was introduced.
TupleDesc's attributes were already stored in contiguous memory after the
struct. Go one step further and get rid of the array of pointers to
attributes so that they can be stored in shared memory mapped at different
addresses in each backend. This won't work for TupleDescs with contraints
and defaults, since those point to other objects, but for many purposes
only attributes are needed.
Author: Thomas Munro
Reviewed-By: Andres Freund
Discussion: https://postgr.es/m/CAEepm=0ZtQ-SpsgCyzzYpsXS6e=kZWqk3g5Ygn3MDV7A8dabUA@mail.gmail.com
This is a mechanical change in preparation for a later commit that
will change the layout of TupleDesc. Introducing a macro to abstract
the details of where attributes are stored will allow us to change
that in separate step and revise it in future.
Author: Thomas Munro, editorialized by Andres Freund
Reviewed-By: Andres Freund
Discussion: https://postgr.es/m/CAEepm=0ZtQ-SpsgCyzzYpsXS6e=kZWqk3g5Ygn3MDV7A8dabUA@mail.gmail.com
Add a new EState member es_leaf_result_relations, so that the trigger
code knows about ResultRelInfos created by tuple routing. Also make
sure ExplainPrintTriggers knows about partition-related
ResultRelInfos.
Etsuro Fujita, reviewed by Amit Langote
Discussion: http://postgr.es/m/57163e18-8e56-da83-337a-22f2c0008051@lab.ntt.co.jp
Instead, lock them in the caller using find_all_inheritors so that
they get locked in the standard order, minimizing deadlock risks.
Also in RelationGetPartitionDispatchInfo, avoid opening tables which
are not partitioned; there's no need.
Amit Langote, reviewed by Ashutosh Bapat and Amit Khandekar
Discussion: http://postgr.es/m/91b36fa1-c197-b72f-ca6e-56c593bae68c@lab.ntt.co.jp
As Andres pointed out, pg_atomic_init_u64 must be used to initialize an
atomic variable, before it can be accessed with the actual atomic ops.
Trying to use pg_atomic_write_u64 on an uninitialized variable leads to a
failure with the fallback implementation that uses a spinlock.
Discussion: https://www.postgresql.org/message-id/20170816191346.d3ke5tpshhco4bnd%40alap3.anarazel.de
Previously, if we had to estimate the number of distinct values in a
VALUES column, we fell back on the default behavior used whenever we lack
statistics, which effectively is that there are Min(# of entries, 200)
distinct values. This can be very badly off with a large VALUES list,
as noted by Jeff Janes.
We could consider actually running an ANALYZE-like scan on the VALUES,
but that seems unduly expensive, and anyway it could not deliver reliable
info if the entries are not all constants. What seems like a better choice
is to assume that the values are all distinct. This will sometimes be just
as wrong as the old code, but it seems more likely to be more nearly right
in many common cases. Also, it is more consistent with what happens in
some related cases, for example WHERE x = ANY(ARRAY[1,2,3,...,n]) and
WHERE x = ANY(VALUES (1),(2),(3),...,(n)) now are estimated similarly.
This was discussed some time ago, but consensus was it'd be better
to slip it in at the start of a development cycle not near the end.
(It should've gone into v10, really, but I forgot about it.)
Discussion: https://postgr.es/m/CAMkU=1xHkyPa8VQgGcCNg3RMFFvVxUdOpus1gKcFuvVi0w6Acg@mail.gmail.com
Previously, if you passed a non-aligned size to shm_toc_create(), the
memory returned by shm_toc_allocate() would be similarly non-aligned.
This was exposed by commit 3cda10f41b, which allocated structs containing
a pg_atomic_uint64 field with shm_toc_allocate(). On systems with
MAXIMUM_ALIGNOF = 4, such structs still need to be 8-bytes aligned, but
the memory returned by shm_toc_allocate() was only 4-bytes aligned.
It's quite bogus that we abuse BUFFERALIGN to align the structs for
pg_atomic_uint64. It doesn't really have anything to do with buffers. But
that's a separate issue.
This ought to fix the buildfarm failures on 32-bit x86 systems.
Discussion: https://www.postgresql.org/message-id/7e0a73a5-0df9-1859-b8ae-9acf122dc38d@iki.fi
Since commit 40dae7ec53, which changed the way b-tree page splitting
works, there has been no difference in the handling of root, and non-root
split WAL records. We don't need to distinguish them anymore
If you're worried about the loss of debugging information, note that
usually a root split record will normally be followed by a WAL record to
create the new root page. The root page will also have the BTP_ROOT flag
set on the page itself, and there is a pointer to it from the metapage.
Author: Aleksander Alekseev
Discussion: https://www.postgresql.org/message-id/20170406122116.GA11081@e733.localdomain
Change to appendStringInfoChar() or appendStringInfoString() where those
can be used.
Author: David Rowley <david.rowley@2ndquadrant.com>
Reviewed-by: Ashutosh Bapat <ashutosh.bapat@enterprisedb.com>
Although not confirmed and probably rare, if the newly allocated memory
is not already zero, this could possibly have caused some problems.
Also reorder the initializations slightly so they match the order of the
struct definition.
Author: Wong, Yi Wen <yiwong@amazon.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
This appears to have been an omission in the original commit
0d692a0dc9. All related information_schema views already include
foreign tables.
Reported-by: Nicolas Thauvin <nicolas.thauvin@dalibo.com>
The initial implementation of autovacuum work-items used a dynamic
shared memory area (DSA). However, it's argued that dynamic shared
memory is not portable enough, so we cannot rely on it being supported
everywhere; at the same time, autovacuum work-items are now a critical
part of the server, so it's not acceptable that they don't work in the
cases where dynamic shared memory is disabled. Therefore, let's fall
back to a simpler implementation of work-items that just uses
autovacuum's main shared memory segment for storage.
Discussion: https://postgr.es/m/CA+TgmobQVbz4K_+RSmiM9HeRKpy3vS5xnbkL95gSEnWijzprKQ@mail.gmail.com
Since we currently only have one protocol, this doesn't make much of a
difference other than the error message.
Author: Yugo Nagata <nagata@sraoss.co.jp>
The executor is capable of splitting buckets during a hash join if
too much memory is being used by a small number of buckets. However,
this only helps if a bucket's population is actually divisible; if
all the hash keys are alike, the tuples still end up in the same
new bucket. This can result in an OOM failure if there are enough
inner keys with identical hash values. The planner's cost estimates
will bias it against choosing a hash join in such situations, but not
by so much that it will never do so. To mitigate the OOM hazard,
explicitly estimate the hash bucket space needed by just the inner
side's most common value, and if that would exceed work_mem then
add disable_cost to the hash cost estimate.
This approach doesn't account for the possibility that two or more
common values would share the same hash value. On the other hand,
work_mem is normally a fairly conservative bound, so that eating
two or more times that much space is probably not going to kill us.
If we have no stats about the inner side, ignore this consideration.
There was some discussion of making a conservative assumption, but that
would effectively result in disabling hash join whenever we lack stats,
which seems like an overreaction given how seldom the problem manifests
in the field.
Per a complaint from David Hinkle. Although this could be viewed
as a bug fix, the lack of similar complaints weighs against back-
patching; indeed we waited for v11 because it seemed already rather
late in the v10 cycle to be making plan choice changes like this one.
Discussion: https://postgr.es/m/32013.1487271761@sss.pgh.pa.us
The original code (since 00e6a16d01) was assuming aborting the
transaction in autovacuum launcher was sufficient to release all
resources, but in reality the launcher runs quite a lot of code out of
any transactions. Re-introduce individual cleanup calls to make abort
more robust.
Reported-by: Robert Haas
Discussion: https://postgr.es/m/CA+TgmobQVbz4K_+RSmiM9HeRKpy3vS5xnbkL95gSEnWijzprKQ@mail.gmail.com
Instead of duplicating the logic to search for a matching
ParamPathInfo in multiple places, factor it out into a separate
function.
Pass only the relevant bits of the PartitionKey to
partition_bounds_equal instead of the whole thing, because
partition-wise join will want to call this without having a
PartitionKey available.
Adjust allow_star_schema_join and calc_nestloop_required_outer
to take relevant Relids rather than the entire Path, because
partition-wise join will want to call it with the top-parent
relids to determine whether a child join is allowable.
Ashutosh Bapat. Review and testing of the larger patch set of which
this is a part by Amit Langote, Rajkumar Raghuwanshi, Rafia Sabih,
Thomas Munro, Dilip Kumar, and me.
Discussion: http://postgr.es/m/CA+TgmobQK80vtXjAsPZWWXd7c8u13G86gmuLupN+uUJjA+i4nA@mail.gmail.com
plpgsql wants to recognize expressions that it can execute directly
via ExecEvalExpr() instead of going through the full SPI machinery.
Originally the test for this consisted of recursively groveling through
the post-planning expression tree to see if it contained only nodes that
plpgsql recognized as safe. That was a major maintenance headache, since
it required updating plpgsql every time we added any kind of expression
node. It was also kind of expensive, so over time we added various
pre-planning checks to try to short-circuit having to do that.
Robert Haas pointed out that as of the SRF-processing changes in v10,
particularly the addition of Query.hasTargetSRFs, there really isn't
any reason to make the recursive scan at all: the initial checks cover
everything we really care about. We do have to make sure that those
checks agree with what inline_function() considers, so that inlining
of a function that formerly wasn't inlined can't cause an expression
considered simple to become non-simple.
Hence, delete the recursive function exec_simple_check_node(), and tweak
those other tests to more exactly agree with inline_function(). Adjust
some comments and function naming to match.
Discussion: https://postgr.es/m/CA+TgmoZGZpwdEV2FQWaVxA_qZXsQE1DAS5Fu8fwxXDNvfndiUQ@mail.gmail.com
The API for WaitLatch and friends followed the Unix convention in which
waiting for a socket connection to complete is identical to waiting for
the socket to accept a write. While Windows provides a select(2)
emulation that agrees with that, the native WaitForMultipleObjects API
treats them as quite different --- and for some bizarre reason, it will
report a not-yet-connected socket as write-ready. libpq itself has so
far escaped dealing with this because it waits with select(), but in
libpqwalreceiver.c we want to wait using WaitLatchOrSocket. The semantics
mismatch resulted in replication connection failures on Windows, but only
for remote connections (apparently, localhost connections complete
immediately, or at least too fast for anyone to have noticed the problem
in single-machine testing).
To fix, introduce an additional WL_SOCKET_CONNECTED wait flag for
WaitLatchOrSocket, which is identical to WL_SOCKET_WRITEABLE on
non-Windows, but results in waiting for FD_CONNECT events on Windows.
Ideally, we would also distinguish the two conditions in the API for
PQconnectPoll(), but changing that API at this point seems infeasible.
Instead, cheat by checking for PQstatus() == CONNECTION_STARTED to
determine that we're still waiting for the connection to complete.
(This is a cheat mainly because CONNECTION_STARTED is documented as an
internal state rather than something callers should rely on. Perhaps
we ought to change the documentation ... but this patch doesn't.)
Per reports from Jobin Augustine and Igor Neyman. Back-patch to v10
where commit 1e8a85009 exposed this longstanding shortcoming.
Andres Freund, minor fix and some code review/beautification by me
Discussion: https://postgr.es/m/CAHBggj8g2T+ZDcACZ2FmzX9CTxkWjKBsHd6NkYB4i9Ojf6K1Fw@mail.gmail.com
Currently, child relations are always base relations, so when we
translate parent relids to child relids, we only need to translate
a singler relid. However, the proposed partition-wise join feature
will create child joins, which will mean we need to translate a set
of parent relids to the corresponding child relids. This is
preliminary refactoring to make that possible.
Ashutosh Bapat. Review and testing of the larger patch set of which
this is a part by Amit Langote, Rajkumar Raghuwanshi, Rafia Sabih,
Thomas Munro, Dilip Kumar, and me. Some adjustments, mostly
cosmetic, by me.
Discussion: http://postgr.es/m/CA+TgmobQK80vtXjAsPZWWXd7c8u13G86gmuLupN+uUJjA+i4nA@mail.gmail.com
Before commit d3cc37f1d8, an inheritance parent
whose only children were temp tables of other sessions would end up
as a simple scan of the parent; but with that commit, we end up with
an Append node, per a report from Ashutosh Bapat. Tweak the logic
so that we go back to the old way, and update the function header
comment for partitioning while we're at it.
Ashutosh Bapat, reviewed by Amit Langote and adjusted by me.
Discussion: http://postgr.es/m/CAFjFpReWJr1yTkHU=OqiMBmcYCMoSW3VPR39RBuQ_ovwDFBT5Q@mail.gmail.com
Stress testing by Andreas Seltenreich disclosed longstanding problems that
occur if a FATAL exit (e.g. due to receipt of SIGTERM) occurs while we are
trying to execute a ROLLBACK of an already-failed transaction. In such a
case, xact.c is in TBLOCK_ABORT state, so that AbortOutOfAnyTransaction
would skip AbortTransaction and go straight to CleanupTransaction. This
led to an assert failure in an assert-enabled build (due to the ROLLBACK's
portal still having a cleanup hook) or without assertions, to a FATAL exit
complaining about "cannot drop active portal". The latter's not
disastrous, perhaps, but it's messy enough to want to improve it.
We don't really want to run all of AbortTransaction in this code path.
The minimum required to clean up the open portal safely is to do
AtAbort_Memory and AtAbort_Portals. It seems like a good idea to
do AtAbort_Memory unconditionally, to be entirely sure that we are
starting with a safe CurrentMemoryContext. That means that if the
main loop in AbortOutOfAnyTransaction does nothing, we need an extra
step at the bottom to restore CurrentMemoryContext = TopMemoryContext,
which I chose to do by invoking AtCleanup_Memory. This'll result in
calling AtCleanup_Memory twice in many of the paths through this function,
but that seems harmless and reasonably inexpensive.
The original motivation for the assertion in AtCleanup_Portals was that
we wanted to be sure that any user-defined code executed as a consequence
of the cleanup hook runs during AbortTransaction not CleanupTransaction.
That still seems like a valid concern, and now that we've seen one case
of the assertion firing --- which means that exactly that would have
happened in a production build --- let's replace the Assert with a runtime
check. If we see the cleanup hook still set, we'll emit a WARNING and
just drop the hook unexecuted.
This has been like this a long time, so back-patch to all supported
branches.
Discussion: https://postgr.es/m/877ey7bmun.fsf@ansel.ydns.eu
The sole useful effect of this function, to check that no catcache
entries have positive refcounts at transaction end, has really been
obsolete since we introduced ResourceOwners in PG 8.1. We reduced the
checks to assertions years ago, so that the function was a complete
no-op in production builds. There have been previous discussions about
removing it entirely, but consensus up to now was that it had some small
value as a cross-check for bugs in the ResourceOwner logic.
However, it now emerges that it's possible to trigger these assertions
if you hit an assert-enabled backend with SIGTERM during a call to
SearchCatCacheList, because that function temporarily increases the
refcounts of entries it's intending to add to a catcache list construct.
In a normal ERROR scenario, the extra refcounts are cleaned up by
SearchCatCacheList's PG_CATCH block; but in a FATAL exit we do a
transaction abort and exit without ever executing PG_CATCH handlers.
There's a case to be made that this is a generic hazard and we should
consider restructuring elog(FATAL) handling so that pending PG_CATCH
handlers do get run. That's pretty scary though: it could easily create
more problems than it solves. Preliminary stress testing by Andreas
Seltenreich suggests that there are not many live problems of this ilk,
so we rejected that idea.
There are more-localized ways to fix the problem; the most principled
one would be to use PG_ENSURE_ERROR_CLEANUP instead of plain PG_TRY.
But adding cycles to SearchCatCacheList isn't very appealing. We could
also weaken the assertions in AtEOXact_CatCache in some more or less
ad-hoc way, but that just makes its raison d'etre even less compelling.
In the end, the most reasonable solution seems to be to just remove
AtEOXact_CatCache altogether, on the grounds that it's not worth trying
to fix it. It hasn't found any bugs for us in many years.
Per report from Jeevan Chalke. Back-patch to all supported branches.
Discussion: https://postgr.es/m/CAM2+6=VEE30YtRQCZX7_sCFsEpoUkFBV1gZazL70fqLn8rcvBA@mail.gmail.com
The previous message didn't mention the name of the table or the
bounds. Put the table name in the primary error message and the
bounds in the detail message.
Amit Langote, changed slightly by me. Suggestions on the exac
phrasing from Tom Lane, David G. Johnston, and Dean Rasheed.
Discussion: http://postgr.es/m/CA+Tgmoae6bpwVa-1BMaVcwvCCeOoJ5B9Q9-RHWo-1gJxfPBZ5Q@mail.gmail.com
We must advance the oldest XID that can be safely looked up in clog
*before* truncating CLOG, and the oldest XID that can't be reused
*after* truncating CLOG. This assertion, and the accompanying
comment, are confused; remove them.
Reported by Neha Sharma.
Discussion: http://postgr.es/m/CANiYTQumC3T=UMBMd1Hor=5XWZYuCEQBioL3ug0YtNQCMMT5wQ@mail.gmail.com
find_composite_type_dependencies correctly found columns that are of
the specified type, and columns that are of arrays of that type, but
not columns that are domains or ranges over the given type, its array
type, etc. The most general way to handle this seems to be to assume
that any type that is directly dependent on the specified type can be
treated as a container type, and processed recursively (allowing us
to handle nested cases such as ranges over domains over arrays ...).
Since a type's array type already has such a dependency, we can drop
the existing special case for the array type.
The very similar logic in get_rels_with_domain was likewise a few
bricks shy of a load, as it supposed that a directly dependent type
could *only* be a sub-domain. This is already wrong for ranges over
domains, and it'll someday be wrong for arrays over domains.
Add test cases illustrating the problems, and back-patch to all
supported branches.
Discussion: https://postgr.es/m/15268.1502309024@sss.pgh.pa.us
Commit 1efc7e538 did a poor job of emulating existing logic for touching
Datums that might be expanded-object pointers. It didn't check for typlen
being -1 first, which meant it could crash on fixed-length pass-by-ref
values, and probably on cstring values as well. It also didn't use
DatumGetPointer before VARATT_IS_EXTERNAL_EXPANDED, which while currently
harmless is not according to documentation nor prevailing style.
I also think the lack of any explanation as to why datumSerialize makes
these particular nonobvious choices is pretty awful, so fix that.
Per report from Jarred Ward. Back-patch to 9.6 where this code came in.
Discussion: https://postgr.es/m/6F61E6D2-2F5E-4794-9479-A429BE1CEA4B@simple.com
Similar to what was fixed in commit 9915de6c1c for replication slots,
but this time it's related to replication origins: DROP SUBSCRIPTION
attempts to drop the replication origin, but that fails if the
replication worker process hasn't yet marked it unused. This causes
failures in the buildfarm:
ERROR: could not drop replication origin with OID 1, in use by PID 34069
Like the aforementioned commit, fix by having the process running DROP
SUBSCRIPTION sleep until the worker marks the the replication origin
struct as free. This uses a condition variable on each replication
origin shmem state struct, so that the session trying to drop can sleep
and expect to be awakened by the process keeping the origin open.
Also fix a SGML markup in the previous commit.
Discussion: https://postgr.es/m/20170808001433.rozlseaf4m2wkw3n@alvherre.pgsql
In commit 9915de6c1c, we introduced a new wait point for replication
slots and incorrectly labelled it as wait event PG_WAIT_LOCK. That's
wrong, so invent an appropriate new wait event instead, and document it
properly.
While at it, fix numerous other problems in the vicinity:
- two different walreceiver wait events were being mixed up in a single
wait event (which wasn't documented either); split it out so that they
can be distinguished, and document the new events properly.
- ParallelBitmapPopulate was documented but didn't exist.
- ParallelBitmapScan was not documented (I think this should be called
"ParallelBitmapScanInit" instead.)
- Logical replication wait events weren't documented
- various symbols had been added in dartboard order in various places.
Put them in alphabetical order instead, as was originally intended.
Discussion: https://postgr.es/m/20170808181131.mu4fjepuh5m75cyq@alvherre.pgsql
This would lead to failures if local and remote tables have a different
column order. The tests previously didn't catch that because they only
tested the initial data copy. So add another test that exercises the
apply worker.
Author: Petr Jelinek <petr.jelinek@2ndquadrant.com>
The relation attribute map was not initialized for dropped columns,
leading to errors later on.
Author: Petr Jelinek <petr.jelinek@2ndquadrant.com>
Reported-by: Scott Milliken <scott@deltaex.com>
Bug: #14769
lo_put() surely should require UPDATE permission, the same as lowrite(),
but it failed to check for that, as reported by Chapman Flack. Oversight
in commit c50b7c09d; backpatch to 9.4 where that was introduced.
Tom Lane and Michael Paquier
Security: CVE-2017-7548
Commit 3eefc51053 claimed to make
pg_user_mappings enforce the qualifications user_mapping_options had
been enforcing, but its removal of a longstanding restriction left them
distinct when the current user is the subject of a mapping yet has no
server privileges. user_mapping_options emits no rows for such a
mapping, but pg_user_mappings includes full umoptions. Change
pg_user_mappings to show null for umoptions. Back-patch to 9.2, like
the above commit.
Reviewed by Tom Lane. Reported by Jeff Janes.
Security: CVE-2017-7547
Some authentication methods allowed it, others did not. In the client-side,
libpq does not even try to authenticate with an empty password, which makes
using empty passwords hazardous: an administrator might think that an
account with an empty password cannot be used to log in, because psql
doesn't allow it, and not realize that a different client would in fact
allow it. To clear that confusion and to be be consistent, disallow empty
passwords in all authentication methods.
All the authentication methods that used plaintext authentication over the
wire, except for BSD authentication, already checked that the password
received from the user was not empty. To avoid forgetting it in the future
again, move the check to the recv_password_packet function. That only
forbids using an empty password with plaintext authentication, however.
MD5 and SCRAM need a different fix:
* In stable branches, check that the MD5 hash stored for the user does not
not correspond to an empty string. This adds some overhead to MD5
authentication, because the server needs to compute an extra MD5 hash, but
it is not noticeable in practice.
* In HEAD, modify CREATE and ALTER ROLE to clear the password if an empty
string, or a password hash that corresponds to an empty string, is
specified. The user-visible behavior is the same as in the stable branches,
the user cannot log in, but it seems better to stop the empty password from
entering the system in the first place. Secondly, it is fairly expensive to
check that a SCRAM hash doesn't correspond to an empty string, because
computing a SCRAM hash is much more expensive than an MD5 hash by design,
so better avoid doing that on every authentication.
We could clear the password on CREATE/ALTER ROLE also in stable branches,
but we would still need to check at authentication time, because even if we
prevent empty passwords from being stored in pg_authid, there might be
existing ones there already.
Reported by Jeroen van der Ham, Ben de Graaff and Jelte Fennema.
Security: CVE-2017-7546
The callers for GetOldestSafeDecodingTransactionId() all inverted the
argument for the argument introduced in 2bef06d516. Luckily this
appears to be inconsequential for the moment, as we wait for
concurrent in-progress transaction when assembling a
snapshot. Additionally this could only make a difference when adding a
second logical slot, because only a pre-existing slot could cause an
issue by lowering the returned xid dangerously much.
Reported-By: Antonin Houska
Discussion: https://postgr.es/m/32704.1496993134@localhost
Backport: 9.4-, where 2bef06d516 was backpatched to.
Previously, it had no effect. Now, if archive_mode=always, it will
work, and if not, you'll get a warning.
Masahiko Sawada, Michael Paquier, and Robert Haas. The patch as
submitted also changed the behavior so that we would write and remove
history files on standbys, but that seems like material for a separate
patch to me.
Discussion: http://postgr.es/m/CAD21AoC2Xw6M=ZJyejq_9d_iDkReC_=rpvQRw5QsyzKQdfYpkw@mail.gmail.com
Supporting ICU 4.2 seems useful because it ships with CentOS 6.
Versions before ICU 4.6 don't support pkg-config, so document an
installation method without using pkg-config.
In ICU 4.2, ucol_getKeywordsForLocale() sometimes returns values that
will not be accepted by uloc_toLanguageTag(). Skip loading keyword
variants in that version.
Reported-by: Victor Wagner <vitus@wagner.pp.ru>
If it works, then we won't be storing two copies of all the tuples
that were just moved. If not, VACUUM will still take care of it
eventually. Per a report from AP and analysis from Amit Kapila, it
seems that a bulk load can cause splits fast enough that VACUUM won't
deal with the problem in time to prevent bloat.
Amit Kapila; I rewrote the comment.
Discussion: http://postgr.es/m/20170704105728.mwb72jebfmok2nm2@zip.com.au
If you do ALTER COLUMN SET NOT NULL against an inheritance parent table,
it will recurse to mark all the child columns as NOT NULL as well. This
is necessary for consistency: if the column is labeled NOT NULL then
reading it should never produce nulls.
However, that didn't happen in the case where ALTER ... ADD PRIMARY KEY
marks a target column NOT NULL that wasn't before. That was questionable
from the beginning, and now Tushar Ahuja points out that it can lead to
dump/restore failures in some cases. So let's make that case recurse too.
Although this is meant to fix a bug, it's enough of a behavioral change
that I'm pretty hesitant to back-patch, especially in view of the lack
of similar field complaints. It doesn't seem to be too late to put it
into v10 though.
Michael Paquier, editorialized on slightly by me
Discussion: https://postgr.es/m/b8794d6a-38f0-9d7c-ad4b-e85adf860fc9@enterprisedb.com
We don't actually support session tickets, since we do not create an SSL
session identifier. But it seems that OpenSSL will issue a session ticket
on-demand anyway, which will then fail when used. This results in
reconnection failures when using ticket-aware client-side SSL libraries
(such as the Npgsql .NET driver), as reported by Shay Rojansky.
To fix, just tell OpenSSL not to issue tickets. At some point in the
far future, we might consider enabling tickets instead. But the security
implications of that aren't entirely clear; and besides it would have
little benefit except for very short-lived database connections, which is
Something We're Bad At anyhow. It would take a lot of other work to get
to a point where that would really be an exciting thing to do.
While at it, also tell OpenSSL not to use a session cache. This doesn't
really do anything, since a backend would never populate the cache anyway,
but it might gain some micro-efficiencies and/or reduce security
exposures.
Patch by me, per discussion with Heikki Linnakangas and Shay Rojansky.
Back-patch to all supported versions.
Discussion: https://postgr.es/m/CADT4RqBU8N-csyZuzaook-c795dt22Zcwg1aHWB6tfVdAkodZA@mail.gmail.com
ALTER USER ... SET did not support all the syntax variants of ALTER ROLE
... SET. Fix that, and to avoid further deviations of this kind, unify
many the grammar rules for ROLE/USER/GROUP commands.
Reported-by: Pavel Golub <pavel@microolap.com>
Some of these comments wrongly implied that only an AFTER ROW trigger
will cause a 'wholerow' attribute to be present for a foreign table,
but a BEFORE ROW trigger can have the same effect. Others implied
that it would always be present for a foreign table, but that's not
true either.
Etsuro Fujita and Robert Haas
Discussion: http://postgr.es/m/10026bc7-1403-ef85-9e43-c6100c1cc0e3@lab.ntt.co.jp
Otherwise, partitioned tables with RETURNING expressions or subject
to a WITH CHECK OPTION do not work properly.
Amit Langote, reviewed by Amit Khandekar and Etsuro Fujita. A few
comment changes by me.
Discussion: http://postgr.es/m/9a39df80-871e-6212-0684-f93c83be4097@lab.ntt.co.jp
init_htab(), with #define HASH_DEBUG, prints a bunch of hashtable
parameters. It used to also print nentries, but commit 44ca4022f changed
that to "hash_get_num_entries(hctl)", which is wrong (the parameter should
be "hashp").
Rather than correct the coding, though, let's just remove that field from
the printout. The table must be empty, since we just finished building
it, so expensively calculating the number of entries is rather pointless.
Moreover hash_get_num_entries makes assumptions (about not needing locks)
which we could do without in debugging code.
Noted by Choi Doo-Won in bug #14764. Back-patch to 9.6 where the
faulty code was introduced.
Discussion: https://postgr.es/m/20170802032353.8424.12274@wrigleys.postgresql.org
This fixes a crash if the local table has a function index and the
function makes non-immutable calls.
Reported-by: Scott Milliken <scott@deltaex.com>
Author: Masahiko Sawada <sawada.mshk@gmail.com>
Commit c0a15e07c moved the setting of OpenSSL's SSL_OP_SINGLE_DH_USE option
into a new subroutine initialize_dh(), but forgot to remove it from where
it was. SSL_CTX_set_options() is a trivial function, amounting indeed to
just "ctx->options |= op", hence there's no reason to contort the code or
break separation of concerns to avoid calling it twice. So separating the
DH setup from disabling of old protocol versions is a good change, but we
need to finish the job.
Noted while poking into the question of SSL session tickets.
This doesn't have a significant impact except that now SECURITY LABEL ON
DOMAIN rejects types that are not domains.
Reported-by: 高增琦 <pgf00a@gmail.com>
The early buildfarm returns for commit 1e165d05f are pretty awful:
not only does Windows not return a useful error, but it looks like
a lot of Unix-ish platforms don't either. Given the number of
different errnos seen so far, guess that what's really going on is
that some newlocale() implementations fail to set errno at all.
Hence, let's try zeroing errno just before newlocale() and then
if it's still zero report as though it's ENOENT. That should cover
the Windows case too.
It's clear that we'll have to drop the regression test case, unless
we want to maintain a separate expected-file for platforms without
HAVE_LOCALE_T. But I'll leave it there awhile longer to see if this
actually improves matters or not.
Discussion: https://postgr.es/m/CAKKotZS-wcDcofXDCH=sidiuajE+nqHn2CGjLLX78anyDmi3gQ@mail.gmail.com
We were just printing errno, which is certainly not gonna work on
Windows. Now, it's not entirely clear from Microsoft's documentation
whether _create_locale() adheres to standard Windows error reporting
conventions, but let's assume it does and try to map the GetLastError
result to an errno. If this turns out not to work, probably the best
thing to do will be to assume the error is always ENOENT on Windows.
This is a longstanding bug, but given the lack of previous field
complaints, I'm not excited about back-patching it.
Per report from Murtuza Zabuawala.
Discussion: https://postgr.es/m/CAKKotZS-wcDcofXDCH=sidiuajE+nqHn2CGjLLX78anyDmi3gQ@mail.gmail.com
Most of our collations code has special handling for the locale names
"C" and "POSIX", allowing those collations to be used whether or not
the system libraries think those locale names are valid, or indeed
whether said libraries even have any locale support. But we missed
handling things that way in CREATE COLLATION. This meant you couldn't
clone the C/POSIX collations, nor explicitly define a new collation
using those locale names, unless the libraries allow it. That's pretty
pointless, as well as being a violation of pg_newlocale_from_collation's
API specification.
The practical effect of this change is quite limited: it allows creating
such collations even on platforms that don't HAVE_LOCALE_T, and it allows
making "POSIX" collation objects on Windows, which before this would only
let you make "C" collation objects. Hence, even though this is a bug fix
IMO, it doesn't seem worth the trouble to back-patch.
In passing, suppress the DROP CASCADE detail messages at the end of the
collation regression test. I'm surprised we've never been bit by
message ordering issues there.
Per report from Murtuza Zabuawala.
Discussion: https://postgr.es/m/CAKKotZS-wcDcofXDCH=sidiuajE+nqHn2CGjLLX78anyDmi3gQ@mail.gmail.com
1024 bits is considered weak these days, but OpenSSL always passes 1024 as
the key length to the tmp_dh callback. All the code to handle other key
lengths is, in fact, dead.
To remedy those issues:
* Only include hard-coded 2048-bit parameters.
* Set the parameters directly with SSL_CTX_set_tmp_dh(), without the
callback
* The name of the file containing the DH parameters is now a GUC. This
replaces the old hardcoded "dh1024.pem" filename. (The files for other
key lengths, dh512.pem, dh2048.pem, etc. were never actually used.)
This is not a new problem, but it doesn't seem worth the risk and churn to
backport. If you care enough about the strength of the DH parameters on
old versions, you can create custom DH parameters, with as many bits as you
wish, and put them in the "dh1024.pem" file.
Per report by Nicolas Guini and Damian Quiroga. Reviewed by Michael Paquier.
Discussion: https://www.postgresql.org/message-id/CAMxBoUyjOOautVozN6ofzym828aNrDjuCcOTcCquxjwS-L2hGQ@mail.gmail.com
This allows us to add stack-depth checks the first time an executor
node is called, and skip that overhead on following
calls. Additionally it yields a nice speedup.
While it'd probably have been a good idea to have that check all
along, it has become more important after the new expression
evaluation framework in b8d7f053c5 - there's no stack depth
check in common paths anymore now. We previously relied on
ExecEvalExpr() being executed somewhere.
We should move towards that model for further routines, but as this is
required for v10, it seems better to only do the necessary (which
already is quite large).
Author: Andres Freund, Tom Lane
Reported-By: Julien Rouhaud
Discussion:
https://postgr.es/m/22833.1490390175@sss.pgh.pa.ushttps://postgr.es/m/b0af9eaa-130c-60d0-9e4e-7a135b1e0c76@dalibo.com
In a followup commit ExecProcNode(), and especially the large switch
it contains, will largely be replaced by a function pointer directly
to the correct node. The node functions will then get invoked by a
thin inline function wrapper. To avoid having to include miscadmin.h
in headers - CHECK_FOR_INTERRUPTS() - move the interrupt checks into
the individual executor routines.
While looking through all executor nodes, I noticed a number of
arguably missing interrupt checks, add these too.
Author: Andres Freund, Tom Lane
Reviewed-By: Tom Lane
Discussion:
https://postgr.es/m/22833.1490390175@sss.pgh.pa.us
If several sessions are concurrently locking a tuple update chain with
nonconflicting lock modes using an old snapshot, and they all succeed,
it may happen that some of them fail because of restarting the loop (due
to a concurrent Xmax change) and getting an error in the subsequent pass
while trying to obtain a tuple lock that they already have in some tuple
version.
This can only happen with very high concurrency (where a row is being
both updated and FK-checked by multiple transactions concurrently), but
it's been observed in the field and can have unpleasant consequences
such as an FK check failing to see a tuple that definitely exists:
ERROR: insert or update on table "child_table" violates foreign key constraint "fk_constraint_name"
DETAIL: Key (keyid)=(123456) is not present in table "parent_table".
(where the key is observably present in the table).
Discussion: https://postgr.es/m/20170714210011.r25mrff4nxjhmf3g@alvherre.pgsql
Initial submitted versions of the functional dependencies patch ignored
row groups that were smaller than a configured size. However, that
consideration was removed in late stages of the patch just before
commit, but some comments referring to it remained. Remove them to
avoid confusion.
Author: Atsushi Torikoshi
Discussion: https://postgr.es/m/7cfb23fc-4493-9c02-5da9-e505fd0115d2@lab.ntt.co.jp
It is relatively easy to get a replication slot to look as still active
while one process is in the process of getting rid of it; when some
other process tries to "acquire" the slot, it would fail with an error
message of "replication slot XYZ is active for PID N".
The error message in itself is fine, except that when the intention is
to drop the slot, it is unhelpful: the useful behavior would be to wait
until the slot is no longer acquired, so that the drop can proceed. To
implement this, we use a condition variable so that slot acquisition can
be told to wait on that condition variable if the slot is already
acquired, and we make any change in active_pid broadcast a signal on the
condition variable. Thus, as soon as the slot is released, the drop
will proceed properly.
Reported by: Tom Lane
Discussion: https://postgr.es/m/11904.1499039688@sss.pgh.pa.us
Authors: Petr Jelínek, Álvaro Herrera
Trading a little too heavily on letting the code path be the same whether
we were creating shared data structures or only attaching to them,
InitPredicateLocks() inserted the "scratch" PredicateLockTargetHash entry
unconditionally. This is just wrong if we're in a postmaster child,
which would only reach this code in EXEC_BACKEND builds. Most of the
time, the hash_search(HASH_ENTER) call would simply report that the
entry already existed, causing no visible effect since the code did not
bother to check for that possibility. However, if this happened while
some other backend had transiently removed the "scratch" entry, then
that other backend's eventual RestoreScratchTarget would suffer an
assert failure; this appears to be the explanation for a recent failure
on buildfarm member culicidae. In non-assert builds, there would be
no visible consequences there either. But nonetheless this is a pretty
bad bug for EXEC_BACKEND builds, for two reasons:
1. Each new backend would perform the hash_search(HASH_ENTER) call
without holding any lock that would prevent concurrent access to the
PredicateLockTargetHash hash table. This creates a low but certainly
nonzero risk of corruption of that hash table.
2. In the event that the race condition occurred, by reinserting the
scratch entry too soon, we were defeating the entire purpose of the
scratch entry, namely to guarantee that transaction commit could move
hash table entries around with no risk of out-of-memory failure.
The odds of an actual OOM failure are quite low, but not zero, and if
it did happen it would again result in corruption of the hash table.
The user-visible symptoms of such corruption are a little hard to predict,
but would presumably amount to misbehavior of SERIALIZABLE transactions
that'd require a crash or postmaster restart to fix.
To fix, just skip the hash insertion if IsUnderPostmaster. I also
inserted a bunch of assertions that the expected things happen
depending on whether IsUnderPostmaster is true. That might be overkill,
since most comparable code in other functions isn't quite that paranoid,
but once burnt twice shy.
In passing, also move a couple of lines to places where they seemed
to make more sense.
Diagnosis of problem by Thomas Munro, patch by me. Back-patch to
all supported branches.
Discussion: https://postgr.es/m/10593.1500670709@sss.pgh.pa.us
If the user modifies a view that has CHECK OPTIONs and this gets
translated into a modification to an underlying relation which happens
to be a foreign table, the check options should be enforced. In the
normal code path, that was happening properly, but it was not working
properly for "direct" modification because the whole operation gets
pushed to the remote side in that case and we never have an option to
enforce the constraint against individual tuples. Fix by disabling
direct modification when there is a need to enforce CHECK OPTIONs.
Etsuro Fujita, reviewed by Kyotaro Horiguchi and by me.
Discussion: http://postgr.es/m/f8a48f54-6f02-9c8a-5250-9791603171ee@lab.ntt.co.jp
Various cases involving renaming of view columns are handled by having
make_viewdef pass down the view's current relation tupledesc to
get_query_def, which then takes care to use the column names from the
tupledesc for the output column names of the SELECT. For some reason
though, we'd missed teaching make_ruledef to do similarly when it is
printing an ON SELECT rule, even though this is exactly the same case.
The results from pg_get_ruledef would then be different and arguably wrong.
In particular, this breaks pre-v10 versions of pg_dump, which in some
situations would define views by means of emitting a CREATE RULE ... ON
SELECT command. Third-party tools might not be happy either.
In passing, clean up some crufty code in make_viewdef; we'd apparently
modernized the equivalent code in make_ruledef somewhere along the way,
and missed this copy.
Per report from Gilles Darold. Back-patch to all supported versions.
Discussion: https://postgr.es/m/ec05659a-40ff-4510-fc45-ca9d965d0838@dalibo.com
Add error checks in some places that were calling get_opfamily_member
or get_opfamily_proc and just assuming that the call could never fail.
Also, standardize the wording for such errors in some other places.
None of these errors are expected in normal use, hence they're just
elog not ereport. But they may be handy for diagnosing omissions in
custom opclasses.
Rushabh Lathia found the oversight in RelationBuildPartitionKey();
I found the others by grepping for all callers of these functions.
Discussion: https://postgr.es/m/CAGPqQf2R9Nk8htpv0FFi+FP776EwMyGuORpc9zYkZKC8sFQE3g@mail.gmail.com
While I couldn't find any live bugs in commit 44ca4022f, the comments
seemed pretty far from adequate; in particular it was not made plain that
"borrowing" entries from other freelists is critical for correctness.
Try to improve the commentary. A couple of very minor code style
tweaks, as well.
Discussion: https://postgr.es/m/10593.1500670709@sss.pgh.pa.us
SLRU buffer lwlocks are allocated twice by oversight in commit
fe702a7b3f where that locks were moved to
separate tranche. The bug doesn't have user-visible effects except small
overspending of shared memory.
Backpatch to 9.6 where it was introduced.
Alexander Korotkov with small editorization by me.
Previously, UNBOUNDED meant no lower bound when used in the FROM list,
and no upper bound when used in the TO list, which was OK for
single-column range partitioning, but problematic with multiple
columns. For example, an upper bound of (10.0, UNBOUNDED) would not be
collocated with a lower bound of (10.0, UNBOUNDED), thus making it
difficult or impossible to define contiguous multi-column range
partitions in some cases.
Fix this by using MINVALUE and MAXVALUE instead of UNBOUNDED to
represent a partition column that is unbounded below or above
respectively. This syntax removes any ambiguity, and ensures that if
one partition's lower bound equals another partition's upper bound,
then the partitions are contiguous.
Also drop the constraint prohibiting finite values after an unbounded
column, and just document the fact that any values after MINVALUE or
MAXVALUE are ignored. Previously it was necessary to repeat UNBOUNDED
multiple times, which was needlessly verbose.
Note: Forces a post-PG 10 beta2 initdb.
Report by Amul Sul, original patch by Amit Langote with some
additional hacking by me.
Discussion: https://postgr.es/m/CAAJ_b947mowpLdxL3jo3YLKngRjrq9+Ej4ymduQTfYR+8=YAYQ@mail.gmail.com
Normally, a JoinExpr would have empty "quals" only if it came from CROSS
JOIN syntax. However, it's possible to get to this state by specifying
NATURAL JOIN between two tables with no common column names, and there
might be other ways too. The code previously printed no ON clause if
"quals" was empty; that's right for CROSS JOIN but syntactically invalid
if it's some type of outer join. Fix by printing ON TRUE in that case.
This got broken by commit 2ffa740be, which stopped using NATURAL JOIN
syntax in ruleutils output due to its brittleness in the face of
column renamings. Back-patch to 9.3 where that commit appeared.
Per report from Tushar Ahuja.
Discussion: https://postgr.es/m/98b283cd-6dda-5d3f-f8ac-87db8c76a3da@enterprisedb.com
When pg_control was first designed, sizeof(ControlFileData) was small
enough that a comment seemed like plenty to document the assumption that
it'd fit into one disk sector. Now it's nearly 300 bytes, raising the
possibility that somebody would carelessly add enough stuff to create
a problem. Let's add a StaticAssertStmt() to ensure that the situation
doesn't pass unnoticed if it ever occurs.
While at it, rename PG_CONTROL_SIZE to PG_CONTROL_FILE_SIZE to make it
clearer what that symbol means, and convert the existing runtime
comparisons of sizeof(ControlFileData) vs. PG_CONTROL_FILE_SIZE to be
static asserts --- we didn't have that technology when this code was
first written.
Discussion: https://postgr.es/m/9192.1500490591@sss.pgh.pa.us
It seemed a bit silly that each caller of make_tsvector() was laboriously
special-casing the situation where no lexemes were found, when it would
be easy and much more bullet-proof to make make_tsvector() handle that.
In an off-list followup to bug #14745, Bob Jones complained that
to_tsvector() on a 2MB jsonb value took an unreasonable amount of
time and space --- enough to draw the wrath of the OOM killer on
his machine. On my machine, his example proved to require upwards
of 18 seconds and 4GB, which seemed pretty bogus considering that
to_tsvector() on the same data treated as text took just a couple
hundred msec and 10 or so MB.
On investigation, the problem is that the implementation scans each
string element of the json(b) and converts it to tsvector separately,
then applies tsvector_concat() to join those separate tsvectors.
The unreasonable memory usage came from leaking every single one of
the transient tsvectors --- but even without that mistake, this is an
O(N^2) or worse algorithm, because tsvector_concat() has to repeatedly
process the words coming from earlier elements.
We can fix it by accumulating all the lexeme data and applying
make_tsvector() just once. As a side benefit, that also makes the
desired adjustment of lexeme positions far cheaper, because we can
just tweak the running "pos" counter between JSON elements.
In passing, try to make the explanation of that tweak more intelligible.
(I didn't think that a barely-readable comment far removed from the
actual code was helpful.) And do some minor other code beautification.
Before, we always used a dummy value of 1, but that's not right when
the partitioned table being modified is inside of a WITH clause
rather than part of the main query.
Amit Langote, reported and reviewd by Etsuro Fujita, with a comment
change by me.
Discussion: http://postgr.es/m/ee12f648-8907-77b5-afc0-2980bcb0aa37@lab.ntt.co.jp
One, logging for CREATE INDEX was oblivious to the fact that when
an unlogged table is created, *only* operations on the init fork
should be logged.
Two, init fork buffers need to be flushed after they are written;
otherwise, a filesystem-level copy following recovery may do the
wrong thing. (There may be a better fix for this issue than the
one used here, but this is transposed from the similar logic already
present in XLogReadBufferForRedoExtended, and a broader refactoring
after beta2 seems inadvisable.)
Amit Kapila, reviewed by Ashutosh Sharma, Kyotaro Horiguchi,
and Michael Paquier
Discussion: http://postgr.es/m/CAA4eK1JpcMsEtOL_J7WODumeEfyrPi7FPYHeVdS7fyyrCrgp4w@mail.gmail.com
Given this code's general eagerness to use subexpressions' output variables
as temporary workspace, it's not exactly clear that it is safe for
FieldStore to tell a newval subexpression that it can write into the same
variable that is being supplied as a potential input. Document the chain
of assumptions needed for that to be safe.
I got confused about why this function doesn't need to recursively
search the expression tree for a CaseTestExpr node. After figuring
that out, add a comment to save the next person some time.
Add missing infrastructure for this node type, notably in ruleutils.c where
its lack could demonstrably cause EXPLAIN to fail. Add outfuncs/readfuncs
support. (outfuncs support is useful today for debugging purposes. The
readfuncs support may never be needed, since at present it would only
matter for parallel query and NextValueExpr should never appear in a
parallelizable query; but it seems like a bad idea to have a primnode type
that isn't fully supported here.) Teach planner infrastructure that
NextValueExpr is a volatile, parallel-unsafe, non-leaky expression node
with cost cpu_operator_cost. Given its limited scope of usage, there
*might* be no live bug today from the lack of that knowledge, but it's
certainly going to bite us on the rear someday. Teach pg_stat_statements
about the new node type, too.
While at it, also teach cost_qual_eval() that MinMaxExpr, SQLValueFunction,
XmlExpr, and CoerceToDomain should be charged as cpu_operator_cost.
Failing to do this for SQLValueFunction was an oversight in my commit
0bb51aa96. The others are longer-standing oversights, but no time like the
present to fix them. (In principle, CoerceToDomain could have cost much
higher than this, but it doesn't presently seem worth trying to examine the
domain's constraints here.)
Modify execExprInterp.c to execute NextValueExpr as an out-of-line
function; it seems quite unlikely to me that it's worth insisting that
it be inlined in all expression eval methods. Besides, providing the
out-of-line function doesn't stop anyone from inlining if they want to.
Adjust some places where NextValueExpr support had been inserted with the
aid of a dartboard rather than keeping it in the same order as elsewhere.
Discussion: https://postgr.es/m/23862.1499981661@sss.pgh.pa.us
The grammar will only accept something syntactically similar to a function
call in a function-in-FROM expression. However, there are various ways
to input something that ruleutils.c won't deparse that way, potentially
leading to a view or rule that fails dump/reload. Fix by inserting a
dummy CAST around anything that isn't going to deparse as a function
(which is one of the ways to get something like that in there in the
first place).
In HEAD, also make use of the infrastructure added by this to avoid
emitting unnecessary parentheses in CREATE INDEX deparsing. I did
not change that in back branches, thinking that people might find it
to be unexpected/unnecessary behavioral change.
In HEAD, also fix incorrect logic for when to add extra parens to
partition key expressions. Somebody apparently thought they could
get away with simpler logic than pg_get_indexdef_worker has, but
they were wrong --- a counterexample is PARTITION BY LIST ((a[1])).
Ignoring the prettyprint flag for partition expressions isn't exactly
a nice solution anyway.
This has been broken all along, so back-patch to all supported branches.
Discussion: https://postgr.es/m/10477.1499970459@sss.pgh.pa.us
The race condition goes like this:
1. GetNewTransactionId advances nextXid e.g. from 100 to 101
2. GetOldestActiveTransactionId reads the new nextXid, 101
3. GetOldestActiveTransactionId loops through the proc array. There are no
active XIDs there, so it returns 101 as the oldest active XID.
4. GetNewTransactionid stores XID 100 to MyPgXact->xid
So, GetOldestActiveTransactionId returned XID 101, even though 100 only
just started and is surely still running.
This would be hard to hit in practice, and even harder to spot any ill
effect if it happens. GetOldestActiveTransactionId is only used when
creating a checkpoint in a master server, and the race condition can only
happen on an online checkpoint, as there are no backends running during a
shutdown checkpoint. The oldestActiveXid value of an online checkpoint is
only used when starting up a hot standby server, to determine the starting
point where pg_subtrans is initialized from. For the race condition to
happen, there must be no other XIDs in the proc array that would hold back
the oldest-active XID value, which means that the missed XID must be a top
transaction's XID. However, pg_subtrans is not used for top XIDs, so I
believe an off-by-one error is in fact inconsequential. Nevertheless, let's
fix it, as it's clearly wrong and the fix is simple.
This has been wrong ever since hot standby was introduced, so backport to
all supported versions.
Discussion: https://www.postgresql.org/message-id/e7258662-82b6-7a45-56d4-99b337a32bf7@iki.fi
Further investigation shows that ruleutils isn't quite up to speed either
for cases where we have a domain-over-array: it needs to be prepared to
look past a CoerceToDomain at the top level of field and element
assignments, else it decompiles them incorrectly. Potentially this would
result in failure to dump/reload a rule, if it looked like the one in the
new test case. (I also added a test for EXPLAIN; that output isn't broken,
but clearly we need more test coverage here.)
Like commit b1cb32fb6, this bug is reachable in cases we already support,
so back-patch all the way.
compute_tsvector_stats() detoasted and kept in memory every tsvector value
in the sample, but that can be a lot of memory. The original bug report
described a case using over 10 gigabytes, with statistics target of 10000
(the maximum).
To fix, allocate a separate copy of just the lexemes that we keep around,
and free the detoasted tsvector values as we go. This adds some palloc/pfree
overhead, when you have a lot of distinct lexemes in the sample, but it's
better than running out of memory.
Fixes bug #14654 reported by James C. Reviewed by Tom Lane. Backport to
all supported versions.
Discussion: https://www.postgresql.org/message-id/20170514200602.1451.46797@wrigleys.postgresql.org
If the number of tuples in the heap exceeds approximately INT_MAX/2,
this loop's calculation "2*i+1" could overflow, resulting in a crash.
Fix it by using unsigned int rather than int for the relevant local
variables; that shouldn't cost anything extra on any popular hardware.
Per bug #14722 from Sergey Koposov.
Original patch by Sergey Koposov, modified by me per a suggestion
from Heikki Linnakangas to use unsigned int not int64.
Back-patch to 9.4, where tuplesort.c grew the ability to sort as many
as INT_MAX tuples in-memory (commit 263865a48).
Discussion: https://postgr.es/m/20170629161637.1478.93109@wrigleys.postgresql.org
Commit 14e8803f1 removed the locking in SyncRepWaitForLSN, but that
introduced a race condition, where SyncRepWaitForLSN might see
syncRepState already set to SYNC_REP_WAIT_COMPLETE, but the process was
not yet removed from the queue. That tripped the assertion, that the
process should no longer be in the uqeue. Reorder the operations in
SyncRepWakeQueue to remove the process from the queue first, and update
syncRepState only after that, and add a memory barrier in between to make
sure the operations are made visible to other processes in that order.
Fixes bug #14721 reported by Const Zhang. Analysis and fix by Thomas Munro.
Backpatch down to 9.5, where the locking was removed.
Discussion: https://www.postgresql.org/message-id/20170629023623.1480.26508%40wrigleys.postgresql.org
We allow INSERT and UPDATE commands to assign to the same column more than
once, as long as the assignments are to subfields or elements rather than
the whole column. However, this failed when the target column was a domain
over array rather than plain array. Fix by teaching process_matched_tle()
to look through CoerceToDomain nodes, and add relevant test cases.
Also add a group of test cases exercising domains over array of composite.
It's doubtless accidental that CREATE DOMAIN allows this case while not
allowing straight domain over composite; but it does, so we'd better make
sure we don't break it. (I could not find any documentation mentioning
either side of that, so no doc changes.)
It's been like this for a long time, so back-patch to all supported
branches.
Discussion: https://postgr.es/m/4206.1499798337@sss.pgh.pa.us
We've heard occasional reports of backend launch failing because
pgwin32_ReserveSharedMemoryRegion() fails, indicating that something
has already used that address space in the child process. It's not
very clear what, given that we disable ASLR in Windows builds, but
suspicion falls on antivirus products. It'd be better if we didn't
have to disable ASLR, anyway. So let's try to ameliorate the problem
by retrying the process launch after such a failure, up to 100 times.
Patch by me, based on previous work by Amit Kapila and others.
This is a longstanding issue, so back-patch to all supported branches.
Discussion: https://postgr.es/m/CAA4eK1+R6hSx6t_yvwtx+NRzneVp+MRqXAdGJZChcau8Uij-8g@mail.gmail.com
Commit c46c0e5202 failed to pass the
TransitionCaptureState object to ExecARInsertTriggers() in the case
where it's using heap_multi_insert and there are indexes. Repair.
Thomas Munro, from a report by David Fetter
Discussion: https://postgr.es/m/20170708084213.GA14720%40fetter.org
On platforms lacking both locale_t and ICU, collationcmds.c failed
to make any use of its static function is_all_ascii(), thus probably
drawing a compiler warning. Oversight in my commit ddb5fdc06.
Per buildfarm member gaur.
partition_rbound_cmp() is intended to compare range partition bounds
in a way such that if all the bound values are equal but one is an
upper bound and one is a lower bound, the upper bound is treated as
smaller than the lower bound. This particular ordering is required by
RelationBuildPartitionDesc() when building the PartitionBoundInfoData,
so that it can consistently keep only the upper bounds when upper and
lower bounds coincide.
Update the function comment to make that clearer.
Also, fix a (currently unreachable) corner-case bug -- if the bound
values coincide and they contain unbounded values, fall through to the
lower-vs-upper comparison code, rather than immediately returning
0. Currently it is not possible to define coincident upper and lower
bounds containing unbounded columns, but that may change in the
future, so code defensively.
Discussion: https://postgr.es/m/CAAJ_b947mowpLdxL3jo3YLKngRjrq9+Ej4ymduQTfYR+8=YAYQ@mail.gmail.com
The previous logic, whilst not actually wrong, was overly complex and
involved doing two binary searches, where only one was really
necessary. This simplifies that logic and improves the comments.
One visible change is that if the new partition overlaps multiple
existing partitions, the error message now always reports the overlap
with the first existing partition (the one with the lowest
bounds). The old code would sometimes report the clash with the first
partition and sometimes with the last one.
Original patch idea from Amit Langote, substantially rewritten by me.
Discussion: https://postgr.es/m/CAAJ_b947mowpLdxL3jo3YLKngRjrq9+Ej4ymduQTfYR+8=YAYQ@mail.gmail.com
WRITE_CHAR_FIELD() didn't do any escaping, so that for example a zero
byte would cause the whole output string to be truncated. To fix, pass
the char through outToken(), so it is escaped like a string. Adjust the
reading side to handle this.
This avoids "tuple concurrently updated" errors when a ALTER or DROP
SUBSCRIPTION writes to pg_subscription_rel at the same time as a worker.
Author: Petr Jelinek <petr.jelinek@2ndquadrant.com>
The simple calculations done to estimate the size of the output buffers
for ucnv_fromUChars() and ucnv_toUChars() could overflow int32_t for
large strings. To avoid that, go the long way and run the function
first without an output buffer to get the correct output buffer size
requirement.
The regression tests contain numerous cases where we do some activity on a
master server and then wait till the slave has ack'd flushing its copy of
that transaction. Because WAL flush on the slave is asynchronous to the
logicalrep worker process, the worker cannot send such a feedback message
during the LogicalRepApplyLoop iteration where it processes the last data
from the master. In the previous coding, the feedback message would come
out only when the loop's WaitLatchOrSocket call returned WL_TIMEOUT. That
requires one full second of delay (NAPTIME_PER_CYCLE); and to add insult
to injury, it could take more than that if the WaitLatchOrSocket was
interrupted a few times by latch-setting events.
In reality we can expect the slave's walwriter process to have flushed the
WAL data after, more or less, WalWriterDelay (typically 200ms). Hence,
if there are unacked transactions pending, make the wait delay only that
long rather than the full NAPTIME_PER_CYCLE. Also, move one of the
send_feedback() calls into the loop main line, so that we'll check for the
need to send feedback even if we were woken by a latch event and not either
socket data or timeout.
It's not clear how much this matters for production purposes, but
it's definitely helpful for testing.
Discussion: https://postgr.es/m/30864.1498861103@sss.pgh.pa.us
When waiting for a logical replication worker process to start or stop,
we have to busy-wait until we see it add or remove itself from the
LogicalRepWorker slot in shared memory. Those loops were using a
one-second delay between checks, but on any reasonably modern machine, it
doesn't take more than a couple of msec for a worker to spawn or shut down.
Reduce the loop delays to 10ms to avoid wasting quite so much time in the
related regression tests.
In principle, a better solution would be to fix things so that the waiting
process can be awakened via its latch at the right time. But that seems
considerably more invasive, which is undesirable for a post-beta fix.
Worker start/stop performance likely isn't of huge interest anyway for
production purposes, so we might not ever get around to it.
In passing, rearrange the second wait loop in logicalrep_worker_stop()
so that the lock is held at the top of the loop, thus saving one lock
acquisition/release per call, and making it look more like the other loop.
Discussion: https://postgr.es/m/30864.1498861103@sss.pgh.pa.us
The bug would previously prevent the update of any column in a table
with identity columns, rather than just the actual identity column.
Reported-by: zam6ak@gmail.com
Bug: #14718
In WAL receiver and WAL server, some accesses to their corresponding
shared memory control structs were done without holding any kind of
lock, which could lead to inconsistent and possibly insecure results.
In walsender, fix by clarifying the locking rules and following them
correctly, as documented in the new comment in walsender_private.h;
namely that some members can be read in walsender itself without a lock,
because the only writes occur in the same process. The rest of the
struct requires spinlock for accesses, as usual.
In walreceiver, fix by always holding spinlock while accessing the
struct.
While there is potentially a problem in all branches, it is minor in
stable ones. This only became a real problem in pg10 because of quorum
commit in synchronous replication (commit 3901fd70cc), and a potential
security problem in walreceiver because a superuser() check was removed
by default monitoring roles (commit 25fff40798). Thus, no backpatch.
In passing, clean up some leftover braces which were used to create
unconditional blocks. Once upon a time these were used for
volatile-izing accesses to those shmem structs, which is no longer
required. Many other occurrences of this pattern remain.
Author: Michaël Paquier
Reported-by: Michaël Paquier
Reviewed-by: Masahiko Sawada, Kyotaro Horiguchi, Thomas Munro,
Robert Haas
Discussion: https://postgr.es/m/CAB7nPqTWYqtzD=LN_oDaf9r-hAjUEPAy0B9yRkhcsLdRN8fzrw@mail.gmail.com
When a sync worker is waiting for the associated apply worker to notice
that it's in SYNCWAIT state, wait_for_worker_state_change() would just
patiently wait for that to happen. This generally required waiting for
the 1-second timeout in LogicalRepApplyLoop to elapse. Kicking the worker
via its latch makes things significantly snappier.
While at it, fix race conditions that could potentially result in crashes:
we can *not* call logicalrep_worker_wakeup_ptr() once we've released the
LogicalRepWorkerLock, because worker->proc might've been reset to NULL
after we do that (indeed, there's no really solid reason to believe that
the LogicalRepWorker slot even belongs to the same worker anymore).
In logicalrep_worker_wakeup(), we can just move the wakeup inside the
lock scope. In process_syncing_tables_for_apply(), a bit more code
rearrangement is needed.
Also improve some nearby comments.
It's possible for WalSndWaitForWal to be asked to wait for WAL that doesn't
exist yet. That's fine, in fact it's the normal situation if we're caught
up; but when the client requests shutdown we should not keep waiting.
The previous coding could wait indefinitely if the source server was idle.
In passing, improve the rather weak comments in this area, and slightly
rearrange some related code for better readability.
Back-patch to 9.4 where this code was introduced.
Discussion: https://postgr.es/m/14154.1498781234@sss.pgh.pa.us
ICU does not support "collate" and "ctype" being different, so the
collctype catalog column is ignored. But for catalog neatness, ensure
that they are the same.
This command used to compute the collencoding entry like when a
completely new collation is created. But for example when copying the
"C" collation, this would then result in a collation that has a
collencoding entry for the current database encoding rather than -1,
thus not making an exact copy. This has probably no practical impact,
but making this change keeps the catalog contents neat.
Reported-by: Tom Lane <tgl@sss.pgh.pa.us>
Traditionally, "pg_ctl start -w" has waited for the server to become
ready to accept connections by attempting a connection once per second.
That has the major problem that connection issues (for instance, a
kernel packet filter blocking traffic) can't be reliably told apart
from server startup issues, and the minor problem that if server startup
isn't quick, we accumulate "the database system is starting up" spam
in the server log. We've hacked around many of the possible connection
issues, but it resulted in ugly and complicated code in pg_ctl.c.
In commit c61559ec3, I changed the probe rate to every tenth of a second.
That prompted Jeff Janes to complain that the log-spam problem had become
much worse. In the ensuing discussion, Andres Freund pointed out that
we could dispense with connection attempts altogether if the postmaster
were changed to report its status in postmaster.pid, which "pg_ctl start"
already relies on being able to read. This patch implements that, teaching
postmaster.c to report a status string into the pidfile at the same
state-change points already identified as being of interest for systemd
status reporting (cf commit 7d17e683f). pg_ctl no longer needs to link
with libpq at all; all its functions now depend on reading server files.
In support of this, teach AddToDataDirLockFile() to allow addition of
postmaster.pid lines in not-necessarily-sequential order. This is needed
on Windows where the SHMEM_KEY line will never be written at all. We still
have the restriction that we don't want to truncate the pidfile; document
the reasons for that a bit better.
Also, fix the pg_ctl TAP tests so they'll notice if "start -w" mode
is broken --- before, they'd just wait out the sixty seconds until
the loop gives up, and then report success anyway. (Yes, I found that
out the hard way.)
While at it, arrange for pg_ctl to not need to #include miscadmin.h;
as a rather low-level backend header, requiring that to be compilable
client-side is pretty dubious. This requires moving the #define's
associated with the pidfile into a new header file, and moving
PG_BACKEND_VERSIONSTR someplace else. For lack of a clearly better
"someplace else", I put it into port.h, beside the declaration of
find_other_exec(), since most users of that macro are passing the value to
find_other_exec(). (initdb still depends on miscadmin.h, but at least
pg_ctl and pg_upgrade no longer do.)
In passing, fix main.c so that PG_BACKEND_VERSIONSTR actually defines the
output of "postgres -V", which remarkably it had never done before.
Discussion: https://postgr.es/m/CAMkU=1xJW8e+CTotojOMBd-yzUvD0e_JZu2xHo=MnuZ4__m7Pg@mail.gmail.com
We now disallow having triggers with both transition tables and ON
INSERT OR UPDATE (which was a PG extension to the spec anyway),
because in this case it's not at all clear how the transition tables
should work for an INSERT ... ON CONFLICT query. Separate ON INSERT
and ON UPDATE triggers with transition tables are allowed, and the
transition tables for these reflect only the inserted and only the
updated tuples respectively.
Patch by Thomas Munro
Discussion: https://postgr.es/m/CAEepm%3D11KHQ0JmETJQihSvhZB5mUZL2xrqHeXbCeLhDiqQ39%3Dw%40mail.gmail.com
Buildfarm evidence shows that TCP_KEEPALIVE_THRESHOLD doesn't exist
after all on Solaris < 11. This means we need to take positive action to
prevent the TCP_KEEPALIVE code path from being taken on that platform.
I've chosen to limit it with "&& defined(__darwin__)", since it's unclear
that anyone else would follow Apple's precedent of spelling the symbol
that way.
Also, follow a suggestion from Michael Paquier of eliminating code
duplication by defining a couple of intermediate symbols for the
socket option.
In passing, make some effort to reduce the number of translatable messages
by replacing "setsockopt(foo) failed" with "setsockopt(%s) failed", etc,
throughout the affected files. And update relevant documentation so
that it doesn't claim to provide an exhaustive list of the possible
socket option names.
Like the previous commit (f0256c774), back-patch to all supported branches.
Discussion: https://postgr.es/m/20170627163757.25161.528@wrigleys.postgresql.org
Turns out that the socket option for this is named TCP_KEEPALIVE_THRESHOLD,
at least according to the tcp(7P) man page for Solaris 11. (But since that
text refers to "SunOS", it's likely pretty ancient.) It appears that the
symbol TCP_KEEPALIVE does get defined on that platform, but it doesn't
seem to represent a valid protocol-level socket option. This leads to
bleats in the postmaster log, and no tcp_keepalives_idle functionality.
Per bug #14720 from Andrey Lizenko, as well as an earlier report from
Dhiraj Chawla that nobody had followed up on. The issue's been there
since we added the TCP_KEEPALIVE code path in commit 5acd417c8, so
back-patch to all supported branches.
Discussion: https://postgr.es/m/20170627163757.25161.528@wrigleys.postgresql.org
check_agg_arguments_walker threw an error upon seeing a SRF or window
function, but that is too aggressive: if the function is within a
sub-select then it's perfectly fine. I broke the SRF case in commit
0436f6bde by copying the logic for window functions ... but that was
broken too, and had been since commit eaccfded9.
Repair both cases in HEAD, and the window function case back to 9.3.
9.2 gets this right.
When a walreceiver dies, the startup process will notice that and send
a PMSIGNAL_START_WALRECEIVER signal to the postmaster, asking for a new
walreceiver to be launched. There's a race condition, which at least
in HEAD is very easy to hit, whereby the postmaster might see that
signal before it processes the SIGCHLD from the walreceiver process.
In that situation, sigusr1_handler() just dropped the start request
on the floor, reasoning that it must be redundant. Eventually, after
10 seconds (WALRCV_STARTUP_TIMEOUT), the startup process would make a
fresh request --- but that's a long time if the connection could have
been re-established almost immediately.
Fix it by setting a state flag inside the postmaster that we won't
clear until we do launch a walreceiver. In cases where that results
in an extra walreceiver launch, it's up to the walreceiver to realize
it's unwanted and go away --- but we have, and need, that logic anyway
for the opposite race case.
I came across this through investigating unexpected delays in the
src/test/recovery TAP tests: it manifests there in test cases where
a master server is stopped and restarted while leaving streaming
slaves active.
This logic has been broken all along, so back-patch to all supported
branches.
Discussion: https://postgr.es/m/21344.1498494720@sss.pgh.pa.us
The stats collector disregards inquiry messages that bear a cutoff_time
before when it last wrote the relevant stats file. That's fine, but at
startup when it reads the "permanent" stats files, it absorbed their
timestamps as if they were the times at which the corresponding temporary
stats files had been written. In reality, of course, there's no data
out there at all. This led to disregarding inquiry messages soon after
startup if the postmaster had been shut down and restarted within less
than PGSTAT_STAT_INTERVAL; which is a pretty common scenario, both for
testing and in the field. Requesting backends would hang for 10 seconds
and then report failure to read statistics, unless they got bailed out
by some other backend coming along and making a newer request within
that interval.
I came across this through investigating unexpected delays in the
src/test/recovery TAP tests: it manifests there because the autovacuum
launcher hangs for 10 seconds when it can't get statistics at startup,
thus preventing a second shutdown from occurring promptly. We might
want to do some things in the autovac code to make it less prone to
getting stuck that way, but this change is a good bug fix regardless.
In passing, also fix pgstat_read_statsfiles() to ensure that it
re-zeroes its global stats variables if they are corrupted by a
short read from the stats file. (Other reads in that function
go into temp variables, so that the issue doesn't arise.)
This has been broken since we created the separation between permanent
and temporary stats files in 8.4, so back-patch to all supported branches.
Discussion: https://postgr.es/m/16860.1498442626@sss.pgh.pa.us
Fix its header comment, which described the old behavior of the <N>
phrase distance operator; we missed updating that in commit 028350f61.
Also, reset errno before strtol() call, to defend against the possibility
that it was already ERANGE at entry. (The lack of complaints says that
it generally isn't, but this is at least a latent bug.) Very minor
stylistic improvements as well.
Victor Drobny noted the obsolete comment, I noted the errno issue.
Back-patch to 9.6 where this code was added, just in case the errno
issue is a live bug in some cases.
Discussion: https://postgr.es/m/2b5382fdff9b1f79d5eb2c99c4d2cbe2@postgrespro.ru
pg_import_system_collations() refused to create any ICU collations if
the current database's encoding didn't support ICU. This is wrongheaded:
initdb must initialize pg_collation in an encoding-independent way
since it might be used in other databases with different encodings.
The reason for the restriction seems to be that get_icu_locale_comment()
used icu_from_uchar() to convert the UChar-format display name, and that
unsurprisingly doesn't know what to do in unsupported encodings.
But by the same token that the initial catalog contents must be
encoding-independent, we can't allow non-ASCII characters in the comment
strings. So we don't really need icu_from_uchar() here: just check for
Unicode codes outside the ASCII range, and if there are none, the format
conversion is trivial. If there are some, we can simply not install the
comment. (In my testing, this affects only Norwegian Bokmål, which has
given us trouble before.)
For paranoia's sake, also check for non-ASCII characters in ICU locale
names, and skip such locales, as we do for libc locales. I don't
currently have a reason to believe that this will ever reject anything,
but then again the libc maintainers should have known better too.
With just the import changes, ICU collations can be found in pg_collation
in databases with unsupported encodings. This resulted in more or less
clean failures at runtime, but that's not how things act for unsupported
encodings with libc collations. Make it work the same as our traditional
behavior for libc collations by having collation lookup take into account
whether is_encoding_supported_by_icu().
Adjust documentation to match. Also, expand Table 23.1 to show which
encodings are supported by ICU.
catversion bump because of likely change in pg_collation/pg_description
initial contents in ICU-enabled builds.
Discussion: https://postgr.es/m/20c74bc3-d6ca-243d-1bbc-12f17fa4fe9a@gmail.com
The maxResultSize argument of uloc_getDisplayName is the number of
UChars in the output buffer, not the number of bytes. In principle
this could result in a stack smash, although at least in my Fedora 25
install there are no ICU locales with display names long enough to
overrun the buffer. But it's easily proven to be wrong by reducing
the length of displayname to around 20, whereupon a stack smash
does happen.
(This is a rather scary bug, because the same mistake could easily
have been made in other places; but in a quick code search looking
at uses of UChar I could not find any other instances.)
The comparison with the target rows on the subscriber side was done with
datumIsEqual(), which can have false negatives. For instance, it didn't
work reliably for text columns. So use the equality operator provided
by the type cache instead.
Also add more user documentation about replica identity requirements.
Reported-by: Tatsuo Ishii <ishii@sraoss.co.jp>
Marco Atzeri reported that initdb would fail if "locale -a" reported
the same locale name more than once. All previous versions of Postgres
implicitly de-duplicated the results of "locale -a", but the rewrite
to move the collation import logic into C had lost that property.
It had also lost the property that locale names matching built-in
collation names were silently ignored.
The simplest way to fix this is to make initdb run the function in
if-not-exists mode, which means that there's no real use-case for
non if-not-exists mode; we might as well just drop the boolean argument
and simplify the function's definition to be "add any collations not
already known". This change also gets rid of some odd corner cases
caused by the fact that aliases were added in if-not-exists mode even
if the function argument said otherwise.
While at it, adjust the behavior so that pg_import_system_collations()
doesn't spew "collation foo already exists, skipping" messages during a
re-run; that's completely unhelpful, especially since there are often
hundreds of them. And make it return a count of the number of collations
it did add, which seems like it might be helpful.
Also, re-integrate the previous coding's property that it would make a
deterministic selection of which alias to use if there were conflicting
possibilities. This would only come into play if "locale -a" reports
multiple equivalent locale names, say "de_DE.utf8" and "de_DE.UTF-8",
but that hardly seems out of the question.
In passing, fix incorrect behavior in pg_import_system_collations()'s
ICU code path: it neglected CommandCounterIncrement, which would result
in failures if ICU returns duplicate names, and it would try to create
comments even if a new collation hadn't been created.
Also, reorder operations in initdb so that the 'ucs_basic' collation
is created before calling pg_import_system_collations() not after.
This prevents a failure if "locale -a" were to report a locale named
that. There's no reason to think that that ever happens in the wild,
but the old coding would have survived it, so let's be equally robust.
Discussion: https://postgr.es/m/20c74bc3-d6ca-243d-1bbc-12f17fa4fe9a@gmail.com
After sitting idle and fully replayed for a while and then encountering
a new burst of WAL activity, we interpolate between an ancient sample and the
not-yet-reached one for the new traffic. That produced a corner case report
of lag after receiving first new reply from standby, which might sometimes
be a large spike.
Correct this by resetting last_read time and handle that new case.
Author: Thomas Munro
Callers of icu_to_uchar() neglected to pfree the result string when done
with it. This results in catastrophic memory leaks in varstr_cmp(),
because of our prevailing assumption that btree comparison functions don't
leak memory. For safety, make all the call sites clean up leaks, though
I suspect that we could get away without it in formatting.c. I audited
callers of icu_from_uchar() as well, but found no places that seemed to
have a comparable issue.
Add function API specifications for icu_to_uchar() and icu_from_uchar();
the lack of any thought-through specification is perhaps not unrelated
to the existence of this bug in the first place. Fix icu_to_uchar()
to guarantee a nul-terminated result; although no existing caller appears
to care, the fact that it would have been nul-terminated except in
extreme corner cases seems ideally designed to bite someone on the rear
someday. Fix ucnv_fromUChars() destCapacity argument --- in the worst
case, that could perhaps have led to a non-nul-terminated result, too.
Fix icu_from_uchar() to have a more reasonable definition of the function
result --- no callers are actually paying attention, so this isn't a live
bug, but it's certainly sloppily designed. Const-ify icu_from_uchar()'s
input string for consistency.
That is not the end of what needs to be done to these functions, but
it's as much as I have the patience for right now.
Discussion: https://postgr.es/m/1955.1498181798@sss.pgh.pa.us
There was a logic error in a formula, reported by Atsushi Torokoshi.
Ashutosh Bapat furthermore recommended to change notation for a variable
that was re-using a letter from a previous formula, though his proposed
patch contained a small error in attributing what the new letter is for.
Also, instead of his proposed d' I ended up using e, to avoid confusing
the reader with quotes which are used differently in the explaining
prose.
Bugs appeared in commit 2686ee1b7c.
Reported-by: Atsushi Torikoshi, Ashutosh Bapat
Discussion: https://postgr.es/m/CAFjFpRd03YojT4wyuDcjhCfYuygfWfnt68XGn2CKv=rcjRCtTA@mail.gmail.com
The autovacuum launcher doesn't actually do anything with its DSA other
than creating it and attaching to it, but it's been observed that after
longjmp'ing to the standard error handling block (for example after
getting SIGINT) the autovacuum enters an infinite loop reporting that it
cannot attach to its DSA anymore (which is correct, because it's already
attached to it.) Fix by only attempting to attach if not already
attached.
I introduced this bug together with BRIN autosummarization in
7526e10224.
Reported-by: Yugo Nagata.
Author: Thomas Munro. I added the comment to go with it.
Discussion: https://postgr.es/m/20170621211538.0c9eae73.nagata@sraoss.co.jp
I misplaced the IF NOT EXISTS clause in commit 7b504eb282, before the
word STATISTICS. Put it where it belongs.
Patch written independently by Amit Langote and myself. I adopted his
submitted test case with a slight edit also.
Reported-by: Bruno Wolff III
Discussion: https://postgr.es/m/20170621004237.GB8337@wolff.to
When promoting a standby just after a XLOG_SWITCH record was replayed,
and next segment(s) are already are locally available (via walsender,
restore_command + trigger/recovery target), that segment could
accidentally be recycled onto the past of the new timeline. Later
checkpointer would create a .ready file for it, assuming there was an
error during creation, and it would get archived. That causes trouble
if another standby is later brought up from a basebackup from before
the timeline creation, because it would try to read the
segment, because XLogFileReadAnyTLI just tries all possible timelines,
which doesn't have valid contents. Thus replay would fail.
The problem, if already occurred, can be fixed by removing the segment
and/or having restore_command filter it out.
The reason for the creation of such "phantom" segments was, that after
an XLOG_SWITCH record the EndOfLog variable points to the beginning of
the next segment, and RemoveXlogFile() used XLByteToPrevSeg().
Normally RemoveXlogFile() doing so is harmless, because the last
segment will still exist preventing InstallXLogFileSegment() from
causing harm, but just after promotion there's no previous segment on
the new timeline.
Fix that by using XLByteToSeg() instead of XLByteToPrevSeg().
Author: Andres Freund
Reported-By: Greg Burek
Discussion: https://postgr.es/m/20170619073026.zcwpe6mydsaz5ygd@alap3.anarazel.de
Backpatch: 9.2-, bug older than all supported versions
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.
By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis. However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent. That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.
This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.
Commit e3860ffa4d wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code. The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there. BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs. So the
net result is that in about half the cases, such comments are placed
one tab stop left of before. This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.
Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
The new indent version includes numerous fixes thanks to Piotr Stefaniak.
The main changes visible in this commit are:
* Nicer formatting of function-pointer declarations.
* No longer unexpectedly removes spaces in expressions using casts,
sizeof, or offsetof.
* No longer wants to add a space in "struct structname *varname", as
well as some similar cases for const- or volatile-qualified pointers.
* Declarations using PG_USED_FOR_ASSERTS_ONLY are formatted more nicely.
* Fixes bug where comments following declarations were sometimes placed
with no space separating them from the code.
* Fixes some odd decisions for comments following case labels.
* Fixes some cases where comments following code were indented to less
than the expected column 33.
On the less good side, it now tends to put more whitespace around typedef
names that are not listed in typedefs.list. This might encourage us to
put more effort into typedef name collection; it's not really a bug in
indent itself.
There are more changes coming after this round, having to do with comment
indentation and alignment of lines appearing within parentheses. I wanted
to limit the size of the diffs to something that could be reviewed without
one's eyes completely glazing over, so it seemed better to split up the
changes as much as practical.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
load_libraries(), which processes the various xxx_preload_libraries GUCs,
was parsing them using SplitIdentifierString() which isn't really
appropriate for values that could be path names: it downcases unquoted
text, and it doesn't allow embedded whitespace unless quoted.
Use SplitDirectoriesString() instead. That also allows us to simplify
load_libraries() a bit, since canonicalize_path() is now done for it.
While this definitely seems like a bug fix, it has the potential to
break configuration settings that accidentally worked before because
of the downcasing behavior. Also, there's an easy workaround for the
bug, namely to double-quote troublesome text. Hence, no back-patch.
QL Zhuo, tweaked a bit by me
Discussion: https://postgr.es/m/CAB-oJtxHVDc3H+Km3CjB9mY1VDzuyaVH_ZYSz7iXcRqCtb93Ew@mail.gmail.com
Viewing a table with \d in psql also shows the publications at table is
in. If a publication is concurrently dropped, this shows an error,
because the view pg_publication_tables internally uses
pg_get_publication_tables(), which uses a catalog snapshot. This can be
particularly annoying if a for-all-tables publication is concurrently
dropped.
To avoid that, write the query in psql differently. Expose the function
pg_relation_is_publishable() to SQL and write the query using that.
That still has a risk of being affected by concurrent catalog changes,
but in this case it would be a table drop that causes problems, and then
the psql \d command wouldn't be interesting anymore anyway.
Reported-by: Tom Lane <tgl@sss.pgh.pa.us>
David Rowley found that the "use the smallest per-column selectivity"
heuristic applied in some cases by get_foreign_key_join_selectivity()
was badly off if the FK columns are independent, producing estimates
much worse than we got before that code was added in 9.6.
One case where that heuristic was used was for LEFT and FULL outer joins
with the referenced rel on the outside of the join. But we should not
really need to special-case those here. eqjoinsel() never has had such a
special case; the correction is applied by calc_joinrel_size_estimate()
instead. Let's just estimate such cases like inner joins and rely on that
later adjustment. (I think there was something of a thinko here, in that
the comments seem to be thinking about the selectivity as defined for
semi/anti joins; but that shouldn't apply to left/full joins.) Add a
regression test exercising such a case to show that this is sane in
at least some cases.
The other case where we used that heuristic was for SEMI/ANTI outer joins,
either if the referenced rel was on the outside, or if it was on the inside
but was part of a join within the RHS. In either case, the FK doesn't give
us a lot of traction towards estimating the selectivity. To ensure that
we don't have regressions from what happened before 9.6, let's punt by
ignoring the FK in such cases and applying the traditional selectivity
calculation. (We might be able to improve on that later, but for now
I just want to be sure it's not worse than 9.5.)
Report and patch by David Rowley, simplified a bit by me. Back-patch
to 9.6 where this code was added.
Discussion: https://postgr.es/m/CAKJS1f8NO8oCDcxrteohG6O72uU1saEVT9qX=R8pENr5QWerXw@mail.gmail.com
When, during logical decoding, a transaction gets too big, it's
contents get spilled to disk. Not just the top-transaction gets
spilled, but *also* all of its subtransactions, even if they're not
that large themselves. Unfortunately we didn't clean up
such small spilled subtransactions from disk.
Fix that, by keeping better track of whether a transaction has been
spilled to disk.
Author: Andres Freund
Reported-By: Dmitriy Sarafannikov, Fabrízio de Royes Mello
Discussion:
https://postgr.es/m/1457621358.355011041@f382.i.mail.ruhttps://postgr.es/m/CAFcNs+qNMhNYii4nxpO6gqsndiyxNDYV0S=JNq0v_sEE+9PHXg@mail.gmail.com
Backpatch: 9.4-, where logical decoding was introduced
Windows uses a separate code path for libc locales. The code previously
ended up there also if an ICU collation should be used, leading to a
crash.
Reported-by: Ashutosh Sharma <ashu.coek88@gmail.com>
When a new base type is created using the old-style procedure of first
creating the input/output functions with "opaque" in place of the base
type, the "opaque" argument/return type is changed to the final base type,
on CREATE TYPE. However, we did not create a pg_depend record when doing
that, so the functions were left not depending on the type.
Fixes bug #14706, reported by Karen Huddleston.
Discussion: https://www.postgresql.org/message-id/20170614232259.1424.82774@wrigleys.postgresql.org
The _equalTableFunc() omission of coltypmods has semantic significance,
but I did not track down resulting user-visible bugs, if any. The other
changes are cosmetic only, affecting order. catversion bump due to
readfuncs.c field order change.
We had three occurrences of essentially the same coding pattern
wherein we tried to retrieve a query result from a libpq connection
without blocking. In the case where PQconsumeInput failed (typically
indicating a lost connection), all three loops simply gave up and
returned, forgetting to clear any previously-collected PGresult
object. Since those are malloc'd not palloc'd, the oversight results
in a process-lifespan memory leak.
One instance, in libpqwalreceiver, is of little significance because
the walreceiver process would just quit anyway if its connection fails.
But we might as well fix it.
The other two instances, in postgres_fdw, are somewhat more worrisome
because at least in principle the scenario could be repeated, allowing
the amount of memory leaked to build up to something worth worrying
about. Moreover, in these cases the loops contain CHECK_FOR_INTERRUPTS
calls, as well as other calls that could potentially elog(ERROR),
providing another way to exit without having cleared the PGresult.
Here we need to add PG_TRY logic similar to what exists in quite a
few other places in postgres_fdw.
Coverity noted the libpqwalreceiver bug; I found the other two cases
by checking all calls of PQconsumeInput.
Back-patch to all supported versions as appropriate (9.2 lacks
postgres_fdw, so this is really quite unexciting for that branch).
Discussion: https://postgr.es/m/22620.1497486981@sss.pgh.pa.us
Previously we required every exported transaction to have an xid
assigned. That was used to check that the exporting transaction is
still running, which in turn is needed to guarantee that that
necessary rows haven't been removed in between exporting and importing
the snapshot.
The exported xid caused unnecessary problems with logical decoding,
because slot creation has to wait for all concurrent xid to finish,
which in turn serializes concurrent slot creation. It also
prohibited snapshots to be exported on hot-standby replicas.
Instead export the virtual transactionid, which avoids the unnecessary
serialization and the inability to export snapshots on standbys. This
changes the file name of the exported snapshot, but since we never
documented what that one means, that seems ok.
Author: Petr Jelinek, slightly editorialized by me
Reviewed-By: Andres Freund
Discussion: https://postgr.es/m/f598b4b8-8cd7-0d54-0939-adda763d8c34@2ndquadrant.com
In a CHECK clause, a null result means true, whereas in a WHERE clause
it means false. predtest.c provided different functions depending on
which set of semantics applied to the predicate being proved, but had
no option to control what a null meant in the clauses provided as
axioms. Add one.
Use that in the partitioning code when figuring out whether the
validation scan on a new partition can be skipped. Rip out the
old logic that attempted (not very successfully) to compensate
for the absence of the necessary support in predtest.c.
Ashutosh Bapat and Robert Haas, reviewed by Amit Langote and
incorporating feedback from Tom Lane.
Discussion: http://postgr.es/m/CAFjFpReT_kq_uwU_B8aWDxR7jNGE=P0iELycdq5oupi=xSQTOw@mail.gmail.com
The optimized code in 728bd991c3 contains a few invalid locking
sequences. To wit, the original code would try to acquire an lwlock
that it already holds. Avoid this by moving lock acquisitions to
higher-level code, and install appropriate assertions in low-level that
the correct mode is held.
Authors: Michael Paquier, Álvaro Herrera
Reported-By: chuanting wang
Bug: #14680
Discussion: https://postgr.es/m/20170531033228.1487.10124@wrigleys.postgresql.org
expression_returns_set() used to short-circuit its recursion upon
seeing certain node types, such as DistinctExpr, that it knew the
executor did not support set-valued arguments for. That was never
inherent, though, just a reflection of laziness in execQual.c.
With the new implementation of SRFs there is no reason to think
that any scalar-valued expression node could not have a set-valued
subexpression, except for AggRefs and WindowFuncs where we know there
is a parser check rejecting it. And indeed, the shortcut causes
unexpected failures for cases such as a SRF underneath DistinctExpr,
because the planner stops looking for SRFs too soon.
Discussion: https://postgr.es/m/5259.1497044025@sss.pgh.pa.us
In commits 2f5c9d9c9 and ab0289651 we invented an abstraction layer
to insulate catalog manipulations from direct heap update calls.
But evidently some patches that hadn't landed in-tree at that point
didn't get the memo completely. Fix a couple of direct calls to
simple_heap_delete to use CatalogTupleDelete instead; these appear
to have been added in commits 7c4f52409 and 7b504eb28. This change is
purely cosmetic ATM, but there's no point in having an abstraction layer
if we allow random code to break it.
Masahiko Sawada and Tom Lane
Discussion: https://postgr.es/m/CAD21AoDOPRSVcwbnCN3Y1n_68ATyTspsU6=ygtHz_uY0VcdZ8A@mail.gmail.com
When we reimplemented SRFs in commit 69f4b9c85, our initial choice was
to allow the behavior to vary from historical practice in cases where a
SRF call appeared within a conditional-execution construct (currently,
only CASE or COALESCE). But that was controversial to begin with, and
subsequent discussion has resulted in a consensus that it's better to
throw an error instead of executing the query differently from before,
so long as we can provide a reasonably clear error message and a way to
rewrite the query.
Hence, add a parser mechanism to allow detection of such cases during
parse analysis. The mechanism just requires storing, in the ParseState,
a pointer to the set-returning FuncExpr or OpExpr most recently emitted
by parse analysis. Then the parsing functions for CASE and COALESCE can
detect the presence of a SRF in their arguments by noting whether this
pointer changes while analyzing their arguments. Furthermore, if it does,
it provides a suitable error cursor location for the complaint. (This
means that if there's more than one SRF in the arguments, the error will
point at the last one to be analyzed not the first. While connoisseurs of
parsing behavior might find that odd, it's unlikely the average user would
ever notice.)
While at it, we can also provide more specific error messages than before
about some pre-existing restrictions, such as no-SRFs-within-aggregates.
Also, reject at parse time cases where a NULLIF or IS DISTINCT FROM
construct would need to return a set. We've never supported that, but the
restriction is depended on in more subtle ways now, so it seems wise to
detect it at the start.
Also, provide some documentation about how to rewrite a SRF-within-CASE
query using a custom wrapper SRF.
It turns out that the information_schema.user_mapping_options view
contained an instance of exactly the behavior we're now forbidding; but
rewriting it makes it more clear and safer too.
initdb forced because of user_mapping_options change.
Patch by me, with error message suggestions from Alvaro Herrera and
Andres Freund, pursuant to a complaint from Regina Obe.
Discussion: https://postgr.es/m/000001d2d5de$d8d66170$8a832450$@pcorp.us
Table partitioning, introduced in commit f0e44751d7, added a new
relkind - RELKIND_PARTITIONED_TABLE. Update relation_is_updatable() to
handle it. Specifically, partitioned tables and simple views built on
top of them are updatable.
This affects the SQL-callable functions pg_relation_is_updatable() and
pg_column_is_updatable(), and the views information_schema.views and
information_schema.columns.
Dean Rasheed, reviewed by Ashutosh Bapat.
Discussion: https://postgr.es/m/CAEZATCXnbiFkMXgF4Ez1pmM2c-tS1z33bSq7OGbw7QQhHov%2B6Q%40mail.gmail.com
Previously, you could write _null_ in a BKI DATA line for a column that's
supposed to be NOT NULL and initdb would let it pass, probably breaking
subsequent accesses to the row. No doubt the original coding overlooked
this simple sanity check because in the beginning we didn't have any way
to mark catalog columns NOT NULL at initdb time.
ExecInitModifyTable() thought there was a plan per partition, but no,
there's only one. The problem had escaped detection so far because there
would only be visible misbehavior if there were a SubPlan (not an InitPlan)
in the quals being duplicated for each partition. However, valgrind
detected a bogus memory access in test cases added by commit 4f7a95be2,
and investigation of that led to discovery of the bug. The additional
test case added here crashes without the patch.
Patch by Amit Langote, test case by me.
Discussion: https://postgr.es/m/10974.1497227727@sss.pgh.pa.us
During pg_upgrade's restore run, all relfilenode choices should be
overridden by commands in the dump script. If we ever find ourselves
choosing a relfilenode in the ordinary way, someone blew it. Likewise for
pg_type OIDs. Since pg_upgrade might well succeed anyway, if there happens
not to be a conflict during the regression test run, we need assertions
here to keep us on the straight and narrow.
We might someday be able to remove the assertion in GetNewRelFileNode,
if pg_upgrade is rewritten to remove its assumption that old and new
relfilenodes always match. But it's hard to see how to get rid of the
pg_type OID constraint, since those OIDs are embedded in user tables
in some cases.
Back-patch as far as 9.5, because of the risk of back-patches breaking
something here even if it works in HEAD. I'd prefer to go back further,
but 9.4 fails both assertions due to get_rel_infos()'s use of a temporary
table. We can't use the later-branch solution of a CTE for compatibility
reasons (cf commit 5d16332e9), and it doesn't seem worth inventing some
other way to do the query. (I did check, by dint of changing the Asserts
to elog(WARNING), that there are no other cases of unwanted OID assignments
during 9.4's regression test run.)
Discussion: https://postgr.es/m/19785.1497215827@sss.pgh.pa.us
It's not necessary for it to do that, since OWNED BY requires only ordinary
catalog updates and doesn't affect future sequence values. And pg_upgrade
needs to use OWNED BY without having it change the sequence's relfilenode.
Commit 3d79013b9 broke this by making all forms of ALTER SEQUENCE change
the relfilenode; that seems to be the explanation for the hard-to-reproduce
buildfarm failures we've been seeing since then.
Discussion: https://postgr.es/m/19785.1497215827@sss.pgh.pa.us
The original code only added ICU_CFLAGS to the backend build. But it is
also needed for building external modules that include pg_locale.h. So
add it to the global CPPFLAGS. (This is only relevant if ICU is not in
a compiler default path, so it apparently hasn't bitten many.)
When a table sync worker is in waiting state and the subscription table
entry is removed because of a concurrent subscription refresh, the
worker could be left orphaned. To avoid that, explicitly stop the
worker when the pg_subscription_rel entry is removed.
Reported-by: Masahiko Sawada <sawada.mshk@gmail.com>
generateSerialExtraStmts() was sloppy about handling the case where
SEQUENCE NAME is given with a not-schema-qualified name. It was generating
a CreateSeqStmt with an unqualified sequence name, and an AlterSeqStmt
whose "owned_by" DefElem contained a T_String Value with a null string
pointer in the schema-name position. The generated nextval() argument was
also underqualified. This accidentally failed to fail at runtime, but only
so long as the current default creation namespace at runtime is the right
namespace. That's bogus; the parse-time transformation is supposed to be
inserting the right schema name in all cases, so as to avoid any possible
skew in that selection. I'm not sure this could fail in pg_dump's usage,
but it's still wrong; we have had real bugs in this area before adopting
the policy that parse_utilcmd.c should generate only fully-qualified
auxiliary commands. A slightly lesser problem, which is what led me to
notice this in the first place, is that pprint() dumped core on the
AlterSeqStmt because of the bogus T_String.
Noted while poking into the open problem with ALTER SEQUENCE breaking
pg_upgrade.
The new partitioned table capability added a new relkind, namely
RELKIND_PARTITIONED_TABLE. Update fireRIRrules() to apply RLS
policies on RELKIND_PARTITIONED_TABLE as it does RELKIND_RELATION.
In addition, add RLS regression test coverage for partitioned tables.
Issue raised by Fakhroutdinov Evgenievich and patch by Mike Palmiotto.
Regression test editorializing by me.
Discussion: https://postgr.es/m/flat/20170601065959.1486.69906@wrigleys.postgresql.org
When a table is removed from a subscription before the tablesync worker
could start, this would previously result in an error when reading
pg_subscription_rel. Now we just ignore this.
Author: Masahiko Sawada <sawada.mshk@gmail.com>
Previously the exit handling was only able to exit from within the
main loop, and not from within the backend code it calls. Fix that by
using the standard die() SIGTERM handler, and adding the necessary
CHECK_FOR_INTERRUPTS() call.
This requires adding yet another process-type-specific branch to
ProcessInterrupts(), which hints that we probably should generalize
that handling. But that's work for another day.
Author: Petr Jelinek
Reviewed-By: Andres Freund
Discussion: https://postgr.es/m/fe072153-babd-3b5d-8052-73527a6eb657@2ndquadrant.com
Since 7c4f52409a (merged in v10), a shutdown master is reported as
FATAL: unexpected result after CommandComplete: server closed the connection unexpectedly
by walsender. It used to be
LOG: replication terminated by primary server
FATAL: could not send end-of-streaming message to primary: no COPY in progress
while the old message clearly is not perfect, it's definitely better
than what's reported now.
The change comes from the attempt to handle finished COPYs without
erroring out, needed for the new logical replication, which wasn't
needed before.
There's probably better ways to handle this, but for now just
explicitly check for a closed connection.
Author: Petr Jelinek
Reviewed-By: Andres Freund
Discussion: https://postgr.es/m/f7c7dd08-855c-e4ed-41f4-d064a6c0665a@2ndquadrant.com
Backpatch: -
Most of the improvements were in the new SCRAM code:
* In SCRAM protocol violation messages, use errdetail to provide the
details.
* If pg_backend_random() fails, throw an ERROR rather than just LOG. We
shouldn't continue authentication if we can't generate a random nonce.
* Use ereport() rather than elog() for the "invalid SCRAM verifier"
messages. They shouldn't happen, if everything works, but it's not
inconceivable that someone would have invalid scram verifiers in
pg_authid, e.g. if a broken client application was used to generate the
verifier.
But this change applied to old code:
* Use ERROR rather than COMMERROR for protocol violation errors. There's
no reason to not tell the client what they did wrong. The client might be
confused already, so that it cannot read and display the error correctly,
but let's at least try. In the "invalid password packet size" case, we
used to actually continue with authentication anyway, but that is now a
hard error.
Patch by Michael Paquier and me. Thanks to Daniel Varrazzo for spotting
the typo in one of the messages that spurred the discussion and these
larger changes.
Discussion: https://www.postgresql.org/message-id/CA%2Bmi_8aZYLhuyQi1Jo0hO19opNZ2OEATEOM5fKApH7P6zTOZGg%40mail.gmail.com
A logical replication worker should not insert new rows into
pg_subscription_rel, only update existing rows, so that there are no
races if a concurrent refresh removes rows. Adjust the API to be able
to choose that behavior.
Author: Masahiko Sawada <sawada.mshk@gmail.com>
Reported-by: tushar <tushar.ahuja@enterprisedb.com>
Since tuple-routing implicitly checks the partitioning constraints
at least for the levels of the partitioning hierarchy it traverses,
there's normally no need to revalidate the partitioning constraint
after performing tuple routing. However, if there's a BEFORE trigger
on the target partition, it could modify the tuple, causing the
partitioning constraint to be violated. Catch that case.
Also, instead of checking the root table's partition constraint after
tuple-routing, check it beforehand. Otherwise, the rules for when
the partitioning constraint gets checked get too complicated, because
you sometimes have to check part of the constraint but not all of it.
This effectively reverts commit 39162b2030
in favor of a different approach altogether.
Report by me. Initial debugging by Jeevan Ladhe. Patch by Amit
Langote, reviewed by me.
Discussion: http://postgr.es/m/CA+Tgmoa9DTgeVOqopieV8d1QRpddmP65aCdxyjdYDoEO5pS5KA@mail.gmail.com
The logical replication apply worker uses the subscription name as
application name, except for table sync. This was incorrectly set to
use the replication slot name, which might be different, in one case.
Also add a comment why the other case is different.
The larger part of this patch replaces usages of MyProc->procLatch
with MyLatch. The latter works even early during backend startup,
where MyProc->procLatch doesn't yet. While the affected code
shouldn't run in cases where it's not initialized, it might get copied
into places where it might. Using MyLatch is simpler and a bit faster
to boot, so there's little point to stick with the previous coding.
While doing so I noticed some weaknesses around newly introduced uses
of latches that could lead to missed events, and an omitted
CHECK_FOR_INTERRUPTS() call in worker_spi.
As all the actual bugs are in v10 code, there doesn't seem to be
sufficient reason to backpatch this.
Author: Andres Freund
Discussion:
https://postgr.es/m/20170606195321.sjmenrfgl2nu6j63@alap3.anarazel.dehttps://postgr.es/m/20170606210405.sim3yl6vpudhmufo@alap3.anarazel.de
Backpatch: -
Make apply busy wait check the catalog instead of shmem state to ensure
that next transaction will see the expected table synchronization state.
Also make the handover always go through same set of steps to make the
overall process easier to understand and debug.
Author: Petr Jelinek <petr.jelinek@2ndquadrant.com>
Tested-by: Mark Kirkwood <mark.kirkwood@catalyst.net.nz>
Tested-by: Erik Rijkers <er@xs4all.nl>
Consistent with what we do for indexes, we shouldn't try to record
dependencies on collation OID 0 or the default collation OID (which
is pinned). Also, the fact that indcollation and partcollation can
contain zero OIDs when the data type is not collatable should be
documented.
Amit Langote, per a complaint from me.
Discussion: http://postgr.es/m/CA+Tgmoba5mtPgM3NKfG06vv8na5gGbVOj0h4zvivXQwLw8wXXQ@mail.gmail.com
This allows to cancel commands run over replication connections. While
it might have some use before v10, it has become important now that
normal SQL commands are allowed in database connected walsender
connections.
Author: Petr Jelinek
Reviewed-By: Andres Freund, Michael Paquier
Discussion: https://postgr.es/m/7966f454-7cd7-2b0c-8b70-cdca9d5a8c97@2ndquadrant.com
Because walsender and normal backends share the same main loop it's
problematic to have two different flag variables, set in signal
handlers, indicating a pending configuration reload. Only certain
walsender commands reach code paths checking for the
variable (START_[LOGICAL_]REPLICATION, CREATE_REPLICATION_SLOT
... LOGICAL, notably not base backups).
This is a bug present since the introduction of walsender, but has
gotten worse in releases since then which allow walsender to do more.
A later patch, not slated for v10, will similarly unify SIGHUP
handling in other types of processes as well.
Author: Petr Jelinek, Andres Freund
Reviewed-By: Michael Paquier
Discussion: https://postgr.es/m/20170423235941.qosiuoyqprq4nu7v@alap3.anarazel.de
Backpatch: 9.2-, bug is present since 9.0
When the checkpointer writes the shutdown checkpoint, it checks
afterwards whether any WAL has been written since it started and
throws a PANIC if so. At that point, only walsenders are still
active, so one might think this could not happen, but walsenders can
also generate WAL, for instance in BASE_BACKUP and logical decoding
related commands (e.g. via hint bits). So they can trigger this panic
if such a command is run while the shutdown checkpoint is being
written.
To fix this, divide the walsender shutdown into two phases. First,
checkpointer, itself triggered by postmaster, sends a
PROCSIG_WALSND_INIT_STOPPING signal to all walsenders. If the backend
is idle or runs an SQL query this causes the backend to shutdown, if
logical replication is in progress all existing WAL records are
processed followed by a shutdown. Otherwise this causes the walsender
to switch to the "stopping" state. In this state, the walsender will
reject any further replication commands. The checkpointer begins the
shutdown checkpoint once all walsenders are confirmed as
stopping. When the shutdown checkpoint finishes, the postmaster sends
us SIGUSR2. This instructs walsender to send any outstanding WAL,
including the shutdown checkpoint record, wait for it to be replicated
to the standby, and then exit.
Author: Andres Freund, based on an earlier patch by Michael Paquier
Reported-By: Fujii Masao, Andres Freund
Reviewed-By: Michael Paquier
Discussion: https://postgr.es/m/20170602002912.tqlwn4gymzlxpvs2@alap3.anarazel.de
Backpatch: 9.4, where logical decoding was introduced
The non-participation in procsignal was a problem for both changes in
master, e.g. parallelism not working for normal statements run in
walsender backends, and older branches, e.g. recovery conflicts and
catchup interrupts not working for logical decoding walsenders.
This commit thus replaces the previous WalSndXLogSendHandler with
procsignal_sigusr1_handler. In branches since db0f6cad48 that can
lead to additional SetLatch calls, but that only rarely seems to make
a difference.
Author: Andres Freund
Reviewed-By: Michael Paquier
Discussion: https://postgr.es/m/20170421014030.fdzvvvbrz4nckrow@alap3.anarazel.de
Backpatch: 9.4, earlier commits don't seem to benefit sufficiently
This reverts commit 086221cf6b, which
was made to master only.
The approach implemented in the above commit has some issues. While
those could easily be fixed incrementally, doing so would make
backpatching considerably harder, so instead first revert this patch.
Discussion: https://postgr.es/m/20170602002912.tqlwn4gymzlxpvs2@alap3.anarazel.de
There was a grammar ambiguity between SET PUBLICATION name REFRESH and
SET PUBLICATION SKIP REFRESH, because SKIP is not a reserved word. To
resolve that, fold the refresh choice into the WITH options. Refreshing
is the default now.
Reported-by: tushar <tushar.ahuja@enterprisedb.com>
Otherwise code that uses this will abort with an assertion failure,
because postmaster_alive_fds are not initialized.
Reported-by: tushar <tushar.ahuja@enterprisedb.com>
Declare the toc_nentry field as uint32 not Size. Since shm_toc_lookup()
reads the field without any lock, it has to be atomically readable, and
we do not assume that for fields wider than 32 bits. Performance would
be impossibly bad for entry counts approaching 2^32 anyway, so there is
no need to try to preserve maximum width here.
This is probably an academic issue, because even if reading int64 isn't
atomic, the high order half would never change in practice. Still, it's
a coding rule violation, so let's fix it.
Adjust some other not-terribly-well-chosen data types too, and copy-edit
some comments. Make shm_toc_attach's Asserts consistent with
shm_toc_create's.
None of this looks to be a live bug, so no need for back-patch.
Discussion: https://postgr.es/m/16984.1496679541@sss.pgh.pa.us
Given the possibility of race conditions and so on, it seems entirely
unsafe to just assume that shm_toc_lookup() always finds the key it's
looking for --- but that was exactly what all but one call site were
doing. To fix, add a "bool noError" argument, similarly to what we
have in many other functions, and throw an error on an unexpected
lookup failure. Remove now-redundant Asserts that a rather random
subset of call sites had.
I doubt this will throw any light on buildfarm member lorikeet's
recent failures, because if an unnoticed lookup failure were involved,
you'd kind of expect a null-pointer-dereference crash rather than the
observed symptom. But you never know ... and this is better coding
practice even if it never catches anything.
Discussion: https://postgr.es/m/9697.1496675981@sss.pgh.pa.us
get_partition_parent felt that it could simply Assert that systable_getnext
found a tuple. This is unlike any other caller of that function, and it's
unsafe IMO --- in fact, the reason I noticed it was that the Assert failed.
(OK, I was working with known-inconsistent catalog contents, but I wasn't
expecting the DB to fall over quite that violently. The behavior in a
non-assert-enabled build wouldn't be very nice, either.) Fix it to do what
other callers do, namely an actual runtime-test-and-elog.
Also, standardize the wording of elog messages that are complaining about
unexpected failure of systable_getnext. 90% of them say "could not find
tuple for <object>", so make the remainder do likewise. Many of the
holdouts were using the phrasing "cache lookup failed", which is outright
misleading since no catcache search is involved.
I'd always assumed that backend/optimizer/geqo/'s remarkably poor
showing on code coverage metrics was because we weren't exercising
it much in the regression tests. But it turns out that a good chunk
of the problem is that there's a bunch of code that is physically
unreachable (because the calls to it are #ifdef'd out in geqo_main.c)
but is being built anyway. Making the called code have #if guards
similar to the calling code saves a couple of kilobytes of executable
size and should make the coverage numbers more reflective of reality.
It's arguable that we should just delete all the unused recombination
mechanisms altogether, but I didn't feel a need to go that far today.
If we allow this, whatever outer command has the table open will not know
about the new index and may fail to update it as needed, as shown in a
report from Laurenz Albe. We already had such a prohibition in place for
ALTER TABLE, but the CREATE INDEX syntax missed the check.
Fixing it requires an API change for DefineIndex(), which conceivably
would break third-party extensions if we were to back-patch it. Given
how long this problem has existed without being noticed, fixing it in
the back branches doesn't seem worth that risk.
Discussion: https://postgr.es/m/A737B7A37273E048B164557ADEF4A58B53A4DC9A@ntex2010i.host.magwien.gv.at
These estimators returned 1 minus the corresponding equality/match
estimate, which is incorrect: we need to subtract off the fraction
of nulls in the column, since those are neither equal nor not equal
to the comparison value. The error only becomes obvious if the
nullfrac is large, but it could be very bad in a mostly-nulls
column, as reported in bug #14676 from Marko Tiikkaja.
To fix the <> case, refactor eqsel() and neqsel() to call a common
support routine, which can be made to account for nullfrac correctly.
The pattern-match cases were already factored that way, and it was
simply an oversight that patternsel() wasn't subtracting off nullfrac.
neqjoinsel() has a similar problem, but since we're elsewhere discussing
changing its behavior entirely, I left it alone for now.
This is a very longstanding bug, but I'm hesitant to back-patch a fix for
it. Given the lack of prior complaints, such cases must not come up often,
so it's probably not worth the risk of destabilizing plans in stable
branches.
Discussion: https://postgr.es/m/20170529153847.4275.95416@wrigleys.postgresql.org
When costing a nestloop with stop-at-first-inner-match semantics, and a
non-indexscan inner path, final_cost_nestloop() wants to charge the full
scan cost of the inner rel at least once, with additional scans charged
at inner_rescan_run_cost which might be less. However the logic for
doing this effectively assumed that outer_matched_rows is at least 1.
If it's zero, which is not unlikely for a small outer rel, we ended up
charging inner_run_cost plus N times inner_rescan_run_cost, as much as
double the correct charge for an outer rel with only one row that
we're betting won't be matched. (Unless the inner rel is materialized,
in which case it has very small inner_rescan_run_cost and the cost
is not so far off what it should have been.)
The upshot of this was that the planner had a tendency to select plans
that failed to make effective use of the stop-at-first-inner-match
semantics, and that might have Materialize nodes in them even when the
predicted number of executions of the Materialize subplan was only 1.
This was not so obvious before commit 9c7f5229a, because the case only
arose in connection with semi/anti joins where there's not freedom to
reverse the join order. But with the addition of unique-inner joins,
it could result in some fairly bad planning choices, as reported by
Teodor Sigaev. Indeed, some of the test cases added by that commit
have plans that look dubious on closer inspection, and are changed
by this patch.
Fix the logic to ensure that we don't charge for too many inner scans.
I chose to adjust it so that the full-freight scan cost is associated
with an unmatched outer row if possible, not a matched one, since that
seems like a better model of what would happen at runtime.
This is a longstanding bug, but given the lesser impact in back branches,
and the lack of field complaints, I won't risk a back-patch.
Discussion: https://postgr.es/m/CAKJS1f-LzkUsFxdJ_-Luy38orQ+AdEXM5o+vANR+-pHAWPSecg@mail.gmail.com
We didn't accept any invalidation messages until the whole sync process
had finished (because it flattens all the remote transactions in the
single one). So the sync worker didn't learn about subscription
changes/drop until it has finished. This could lead to "orphaned" sync
workers.
Author: Petr Jelinek <petr.jelinek@2ndquadrant.com>
Reported-by: Masahiko Sawada <sawada.mshk@gmail.com>
This avoids "orphaned" sync workers.
This was caused by a thinko in wait_for_sync_status_change.
Author: Petr Jelinek <petr.jelinek@2ndquadrant.com>
Reported-by: Masahiko Sawada <sawada.mshk@gmail.com>
Previously this was not allowed, as copy.c didn't set the
CURSOR_OPT_PARALLEL_OK flag when planning the query. Set it.
While the lack of parallel query for COPY isn't strictly speaking a
bug, it does prevent parallelism from being used in a facility
commonly used to run long running queries. Thus backpatch to 9.6.
Author: Andres Freund
Discussion: https://postgr.es/m/20170531231958.ihanapplorptykzm@alap3.anarazel.de
Backpatch: 9.6, where parallelism was introduced.
When trying to access a replication slot that is supposed to already
exist, we don't need to check the naming rules again. If the slot
does not exist, we will then get a "does not exist" error message, which
is generally more useful from the perspective of an end user.
The logical replication worker processes now use the normal die()
handler for SIGTERM and CHECK_FOR_INTERRUPTS() instead of custom code.
One problem before was that the apply worker would not exit promptly
when a subscription was dropped, which could lead to deadlocks.
Author: Petr Jelinek <petr.jelinek@2ndquadrant.com>
Reported-by: Masahiko Sawada <sawada.mshk@gmail.com>
Move the walrcv_disconnect() calls into the before_shmem_exit handler.
This makes sure the call is always made even during exit by signal, it
saves some duplicate code, and it makes the logic more similar to
walreceiver.c.
Author: Petr Jelinek <petr.jelinek@2ndquadrant.com>
Previously the changes to the "data" part of the sequence, i.e. the
one containing the current value, were not transactional, whereas the
definition, including minimum and maximum value were. That leads to
odd behaviour if a schema change is rolled back, with the potential
that out-of-bound sequence values can be returned.
To avoid the issue create a new relfilenode fork whenever ALTER
SEQUENCE is executed, similar to how TRUNCATE ... RESTART IDENTITY
already is already handled.
This commit also makes ALTER SEQUENCE RESTART transactional, as it
seems to be too confusing to have some forms of ALTER SEQUENCE behave
transactionally, some forms not. This way setval() and nextval() are
not transactional, but DDL is, which seems to make sense.
This commit also rolls back parts of the changes made in 3d092fe540
and f8dc1985f as they're now not needed anymore.
Author: Andres Freund
Discussion: https://postgr.es/m/20170522154227.nvafbsm62sjpbxvd@alap3.anarazel.de
Backpatch: Bug is in master/v10 only
Remove some gratuituous message differences by making the AM name
previously embedded in each message be a %s instead. While at it, get
rid of terminology that's unclear and unnecessary in one message.
Discussion: https://postgr.es/m/20170523001557.bq2hbq7hxyvyw62q@alvherre.pgsql
We could have limped along without this for v10, which was my intention
when I annotated the bug in commit 76a3df6e5. But consensus is that it's
better to fix it now and take the cost of a post-beta1 initdb (which is
needed because these node types are stored in pg_class.relpartbound).
Since we're forcing initdb anyway, take the opportunity to make the node
type identification strings match the node struct names, instead of being
randomly different from them.
Discussion: https://postgr.es/m/E1dFBEX-0004wt-8t@gemulon.postgresql.org
Per our message style guidelines, error messages incorporating the
results of format_type_be() and its siblings should not add quotes
around those results, because those functions already add quotes
at need. Fix a few places that hadn't gotten that memo.
json_populate_record throws an error if asked to convert a JSON scalar
or array into a composite type. jsonb_populate_record was returning
a record full of NULL fields instead. It seems better to make it
throw an error for this case as well.
Nikita Glukhov
Discussion: https://postgr.es/m/fbd1d566-bba0-a3de-d6d0-d3b1d7c24ff2@postgrespro.ru
The macro gave the wrong answers for a JsObject with is_json == 0:
it would return 1 if jsonb_cont == NULL, or if that wasn't NULL,
it would return 1 for any non-zero size.
We could fix that, but the only use of this macro at present is in the
JsObjectIsEmpty() macro, so it seems simpler and clearer to get rid of
JsObjectSize() and put corrected logic into JsObjectIsEmpty().
Thinko in commit cf35346e8, so no need for back-patch.
Nikita Glukhov
Discussion: https://postgr.es/m/fbd1d566-bba0-a3de-d6d0-d3b1d7c24ff2@postgrespro.ru
The NumericOnly grammar production accepted ICONST, + ICONST, - ICONST,
FCONST, and - FCONST, but for some reason not + FCONST. This led to
strange inconsistencies like
regression=# set random_page_cost = +4;
SET
regression=# set random_page_cost = 4000000000;
SET
regression=# set random_page_cost = +4000000000;
ERROR: syntax error at or near "4000000000"
(because 4000000000 is too large to be an ICONST). While there's
no actual functional reason to need to write a "+", if we allow
it for integers it seems like we should allow it for numerics too.
It's been like that forever, so back-patch to all supported branches.
Discussion: https://postgr.es/m/30908.1496006184@sss.pgh.pa.us
Avoid trashing the input PartitionBoundSpec; while that might be safe for
current callers, it's certainly trouble waiting to happen. In the same
vein, make sure that all of the result data structure is freshly palloc'd,
rather than some of it being pointers into the input data structures
(which we don't know the lifespans of).
Simplify the logic for tacking on IS NULL or IS NOT NULL conditions some
more; commit 85c2b9a15 left a lot on the table there. And rearrange the
construction of the nodes into (what seems to me) a more logical order.
In passing, make sure that get_qual_for_range() also returns a freshly
palloc'd structure, since there's no value in having that guarantee for
only one kind of partitioning. And improve some comments there.
Jeevan Ladhe, with further tweaking by me
Discussion: https://postgr.es/m/CAOgcT0MAcYoMs93W80iTUf_dP36=1mZQzeUk+nnwY_-qWDrCfw@mail.gmail.com
Fix failure to check that we got a plain Const from const-simplification of
a coercion request. This is the cause of bug #14666 from Tian Bing: there
is an int4 to money cast, but it's only stable not immutable (because of
dependence on lc_monetary), resulting in a FuncExpr that the code was
miserably unequipped to deal with, or indeed even to notice that it was
failing to deal with. Add test cases around this coercion behavior.
In view of the above, sprinkle the code liberally with castNode() macros,
in hope of catching the next such bug a bit sooner. Also, change some
functions that were randomly declared to take Node* to take more specific
pointer types. And change some struct fields that were declared Node*
but could be given more specific types, allowing removal of assorted
explicit casts.
Place PARTITION_MAX_KEYS check a bit closer to the code it's protecting.
Likewise check only-one-key-for-list-partitioning restriction in a less
random place.
Avoid not-per-project-style usages like !strcmp(...).
Fix assorted failures to avoid scribbling on the input of parse
transformation. I'm not sure how necessary this is, but it's entirely
silly for these functions to be expending cycles to avoid that and not
getting it right.
Add guards against partitioning on system columns.
Put backend/nodes/ support code into an order that matches handling
of these node types elsewhere.
Annotate the fact that somebody added location fields to PartitionBoundSpec
and PartitionRangeDatum but forgot to handle them in
outfuncs.c/readfuncs.c. This is fairly harmless for production purposes
(since readfuncs.c would just substitute -1 anyway) but it's still bogus.
It's not worth forcing a post-beta1 initdb just to fix this, but if we
have another reason to force initdb before 10.0, we should go back and
clean this up.
Contrariwise, somebody added location fields to PartitionElem and
PartitionSpec but forgot to teach exprLocation() about them.
Consolidate duplicative code in transformPartitionBound().
Improve a couple of error messages.
Improve assorted commentary.
Re-pgindent the files touched by this patch; this affects a few comment
blocks that must have been added quite recently.
Report: https://postgr.es/m/20170524024550.29935.14396@wrigleys.postgresql.org
Commit 9aa3c782c added code to allow CREATE TABLE/CREATE TYPE to not fail
when the desired type name conflicts with an autogenerated array type, by
dint of renaming the array type out of the way. But I (tgl) overlooked
that the same case arises in ALTER TABLE/TYPE RENAME. Fix that too.
Back-patch to all supported branches.
Report and patch by Vik Fearing, modified a bit by me
Discussion: https://postgr.es/m/0f4ade49-4f0b-a9a3-c120-7589f01d1eb8@2ndquadrant.com
Logical replication supports replicating between tables with different
column order. But this failed for the initial table sync because of a
logic error in how the column list for the internal COPY command was
composed. Fix that and also add a test.
Also fix a minor omission in the column name mapping cache. When
creating the mapping list, it would not skip locally dropped columns.
So if a remote column had the same name as a locally dropped
column (...pg.dropped...), then the expected error would not occur.
Reduce some redundant messages to DEBUG1. Be clearer about the
distinction between apply workers and table synchronization workers.
Add subscription and table name where possible.
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
We need not consider the case where both nulltest1 and nulltest2 are
NULL; the partition either accepts nulls or it does not.
Jeevan Ladhe. I added an assertion.
This patch replaces isspace() calls with scanner_isspace() in functions
that are likely to be presented with non-ASCII input. isspace() has
the small advantage that it will correctly recognize no-break space
in single-byte encodings (such as LATIN1); but it cannot work successfully
for any multibyte character, and depending on platform it might return
false positive results for some fragments of multibyte characters. That's
disastrous for functions that are trying to discard whitespace between
valid strings, as noted in bug #14662 from Justin Muise. Even treating
no-break space as whitespace is pretty questionable for the usages touched
here, because the core scanner would think it is an identifier character.
Affected functions are parse_ident(), parseNameAndArgTypes (underlying
regprocedurein() and siblings), SplitIdentifierString (used for parsing
GUCs and options that are qualified names or lists of names), and
SplitDirectoriesString (used for parsing GUCs that are lists of
directories).
All the functions adjusted here are parsing SQL identifiers and similar
constructs, so it's reasonable to insist that their definition of
whitespace match the core scanner. So we can hope that this won't cause
many backwards-compatibility problems. I've left alone isspace() calls
in places that aren't really expecting any non-ASCII input characters,
such as float8in().
Back-patch to all supported branches.
Discussion: https://postgr.es/m/10129.1495302480@sss.pgh.pa.us
The cash_div_intX functions applied rint() to the result of the division.
That's not merely useless (because the result is already an integer) but
it causes precision loss for values larger than 2^52 or so, because of
the forced conversion to float8.
On the other hand, the cash_mul_fltX functions neglected to apply rint() to
their multiplication results, thus possibly causing off-by-one outputs.
Per C standard, arithmetic between any integral value and a float value is
performed in float format. Thus, cash_mul_flt4 and cash_div_flt4 produced
answers good to only about six digits, even when the float value is exact.
We can improve matters noticeably by widening the float inputs to double.
(It's tempting to consider using "long double" arithmetic if available,
but that's probably too much of a stretch for a back-patched fix.)
Also, document that cash_div_intX operators truncate rather than round.
Per bug #14663 from Richard Pistole. Back-patch to all supported branches.
Discussion: https://postgr.es/m/22403.1495223615@sss.pgh.pa.us
Using flex's -i switch to achieve case-insensitivity is not a very safe
practice, because the scanner's behavior may then depend on the locale
that flex was invoked in. In the particular example at hand, that's
not academic: the possible matches for "FIRST" will be different in a
Turkish locale than elsewhere. Do it the hard way instead, as our
other scanners do.
Also, drop use of -b -CF -p, because this scanner is only used when
parsing the contents of a GUC variable. That's not done often, and
the amount of text to be parsed can be expected to be trivial, so
prioritizing scanner speed over code size seems like quite the wrong
tradeoff. Using flex's default optimization options reduces the
size of syncrep_gram.o by more than 50%.
The case-insensitivity problem is new in HEAD (cf commit 3901fd70c).
The poor choice of optimization flags exists also in 9.6, but it doesn't
seem important enough to back-patch.
Discussion: https://postgr.es/m/24403.1495225931@sss.pgh.pa.us
Otherwise, set_plan_refs() can get applied to the same list
multiple times through different references, leading to chaos.
Amit Langote, Dilip Kumar, and Robert Haas, reviewed by Ashutosh
Bapat. Original report by Sveinn Sveinsson.
Discussion: http://postgr.es/m/20170517141151.1435.79890@wrigleys.postgresql.org
Since commit e7b3349a8a, MergeAttributes
destructively modifies the input List, to which the caller's
CreateStmt still points. One may wonder whether this was already a
bug, but commit f0e44751d7 made things
noticeably worse by adding additional destructive modifications so
that the caller's List might, in the case of creation a partitioned
table, no longer even be structurally valid. Restore the status quo
ante by assigning the return value of MergeAttributes back to
stmt->tableElts in the caller.
In most of the places where DefineRelation is called, it doesn't
matter what stmt->tableElts points to here or whether it's valid or
not, because the caller doesn't use the statement for anything after
DefineRelation returns anyway. However, ProcessUtilitySlow passes it
to EventTriggerCollectSimpleCommand, and that function tries to invoke
copyObject on it. If any of the CreateStmt's substructure is invalid
at that point, undefined behavior will result.
One might wonder whether this whole area needs further revision -
perhaps DefineRelation() ought not to be destructively modifying the
caller-provided CreateStmt at all. However, that would be a behavior
change for any event triggers using C code to inspect the CreateStmt,
so for now, just fix the crash.
Report by Amit Langote, who provided a somewhat different patch for it.
Discussion: http://postgr.es/m/bf6a39a7-100a-74bd-1156-3c16a1429d88@lab.ntt.co.jp
This seemed like a good idea originally because there's no way to mark
a range partition as accepting NULL, but that now seems more like a
current limitation than something we want to lock down for all time.
For example, there's a proposal to add the notion of a default
partition which accepts all rows not otherwise routed, which directly
conflicts with the idea that a range-partitioned table should never
allow nulls anywhere. So let's change this while we still can, by
putting the NOT NULL test into the partition constraint instead of
changing the column properties.
Amit Langote and Robert Haas, reviewed by Amit Kapila
Discussion: http://postgr.es/m/8e2dd63d-c6fb-bb74-3c2b-ed6d63629c9d@lab.ntt.co.jp
When creating a subscription with slot_name = NONE, we failed to check
that also create_slot = false and enabled = false were set. This
created an invalid subscription and could later lead to a crash if a
NULL slot name was accessed. Add more checks around that for
robustness.
Reported-by: tushar <tushar.ahuja@enterprisedb.com>
Add some tests for parsing different option combinations. Fix some of
the resulting error messages for recent changes in option naming.
Author: Masahiko Sawada <sawada.mshk@gmail.com>
We used to only check for a supported relkind on the subscriber during
replication, which is needed to ensure that the setup is valid and we
don't crash. But it's also useful to tell the user immediately when
CREATE or ALTER SUBSCRIPTION is executed that the relation being added
to the subscription is not of a supported relkind.
Author: Petr Jelinek <petr.jelinek@2ndquadrant.com>
Reported-by: tushar <tushar.ahuja@enterprisedb.com>
Reformat various places in which pgindent will make a mess, and
fix a few small violations of coding style that I happened to notice
while perusing the diffs from a pgindent dry run.
There is one actual bug fix here: the need-to-enlarge-the-buffer code
path in icu_convert_case was obviously broken. Perhaps it's unreachable
in our usage? Or maybe this is just sadly undertested.
Commit 827d6f977 contained the same misunderstanding of hash_create's API
as commit 090010f2e. As in 5d00b764c, remove the unnecessary layer of
memory context. (This bug is less significant than the other one, since
the extra context would be under a relatively short-lived context, but
it's still a bug.)
Since commit 4e37b3e15, buildfarm member frogmouth has been failing
occasionally with symptoms indicating that some expected stats data is
getting dropped. The reason that that commit changed the behavior seems
probably to be that more data is getting shoved at the collector in a short
span of time. In current sources, the stats test's first session sends
about 9KB of data while exiting, which is probably the same as what was
sent just before wait_for_stats() in the previous test design. But now,
the test's second session is starting up concurrently, and it sends another
2KB (presumably reflecting its initial catalog accesses). Since frogmouth
is running on Windows XP, which reputedly has a default socket receive
buffer size of only 8KB, it is not very surprising if this has put us over
the threshold where the receive buffer can overflow and drop messages.
The same mechanism could very easily explain the intermittent stats test
failures we've been seeing for years, since background processes such
as the bgwriter will sometimes send data concurrently with all this, and
could thus cause occasional buffer overflows.
Hence, insert some code into pgstat_init() to increase the stats socket's
receive buffer size to 100KB if it's less than that. (On failure, emit a
LOG message, but keep going.) Modern systems seem to have default sizes
in the range of 100KB-250KB, but older platforms don't. I couldn't find
any platforms that wouldn't accept 100KB, so in theory this won't cause
any portability problems.
If this is successful at reducing the buildfarm failure rate in HEAD,
we should back-patch it, because it's certain that similar buffer overflows
happen in the field on platforms with small buffer sizes. Going forward,
there might be an argument for trying to increase the buffer size even
more, but let's take a baby step first.
Discussion: https://postgr.es/m/22173.1494788088@sss.pgh.pa.us
The CommentStmt made by RebuildConstraintComment() has to pstrdup the
relation name, else it will contain a dangling pointer after that
relcache entry is flushed. (I'm less sure that pstrdup'ing conname
is necessary, but let's be safe.) Failure to do this leads to weird
errors or crashes, as reported by Marko Elezovic.
Bug introduced by commit e42375fc8, so back-patch to 9.5 as that was.
Fix by David Rowley, regression test by Michael Paquier
Discussion: https://postgr.es/m/DB6PR03MB30775D58E732D4EB0C13725B9AE00@DB6PR03MB3077.eurprd03.prod.outlook.com
In 1753b1b027, the pg_sequence system
catalog was introduced. This made sequence metadata changes
transactional, while the actual sequence values are still behaving
nontransactionally. This requires some refinement in how ALTER
SEQUENCE, which operates on both, locks the sequence and the catalog.
The main problems were:
- Concurrent ALTER SEQUENCE causes "tuple concurrently updated" error,
caused by updates to pg_sequence catalog.
- Sequence WAL writes and catalog updates are not protected by same
lock, which could lead to inconsistent recovery order.
- nextval() disregarding uncommitted ALTER SEQUENCE changes.
To fix, nextval() and friends now lock the sequence using
RowExclusiveLock instead of AccessShareLock. ALTER SEQUENCE locks the
sequence using ShareRowExclusiveLock. This means that nextval() and
ALTER SEQUENCE block each other, and ALTER SEQUENCE on the same sequence
blocks itself. (This was already the case previously for the OWNER TO,
RENAME, and SET SCHEMA variants.) Also, rearrange some code so that the
entire AlterSequence is protected by the lock on the sequence.
As an exception, use reduced locking for ALTER SEQUENCE ... RESTART.
Since that is basically a setval(), it does not require the full locking
of other ALTER SEQUENCE actions. So check whether we are only running a
RESTART and run with less locking if so.
Reviewed-by: Michael Paquier <michael.paquier@gmail.com>
Reported-by: Jason Petersen <jason@citusdata.com>
Reported-by: Andres Freund <andres@anarazel.de>
Code review for commit 090010f2e.
Fix cases where an elog(ERROR) partway through a function would leave the
persistent data structures in a corrupt state. pgstat_report_stat got this
wrong by invalidating PgStat_TableEntry structs before removing hashtable
entries pointing to them, and get_tabstat_entry got it wrong by ignoring
the possibility of palloc failure after it had already created a hashtable
entry.
Also, avoid leaking a memory context per transaction, which the previous
code did through misunderstanding hash_create's API. We do not need to
create a context to hold the hash table; hash_create will do that.
(The leak wasn't that large, amounting to only a memory context header
per iteration, but it's still surprising that nobody noticed it yet.)
Per a report from Tom Lane, newer versions of gcc apparently think
that partexprs_item_saved can be used uninitialized. Try to convince
them otherwise.
Remove default cases from assorted switches over ObjectClass and some
related enum types, so that we'll get compiler warnings when someone
adds a new enum value without accounting for it in all these places.
In passing, re-order some switch cases as needed to match the declaration
of enum ObjectClass. OK, that's just neatnik-ism, but I dislike code
that looks like it was assembled with the help of a dartboard.
Discussion: https://postgr.es/m/20170512221010.nglatgt5azzdxjlj@alvherre.pgsql
ALTER COLUMN TYPE on a column used by a statistics object fails since
commit 928c4de30, because the relevant switch in ATExecAlterColumnType
is unprepared for columns to have dependencies from OCLASS_STATISTIC_EXT
objects.
Although the existing types of extended statistics don't actually need us
to do any work for a column type change, it seems completely indefensible
that that assumption is hidden behind the failure of an unrelated module
to contain any code for the case. Hence, create and call an API function
in statscmds.c where the assumption can be explained, and where we could
add code to deal with the problem when it inevitably becomes real.
Also, the reason this wasn't handled before, neither for extended stats
nor for the last half-dozen new OCLASS kinds :-(, is that the default:
in that switch suppresses compiler warnings, allowing people to miss the
need to consider it when adding an OCLASS. We don't really need a default
because surely getObjectClass should only return valid values of the enum;
so remove it, and add the missed OCLASS entries where they should be.
Discussion: https://postgr.es/m/20170512221010.nglatgt5azzdxjlj@alvherre.pgsql
Consistently refer to such an entry as a "statistics object", not just
"statistics" or "extended statistics". Previously we had a mismash of
terms, accompanied by utter confusion as to whether the term was
singular or plural. That's not only grating (at least to the ear of
a native English speaker) but could be outright misleading, eg in error
messages that seemed to be referring to multiple objects where only one
could be meant.
This commit fixes the code and a lot of comments (though I may have
missed a few). I also renamed two new SQL functions,
pg_get_statisticsextdef -> pg_get_statisticsobjdef
pg_statistic_ext_is_visible -> pg_statistics_obj_is_visible
to conform better with this terminology.
I have not touched the SGML docs other than fixing those function
names; the docs certainly need work but it seems like a separable task.
Discussion: https://postgr.es/m/22676.1494557205@sss.pgh.pa.us
Before 955a684e04 logical decoding snapshot maintenance needed to
cope with transactions it might not have seen in their entirety. For
such transactions we'd to assume they modified the catalog (could have
happened before we were watching), and thus a new snapshot had to be
built, and distributed to concurrently running transactions.
That's problematic because building a new snapshot isn't that cheap ,
especially as the the array of committed transactions needs to be
sorted. When creating a slot on a server with a lot of transactions,
this could make logical slot creation infeasibly expensive.
After 955a684e04 there's no need to deal with transaction that
aren't guaranteed to be fully observable. That allows to avoid
building snapshots for transactions that haven't modified catalog,
even before reaching consistency.
While this isn't necessarily a bugfix, slot creation being impossible
in some production workloads, is severe enough to warrant
backpatching.
Author: Andres Freund, based on a quite different patch from Petr Jelinek
Analyzed-By: Petr Jelinek
Reviewed-By: Petr Jelinek
Discussion: https://postgr.es/m/f37e975c-908f-858e-707f-058d3b1eb214@2ndquadrant.com
Backpatch: 9.4-, where logical decoding has been introduced
The snapshot assembly during the creation of logical slots relied
waiting for transactions in xl_running_xacts to end, by checking for
their commit/abort records. Unfortunately, despite locking, it is
possible to see an xl_running_xact record listing transactions as
ready, that have already WAL-logged an commit/abort record, as the
locking just prevents the ProcArray to be adjusted, and the commit
record has to be logged first.
That lead to either delayed or hanging snapshot creation, because
snapbuild.c would wait "forever" to see commit/abort records for some
transactions. That hang resolved only if a xl_running_xacts record
without any running transactions happened to be logged, far from
certain on a busy server.
It's impractical to prevent that via more heavyweight locking, the
likelihood of deadlocks and significantly increased contention would
be too big.
Instead change the initial snapshot creation to be solely based on
tracking the oldest running transaction via
xl_running_xacts->oldestRunningXid - that actually ends up
significantly simplifying the code. That has two disadvantages:
1) Because we cannot fully "trust" the contents of xl_running_xacts,
we cannot use it to build the initial snapshot. Instead we have to
wait twice for all running transactions to finish.
2) Previously a slot, unless the race occurred, could be created when
the all transaction perceived as running based on commit/abort
records, now we have to wait for the next xl_running_xacts record.
To address that, trigger logging new xl_running_xacts record from
within snapbuild.c exactly when necessary.
Unfortunately snabuild.c's SnapBuild is stored on disk, one of the
stupider ideas of a certain Mr Freund, so we can't change it in a
minor release. As this is going to be backpatched, we have to hack
around a bit to keep on-disk compatibility. A later commit will
rejigger that on master.
Author: Andres Freund, based on a quite different patch from Petr Jelinek
Analyzed-By: Petr Jelinek
Reviewed-By: Petr Jelinek
Discussion: https://postgr.es/m/f37e975c-908f-858e-707f-058d3b1eb214@2ndquadrant.com
Backpatch: 9.4-, where logical decoding has been introduced
The mess cleaned up in commit da0759600 is clear evidence that it's a
bug hazard to expect the caller of get_attstatsslot()/free_attstatsslot()
to provide the correct type OID for the array elements in the slot.
Moreover, we weren't even getting any performance benefit from that,
since get_attstatsslot() was extracting the real type OID from the array
anyway. So we ought to get rid of that requirement; indeed, it would
make more sense for get_attstatsslot() to pass back the type OID it found,
in case the caller isn't sure what to expect, which is likely in binary-
compatible-operator cases.
Another problem with the current implementation is that if the stats array
element type is pass-by-reference, we incur a palloc/memcpy/pfree cycle
for each element. That seemed acceptable when the code was written because
we were targeting O(10) array sizes --- but these days, stats arrays are
almost always bigger than that, sometimes much bigger. We can save a
significant number of cycles by doing one palloc/memcpy/pfree of the whole
array. Indeed, in the now-probably-common case where the array is toasted,
that happens anyway so this method is basically free. (Note: although the
catcache code will inline any out-of-line toasted values, it doesn't
decompress them. At the other end of the size range, it doesn't expand
short-header datums either. In either case, DatumGetArrayTypeP would have
to make a copy. We do end up using an extra array copy step if the element
type is pass-by-value and the array length is neither small enough for a
short header nor large enough to have suffered compression. But that
seems like a very acceptable price for winning in pass-by-ref cases.)
Hence, redesign to take these insights into account. While at it,
convert to an API in which we fill a struct rather than passing a bunch
of pointers to individual output arguments. That will make it less
painful if we ever want further expansion of what get_attstatsslot can
pass back.
It's certainly arguable that this is new development and not something to
push post-feature-freeze. However, I view it as primarily bug-proofing
and therefore something that's better to have sooner not later. Since
we aren't quite at beta phase yet, let's put it in.
Discussion: https://postgr.es/m/16364.1494520862@sss.pgh.pa.us
Tab-completing DROP STATISTICS would only work if you started writing
the schema name containing the statistics object, because the visibility
clause was missing. To add it, we need to add SQL-callable support for
testing visibility of a statistics object, like all other object types
already have.
Discussion: https://postgr.es/m/22676.1494557205@sss.pgh.pa.us
We have now grown enough registerable syscache-invalidation callback
functions that the original assumption that there would be few of them
is causing performance problems. In particular, let's fix things so that
CallSyscacheCallbacks doesn't have to search the whole array to find
which callback(s) to invoke for a given cache ID. Preserve the original
behavior that callbacks are called in order of registration, just in
case there's someplace that depends on that (which I doubt).
In support of this, export the number of syscaches from syscache.h.
People could have found that out anyway from the enum, but adding a
#define makes that much safer.
This provides a useful additional speedup in Mathieu Fenniak's
logical-decoding test case, although we're reaching the point of
diminishing returns there. I think any further improvement will have
to come from reducing the number of cache invalidations that are
triggered in the first place. Still, we can hope that this change
gives some incremental benefit for all invalidation scenarios.
Back-patch to 9.4 where logical decoding was introduced.
Discussion: https://postgr.es/m/CAHoiPjzea6N0zuCi=+f9v_j94nfsy6y8SU7-=bp4=7qw6_i=Rg@mail.gmail.com
A test case provided by Mathieu Fenniak shows that hash_seq_search'ing
this hashtable can consume a very significant amount of overhead during
logical decoding, which triggers frequent cache invalidation. Testing
suggests that the actual population of the hashtable is often no more
than a few dozen entries, so we can cut the overhead just by dropping
the initial number of buckets down from 1024 --- I chose to cut it to 64.
(In situations where we do have a significant number of entries, we
shouldn't get any real penalty from doing this, as the dynahash.c code
will resize the hashtable automatically.)
This gives a further factor-of-two savings in Mathieu's test case.
That may be overly optimistic for real-world benefit, as real cases
may have larger average table populations, but it's hard to see it
turning into a net negative for any workload.
Back-patch to 9.4 where relfilenodemap.c was introduced.
Discussion: https://postgr.es/m/CAHoiPjzea6N0zuCi=+f9v_j94nfsy6y8SU7-=bp4=7qw6_i=Rg@mail.gmail.com
This was missed in 7b504eb282.
Remove the "default:" clause in the switch, to avoid this problem in the
future. Other switches involving the same enum should probably be
changed in the same way, but are not touched by this patch.
Discussion: https://postgr.es/m/20170512204800.iqt2uwyx3c32j45r@alvherre.pgsql
A test case provided by Mathieu Fenniak shows that the initial search for
the target catcache in CatalogCacheIdInvalidate consumes a very significant
amount of overhead in cases where cache invalidation is triggered but has
little useful work to do. There is no good reason for that search to exist
at all, as the index array maintained by syscache.c allows direct lookup of
the catcache from its ID. We just need a frontend function in syscache.c,
matching the division of labor for most other cache-accessing operations.
While there's more that can be done in this area, this patch alone reduces
the runtime of Mathieu's example by 2X. We can hope that it offers some
useful benefit in other cases too, although usually cache invalidation
overhead is not such a striking fraction of the total runtime.
Back-patch to 9.4 where logical decoding was introduced. It might be
worth going further back, but presently the only case we know of where
cache invalidation is really a significant burden is in logical decoding.
Also, older branches have fewer catcaches, reducing the possible benefit.
(Note: although this nominally changes catcache's API, we have always
documented CatalogCacheIdInvalidate as a private function, so I would
have little sympathy for an external module calling it directly. So
backpatching should be fine.)
Discussion: https://postgr.es/m/CAHoiPjzea6N0zuCi=+f9v_j94nfsy6y8SU7-=bp4=7qw6_i=Rg@mail.gmail.com
A stats object ought to have a dependency on each individual column
it reads, not the entire table. Doing this honestly lets us get rid
of the hard-wired logic in RemoveStatisticsExt, which seems to have
been misguidedly modeled on RemoveStatistics; and it will be far easier
to extend to multiple tables later.
Also, add overlooked dependency on owner, and make the dependency on
schema be NORMAL like every other such dependency.
There remains some unfinished work here, which is to allow statistics
objects to be extension members. That takes more effort than just
adding the dependency call, though, so I left it out for now.
initdb forced because this changes the set of pg_depend records that
should exist for a statistics object.
Discussion: https://postgr.es/m/22676.1494557205@sss.pgh.pa.us
Previously, we had the WITH clause in the middle of the command, where
you'd specify both generic options as well as statistic types. Few
people liked this, so this commit changes it to remove the WITH keyword
from that clause and makes it accept statistic types only. (We
currently don't have any generic options, but if we invent in the
future, we will gain a new WITH clause, probably at the end of the
command).
Also, the column list is now specified without parens, which makes the
whole command look more similar to a SELECT command. This change will
let us expand the command to supporting expressions (not just columns
names) as well as multiple tables and their join conditions.
Tom added lots of code comments and fixed some parts of the CREATE
STATISTICS reference page, too; more changes in this area are
forthcoming. He also fixed a potential problem in the alter_generic
regression test, reducing verbosity on a cascaded drop to avoid
dependency on message ordering, as we do in other tests.
Tom also closed a security bug: we documented that table ownership was
required in order to create a statistics object on it, but didn't
actually implement it.
Implement tab-completion for statistics objects. This can stand some
more improvement.
Authors: Alvaro Herrera, with lots of cleanup by Tom Lane
Discussion: https://postgr.es/m/20170420212426.ltvgyhnefvhixm6i@alvherre.pgsql
For CREATE/ALTER PUBLICATION/SUBSCRIPTION, use similar option style as
other statements that use a WITH clause for options.
Author: Petr Jelinek <petr.jelinek@2ndquadrant.com>
Lag tracking is called for each commit, but we introduce
a pacing delay to ensure we don't swamp the lag tracker.
Author: Petr Jelinek, with minor pacing delay code from me
Increase from the historical value of 32 to 64. We are up to 31 callers
of CacheRegisterSyscacheCallback() in HEAD, so if they were all to be
exercised in one process that would leave only one slot for add-on modules.
It's probably not possible for that to happen, but still we clearly need
more daylight here. (At some point it might be worth making the array
dynamically resizable; but since we've never heard a complaint of "out of
syscache_callback_list slots" happening in the field, I doubt it's worth
it yet.)
Back-patch as far as 9.4, which is where we increased the companion limit
MAX_RELCACHE_CALLBACKS (cf commit f01d1ae3a). It's not as urgent in
released branches, which have only a couple dozen call sites in core, but
it still seems that somebody might hit the limit before these branches die.
Discussion: https://postgr.es/m/12184.1494450131@sss.pgh.pa.us
Per discussion, "location" is a rather vague term that could refer to
multiple concepts. "LSN" is an unambiguous term for WAL locations and
should be preferred. Some function names, view column names, and function
output argument names used "lsn" already, but others used "location",
as well as yet other terms such as "wal_position". Since we've already
renamed a lot of things in this area from "xlog" to "wal" for v10,
we may as well incur a bit more compatibility pain and make these names
all consistent.
David Rowley, minor additional docs hacking by me
Discussion: https://postgr.es/m/CAKJS1f8O0njDKe8ePFQ-LK5-EjwThsDws6ohJ-+c6nWK+oUxtg@mail.gmail.com
This reverts commits fa2fa99552 and 42f50cb8fa.
While the functionality that was intended to be provided by these
commits is desired, the patch didn't actually solve as many of the
problematic situations as we hoped, and it created a bunch of its own
problems. Since we're going to require more extensive changes soon for
other reasons and users have been working around these problems for a
long time already, there is no point in spending effort in fixing this
halfway measure.
Per complaint from Tom Lane.
Discussion: https://postgr.es/m/21407.1484606922@sss.pgh.pa.us
(Commit fa2fa99552 had already been reverted in branches 9.5 as
f858524ee4 and 9.6 as e9e44a0953, so this touches master only.
Commit 42f50cb8fa was not present in the older branches.)
Previously, the memory used by the logical replication apply worker for
processing messages would never be freed, so that could end up using a
lot of memory. To improve that, change the existing ApplyContext memory
context to ApplyMessageContext and reset that after every
message (similar to MessageContext used elsewhere). For consistency of
naming, rename the ApplyCacheContext to ApplyContext.
Author: Stas Kelvich <s.kelvich@postgrespro.ru>
It turned out this approach had problems, because a DROP command should
not have any options other than CASCADE and RESTRICT. Instead, always
attempt to drop the slot if there is one configured, but also add an
ALTER SUBSCRIPTION action to set the slot to NONE.
Author: Petr Jelinek <petr.jelinek@2ndquadrant.com>
Reported-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://www.postgresql.org/message-id/29431.1493730652@sss.pgh.pa.us
Values in a STATISTIC_KIND_RANGE_LENGTH_HISTOGRAM slot are float8,
not of the type of the column the statistics are for.
This bug is at least partly the fault of sloppy specification comments
for get_attstatsslot()/free_attstatsslot(): the type OID they want is that
of the stavalues entries, not of the underlying column. (I double-checked
other callers and they seem to get this right.) Adjust the comments to be
more correct.
Per buildfarm.
Security: CVE-2017-7484
Previously it would allow an invalid connection string to be set.
Author: Petr Jelinek <petr.jelinek@2ndquadrant.com>
Reported-by: tushar <tushar.ahuja@enterprisedb.com>
This new arrangement ensures that statistics are reported right after
commit of transactions. The previous arrangement didn't get this quite
right and could lead to assertion failures.
Author: Petr Jelinek <petr.jelinek@2ndquadrant.com>
Reported-by: Erik Rijkers <er@xs4all.nl>
Both views replace the umoptions field with NULL when the user does not
meet qualifications to see it. They used different qualifications, and
pg_user_mappings documented qualifications did not match its implemented
qualifications. Make its documentation and implementation match those
of user_mapping_options. One might argue for stronger qualifications,
but these have long, documented tenure. pg_user_mappings has always
exhibited this problem, so back-patch to 9.2 (all supported versions).
Michael Paquier and Feike Steenbergen. Reviewed by Jeff Janes.
Reported by Andrew Wheelwright.
Security: CVE-2017-7486
Some selectivity estimation functions run user-supplied operators over
data obtained from pg_statistic without security checks, which allows
those operators to leak pg_statistic data without having privileges on
the underlying tables. Fix by checking that one of the following is
satisfied: (1) the user has table or column privileges on the table
underlying the pg_statistic data, or (2) the function implementing the
user-supplied operator is leak-proof. If neither is satisfied, planning
will proceed as if there are no statistics available.
At least one of these is satisfied in most cases in practice. The only
situations that are negatively impacted are user-defined or
not-leak-proof operators on a security-barrier view.
Reported-by: Robert Haas <robertmhaas@gmail.com>
Author: Peter Eisentraut <peter_e@gmx.net>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Security: CVE-2017-7484
Storing passwords in plaintext hasn't been a good idea for a very long
time, if ever. Now seems like a good time to finally forbid it, since we're
messing with this in PostgreSQL 10 anyway.
Remove the CREATE/ALTER USER UNENCRYPTED PASSSWORD 'foo' syntax, since
storing passwords unencrypted is no longer supported. ENCRYPTED PASSWORD
'foo' is still accepted, but ENCRYPTED is now just a noise-word, it does
the same as just PASSWORD 'foo'.
Likewise, remove the --unencrypted option from createuser, but accept
--encrypted as a no-op for backward compatibility. AFAICS, --encrypted was
a no-op even before this patch, because createuser encrypted the password
before sending it to the server even if --encrypted was not specified. It
added the ENCRYPTED keyword to the SQL command, but since the password was
already in encrypted form, it didn't make any difference. The documentation
was not clear on whether that was intended or not, but it's moot now.
Also, while password_encryption='on' is still accepted as an alias for
'md5', it is now marked as hidden, so that it is not listed as an accepted
value in error hints, for example. That's not directly related to removing
'plain', but it seems better this way.
Reviewed by Michael Paquier
Discussion: https://www.postgresql.org/message-id/16e9b768-fd78-0b12-cfc1-7b6b7f238fde@iki.fi
In the backend, this is just to silence coverity warnings, but in the
frontend, it's a genuine leak, even if extremely rare.
Spotted by Coverity, patch by Michael Paquier.
When we add the SELECT-privilege based policies to the RLS with check
options (such as for an UPDATE statement, or when we have INSERT ...
RETURNING), we need to be sure and use the 'USING' case if the policy is
actually an 'ALL' policy (which could have both a USING clause and an
independent WITH CHECK clause).
This could result in policies acting differently when built using ALL
(when the ALL had both USING and WITH CHECK clauses) and when building
the policies independently as SELECT and UPDATE policies.
Fix this by adding an explicit boolean to add_with_check_options() to
indicate when the USING policy should be used, even if the policy has
both USING and WITH CHECK policies on it.
Reported by: Rod Taylor
Back-patch to 9.5 where RLS was introduced.
Since 6ef2eba3f5 ("Skip checkpoints, archiving on idle systems."),
GetLastImportantRecPtr() is used to avoid performing superfluous
checkpoints, xlog switches, running-xact records when the system is
idle. Unfortunately the check concerning running-xact records had a
off-by-one error, leading to such records being potentially skipped
when only a single record has been inserted since the last
running-xact record.
An alternative approach would have been to change
GetLastImportantRecPtr()'s definition to point to the end of records,
but that would make the checkpoint code more complicated.
Author: Andres Freund
Discussion: https://postgr.es/m/20170505012447.wsrympaxnfis6ojt@alap3.anarazel.de
Backpatch: no, code only present in master
When the checkpointer writes the shutdown checkpoint, it checks
afterwards whether any WAL has been written since it started and throws
a PANIC if so. At that point, only walsenders are still active, so one
might think this could not happen, but walsenders can also generate WAL,
for instance in BASE_BACKUP and certain variants of
CREATE_REPLICATION_SLOT. So they can trigger this panic if such a
command is run while the shutdown checkpoint is being written.
To fix this, divide the walsender shutdown into two phases. First, the
postmaster sends a SIGUSR2 signal to all walsenders. The walsenders
then put themselves into the "stopping" state. In this state, they
reject any new commands. (For simplicity, we reject all new commands,
so that in the future we do not have to track meticulously which
commands might generate WAL.) The checkpointer waits for all walsenders
to reach this state before proceeding with the shutdown checkpoint.
After the shutdown checkpoint is done, the postmaster sends
SIGINT (previously unused) to the walsenders. This triggers the
existing shutdown behavior of sending out the shutdown checkpoint record
and then terminating.
Author: Michael Paquier <michael.paquier@gmail.com>
Reported-by: Fujii Masao <masao.fujii@gmail.com>
* Remove is_scram_verifier() function. It was unused.
* Fix sanitize_char() function, used in error messages on protocol
violations, to print bytes >= 0x7F correctly.
* Change spelling of scram_MockSalt() function to be more consistent with
the surroundings.
* Change a few more references to "server proof" to "server signature" that
I missed in commit d981074c24.
Instead, send the same FATAL message as with other password-based
authentication mechanisms. This gives a more user-friendly message:
psql: FATAL: password authentication failed for user "test"
instead of:
psql: error received from server in SASL exchange: invalid-proof
Even before this patch, the server sent that FATAL message, after the
SCRAM-specific "e=invalid-proof" message. But libpq would stop at the
SCRAM error message, and not process the ErrorResponse that would come
after that. We could've taught libpq to check for an ErrorResponse after
failed authentication, but it's simpler to modify the server to send only
the ErrorResponse. The SCRAM specification allows for aborting the
authentication at any point, using an application-defined error mechanism,
like PostgreSQL's ErrorResponse. Using the e=invalid-proof message is
optional.
Reported by Jeff Janes.
Discussion: https://www.postgresql.org/message-id/CAMkU%3D1w3jQ53M1OeNfN8Cxd9O%2BA_9VONJivTbYoYRRdRsLT6vA@mail.gmail.com
GiST's getNextNearest() function attempts to pfree the previously-returned
tuple if any (that is, scan->xs_hitup in HEAD, or scan->xs_itup in older
branches). However, if we are rescanning a plan node after ending a
previous scan early, those tuple pointers could be pointing to garbage,
because they would be pointing into the scan's pageDataCxt or queueCxt
which has been reset. In a debug build this reliably results in a crash,
although I think it might sometimes accidentally fail to fail in
production builds.
To fix, clear the pointer field anyplace we reset a context it might
be pointing into. This may be overkill --- I think probably only the
queueCxt case is involved in this bug, so that resetting in gistrescan()
would be sufficient --- but dangling pointers are generally bad news,
so let's avoid them.
Another plausible answer might be to just not bother with the pfree in
getNextNearest(). The reconstructed tuples would go away anyway in the
context resets, and I'm far from convinced that freeing them a bit earlier
really saves anything meaningful. I'll stick with the original logic in
this patch, but if we find more problems in the same area we should
consider that approach.
Per bug #14641 from Denis Smirnov. Back-patch to 9.5 where this
logic was introduced.
Discussion: https://postgr.es/m/20170504072034.24366.57688@wrigleys.postgresql.org
It only produced <row> elements but no wrapping <table> element.
By contrast, cursor_to_xmlschema produced a schema that is now correct
but did not previously match the XML data produced by cursor_to_xml.
In passing, also fix a minor misunderstanding about moving cursors in
the tests related to this.
Reported-by: filip@jirsak.org
Based-on-patch-by: Thomas Munro <thomas.munro@enterprisedb.com>
create_singleton_array() was not really as useful as we perhaps thought
when we added it. It had never accreted more than one call site, and is
only saving a dozen lines of code at that one, which is considerably less
bulk than the function itself. Moreover, because of its insistence on
using the caller's fn_extra cache space, it's arguably a coding hazard.
text_to_array_internal() does not currently use fn_extra in any other way,
but if it did it would be subtly broken, since the conflicting fn_extra
uses could be needed within a single query, in the seldom-tested case that
the field separator varies during the query. The same objection seems
likely to apply to any other potential caller.
The replacement code is a bit uglier, because it hardwires knowledge of
the storage parameters of type TEXT, but it's not like we haven't got
dozens or hundreds of other places that do the same. Uglier seems like
a good tradeoff for smaller, faster, and safer.
Per discussion with Neha Khatri.
Discussion: https://postgr.es/m/CAFO0U+_fS5SRhzq6uPG+4fbERhoA9N2+nPrtvaC9mmeWivxbsA@mail.gmail.com
Due to a missing CommandCounterIncrement() call, parsing of a non-utility
command in an extension script would not see the effects of the immediately
preceding DDL command, unless that command's execution ends with
CommandCounterIncrement() internally ... which some do but many don't.
Report by Philippe Beaudoin, diagnosis by Julien Rouhaud.
Rather remarkably, this bug has evaded detection since extensions were
invented, so back-patch to all supported branches.
Discussion: https://postgr.es/m/2cf7941e-4e41-7714-3de8-37b1a8f74dff@free.fr
ALTER SEQUENCE can do nontransactional changes to the sequence (RESTART
clause) and transactional updates to the pg_sequence catalog (most other
clauses). When just calling RESTART, the code would still needlessly do
a catalog update without any changes. This would entangle that
operation in the concurrency issues of a catalog update (causing either
locking or concurrency errors, depending on how that issue is to be
resolved).
Fix by keeping track during options parsing whether a catalog update is
needed, and skip it if not.
Reported-by: Jason Petersen <jason@citusdata.com>
This goes together with the changes made to enable replication on the
sending side by default (wal_level, max_wal_senders etc) by making the
receiving stadby node also enable it by default.
Huong Dangminh
If the inner relation can be proven unique, that is it can have no more
than one matching row for any row of the outer query, then we might as
well implement the semijoin as a plain inner join, allowing substantially
more freedom to the planner. This is a form of outer join strength
reduction, but it can't be implemented in reduce_outer_joins() because
we don't have enough info about the individual relations at that stage.
Instead do it much like remove_useless_joins(): once we've built base
relations, we can make another pass over the SpecialJoinInfo list and
get rid of any entries representing reducible semijoins.
This is essentially a followon to the inner-unique patch (commit 9c7f5229a)
and makes use of the proof machinery that that patch created. We need only
minor refactoring of innerrel_is_unique's API to support this usage.
Per performance complaint from Teodor Sigaev.
Discussion: https://postgr.es/m/f994fc98-389f-4a46-d1bc-c42e05cb43ed@sigaev.ru
The inner-unique patch (commit 9c7f5229a) supposed that if we're
considering a JOIN_UNIQUE_INNER join path, we can always set inner_unique
for the join, because the inner path produced by create_unique_path should
be unique relative to the outer relation. However, that's true only if
we're considering joining to the whole outer relation --- otherwise we may
be applying only some of the join quals, and so the inner path might be
non-unique from the perspective of this join. Adjust the test to only
believe that we can set inner_unique if we have the whole semijoin LHS on
the outer side.
There is more that can be done in this area, but this commit is only
intended to provide the minimal fix needed to get correct plans.
Per report from Teodor Sigaev. Thanks to David Rowley for preliminary
investigation.
Discussion: https://postgr.es/m/f994fc98-389f-4a46-d1bc-c42e05cb43ed@sigaev.ru
After the logical replication launcher was told to wake up at
commit (for example, by a CREATE SUBSCRIPTION command), the flag to wake
up was not reset, so it would be woken up at every following commit as
well. So fix that by resetting the flag.
Also, we don't need to wake up anything if the transaction was rolled
back. Just reset the flag in that case.
Author: Masahiko Sawada <sawada.mshk@gmail.com>
Reported-by: Fujii Masao <masao.fujii@gmail.com>
Even though no actual tuples are ever inserted into a partitioned
table (the actual tuples are in the partitions, not the partitioned
table itself), we still need to have a ResultRelInfo for the
partitioned table, or per-statement triggers won't get fired.
Amit Langote, per a report from Rajkumar Raghuwanshi. Reviewed by me.
Discussion: http://postgr.es/m/CAKcux6%3DwYospCRY2J4XEFuVy0L41S%3Dfic7rmkbsU-GXhhSbmBg%40mail.gmail.com
Thinko in commit de4389712: this warning message references the wrong
"LogicalRepWorker *" variable. This would often result in a core dump,
but if it didn't, the message would show the wrong subscription OID.
In passing, adjust the message text to format a subscription OID
similarly to how that's done elsewhere in the function; and fix
grammatical issues in some nearby messages.
Per Coverity testing.
Currently, trying to validate a NO INHERIT constraint on the parent will
search for the constraint in child tables (where it is not supposed to
exist), wrongly causing a "constraint does not exist" error.
Amit Langote, per a report from Hans Buschmann.
Discussion: http://postgr.es/m/20170421184012.24362.19@wrigleys.postgresql.org
Before restarting a tablesync worker for the same relation, wait
wal_retrieve_retry_interval (currently 5s by default). This avoids
restarting failing workers in a tight loop.
We keep the last start times in a hash table last_start_times that is
separate from the table_states list, because that list is cleared out on
syscache invalidation, which happens whenever a table finishes syncing.
The hash table is kept until all tables have finished syncing.
A future project might be to unify these two and keep everything in one
data structure, but for now this is a less invasive change to accomplish
the original purpose.
For the test suite, set wal_retrieve_retry_interval to its minimum
value, to not increase the test suite run time.
Reviewed-by: Petr Jelinek <petr.jelinek@2ndquadrant.com>
Reported-by: Masahiko Sawada <sawada.mshk@gmail.com>
* Move computation of SaltedPassword to a separate function from
scram_ClientOrServerKey(). This saves a lot of cycles in libpq, by
computing SaltedPassword only once per authentication. (Computing
SaltedPassword is expensive by design.)
* Split scram_ClientOrServerKey() into two functions. Improves
readability, by making the calling code less verbose.
* Rename "server proof" to "server signature", to better match the
nomenclature used in RFC 5802.
* Rename SCRAM_SALT_LEN to SCRAM_DEFAULT_SALT_LEN, to make it more clear
that the salt can be of any length, and the constant only specifies how
long a salt we use when we generate a new verifier. Also rename
SCRAM_ITERATIONS_DEFAULT to SCRAM_DEFAULT_ITERATIONS, for consistency.
These things caught my eye while working on other upcoming changes.
Declarative partitioning duplicated the TypedTableElement productions,
evidently to remove the need to specify WITH OPTIONS when creating
partitions. Instead, simply make WITH OPTIONS optional in the
TypedTableElement production and remove all of the duplicate
PartitionElement-related productions. This change simplifies the
syntax and makes WITH OPTIONS optional when adding defaults, constraints
or storage parameters to columns when creating either typed tables or
partitions.
Also update pg_dump to no longer include WITH OPTIONS, since it's not
necessary, and update the documentation to reflect that WITH OPTIONS is
now optional.
Earlier commits (56e19d938d and 2bef06d516) make it cheaper to
create a logical slot if not exporting the initial snapshot. If
NOEXPORT_SNAPSHOT is specified, we can skip the overhead, not just
when creating a slot via sql (which can't export snapshots). As
NOEXPORT_SNAPSHOT has only recently been introduced, this shouldn't be
backpatched.
Logical decoding stores historical snapshots on disk, so that logical
decoding can restart without having to reconstruct a snapshot from
scratch (for which the resources are not guaranteed to be present
anymore). These serialized snapshots were also used when creating a
new slot via the walsender interface, which can export a "full"
snapshot (i.e. one that can read all tables, not just catalog ones).
The problem is that the serialized snapshots are only useful for
catalogs and not for normal user tables. Thus the use of such a
serialized snapshot could result in an inconsistent snapshot being
exported, which could lead to queries returning wrong data. This
would only happen if logical slots are created while another logical
slot already exists.
Author: Petr Jelinek
Reviewed-By: Andres Freund
Discussion: https://postgr.es/m/f37e975c-908f-858e-707f-058d3b1eb214@2ndquadrant.com
Backport: 9.4, where logical decoding was introduced.
Previously the logical replication launcher stored the last timestamp
when it started the worker, in the local variable "last_start_time",
in order to check whether wal_retrive_retry_interval elapsed since
the last startup of worker. If it has elapsed, the launcher sees
pg_subscription and starts new worker if necessary. This is for
limitting the startup of worker to once a wal_retrieve_retry_interval.
The bug was that the variable "last_start_time" was defined and
always initialized with 0 at the beginning of the launcher's main loop.
So even if it's set to the last timestamp in later phase of the loop,
it's always reset to 0. Therefore the launcher could not check
correctly whether wal_retrieve_retry_interval elapsed since
the last startup.
This patch moves the variable "last_start_time" outside the main loop
so that it will not be reset.
Reviewed-by: Petr Jelinek
Discussion: http://postgr.es/m/CAHGQGwGJrPO++XM4mFENAwpy1eGXKsGdguYv43GUgLgU-x8nTQ@mail.gmail.com
Commit fa31b6f4e supposed that we didn't have to worry about that
anymore, but it seems that RHEL5 is like that, and that's still
a supported platform. Put back the prior coding under an #ifdef,
adding an explicit fcntl() to retain the desired CLOEXEC property.
Discussion: https://postgr.es/m/12307.1493325329@sss.pgh.pa.us
The logical decoding machinery already preserved all the required
catalog tuples, which is sufficient in the course of normal logical
decoding, but did not guarantee that non-catalog tuples were preserved
during computation of the initial snapshot when creating a slot over
the replication protocol.
This could cause a corrupted initial snapshot being exported. The
time window for issues is usually not terribly large, but on a busy
server it's perfectly possible to it hit it. Ongoing decoding is not
affected by this bug.
To avoid increased overhead for the SQL API, only retain additional
tuples when a logical slot is being created over the replication
protocol. To do so this commit changes the signature of
CreateInitDecodingContext(), but it seems unlikely that it's being
used in an extension, so that's probably ok.
In a drive-by fix, fix handling of
ReplicationSlotsComputeRequiredXmin's already_locked argument, which
should only apply to ProcArrayLock, not ReplicationSlotControlLock.
Reported-By: Erik Rijkers
Analyzed-By: Petr Jelinek
Author: Petr Jelinek, heavily editorialized by Andres Freund
Reviewed-By: Andres Freund
Discussion: https://postgr.es/m/9a897b86-46e1-9915-ee4c-da02e4ff6a95@2ndquadrant.com
Backport: 9.4, where logical decoding was introduced.
Although the postmaster doesn't currently create a self-pipe or any
latches, there's discussion of it doing so in future. It's also
conceivable that a shared_preload_libraries extension would try to
create such a thing in the postmaster process today. In that case
the self-pipe FDs would be inherited by forked child processes.
latch.c was entirely unprepared for such a case and could suffer an
assertion failure, or worse try to use the inherited pipe if somebody
called WaitLatch without having called InitializeLatchSupport in that
process. Make it keep track of whether InitializeLatchSupport has been
called in the *current* process, and do the right thing if state has
been inherited from a parent.
Apply FD_CLOEXEC to file descriptors created in latch.c (the self-pipe,
as well as epoll event sets). This ensures that child processes spawned
in backends, the archiver, etc cannot accidentally or intentionally mess
with these FDs. It also ensures that we end up with the right state
for the self-pipe in EXEC_BACKEND processes, which otherwise wouldn't
know to close the postmaster's self-pipe FDs.
Back-patch to 9.6, mainly to keep latch.c looking similar in all branches
it exists in.
Discussion: https://postgr.es/m/8322.1493240739@sss.pgh.pa.us
Previously, maybe_start_bgworker() would launch at most one bgworker
process per call, on the grounds that the postmaster might otherwise
neglect its other duties for too long. However, that seems overly
conservative, especially since bad effects only become obvious when
many hundreds of bgworkers need to be launched at once. On the other
side of the coin is that the existing logic could result in substantial
delay of bgworker launches, because ServerLoop isn't guaranteed to
iterate immediately after a signal arrives. (My attempt to fix that
by using pselect(2) encountered too many portability question marks,
and in any case could not help on platforms without pselect().)
One could also question the wisdom of using an O(N^2) processing
method if the system is intended to support so many bgworkers.
As a compromise, allow that function to launch up to 100 bgworkers
per call (and in consequence, rename it to maybe_start_bgworkers).
This will allow any normal parallel-query request for workers
to be satisfied immediately during sigusr1_handler, avoiding the
question of whether ServerLoop will be able to launch more promptly.
There is talk of rewriting the postmaster to use a WaitEventSet to
avoid the signal-response-delay problem, but I'd argue that this change
should be kept even after that happens (if it ever does).
Backpatch to 9.6 where parallel query was added. The issue exists
before that, but previous uses of bgworkers typically aren't as
sensitive to how quickly they get launched.
Discussion: https://postgr.es/m/4707.1493221358@sss.pgh.pa.us
Our general rule for pg_get_X(oid) functions is to simply return NULL
when passed an invalid or inappropriate OID. Teach pg_get_partkeydef to
do this also, making it easier for users to use this function when
querying against tables with both partitions and non-partitions (such as
pg_class).
As a concrete example, this makes pg_dump's life a little easier.
Author: Amit Langote
Publisher relation can be incorrectly chosen, if there are more than
one relation in different schemas with the same name.
Author: Euler Taveira <euler@timbira.com.br>
The code was originally written with assumption that launcher is the
only process starting the worker. However that hasn't been true since
commit 7c4f52409 which failed to modify the worker management code
adequately.
This patch adds an in_use field to the LogicalRepWorker struct to
indicate whether the worker slot is being used and uses proper locking
everywhere this flag is set or read.
However if the parent process dies while the new worker is starting and
the new worker fails to attach to shared memory, this flag would never
get cleared. We solve this rare corner case by adding a sort of garbage
collector for in_use slots. This uses another field in the
LogicalRepWorker struct named launch_time that contains the time when
the worker was started. If any request to start a new worker does not
find free slot, we'll check for workers that were supposed to start but
took too long to actually do so, and reuse their slot.
In passing also fix possible race conditions when stopping a worker that
hasn't finished starting yet.
Author: Petr Jelinek <petr.jelinek@2ndquadrant.com>
Reported-by: Fujii Masao <masao.fujii@gmail.com>
There is no need to forbid ALTER TABLE ONLY on partitioned tables,
when no partitions exist yet. This can be handy for users who are
building up their partitioned table independently and will create actual
partitions later.
In addition, this is how pg_dump likes to operate in certain instances.
Author: Amit Langote, with some error message word-smithing by me
Otherwise one would have to wait up to DEFAULT_NAPTIME_PER_CYCLE until
the subscription worker is considered for starting.
There is a small race condition: If one enables a subscription right
after disabling it, the launcher might not have registered the stopping
when receiving the wakeup signal for the re-enabling. The start will
then not happen right away but after the full cycle time.
Author: Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>
In quorum-based synchronous replication, all the standbys listed in
synchronous_standby_names equally have chances to be chosen
as synchronous standbys. So they should have the same priority.
However, previously, quorum standbys whose names appear earlier
in the list were given higher priority values though the difference of
those priority values didn't affect the selection of synchronous standbys.
Users could see those "meaningless" priority values in pg_stat_replication
and this was confusing.
This commit gives all the quorum synchronous standbys the same
highest priority, i.e., 1, in order to remove such confusion.
Author: Fujii Masao
Reviewed-by: Masahiko Sawada, Kyotaro Horiguchi
Discussion: http://postgr.es/m/CAHGQGwEKOw=SmPLxJzkBsH6wwDBgOnVz46QjHbtsiZ-d-2RGUg@mail.gmail.com
This reverts commit 81069a9efc.
Buildfarm results suggest that some platforms have versions of pselect(2)
that are not merely non-atomic, but flat out non-functional. Revert the
use-pselect patch to confirm this diagnosis (and exclude the no-SA_RESTART
patch as the source of trouble). If it's so, we should probably look into
blacklisting specific platforms that have broken pselect.
Discussion: https://postgr.es/m/9696.1493072081@sss.pgh.pa.us
Traditionally we've unblocked signals, called select(2), and then blocked
signals again. The code expects that the select() will be cancelled with
EINTR if an interrupt occurs; but there's a race condition, which is that
an already-pending signal will be delivered as soon as we unblock, and then
when we reach select() there will be nothing preventing it from waiting.
This can result in a long delay before we perform any action that
ServerLoop was supposed to have taken in response to the signal. As with
the somewhat-similar symptoms fixed by commit 893902085, the main practical
problem is slow launching of parallel workers. The window for trouble is
usually pretty short, corresponding to one iteration of ServerLoop; but
it's not negligible.
To fix, use pselect(2) in place of select(2) where available, as that's
designed to solve exactly this problem. Where not available, we continue
to use the old way, and are no worse off than before.
pselect(2) has been required by POSIX since about 2001, so most modern
platforms should have it. A bigger portability issue is that some
implementations are said to be non-atomic, ie pselect() isn't really
any different from unblock/select/reblock. Still, we're no worse off
than before on such a platform.
There is talk of rewriting the postmaster to use a WaitEventSet and
not do signal response work in signal handlers, at which point this
could be reverted, since we'd be using a self-pipe to solve the race
condition. But that's not happening before v11 at the earliest.
Back-patch to 9.6. The problem exists much further back, but the
worst symptom arises only in connection with parallel query, so it
does not seem worth taking any portability risks in older branches.
Discussion: https://postgr.es/m/9205.1492833041@sss.pgh.pa.us
The postmaster keeps signals blocked everywhere except while waiting
for something to happen in ServerLoop(). The code expects that the
select(2) will be cancelled with EINTR if an interrupt occurs; without
that, followup actions that should be performed by ServerLoop() itself
will be delayed. However, some platforms interpret the SA_RESTART
signal flag as meaning that they should restart rather than cancel
the select(2). Worse yet, some of them restart it with the original
timeout delay, meaning that a steady stream of signal interrupts can
prevent ServerLoop() from iterating at all if there are no incoming
connection requests.
Observable symptoms of this, on an affected platform such as HPUX 10,
include extremely slow parallel query startup (possibly as much as
30 seconds) and failure to update timestamps on the postmaster's sockets
and lockfiles when no new connections arrive for a long time.
We can fix this by running the postmaster's signal handlers without
SA_RESTART. That would be quite a scary change if the range of code
where signals are accepted weren't so tiny, but as it is, it seems
safe enough. (Note that postmaster children do, and must, reset all
the handlers before unblocking signals; so this change should not
affect any child process.)
There is talk of rewriting the postmaster to use a WaitEventSet and
not do signal response work in signal handlers, at which point it might
be appropriate to revert this patch. But that's not happening before
v11 at the earliest.
Back-patch to 9.6. The problem exists much further back, but the
worst symptom arises only in connection with parallel query, so it
does not seem worth taking any portability risks in older branches.
Discussion: https://postgr.es/m/9205.1492833041@sss.pgh.pa.us
This corner case didn't behave nicely at all: the postmaster would
(partially) update its state as though the process had started
successfully, and be quite confused thereafter. Fix it to act
like the worker had crashed, instead.
In passing, refactor so that do_start_bgworker contains all the
state-change logic for bgworker launch, rather than just some of it.
Back-patch as far as 9.4. 9.3 contains similar logic, but it's just
enough different that I don't feel comfortable applying the patch
without more study; and the use of bgworkers in 9.3 was so small
that it doesn't seem worth the extra work.
transam/parallel.c is still entirely unprepared for the possibility
of bgworker startup failure, but that seems like material for a
separate patch.
Discussion: https://postgr.es/m/4905.1492813727@sss.pgh.pa.us
Fix machine-dependent sorting of column numbers. (Odd behavior
would only materialize for column numbers above 255, but that's
certainly legal.)
Fix poor choice of SQLSTATE for some errors, and improve error message
wording. (Notably, "is not a scalar type" is a totally misleading way
to explain "does not have a default btree opclass".)
Avoid taking AccessExclusiveLock on the associated relation during DROP
STATISTICS. That's neither necessary nor desirable, and it could easily
have put us into situations where DROP fails (compare commit 68ea2b7f9).
Adjust/improve comments.
David Rowley and Tom Lane
Discussion: https://postgr.es/m/CAKJS1f-GmCfPvBbAEaM5xoVOaYdVgVN1gicALSoYQ77z-+vLbw@mail.gmail.com
poll.h is mandated by Single Unix Spec v2, the usual baseline for
postgres on unix. None of the unixoid buildfarms animals has
sys/poll.h but not poll.h. Therefore there's not much point to test
for sys/poll.h's existence and include it optionally.
Author: Andres Freund, per suggestion from Tom Lane
Discussion: https://postgr.es/m/20505.1492723662@sss.pgh.pa.us
This seems to be largely cosmetic, avoiding valgrind bleats and the
like. The uninitialized padding influences the CRC of the on-disk
entry, but because it's also used when verifying the CRC, that doesn't
cause spurious failures. Backpatch nonetheless.
It's a bit unfortunate that contrib/test_decoding/sql/replorigin.sql
doesn't exercise the checkpoint path, but checkpoints are fairly
expensive on weaker machines, and we'd have to stop/start for that to
be meaningful.
Author: Andres Freund
Discussion: https://postgr.es/m/20170422183123.w2jgiuxtts7qrqaq@alap3.anarazel.de
Backpatch: 9.5, where replication origins were introduced
As reported by buildfarm animal skink / valgrind, some of the
variables weren't always initialized. To avoid further mishaps use
memset to ensure the entire entry is initialized.
Author: Petr Jelinek
Reported-By: Andres Freund
Discussion: https://postgr.es/m/20170422183123.w2jgiuxtts7qrqaq@alap3.anarazel.de
Backpatch: none, code new in master
poll(2) is required by Single Unix Spec v2, the usual baseline for
postgres (leaving windows aside). There's not been any buildfarm
animals without poll(2) for a long while, leaving the select(2)
implementation to be largely untested.
On windows, including mingw, poll() is not available, but we have a
special case implementation for windows anyway.
Author: Andres Freund
Discussion: https://postgr.es/m/20170420003611.7r2sdvehesdyiz2i@alap3.anarazel.de
Bug was masked by error in running 004_timeline_switch.pl that was
fixed recently in 7d68f2281a.
Detective work by Alvaro Herrera and Tom Lane
Author: Thomas Munro
ProcessTwoPhaseBuffer (formerly StandbyRecoverPreparedTransactions)
mixed up the parent and child XIDs when calling SubTransSetParent to
record the transactions' relationship in pg_subtrans.
Remarkably, analysis by Simon Riggs suggests that this doesn't lead to
visible problems (at least, not in non-Assert builds). That might
explain why we'd not noticed it before. Nonetheless, it's surely wrong.
This code was born broken, so back-patch to all supported branches.
Discussion: https://postgr.es/m/20110.1492905318@sss.pgh.pa.us
The POSIX standard does not say that the success return value for
fcntl(F_SETFD) and fcntl(F_SETFL) is zero; it says only that it's not -1.
We had several calls that were making the stronger assumption. Adjust
them to test specifically for -1 for strict spec compliance.
The standard further leaves open the possibility that the O_NONBLOCK
flag bit is not the only active one in F_SETFL's argument. Formally,
therefore, one ought to get the current flags with F_GETFL and store
them back with only the O_NONBLOCK bit changed when trying to change
the nonblock state. In port/noblock.c, we were doing the full pushup
in pg_set_block but not in pg_set_noblock, which is just weird. Make
both of them do it properly, since they have little business making
any assumptions about the socket they're handed. The other places
where we're issuing F_SETFL are working with FDs we just got from
pipe(2), so it's reasonable to assume the FDs' properties are all
default, so I didn't bother adding F_GETFL steps there.
Also, while pg_set_block deserves some points for trying to do things
right, somebody had decided that it'd be even better to cast fcntl's
third argument to "long". Which is completely loony, because POSIX
clearly says the third argument for an F_SETFL call is "int".
Given the lack of field complaints, these missteps apparently are not
of significance on any common platforms. But they're still wrong,
so back-patch to all supported branches.
Discussion: https://postgr.es/m/30882.1492800880@sss.pgh.pa.us
It doesn't make any immediate difference to PostgreSQL, but might as well
follow the standard, since one exists. (I looked at RFC 5803 earlier, but
didn't fully understand it back then.)
The new format uses Base64 instead of hex to encode StoredKey and
ServerKey, which makes the verifiers slightly smaller. Using the same
encoding for the salt and the keys also means that you only need one
encoder/decoder instead of two. Although we have code in the backend to
do both, we are talking about teaching libpq how to create SCRAM verifiers
for PQencodePassword(), and libpq doesn't currently have any code for hex
encoding.
Bump catversion, because this renders any existing SCRAM verifiers in
pg_authid invalid.
Discussion: https://www.postgresql.org/message-id/351ba574-85ea-d9b8-9689-8c928dd0955d@iki.fi
Give a more specific error message than "xyz is not a table".
Also document in CREATE PUBLICATION which kinds of relations are not
supported.
based on patch by Amit Langote <Langote_Amit_f8@lab.ntt.co.jp>
Commit 7c4f524 allowed walsender to execute normal SQL commands
to support table sync feature in logical replication. Previously
while log_statement caused such SQL commands to be logged,
log_replication_commands caused them to be logged, too.
That is, such SQL commands were logged twice unexpectedly
when those settings were both enabled.
This commit forces log_replication_commands to log only replication
commands, to prevent normal SQL commands from being logged twice.
Author: Masahiko Sawada
Reviewed-by: Kyotaro Horiguchi
Reported-by: Fujii Masao
Discussion: http://postgr.es/m/CAHGQGwFDWh_Qr-q_GEMpD+qH=vYPMdVqw=ZOSY3kX_Pna9R9SA@mail.gmail.com
It's not safe to raise an error while holding spinlock. But previously
logical replication worker for table sync called the function which
reads the system catalog and may raise an error while it's holding
spinlock. Which could lead to the trouble where spinlock will never
be released and the server gets stuck infinitely.
Author: Petr Jelinek
Reviewed-by: Kyotaro Horiguchi and Fujii Masao
Reported-by: Fujii Masao
Discussion: http://postgr.es/m/CAHGQGwFDWh_Qr-q_GEMpD+qH=vYPMdVqw=ZOSY3kX_Pna9R9SA@mail.gmail.com
is_parallel_safe() supposed that the only relevant property of a SubPlan
was the parallel safety of the referenced subplan tree. This is wrong:
the testexpr or args subtrees might contain parallel-unsafe stuff, as
demonstrated by the test case added here. However, just recursing into the
subtrees fails in a different way: we'll typically find PARAM_EXEC Params
representing the subplan's output columns in the testexpr. The previous
coding supposed that any Param must be treated as parallel-restricted, so
that a naive attempt at fixing this disabled parallel pushdown of SubPlans
altogether. We must instead determine, for any visited Param, whether it
is one that would be computed by a surrounding SubPlan node; if so, it's
safe to push down along with the SubPlan node.
We might later be able to extend this logic to cope with Params used for
correlated subplans and other cases; but that's a task for v11 or beyond.
Tom Lane and Amit Kapila
Discussion: https://postgr.es/m/7064.1492022469@sss.pgh.pa.us
Since it appears that v10 is going to move the goalposts by some amount
in terms of where you can and can't invoke set-returning functions,
arrange for the executor's "set-valued function called in context that
cannot accept a set" errors to include a syntax position if possible,
pointing to the specific SRF that can't be called where it's located.
The main bit of infrastructure needed for this is to make the query source
text accessible in the executor; but it turns out that commit 4c728f382
already did that. We just need a new function executor_errposition()
modeled on parser_errposition(), and we're ready to rock.
While experimenting with this, I noted that the error position wasn't
properly reported if it occurred in a plpgsql FOR-over-query loop,
which turned out to be because SPI_cursor_open_internal wasn't providing
an error context callback during PortalStart. Fix that.
There's a whole lot more that could be done with this infrastructure
now that it's there, but this is not the right time in the development
cycle for that sort of work. Hence, resist the temptation to plaster
executor_errposition() calls everywhere ... for the moment.
Discussion: https://postgr.es/m/5263.1492471571@sss.pgh.pa.us
* Be sure to reset the launcher's pid (LogicalRepCtx->launcher_pid) to 0
even when the launcher emits an error.
* Declare ApplyLauncherWakeup() as a static function because it's called
only in launcher.c.
* Previously IsBackendPId() was used to check whether the launcher's pid
was valid. IsBackendPid() was necessary because there was the bug where
the launcher's pid was not reset to 0. But now it's fixed, so IsBackendPid()
is not necessary and this patch removes it.
Author: Masahiko Sawada
Reviewed-by: Kyotaro Horiguchi
Reported-by: Fujii Masao
Discussion: http://postgr.es/m/CAHGQGwFDWh_Qr-q_GEMpD+qH=vYPMdVqw=ZOSY3kX_Pna9R9SA@mail.gmail.com
Per discussion, plain "scram" is confusing because we actually implement
SCRAM-SHA-256 rather than the original SCRAM that uses SHA-1 as the hash
algorithm. If we add support for SCRAM-SHA-512 or some other mechanism in
the SCRAM family in the future, that would become even more confusing.
Most of the internal files and functions still use just "scram" as a
shorthand for SCRMA-SHA-256, but I did change PASSWORD_TYPE_SCRAM to
PASSWORD_TYPE_SCRAM_SHA_256, as that could potentially be used by 3rd
party extensions that hook into the password-check hook.
Michael Paquier did this in an earlier version of the SCRAM patch set
already, but I didn't include that in the version that was committed.
Discussion: https://www.postgresql.org/message-id/fde71ff1-5858-90c8-99a9-1c2427e7bafb@iki.fi
Doing so allows various crash possibilities. Fix by avoiding
having PrescanPreparedTransactions() increment
ShmemVariableCache->nextXid when it has no 2PC files
Bug found by Jeff Janes, diagnosis and patch by Pavan Deolasee,
then patch re-designed for clarity and full accuracy by
Michael Paquier.
Reported-by: Jeff Janes
Author: Pavan Deolasee, Michael Paquier
Discussion: https://postgr.es/m/CAMkU=1zMLnH_i1-PVQ-biZzvNx7VcuatriquEnh7HNk6K8Ss3Q@mail.gmail.com
CopyFrom() needs a range table for formatting certain errors for
constraint violations.
This changes the mechanism of how the range table is passed to the
CopyFrom() executor state. We used to generate the range table and one
entry for the relation manually inside DoCopy(). Now we use
addRangeTableEntryForRelation() to setup the range table and relation
entry for the ParseState, which is then passed down by BeginCopyFrom().
Author: Petr Jelinek <petr.jelinek@2ndquadrant.com>
Reported-by: Euler Taveira <euler@timbira.com.br>
We were accepting creation of extended statistics only for regular
tables, but they can usefully be created for foreign tables, partitioned
tables, and materialized views, too. Allow those cases.
While at it, make sure all the rejected cases throw a consistent error
message, and add regression tests for the whole thing.
Author: David Rowley, Álvaro Herrera
Discussion: https://postgr.es/m/CAKJS1f-BmGo410bh5RSPZUvOO0LhmHL2NYmdrC_Jm8pk_FfyCA@mail.gmail.com
We were applying the use_physical_tlist optimization to all relation
scan plans, even those implemented by custom scan providers. However,
that's a bad idea for a couple of reasons. The custom provider might
be unable to provide columns that it hadn't expected to be asked for
(for example, the custom scan might depend on an index-only scan).
Even more to the point, there's no good reason to suppose that this
"optimization" is a win for a custom scan; whatever the custom provider
is doing is likely not based on simply returning physical heap tuples.
(As a counterexample, if the custom scan is an interface to a column store,
demanding all columns would be a huge loss.) If it is a win, the custom
provider could make that decision for itself and insert a suitable
pathtarget into the path, anyway.
Per discussion with Dmitry Ivanov. Back-patch to 9.5 where custom scan
support was introduced. The argument that the custom provider can adjust
the behavior by changing the pathtarget only applies to 9.6+, but on
balance it seems more likely that use_physical_tlist will hurt custom
scans than help them.
Discussion: https://postgr.es/m/e29ddd30-8ef9-4da5-a50b-2bb7b8c7198d@postgrespro.ru
Either because of a previous ALTER TABLE .. SET STATISTICS 0 or because
of being invoked with a partial column list, ANALYZE could fail to
acquire sufficient data to build extended statistics. Previously, this
would draw an ERROR and fail to collect any statistics at all (extended
and regular). Change things so that we raise a WARNING instead, and
remove a hint that was wrong in half the cases.
Reported by: David Rowley
Discussion: https://postgr.es/m/CAKJS1f9Kk0NF6Fg7TA=JUXsjpS9kX6NVu27pb5QDCpOYAvb-Og@mail.gmail.com
Coverity complained because bgw.bgw_extra wasn't being filled in by
ApplyLauncherRegister(). The most future-proof fix is to memset the
whole BackgroundWorker struct to zeroes. While at it, let's apply the
same coding rule to other places that set up BackgroundWorker structs;
four out of five had the same or related issues.
addRangeTableEntryForENR had a check for pstate != NULL, which Coverity
pointed out was rather useless since it'd already dereferenced pstate
before that. More to the point, we'd established policy in commit
bc93ac12c that we'd require non-NULL pstate for all addRangeTableEntryFor*
functions; this test was evidently copied-and-pasted from some older
version of one of those functions. Make it look more like the others.
In passing, make an elog message look more like the rest of the code,
too.
Michael Paquier
In standard non-Windows builds, there's no particular reason to care what
address the kernel chooses to map the shared memory segment at. However,
when building with EXEC_BACKEND, there's a risk that the chosen address
won't be available in all child processes. Linux with ASLR enabled (which
it is by default) seems particularly at risk because it puts shmem segments
into the same area where it maps shared libraries. We can work around
that by specifying a mapping address that's outside the range where
shared libraries could get mapped. On x86_64 Linux, 0x7e0000000000
seems to work well.
This is only meant for testing/debugging purposes, so it doesn't seem
necessary to go as far as providing a GUC (or any user-visible
documentation, though we might change that later). Instead, it's just
controlled by setting an environment variable PG_SHMEM_ADDR to the
desired attach address.
Back-patch to all supported branches, since the point here is to
remove intermittent buildfarm failures on EXEC_BACKEND animals.
Owners of affected animals will need to add a suitable setting of
PG_SHMEM_ADDR to their build_env configuration.
Discussion: https://postgr.es/m/7036.1492231361@sss.pgh.pa.us
We'd already recognized that we can't pass function pointers across process
boundaries for functions in loadable modules, since a shared library could
get loaded at different addresses in different processes. But actually the
practice doesn't work for functions in the core backend either, if we're
using EXEC_BACKEND. This is the cause of recent failures on buildfarm
member culicidae. Switch to passing a string function name in all cases.
Something like this needs to be back-patched into 9.6, but let's see
if the buildfarm likes it first.
Petr Jelinek, with a bunch of basically-cosmetic adjustments by me
Discussion: https://postgr.es/m/548f9c1d-eafa-e3fa-9da8-f0cc2f654e60@2ndquadrant.com
AFAICT, the only actual benefit of closing a bootstrap transaction
is to reclaim transient memory. We can do that a lot more cheaply
by just doing a MemoryContextReset on a suitable context. This
gets the runtime of the "bootstrap" phase of initdb down to the
point where, at least by eyeball, it's quite negligible compared
to the rest of the phases. Per discussion with Andres Freund.
Discussion: https://postgr.es/m/9244.1492106743@sss.pgh.pa.us
Standardize on testing a hash index page's type by doing
(opaque->hasho_flag & LH_PAGE_TYPE) == LH_xxx_PAGE
Various places were taking shortcuts like
opaque->hasho_flag & LH_BUCKET_PAGE
which while not actually wrong, is still bad practice because
it encourages use of
opaque->hasho_flag & LH_UNUSED_PAGE
which *is* wrong (LH_UNUSED_PAGE == 0, so the above is constant false).
hash_xlog.c's hash_mask() contained such an incorrect test.
This also ensures that we mask out the additional flag bits that
hasho_flag has accreted since 9.6. pgstattuple's pgstat_hash_page(),
for one, was failing to do that and was thus actively broken.
Also fix assorted comments that hadn't been updated to reflect the
extended usage of hasho_flag, and fix some macros that were testing
just "(hasho_flag & bit)" to use the less dangerous, project-approved
form "((hasho_flag & bit) != 0)".
Coverity found the bug in hash_mask(); I noted the one in
pgstat_hash_page() through code reading.
regexport.c thought it could just ignore LACON arcs, but the correct
behavior is to treat them as satisfiable while consuming zero input
(rather reminiscently of commit 9f1e642d5). Otherwise, the emitted
simplified-NFA representation may contain no paths leading from initial
to final state, which unsurprisingly confuses pg_trgm, as seen in
bug #14623 from Jeff Janes.
Since regexport's output representation has no concept of an arc that
consumes zero input, recurse internally to find the next normal arc(s)
after any LACON transitions. We'd be forced into changing that
representation if a LACON could be the last arc reaching the final
state, but fortunately the regex library never builds NFAs with such
a configuration, so there always is a next normal arc.
Back-patch to 9.3 where this logic was introduced.
Discussion: https://postgr.es/m/20170413180503.25948.94871@wrigleys.postgresql.org
This contains some protocol changes to SASL authentiation (which is new
in v10):
* For future-proofing, in the AuthenticationSASL message that begins SASL
authentication, provide a list of SASL mechanisms that the server
supports, for the client to choose from. Currently, it's always just
SCRAM-SHA-256.
* Add a separate authentication message type for the final server->client
SASL message, which the client doesn't need to respond to. This makes
it unambiguous whether the client is supposed to send a response or not.
The SASL mechanism should know that anyway, but better to be explicit.
Also, in the server, support clients that don't send an Initial Client
response in the first SASLInitialResponse message. The server is supposed
to first send an empty request in that case, to which the client will
respond with the data that usually comes in the Initial Client Response.
libpq uses the Initial Client Response field and doesn't need this, and I
would assume any other sensible implementation to use Initial Client
Response, too, but let's follow the SASL spec.
Improve the documentation on SASL authentication in protocol. Add a
section describing the SASL message flow, and some details on our
SCRAM-SHA-256 implementation.
Document the different kinds of PasswordMessages that the frontend sends
in different phases of SASL authentication, as well as GSS/SSPI
authentication as separate message formats. Even though they're all 'p'
messages, and the exact format depends on the context, describing them as
separate message formats makes the documentation more clear.
Reviewed by Michael Paquier and Álvaro Hernández Tortosa.
Discussion: https://www.postgresql.org/message-id/CAB7nPqS-aFg0iM3AQOJwKDv_0WkAedRjs1W2X8EixSz+sKBXCQ@mail.gmail.com
Formerly, the bootstrap backend looked up the OIDs corresponding to
names in regproc catalog entries using brute-force searches of pg_proc.
It was somewhat remarkable that that worked at all, since it was used
while populating other pretty-fundamental catalogs like pg_operator.
And it was also quite slow, and getting slower as pg_proc gets bigger.
This patch moves the lookup work into genbki.pl, so that the values in
postgres.bki for regproc columns are always numeric OIDs, an option
that regprocin() already supported. Perl isn't the world's speediest
language, so this about doubles the time needed to run genbki.pl (from
0.3 to 0.6 sec on my machine). But we only do that at most once per
build. The time needed to run initdb drops significantly --- on my
machine, initdb --no-sync goes from 1.8 to 1.3 seconds. So this is
a small net win even for just one initdb per build, and it becomes
quite a nice win for test sequences requiring many initdb runs.
Strip out the now-dead code for brute-force catalog searching in
regprocin. We'd also cargo-culted similar logic into regoperin
and some (not all) of the other reg*in functions. That is all
dead code too since we currently have no need to load such values
during bootstrap. I removed it all, reasoning that if we ever
need such functionality it'd be much better to do it in a similar
way to this patch.
There might be some simplifications possible in the backend now that
regprocin doesn't require doing catalog reads so early in bootstrap.
I've not looked into that, though.
Andreas Karlsson, with some small adjustments by me
Discussion: https://postgr.es/m/30896.1492006367@sss.pgh.pa.us
Free each SASL message after sending it. It's not a lot of wasted memory,
and it's short-lived, but the authentication code in general tries to
pfree() stuff, so let's follow the example.
Adding the pfree() revealed a little bug in build_server_first_message().
It attempts to keeps a copy of the sent message, but it was missing a
pstrdup(), so the pointer started to dangle, after adding the pfree()
into CheckSCRAMAuth().
Reword comments and debug messages slightly, while we're at it.
Reviewed by Michael Paquier.
Discussion: https://www.postgresql.org/message-id/6490b975-5ee1-6280-ac1d-af975b19fb9a@iki.fi
Commit 5e6d8d2bb allowed parallel workers to execute parallel-safe
subplans, but it transmitted the query's entire list of subplans to
the worker(s). Since execMain.c blindly does ExecInitNode and later
ExecEndNode on every list element, this resulted in parallel-unsafe plan
nodes nonetheless getting started up and shut down in parallel workers.
That seems mostly harmless as far as core plan node types go (but
maybe not so much for Gather?). But it resulted in postgres_fdw
opening and then closing extra remote connections, and it's likely
that other non-parallel-safe FDWs or custom scan providers would have
worse reactions.
To fix, just make ExecSerializePlan replace parallel-unsafe subplans
with NULLs in the cut-down plan tree that it transmits to workers.
This relies on ExecInitNode and ExecEndNode to do nothing on NULL
input, but they do anyway. If anything else is touching the dropped
subplans in a parallel worker, that would be a bug to be fixed.
(This thus provides a strong guarantee that we won't try to do
something with a parallel-unsafe subplan in a worker.)
This is, I think, the last fix directly occasioned by Andreas Seltenreich's
bug report of a few days ago.
Tom Lane and Amit Kapila
Discussion: https://postgr.es/m/87tw5x4vcu.fsf@credativ.de
We'd managed to avoid doing this so far, but it seems pretty obvious
that it would be forced on us some day, and this is much the cleanest
way of approaching the open problem that parallel-unsafe subplans are
being transmitted to parallel workers. Anyway there's no space cost
due to alignment considerations, and the time cost is pretty minimal
since we're just copying the flag from the corresponding Path node.
(At least in most cases ... some of the klugier spots in createplan.c
have to work a bit harder.)
In principle we could perhaps get rid of SubPlan.parallel_safe,
but I thought it better to keep that in case there are reasons to
consider a SubPlan unsafe even when its child plan is parallel-safe.
This patch doesn't actually do anything with the new flags, but
I thought I'd commit it separately anyway.
Note: although this touches outfuncs/readfuncs, there's no need for
a catversion bump because Plan trees aren't stored on disk.
Discussion: https://postgr.es/m/87tw5x4vcu.fsf@credativ.de