Commit Graph

18972 Commits

Author SHA1 Message Date
Tom Lane 5bbee34d9f Avoid producing over-length specific_name outputs in information_schema.
information_schema output columns that are declared as being type
sql_identifier are supposed to conform to the implementation's rules
for valid identifiers, in particular the identifier length limit.
Several places potentially violated this limit by concatenating a
function's name and OID.  (The OID is added to ensure name uniqueness
within a schema, since the spec doesn't expect function name overloading.)

Simply truncating the concatenation result to fit in "name" won't do,
since losing part of the OID might wind up giving non-unique results.
Instead, let's truncate the function name as necessary.

The most practical way to do that is to do it in a C function; the
information_schema.sql script doesn't have easy access to the value
of NAMEDATALEN, nor does it have an easy way to truncate on the basis
of resulting byte-length rather than number of characters.

(There are still a couple of places that cast concatenation results to
sql_identifier, but as far as I can see they are guaranteed not to produce
over-length strings, at least with the normal value of NAMEDATALEN.)

Discussion: https://postgr.es/m/23817.1545283477@sss.pgh.pa.us
2018-12-20 16:21:59 -05:00
Alvaro Herrera 7b14bcc06c Fix lock level used for partition when detaching it
For probably bogus reasons, we acquire only AccessShareLock on the
partition when we try to detach it from its parent partitioned table.
This can cause ugly things to happen if another transaction is doing
any sort of DDL to the partition concurrently.

Upgrade that lock to ShareUpdateExclusiveLock, which per discussion
seems to be the minimum needed.

Reported by Robert Haas.

Discussion: https://postgr.es/m/CA+TgmoYruJQ+2qnFLtF1xQtr71pdwgfxy3Ziy-TxV28M6pEmyA@mail.gmail.com
2018-12-20 16:42:13 -03:00
Alvaro Herrera 0c2377152f DETACH PARTITION: hold locks on indexes until end of transaction
When a partition is detached from its parent, we acquire locks on all
attached indexes to also detach them ... but we release those locks
immediately.  This is a violation of the policy of keeping locks on user
objects to the end of the transaction.  Bug introduced in 8b08f7d482.

It's unclear that there are any ill effects possible, but it's clearly
wrong nonetheless.  It's likely that bad behavior *is* possible, but
mostly because the relation that the index is for is only locked with
AccessShareLock, which is an older bug that shall be fixed separately.

While touching that line of code, close the index opened with
index_open() using index_close() instead of relation_close().
No difference in practice, but let's be consistent.

Unearthed by Robert Haas.

Discussion: https://postgr.es/m/CA+TgmoYruJQ+2qnFLtF1xQtr71pdwgfxy3Ziy-TxV28M6pEmyA@mail.gmail.com
2018-12-20 10:58:22 -03:00
Greg Stark 1075dfdaf3 Fix ADD IF NOT EXISTS used in conjunction with ALTER TABLE ONLY
The flag for IF NOT EXISTS was only being passed down in the normal
recursing case. It's been this way since originally added in 9.6 in
commit 2cd40adb85 so backpatch back to 9.6.
2018-12-19 19:38:31 -05:00
Tom Lane 2ece7c07dc Add text-vs-name cross-type operators, and unify name_ops with text_ops.
Now that name comparison has effectively the same behavior as text
comparison, we might as well merge the name_ops opfamily into text_ops,
allowing cross-type comparisons to be processed without forcing a
datatype coercion first.  We need do little more than add cross-type
operators to make the opfamily complete, and fix one or two places
in the planner that assumed text_ops was a single-datatype opfamily.

I chose to unify hash name_ops into hash text_ops as well, since the
types have compatible hashing semantics.  This allows marking the
new cross-type equality operators as oprcanhash.

(Note: this doesn't remove the name_ops opclasses, so there's no
breakage of index definitions.  Those opclasses are just reparented
into the text_ops opfamily.)

Discussion: https://postgr.es/m/15938.1544377821@sss.pgh.pa.us
2018-12-19 17:46:25 -05:00
Tom Lane 586b98fdf1 Make type "name" collation-aware.
The "name" comparison operators now all support collations, making them
functionally equivalent to "text" comparisons, except for the different
physical representation of the datatype.  They do, in fact, mostly share
the varstr_cmp and varstr_sortsupport infrastructure, which has been
slightly enlarged to handle the case.

To avoid changes in the default behavior of the datatype, set name's
typcollation to C_COLLATION_OID not DEFAULT_COLLATION_OID, so that
by default comparisons to a name value will continue to use strcmp
semantics.  (This would have been the case for system catalog columns
anyway, because of commit 6b0faf723, but doing this makes it true for
user-created name columns as well.  In particular, this avoids
locale-dependent changes in our regression test results.)

In consequence, tweak a couple of places that made assumptions about
collatable base types always having typcollation DEFAULT_COLLATION_OID.
I have not, however, attempted to relax the restriction that user-
defined collatable types must have that.  Hence, "name" doesn't
behave quite like a user-defined type; it acts more like a domain
with COLLATE "C".  (Conceivably, if we ever get rid of the need for
catalog name columns to be fixed-length, "name" could actually become
such a domain over text.  But that'd be a pretty massive undertaking,
and I'm not volunteering.)

Discussion: https://postgr.es/m/15938.1544377821@sss.pgh.pa.us
2018-12-19 17:46:25 -05:00
Alvaro Herrera 68f6f2b739 Remove function names from error messages
They are not necessary, and having them there gives useless work for
translators.
2018-12-19 14:53:27 -03:00
Tom Lane c6e394c1a2 Small improvements for allocation logic in ginHeapTupleFastCollect().
Avoid repetitive calls to repalloc() when the required size of the
collector array grows more than 2x in one call.  Also ensure that the
array size is a power of 2 (since palloc will probably consume a power
of 2 anyway) and doesn't start out very small (which'd likely just lead
to extra repallocs).

David Rowley, tweaked a bit by me

Discussion: https://postgr.es/m/CAKJS1f8vn-iSBE8PKeVHrnhvyjRNYCxguPFFY08QLYmjWG9hPQ@mail.gmail.com
2018-12-19 11:41:36 -05:00
Peter Geoghegan 61a4480a68 Remove obsolete nbtree duplicate entries comment.
Remove a comment from the Berkeley days claiming that nbtree must
disambiguate duplicate keys within _bt_moveright().  There is no special
care taken around duplicates within _bt_moveright(), at least since
commit 9e85183bfc removed inscrutable _bt_moveright() code to handle
pages full of duplicates.
2018-12-18 21:40:38 -08:00
Peter Geoghegan 60f3cc9553 Correct obsolete nbtree recovery comments.
Commit 40dae7ec53, which made the handling of interrupted nbtree page
splits more robust, removed an nbtree-specific end-of-recovery cleanup
step.  This meant that it was no longer possible to complete an
interrupted page split during recovery.  However, a reference to
recovery as a reason for using a NULL stack while inserting into a
parent page was missed.  Remove the reference.

Remove a similar obsolete reference to recovery that was introduced much
more recently, as part of the btree fastpath optimization enhancement
that made it into Postgres 11 (commit 2b272734, and follow-up commits).

Backpatch: 11-, where the fastpath optimization was introduced.
2018-12-18 16:59:50 -08:00
Tom Lane 6b0faf7236 Make collation-aware system catalog columns use "C" collation.
Up to now we allowed text columns in system catalogs to use collation
"default", but that isn't really safe because it might mean something
different in template0 than it means in a database cloned from template0.
In particular, this could mean that cloned pg_statistic entries for such
columns weren't entirely valid, possibly leading to bogus planner
estimates, though (probably) not any outright failures.

In the wake of commit 5e0928005, a better solution is available: if we
label such columns with "C" collation, then their pg_statistic entries
will also use that collation and hence will be valid independently of
the database collation.

This also provides a cleaner solution for indexes on such columns than
the hack added by commit 0b28ea79c: the indexes will naturally inherit
"C" collation and don't have to be forced to use text_pattern_ops.

Also, with the planned improvement of type "name" to be collation-aware,
this policy will apply cleanly to both text and name columns.

Because of the pg_statistic angle, we should also apply this policy
to the tables in information_schema.  This patch does that by adjusting
information_schema's textual domain types to specify "C" collation.
That has the user-visible effect that order-sensitive comparisons to
textual information_schema view columns will now use "C" collation
by default.  The SQL standard says that the collation of those view
columns is implementation-defined, so I think this is legal per spec.
At some point this might allow for translation of such comparisons
into indexable conditions on the underlying "name" columns, although
additional work will be needed before that can happen.

Discussion: https://postgr.es/m/19346.1544895309@sss.pgh.pa.us
2018-12-18 12:48:15 -05:00
Tom Lane d364e88155 Fix ancient thinko in mergejoin cost estimation.
"rescanratio" was computed as 1 + rescanned-tuples / total-inner-tuples,
which is sensible if it's to be multiplied by total-inner-tuples or a cost
value corresponding to scanning all the inner tuples.  But in reality it
was (mostly) multiplied by inner_rows or a related cost, numbers that take
into account the possibility of stopping short of scanning the whole inner
relation thanks to a limited key range in the outer relation.  This'd
still make sense if we could expect that stopping short would result in a
proportional decrease in the number of tuples that have to be rescanned.
It does not, however.  The argument that establishes the validity of our
estimate for that number is independent of whether we scan all of the inner
relation or stop short, and experimentation also shows that stopping short
doesn't reduce the number of rescanned tuples.  So the correct calculation
is 1 + rescanned-tuples / inner_rows, and we should be sure to multiply
that by inner_rows or a corresponding cost value.

Most of the time this doesn't make much difference, but if we have
both a high rescan rate (due to lots of duplicate values) and an outer
key range much smaller than the inner key range, then the error can
be significant, leading to a large underestimate of the cost associated
with rescanning.

Per report from Vijaykumar Jain.  This thinko appears to go all the way
back to the introduction of the rescan estimation logic in commit
70fba7043, so back-patch to all supported branches.

Discussion: https://postgr.es/m/CAE7uO5hMb_TZYJcZmLAgO6iD68AkEK6qCe7i=vZUkCpoKns+EQ@mail.gmail.com
2018-12-18 11:19:38 -05:00
Michael Paquier f94cec6447 Include partitioned indexes to system view pg_indexes
pg_tables already includes partitioned tables, so for consistency
pg_indexes should show partitioned indexes.

Author: Suraj Kharage
Reviewed-by: Amit Langote, Michael Paquier
Discussion: https://postgr.es/m/CAF1DzPVrYo4XNTEnc=PqVp6aLJc7LFYpYR4rX=_5pV=wJ2KdZg@mail.gmail.com
2018-12-18 16:37:51 +09:00
Alvaro Herrera ca4103025d Fix tablespace handling for partitioned tables
When partitioned tables were introduced, we failed to realize that by
copying the tablespace handling for other relation kinds with no
physical storage we were causing the secondary effect that their
partitions would not automatically inherit the tablespace setting.  This
is surprising and unhelpful, so change it to adopt the behavior
introduced in pg11 (commit 33e6c34c32) for partitioned indexes: the
parent relation remembers the tablespace specification, which is then
used for any new partitions that don't declare one.

Because this commit changes behavior of the TABLESPACE clause for
partitioned tables (it's no longer a no-op), it is not backpatched.

Author: David Rowley, Álvaro Herrera
Reviewed-by: Michael Paquier
Discussion: https://postgr.es/m/CAKJS1f9SxVzqDrGD1teosFd6jBMM0UEaa14_8mRvcWE19Tu0hA@mail.gmail.com
2018-12-17 15:37:40 -03:00
Amit Kapila 3abb11e55b Remove extra semicolons.
Reported-by: David Rowley
Author: David Rowley
Reviewed-by: Amit Kapila
Backpatch-through: 10
Discussion: https://postgr.es/m/CAKJS1f8EneeYyzzvdjahVZ6gbAHFkHbSFB5m_C0Y6TUJs9Dgdg@mail.gmail.com
2018-12-17 14:32:25 +05:30
Michael Paquier 67915fb8e5 Fix use-after-free bug when renaming constraints
This is an oversight from recent commit b13fd344.  While on it, tweak
the previous test with a better name for the renamed primary key.

Detected by buildfarm member prion which forces relation cache release
with -DRELCACHE_FORCE_RELEASE.  Back-patch down to 9.4 as the previous
commit.
2018-12-17 12:43:00 +09:00
Michael Paquier b13fd344c5 Make constraint rename issue relcache invalidation on target relation
When a constraint gets renamed, it may have associated with it a target
relation (for example domain constraints don't have one).  Not
invalidating the target relation cache when issuing the renaming can
result in issues with subsequent commands that refer to the old
constraint name using the relation cache, causing various failures.  One
pattern spotted was using CREATE TABLE LIKE after a constraint
renaming.

Reported-by: Stuart <sfbarbee@gmail.com>
Author: Amit Langote
Reviewed-by: Michael Paquier
Discussion: https://postgr.es/m/2047094.V130LYfLq4@station53.ousa.org
2018-12-17 10:34:44 +09:00
Tom Lane a73d083195 Modernize our code for looking up descriptive strings for Unix signals.
At least as far back as the 2008 spec, POSIX has defined strsignal(3)
for looking up descriptive strings for signal numbers.  We hadn't gotten
the word though, and were still using the crufty old sys_siglist array,
which is in no standard even though most Unixen provide it.

Aside from not being formally standards-compliant, this was just plain
ugly because it involved #ifdef's at every place using the code.

To eliminate the #ifdef's, create a portability function pg_strsignal,
which wraps strsignal(3) if available and otherwise falls back to
sys_siglist[] if available.  The set of Unixen with neither API is
probably empty these days, but on any platform with neither, you'll
just get "unrecognized signal".  All extant callers print the numeric
signal number too, so no need to work harder than that.

Along the way, upgrade pg_basebackup's child-error-exit reporting
to match the rest of the system.

Discussion: https://postgr.es/m/25758.1544983503@sss.pgh.pa.us
2018-12-16 19:38:57 -05:00
Tom Lane ade2d61ed0 Improve detection of child-process SIGPIPE failures.
Commit ffa4cbd62 added logic to detect SIGPIPE failure of a COPY child
process, but it only worked correctly if the SIGPIPE occurred in the
immediate child process.  Depending on the shell in use and the
complexity of the shell command string, we might instead get back
an exit code of 128 + SIGPIPE, representing a shell error exit
reporting SIGPIPE in the child process.

We could just hack up ClosePipeToProgram() to add the extra case,
but it seems like this is a fairly general issue deserving a more
general and better-documented solution.  I chose to add a couple
of functions in src/common/wait_error.c, which is a natural place
to know about wait-result encodings, that will test for either a
specific child-process signal type or any child-process signal failure.
Then, adjust other places that were doing ad-hoc tests of this type
to use the common functions.

In RestoreArchivedFile, this fixes a race condition affecting whether
the process will report an error or just silently proc_exit(1): before,
that depended on whether the intermediate shell got SIGTERM'd itself
or reported a child process failing on SIGTERM.

Like the previous patch, back-patch to v10; we could go further
but there seems no real need to.

Per report from Erik Rijkers.

Discussion: https://postgr.es/m/f3683f87ab1701bea5d86a7742b22432@xs4all.nl
2018-12-16 14:32:14 -05:00
Tom Lane 5e09280057 Make pg_statistic and related code account more honestly for collations.
When we first put in collations support, we basically punted on teaching
pg_statistic, ANALYZE, and the planner selectivity functions about that.
They've just used DEFAULT_COLLATION_OID independently of the actual
collation of the data.  It's time to improve that, so:

* Add columns to pg_statistic that record the specific collation associated
with each statistics slot.

* Teach ANALYZE to use the column's actual collation when comparing values
for statistical purposes, and record this in the appropriate slot.  (Note
that type-specific typanalyze functions are now expected to fill
stats->stacoll with the appropriate collation, too.)

* Teach assorted selectivity functions to use the actual collation of
the stats they are looking at, instead of just assuming it's
DEFAULT_COLLATION_OID.

This should give noticeably better results in selectivity estimates for
columns with nondefault collations, at least for query clauses that use
that same collation (which would be the default behavior in most cases).
It's still true that comparisons with explicit COLLATE clauses different
from the stored data's collation won't be well-estimated, but that's no
worse than before.  Also, this patch does make the first step towards
doing better with that, which is that it's now theoretically possible to
collect stats for a collation other than the column's own collation.

Patch by me; thanks to Peter Eisentraut for review.

Discussion: https://postgr.es/m/14706.1544630227@sss.pgh.pa.us
2018-12-14 12:52:49 -05:00
Michael Paquier 8fb569e978 Introduce new extended routines for FDW and foreign server lookups
The cache lookup routines for foreign-data wrappers and foreign servers
are extended with an extra argument to handle a set of flags.  The only
value which can be used now is to indicate if a missing object should
result in an error or not, and are designed to be extensible on need.
Those new routines are added into the existing set of user-visible
FDW APIs and documented in consequence.  They will be used for future
patches to improve the SQL interface for object addresses.

Author: Michael Paquier
Reviewed-by: Álvaro Herrera
Discussion: https://postgr.es/m/CAB7nPqSZxrSmdHK-rny7z8mi=EAFXJ5J-0RbzDw6aus=wB5azQ@mail.gmail.com
2018-12-14 08:59:35 +09:00
Andres Freund 09568ec3d3 Create a separate oid range for oids assigned by genbki.pl.
The changes I made in 578b229718 assigned oids below
FirstBootstrapObjectId to objects in include/catalog/*.dat files that
did not have an oid assigned, starting at the max oid explicitly
assigned.  Tom criticized that for mainly two reasons:
1) It's not clear which values are manually and which explicitly
   assigned.
2) The space below FirstBootstrapObjectId gets pretty crowded, and
   some PostgreSQL forks have used oids >= 9000 for their own objects,
   to avoid conflicting.

Thus create a new range for objects not assigned explicit oids, but
assigned by genbki.pl. For now 1-9999 is for explicitly assigned oids,
FirstGenbkiObjectId (10000) to FirstBootstrapObjectId (1200) -1 is for
genbki.pl assigned oids, and < FirstNormalObjectId (16384) is for oids
assigned during bootstrap.  It's possible that we'll have to adjust
these boundaries, but there's some headroom for now.

Add a note suggesting that oids in forks should be assigned in the
9000-9999 range.

Catversion bump for obvious reasons.

Per complaint from Tom Lane.

Author: Andres Freund
Discussion: https://postgr.es/m/16845.1544393682@sss.pgh.pa.us
2018-12-13 14:50:57 -08:00
Tom Lane 84d514887f Fix bogus logic for skipping unnecessary partcollation dependencies.
The idea here is to not call recordDependencyOn for the default collation,
since we know that's pinned.  But what the code actually did was to record
the partition key's dependency on the opclass twice, instead.

Evidently introduced by sloppy coding in commit 2186b608b.  Back-patch
to v10 where that came in.
2018-12-13 15:11:09 -05:00
Tom Lane 04fe805a17 Drop no-op CoerceToDomain nodes from expressions at planning time.
If a domain has no constraints, then CoerceToDomain doesn't really do
anything and can be simplified to a RelabelType.  This not only
eliminates cycles at execution, but allows the planner to optimize better
(for instance, match the coerced expression to an index on the underlying
column).  However, we do have to support invalidating the plan later if
a constraint gets added to the domain.  That's comparable to the case of
a change to a SQL function that had been inlined into a plan, so all the
necessary logic already exists for plans depending on functions.  We
need only duplicate or share that logic for domains.

ALTER DOMAIN ADD/DROP CONSTRAINT need to be taught to send out sinval
messages for the domain's pg_type entry, since those operations don't
update that row.  (ALTER DOMAIN SET/DROP NOT NULL do update that row,
so no code change is needed for them.)

Testing this revealed what's really a pre-existing bug in plpgsql:
it caches the SQL-expression-tree expansion of type coercions and
had no provision for invalidating entries in that cache.  Up to now
that was only a problem if such an expression had inlined a SQL
function that got changed, which is unlikely though not impossible.
But failing to track changes of domain constraints breaks an existing
regression test case and would likely cause practical problems too.

We could fix that locally in plpgsql, but what seems like a better
idea is to build some generic infrastructure in plancache.c to store
standalone expressions and track invalidation events for them.
(It's tempting to wonder whether plpgsql's "simple expression" stuff
could use this code with lower overhead than its current use of the
heavyweight plancache APIs.  But I've left that idea for later.)

Other stuff fixed in passing:

* Allow estimate_expression_value() to drop CoerceToDomain
unconditionally, effectively assuming that the coercion will succeed.
This will improve planner selectivity estimates for cases involving
estimatable expressions that are coerced to domains.  We could have
done this independently of everything else here, but there wasn't
previously any need for eval_const_expressions_mutator to know about
CoerceToDomain at all.

* Use a dlist for plancache.c's list of cached plans, rather than a
manually threaded singly-linked list.  That eliminates a potential
performance problem in DropCachedPlan.

* Fix a couple of inconsistencies in typecmds.c about whether
operations on domains drop RowExclusiveLock on pg_type.  Our common
practice is that DDL operations do drop catalog locks, so standardize
on that choice.

Discussion: https://postgr.es/m/19958.1544122124@sss.pgh.pa.us
2018-12-13 13:24:43 -05:00
Alexander Korotkov 52ac6cd2d0 Prevent GIN deleted pages from being reclaimed too early
When GIN vacuum deletes a posting tree page, it assumes that no concurrent
searchers can access it, thanks to ginStepRight() locking two pages at once.
However, since 9.4 searches can skip parts of posting trees descending from the
root.  That leads to the risk that page is deleted and reclaimed before
concurrent search can access it.

This commit prevents the risk of above by waiting for every transaction, which
might wait to reference this page, to finish.  Due to binary compatibility
we can't change GinPageOpaqueData to store corresponding transaction id.
Instead we reuse page header pd_prune_xid field, which is unused in index pages.

Discussion: https://postgr.es/m/31a702a.14dd.166c1366ac1.Coremail.chjischj%40163.com
Author: Andrey Borodin, Alexander Korotkov
Reviewed-by: Alexander Korotkov
Backpatch-through: 9.4
2018-12-13 06:55:34 +03:00
Alexander Korotkov c6ade7a8cd Prevent deadlock in ginRedoDeletePage()
On standby ginRedoDeletePage() can work concurrently with read-only queries.
Those queries can traverse posting tree in two ways.
1) Using rightlinks by ginStepRight(), which locks the next page before
   unlocking its left sibling.
2) Using downlinks by ginFindLeafPage(), which locks at most one page at time.

Original lock order was: page, parent, left sibling.  That lock order can
deadlock with ginStepRight().  In order to prevent deadlock this commit changes
lock order to: left sibling, page, parent.  Note, that position of parent in
locking order seems insignificant, because we only lock one page at time while
traversing downlinks.

Reported-by: Chen Huajun
Diagnosed-by: Chen Huajun, Peter Geoghegan, Andrey Borodin
Discussion: https://postgr.es/m/31a702a.14dd.166c1366ac1.Coremail.chjischj%40163.com
Author: Alexander Korotkov
Backpatch-through: 9.4
2018-12-13 06:55:34 +03:00
Alexander Korotkov fd83c83d09 Fix deadlock in GIN vacuum introduced by 218f51584d
Before 218f51584d if posting tree page is about to be deleted, then the whole
posting tree is locked by LockBufferForCleanup() on root preventing all the
concurrent inserts.  218f51584d reduced locking to the subtree containing
page to be deleted.  However, due to concurrent parent split, inserter doesn't
always holds pins on all the pages constituting path from root to the target
leaf page.  That could cause a deadlock between GIN vacuum process and GIN
inserter.  And we didn't find non-invasive way to fix this.

This commit reverts VACUUM behavior to lock the whole posting tree before
delete any page.  However, we keep another useful change by 218f51584d5: the
tree is locked only if there are pages to be deleted.

Reported-by: Chen Huajun
Diagnosed-by: Chen Huajun, Andrey Borodin, Peter Geoghegan
Discussion: https://postgr.es/m/31a702a.14dd.166c1366ac1.Coremail.chjischj%40163.com
Author: Alexander Korotkov, based on ideas from Andrey Borodin and Peter Geoghegan
Reviewed-by: Andrey Borodin
Backpatch-through: 10
2018-12-13 06:55:34 +03:00
Tom Lane 77d4d88afb Repair bogus EPQ plans generated for postgres_fdw foreign joins.
postgres_fdw's postgresGetForeignPlan() assumes without checking that the
outer_plan it's given for a join relation must have a NestLoop, MergeJoin,
or HashJoin node at the top.  That's been wrong at least since commit
4bbf6edfb (which could cause insertion of a Sort node on top) and it seems
like a pretty unsafe thing to Just Assume even without that.

Through blind good fortune, this doesn't seem to have any worse
consequences today than strange EXPLAIN output, but it's clearly trouble
waiting to happen.

To fix, test the node type explicitly before touching Join-specific
fields, and avoid jamming the new tlist into a node type that can't
do projection.  Export a new support function from createplan.c
to avoid building low-level knowledge about the latter into FDWs.

Back-patch to 9.6 where the faulty coding was added.  Note that the
associated regression test cases don't show any changes before v11,
apparently because the tests back-patched with 4bbf6edfb don't actually
exercise the problem case before then (there's no top-level Sort
in those plans).

Discussion: https://postgr.es/m/8946.1544644803@sss.pgh.pa.us
2018-12-12 16:08:30 -05:00
Tom Lane 0f7ec8d9c3 Repair bogus handling of multi-assignment Params in upper plan levels.
Our support for multiple-set-clauses in UPDATE assumes that the Params
referencing a MULTIEXPR_SUBLINK SubPlan will appear before that SubPlan
in the targetlist of the plan node that calculates the updated row.
(Yeah, it's a hack...)  In some PG branches it's possible that a Result
node gets inserted between the primary calculation of the update tlist
and the ModifyTable node.  setrefs.c did the wrong thing in this case
and left the upper-level Params as Params, causing a crash at runtime.
What it should do is replace them with "outer" Vars referencing the child
plan node's output.  That's a result of careless ordering of operations
in fix_upper_expr_mutator, so we can fix it just by reordering the code.

Fix fix_join_expr_mutator similarly for consistency, even though join
nodes could never appear in such a context.  (In general, it seems
likely to be a bit cheaper to use Vars than Params in such situations
anyway, so this patch might offer a tiny performance improvement.)

The hazard extends back to 9.5 where the MULTIEXPR_SUBLINK stuff
was introduced, so back-patch that far.  However, this may be a live
bug only in 9.6.x and 10.x, as the other branches don't seem to want
to calculate the final tlist below the Result node.  (That plan shape
change between branches might be a mini-bug in itself, but I'm not
really interested in digging into the reasons for that right now.
Still, add a regression test memorializing what we expect there,
so we'll notice if it changes again.)

Per bug report from Eduards Bezverhijs.

Discussion: https://postgr.es/m/b6cd572a-3e44-8785-75e9-c512a5a17a73@tieto.com
2018-12-12 13:49:41 -05:00
Michael Paquier cc53123bcc Tweak pg_partition_tree for undefined relations and unsupported relkinds
This fixes a crash which happened when calling the function directly
with a relation OID referring to a non-existing object, and changes the
behavior so as NULL is returned for unsupported relkinds instead of
generating an error.  This puts the new function in line with many other
system functions, and eases actions like full scans of pg_class.

Author: Michael Paquier
Reviewed-by: Amit Langote, Stephen Frost
Discussion: https://postgr.es/m/20181207010406.GO2407@paquier.xyz
2018-12-12 09:49:39 +09:00
Tom Lane 001bb9f3ed Add stack depth checks to key recursive functions in backend/nodes/*.c.
Although copyfuncs.c has a check_stack_depth call in its recursion,
equalfuncs.c, outfuncs.c, and readfuncs.c lacked one.  This seems
unwise.

Likewise fix planstate_tree_walker(), in branches where that exists.

Discussion: https://postgr.es/m/30253.1544286631@sss.pgh.pa.us
2018-12-10 11:12:43 -05:00
Tom Lane b7a29695f7 Make TupleDescInitBuiltinEntry throw error for unsupported types.
Previously, it would just pass back a partially-uninitialized tupdesc,
which doesn't seem like a safe or useful behavior.

Backpatch to v10 where this code came in.

Discussion: https://postgr.es/m/30830.1544384975@sss.pgh.pa.us
2018-12-10 10:38:48 -05:00
Stephen Frost 96c702c1ed Remove dead code in toast_fetch_datum_slice
In toast_fetch_datum_slice(), we Assert() that what is passed in isn't
compressed, but we then later had a check to see what the length of if
what was passed in is compressed.  That later check is rather confusing
since toast_fetch_datum_slice() is only ever called with non-compressed
datums and the Assert() earlier makes it clear that one shouldn't be
passing in compressed datums.

Add a comment to make it clear that toast_fetch_datum_slice() is just
for non-compressed datums, and remove the dead code.
2018-12-10 09:31:38 -05:00
Michael Paquier 6d8727f95e Ensure cleanup of orphan archive status files
When a WAL segment is recycled, its ".ready" and ".done" status files
get also automatically removed, however this is not done in a durable
manner.  Hence, in a subsequent crash, it could be possible that a
".ready" status file is still around with its corresponding segment
already gone.

If the backend reaches such a state, the archive command would most
likely complain about a segment non-existing and would keep retrying,
causing WAL segments to bloat pg_wal/, potentially making Postgres crash
hard when running out of space.

As status files are removed after each individual segment, using
durable_unlink() does not completely close the window either, as a crash
could happen between the moment the WAL segment is recycled and the
moment its status files are removed.  This has also some performance
impact with the additional fsync() calls needed to make the removal in a
durable manner.  Doing the cleanup at recovery is not cost-free either
as this makes crash recovery potentially take longer than necessary.

So, instead, as per an idea of Stephen Frost, make the archiver aware of
orphan status files and remove them on-the-fly if the corresponding
segment goes missing.  Removal failures follow a model close to what
happens for WAL segments, where multiple attempts are done before giving
up temporarily, and where a successful orphan removal makes the archiver
move immediately to the next WAL segment thought as ready to be
archived.

Author: Michael Paquier
Reviewed-by: Nathan Bossart, Andres Freund, Stephen Frost, Kyotaro
Horiguchi
Discussion: https://postgr.es/m/20180928032827.GF1500@paquier.xyz
2018-12-10 15:00:59 +09:00
Michael Paquier 7fee252f6f Add timestamp of last received message from standby to pg_stat_replication
The timestamp generated by the standby at message transmission has been
included in the protocol since its introduction for both the status
update message and hot standby feedback message, but it has never
appeared in pg_stat_replication.  Seeing this timestamp does not matter
much with a cluster which has a lot of activity, but on a mostly-idle
cluster, this makes monitoring able to react faster than the configured
timeouts.

Author: MyungKyu LIM
Reviewed-by: Michael Paquier, Masahiko Sawada
Discussion: https://postgr.es/m/1657809367.407321.1533027417725.JavaMail.jboss@ep2ml404
2018-12-09 16:35:06 +09:00
Tom Lane 5deadfef28 Fix misapplication of pgstat_count_truncate to wrong relation.
The stanza of ExecuteTruncate[Guts] that truncates a target table's toast
relation re-used the loop local variable "rel" to reference the toast rel.
This was safe enough when written, but commit d42358efb added code below
that that supposed "rel" still pointed to the parent table.  Therefore,
the stats counter update was applied to the wrong relcache entry (the
toast rel not the user rel); and if we were unlucky and that relcache
entry had been flushed during reindex_relation, very bad things could
ensue.

(I'm surprised that CLOBBER_CACHE_ALWAYS testing hasn't found this.
I'm even more surprised that the problem wasn't detected during the
development of d42358efb; it must not have been tested in any case
with a toast table, as the incorrect stats counts are very obvious.)

To fix, replace use of "rel" in that code branch with a more local
variable.  Adjust test cases added by d42358efb so that some of them
use tables with toast tables.

Per bug #15540 from Pan Bian.  Back-patch to 9.5 where d42358efb came in.

Discussion: https://postgr.es/m/15540-01078812338195c0@postgresql.org
2018-12-07 12:11:59 -05:00
Tom Lane 9286ef8e91 Clean up sloppy coding in publicationcmds.c's OpenTableList().
Remove dead code (which would be incorrect if it weren't dead),
per report from Pan Bian.  Add a CHECK_FOR_INTERRUPTS in the
inner loop over child relations, because there's little point
in having one in the outer loop if there's not one here too.
Minor stylistic adjustments and comment improvements.

Seems to be aboriginal to this code (cf commit 665d1fad9).
Back-patch to v10 where that came in, not because any of this
is significant, but just to keep the branches looking similar.

Discussion: https://postgr.es/m/15539-06d00ef6b1e2e1bb@postgresql.org
2018-12-07 11:02:39 -05:00
Michael Paquier 730422afcd Fix some errhint and errdetail strings missing a period
As per the error message style guide of the documentation, those should
be full sentences.

Author: Daniel Gustafsson
Reviewed-by: Michael Paquier, Álvaro Herrera
Discussion: https://1E8D49B4-16BC-4420-B4ED-58501D9E076B@yesql.se
2018-12-07 07:47:42 +09:00
Tom Lane d2b0b60e71 Improve our response to invalid format strings, and detect more cases.
Places that are testing for *printf failure ought to include the format
string in their error reports, since bad-format-string is one of the
more likely causes of such failure.  This both makes it easier to find
and repair the mistake, and provides at least some useful info to the
user who stumbles across such a problem.

Also, tighten snprintf.c to report EINVAL for an invalid flag or
final character in a format %-spec (including the case where the
%-spec is missing a final character altogether).  This seems like
better project policy, and it also allows removing an instruction
or two from the hot code path.

Back-patch the error reporting change in pvsnprintf, since it should be
harmless and may be helpful; but not the snprintf.c change.

Per discussion of bug #15511 from Ertuğrul Kahveci, which reported an
invalid translated format string.  These changes don't fix that error,
but they should improve matters next time we make such a mistake.

Discussion: https://postgr.es/m/15511-1d8b6a0bc874112f@postgresql.org
2018-12-06 15:08:44 -05:00
Stephen Frost a243c55326 Cleanup comments in xlog compression
Skipping over the "hole" in full page images in the XLOG code was
described as being a form of compression, but this got a bit confusing
since we now have PGLZ-based compression happening, so adjust the
wording to discuss "removing" the "hole" and keeping the talk about
compression to where we're talking about using PGLZ-based compression of
the full page images.

Reviewed-By: Kyotaro Horiguchi
Discussion: https://postgr.es/m/20181127234341.GM3415@tamriel.snowman.net
2018-12-06 11:05:39 -05:00
Alvaro Herrera 71a05b2232 Don't mark partitioned indexes invalid unnecessarily
When an indexes is created on a partitioned table using ONLY (don't
recurse to partitions), it gets marked invalid until index partitions
are attached for each table partition.  But there's no reason to do this
if there are no partitions ... and moreover, there's no way to get the
index to become valid afterwards, because all partitions that get
created/attached get their own index partition already attached to the
parent index, so there's no chance to do ALTER INDEX ... ATTACH PARTITION
that would make the parent index valid.

Fix by not marking the index as invalid to begin with.

This is very similar to 9139aa1942, but the pg_dump aspect does not
appear to be relevant until we add FKs that can point to PKs on
partitioned tables.  (I tried to cause the pg_upgrade test to break by
leaving some of these bogus tables around, but wasn't able to.)

Making this change means that an index that was supposed to be invalid
in the insert_conflict regression test is no longer invalid; reorder the
DDL so that the test continues to verify the behavior we want it to.

Author: Álvaro Herrera
Reviewed-by: Amit Langote
Discussion: https://postgr.es/m/20181203225019.2vvdef2ybnkxt364@alvherre.pgsql
2018-12-05 13:31:51 -03:00
Stephen Frost f502fc88b3 Fix typo
Backends don't typically exist uncleanly, but they can certainly exit
uncleanly, and it's exiting uncleanly that's being discussed here.
2018-12-04 11:04:54 -05:00
Michael Paquier ee2b37ae04 Add some missing schema qualifications
This does not improve the security and reliability of the touched areas,
but it makes the style more consistent.

Author: Michael Paquier
Reviewed-by- Noah Misch
Discussion: https://postgr.es/m/20180309075538.GD9376@paquier.xyz
2018-12-03 14:21:52 +09:00
Alvaro Herrera 9dc1225855 Silence compiler warning
My original coding was questionable anyway.

Reported-by: Sergei Kornilov
Discussion: https://postgr.es/m/9645101543575886@myt6-27270b78ac4f.qloud-c.yandex.net
2018-11-30 10:20:49 -03:00
Amit Kapila dcfdf56e89 Fix typo. 2018-11-30 11:50:43 +05:30
Michael Paquier 5c99513975 Fix various checksum check problems for pg_verify_checksums and base backups
Three issues are fixed in this patch:
- Base backups forgot to ignore files specific to EXEC_BACKEND, leading
to spurious warnings when checksums are enabled, per analysis from me.
- pg_verify_checksums forgot about files specific to EXEC_BACKEND,
leading to failures of the tool on any such build, particularly Windows.
This error was originally found by newly-introduced TAP tests in various
buildfarm members using EXEC_BACKEND.
- pg_verify_checksums forgot to count for temporary files and temporary
paths, which could be valid relation files, without checksums, per
report from Andres Freund.  More tests are added to cover this case.

A new test case which emulates corruption for a file in a different
tablespace is added, coming from from Michael Banck, while I have coded
the main code and refactored the test code.

Author: Michael Banck, Michael Paquier
Reviewed-by: Stephen Frost, David Steele
Discussion: https://postgr.es/m/20181021134206.GA14282@paquier.xyz
2018-11-30 10:34:45 +09:00
Alvaro Herrera 88bdbd3f74 Add log_statement_sample_rate parameter
This allows to set a lower log_min_duration_statement value without
incurring excessive log traffic (which reduces performance).  This can
be useful to analyze workloads with lots of short queries.

Author: Adrien Nayrat
Reviewed-by: David Rowley, Vik Fearing
Discussion: https://postgr.es/m/c30ee535-ee1e-db9f-fa97-146b9f62caed@anayrat.info
2018-11-29 18:42:53 -03:00
Thomas Munro 2ac180c286 Fix minor typo in dsa.c.
Author: Takeshi Ideriha
Discussion: https://postgr.es/m/4E72940DA2BF16479384A86D54D0988A6F3BF22D%40G01JPEXMBKW04
2018-11-29 14:14:26 +13:00
Michael Paquier 4c703369af Fix handling of synchronous replication for stopping WAL senders
This fixes an oversight from c6c3334 which forgot that if a subset of
WAL senders are stopping and in a sync state, other WAL senders could
still be waiting for a WAL position to be synced while committing a
transaction.  However the subset of stopping senders would not release
waiters, potentially breaking synchronous replication guarantees.  This
commit makes sure that even WAL senders stopping are able to release
waiters and are tracked properly.

On 9.4, this can also trigger an assertion failure when setting for
example max_wal_senders to 1 where a WAL sender is not able to find
itself as in synchronous state when the instance stops.

Reported-by: Paul Guo
Author: Paul Guo, Michael Paquier
Discussion: https://postgr.es/m/CAEET0ZEv8VFqT3C-cQm6byOB4r4VYWcef1J21dOX-gcVhCSpmA@mail.gmail.com
Backpatch-through: 9.4
2018-11-29 09:12:19 +09:00
Peter Geoghegan 1a990b207b Have BufFileSize() ereport() on FileSize() failure.
Move the responsibility for checking for and reporting a failure from
the only current BufFileSize() caller, logtape.c, to BufFileSize()
itself.  Code within buffile.c is generally responsible for interfacing
with fd.c to report irrecoverable failures.  This seems like a
convention that's worth sticking to.

Reorganizing things this way makes it easy to make the error message
raised in the event of BufFileSize() failure descriptive of the
underlying problem.  We're now clear on the distinction between
temporary file name and BufFile name, and can show errno, confident that
its value actually relates to the error being reported.  In passing, an
existing, similar buffile.c ereport() + errcode_for_file_access() site
is changed to follow the same conventions.

The API of the function BufFileSize() is changed by this commit, despite
already being in a stable release (Postgres 11).  This seems acceptable,
since the BufFileSize() ABI was changed by commit aa55183042, which
hasn't made it into a point release yet.  Besides, it's difficult to
imagine a third party BufFileSize() caller not just raising an error
anyway, since BufFile state should be considered corrupt when
BufFileSize() fails.

Per complaint from Tom Lane.

Discussion: https://postgr.es/m/26974.1540826748@sss.pgh.pa.us
Backpatch: 11-, where shared BufFiles were introduced.
2018-11-28 14:42:54 -08:00
Peter Eisentraut f2cbffc7a6 Only allow one recovery target setting
The previous recovery.conf regime accepted multiple recovery_target*
settings and used the last one.  This does not translate well to the
general GUC system.  Specifically, under EXEC_BACKEND, the settings
are written out not in any particular order, so the order in which
they were originally set is not available to new processes.

Rather than redesign the GUC system, it was decided to abandon the old
behavior and only allow one recovery target setting.  A second setting
will cause an error.  However, it is allowed to set the same parameter
multiple times or unset a parameter and set a different one.

Discussion: https://www.postgresql.org/message-id/flat/27802171543235530%40iva2-6ec8f0a6115e.qloud-c.yandex.net#701a59c837ad0bf8c244344aaf3ef5a4
2018-11-28 13:55:54 +01:00
Thomas Munro 0f9cdd7dca Don't set PAM_RHOST for Unix sockets.
Since commit 2f1d2b7a we have set PAM_RHOST to "[local]" for Unix
sockets.  This caused Linux PAM's libaudit integration to make DNS
requests for that name.  It's not exactly clear what value PAM_RHOST
should have in that case, but it seems clear that we shouldn't set it
to an unresolvable name, so don't do that.

Back-patch to 9.6.  Bug #15520.

Author: Thomas Munro
Reviewed-by: Peter Eisentraut
Reported-by: Albert Schabhuetl
Discussion: https://postgr.es/m/15520-4c266f986998e1c5%40postgresql.org
2018-11-28 14:12:30 +13:00
Tomas Vondra f69c959df0 Do not decode TOAST data for table rewrites
During table rewrites (VACUUM FULL and CLUSTER), the main heap is logged
using XLOG / FPI records, and thus (correctly) ignored in decoding.
But the associated TOAST table is WAL-logged as plain INSERT records,
and so was logically decoded and passed to reorder buffer.

That has severe consequences with TOAST tables of non-trivial size.
Firstly, reorder buffer has to keep all those changes, possibly spilling
them to a file, incurring I/O costs and disk space.

Secondly, ReoderBufferCommit() was stashing all those TOAST chunks into
a hash table, which got discarded only after processing the row from the
main heap.  But as the main heap is not decoded for rewrites, this never
happened, so all the TOAST data accumulated in memory, resulting either
in excessive memory consumption or OOM.

The fix is simple, as commit e9edc1ba already introduced infrastructure
(namely HEAP_INSERT_NO_LOGICAL flag) to skip logical decoding of TOAST
tables, but it only applied it to system tables.  So simply use it for
all TOAST data in raw_heap_insert().

That would however solve only the memory consumption issue - the TOAST
changes would still be decoded and added to the reorder buffer, and
spilled to disk (although without TOAST tuple data, so much smaller).
But we can solve that by tweaking DecodeInsert() to just ignore such
INSERT records altogether, using XLH_INSERT_CONTAINS_NEW_TUPLE flag,
instead of skipping them later in ReorderBufferCommit().

Review: Masahiko Sawada
Discussion: https://www.postgresql.org/message-id/flat/1a17c643-e9af-3dba-486b-fbe31bc1823a%402ndquadrant.com
Backpatch: 9.4-, where logical decoding was introduced
2018-11-28 01:43:08 +01:00
Thomas Munro d67dae036b Don't count zero-filled buffers as 'read' in EXPLAIN.
If you extend a relation, it should count as a block written, not
read (we write a zero-filled block).  If you ask for a zero-filled
buffer, it shouldn't be counted as read or written.

Later we might consider counting zero-filled buffers with a separate
counter, if they become more common due to future work.

Author: Thomas Munro
Reviewed-by: Haribabu Kommi, Kyotaro Horiguchi, David Rowley
Discussion: https://postgr.es/m/CAEepm%3D3JytB3KPpvSwXzkY%2Bdwc5zC8P8Lk7Nedkoci81_0E9rA%40mail.gmail.com
2018-11-28 11:58:10 +13:00
Andres Freund b238527664 Fix jit compilation bug on wide tables.
The function generated to perform JIT compiled tuple deforming failed
when HeapTupleHeader's t_hoff was bigger than a signed int8. I'd
failed to realize that LLVM's getelementptr would treat an int8 index
argument as signed, rather than unsigned.  That means that a hoff
larger than 127 would result in a negative offset being applied.  Fix
that by widening the index to 32bit.

Add a testcase with a wide table. Don't drop it, as it seems useful to
verify other tools deal properly with wide tables.

Thanks to Justin Pryzby for both reporting a bug and then reducing it
to a reproducible testcase!

Reported-By: Justin Pryzby
Author: Andres Freund
Discussion: https://postgr.es/m/20181115223959.GB10913@telsasoft.com
Backpatch: 11, just as jit compilation was
2018-11-27 10:07:03 -08:00
Peter Eisentraut 2dedf4d9a8 Integrate recovery.conf into postgresql.conf
recovery.conf settings are now set in postgresql.conf (or other GUC
sources).  Currently, all the affected settings are PGC_POSTMASTER;
this could be refined in the future case by case.

Recovery is now initiated by a file recovery.signal.  Standby mode is
initiated by a file standby.signal.  The standby_mode setting is
gone.  If a recovery.conf file is found, an error is issued.

The trigger_file setting has been renamed to promote_trigger_file as
part of the move.

The documentation chapter "Recovery Configuration" has been integrated
into "Server Configuration".

pg_basebackup -R now appends settings to postgresql.auto.conf and
creates a standby.signal file.

Author: Fujii Masao <masao.fujii@gmail.com>
Author: Simon Riggs <simon@2ndquadrant.com>
Author: Abhijit Menon-Sen <ams@2ndquadrant.com>
Author: Sergei Kornilov <sk@zsrv.org>
Discussion: https://www.postgresql.org/message-id/flat/607741529606767@web3g.yandex.ru/
2018-11-25 16:33:40 +01:00
Thomas Munro ab69ea9fee Fix assertion failure for SSL connections.
Commit cfdf4dc4 added an assertion that every WaitLatch() or similar
handles postmaster death.  One place did not, but was missed in
review and testing due to the need for an SSL connection.  Fix, by
asking for WL_EXIT_ON_PM_DEATH.

Reported-by: Christoph Berg
Discussion: https://postgr.es/m/20181124143845.GA15039%40msg.df7cb.de
2018-11-25 18:34:58 +13:00
Tom Lane cbdb8b4c01 Fix float-to-integer coercions to handle edge cases correctly.
ftoi4 and its sibling coercion functions did their overflow checks in
a way that looked superficially plausible, but actually depended on an
assumption that the MIN and MAX comparison constants can be represented
exactly in the float4 or float8 domain.  That fails in ftoi4, ftoi8,
and dtoi8, resulting in a possibility that values near the MAX limit will
be wrongly converted (to negative values) when they need to be rejected.

Also, because we compared before rounding off the fractional part,
the other three functions threw errors for values that really ought
to get rounded to the min or max integer value.

Fix by doing rint() first (requiring an assumption that it handles
NaN and Inf correctly; but dtoi8 and ftoi8 were assuming that already),
and by comparing to values that should coerce to float exactly, namely
INTxx_MIN and -INTxx_MIN.  Also remove some random cosmetic discrepancies
between these six functions.

Per bug #15519 from Victor Petrovykh.  This should get back-patched,
but first let's see what the buildfarm thinks of it --- I'm not too
sure about portability of some of the regression test cases.

Patch by me; thanks to Andrew Gierth for analysis and discussion.

Discussion: https://postgr.es/m/15519-4fc785b483201ff1@postgresql.org
2018-11-23 20:57:11 -05:00
Tom Lane a314c34079 Clamp semijoin selectivity to be not more than inner-join selectivity.
We should never estimate the output of a semijoin to be more rows than
we estimate for an inner join with the same input rels and join condition;
it's obviously impossible for that to happen.  However, given the
relatively poor quality of our semijoin selectivity estimates ---
particularly, but not only, in cases where we punt and return a default
estimate --- we did often deliver such estimates.  To improve matters,
calculate both estimates inside eqjoinsel() and take the smaller one.

The bulk of this patch is just mechanical refactoring to avoid repetitive
information lookup when we call both eqjoinsel_semi and eqjoinsel_inner.
The actual new behavior is just

	selec = Min(selec, inner_rel->rows * selec_inner);

which looks a bit odd but is correct because of our different definitions
for inner and semi join selectivity.

There is one ensuing plan change in the regression tests, but it looks
reasonable enough (and checking the actual row counts shows that the
estimate moved closer to reality, not further away).

Per bug #15160 from Alexey Ermakov.  Although this is arguably a bug fix,
I won't risk destabilizing plan choices in stable branches by
back-patching.

Tom Lane, reviewed by Melanie Plageman

Discussion: https://postgr.es/m/152395805004.19366.3107109716821067806@wrigleys.postgresql.org
2018-11-23 12:48:49 -05:00
Alvaro Herrera 3be5fe2b10 Silence compiler warnings
Commit cfdf4dc4fc left a few unnecessary assignments, one of which
caused compiler warnings, as reported by Erik Rijkers.  Remove them all.

Discussion: https://postgr.es/m/df0dcca2025b3d90d946ecc508ca9678@xs4all.nl
2018-11-23 13:01:05 -03:00
Alvaro Herrera de38ce1b83 Don't allow partitioned indexes in pg_global tablespace
Missing in dfa6081419.

Author: David Rowley
Discussion: https://postgr.es/m/CAKJS1f-M3NMTCpv=vDfkoqHbMPFf=3-Z1ud=+1DHH00tC+zLaQ@mail.gmail.com
2018-11-23 08:48:20 -03:00
Thomas Munro cfdf4dc4fc Add WL_EXIT_ON_PM_DEATH pseudo-event.
Users of the WaitEventSet and WaitLatch() APIs can now choose between
asking for WL_POSTMASTER_DEATH and then handling it explicitly, or asking
for WL_EXIT_ON_PM_DEATH to trigger immediate exit on postmaster death.
This reduces code duplication, since almost all callers want the latter.

Repair all code that was previously ignoring postmaster death completely,
or requesting the event but ignoring it, or requesting the event but then
doing an unconditional PostmasterIsAlive() call every time through its
event loop (which is an expensive syscall on platforms for which we don't
have USE_POSTMASTER_DEATH_SIGNAL support).

Assert that callers of WaitLatchXXX() under the postmaster remember to
ask for either WL_POSTMASTER_DEATH or WL_EXIT_ON_PM_DEATH, to prevent
future bugs.

The only process that doesn't handle postmaster death is syslogger.  It
waits until all backends holding the write end of the syslog pipe
(including the postmaster) have closed it by exiting, to be sure to
capture any parting messages.  By using the WaitEventSet API directly
it avoids the new assertion, and as a by-product it may be slightly
more efficient on platforms that have epoll().

Author: Thomas Munro
Reviewed-by: Kyotaro Horiguchi, Heikki Linnakangas, Tom Lane
Discussion: https://postgr.es/m/CAEepm%3D1TCviRykkUb69ppWLr_V697rzd1j3eZsRMmbXvETfqbQ%40mail.gmail.com,
            https://postgr.es/m/CAEepm=2LqHzizbe7muD7-2yHUbTOoF7Q+qkSD5Q41kuhttRTwA@mail.gmail.com
2018-11-23 20:46:34 +13:00
Tom Lane eba2ce1712 Fix another crash in json{b}_populate_recordset and json{b}_to_recordset.
populate_recordset_worker() failed to consider the possibility that the
supplied JSON data contains no rows, so that update_cached_tupdesc never
got called.  This led to a null-pointer dereference since commit 9a5e8ed28;
before that it led to a bogus "set-valued function called in context that
cannot accept a set" error.  Fix by forcing the update to happen.

Per bug #15514.  Back-patch to v11 as 9a5e8ed28 was.  (If we were excited
about the bogus error, we could perhaps go back further, but it'd take more
work to figure out how to fix it in older branches.  Given the lack of
field complaints about that aspect, I'm not excited.)

Discussion: https://postgr.es/m/15514-59d5b4c4065b178b@postgresql.org
2018-11-22 15:14:01 -05:00
Michael Paquier 25c026c284 Fix typo in description of ExecFindPartition
Author: Amit Langote
Discussion: https://postgr.es/m/CA+HiwqHg0=UL+Dhh3gpiwYNA=ufk9Lb7GQ2c=5rs=ZmVTP7xAw@mail.gmail.com
2018-11-22 13:23:54 +09:00
Alvaro Herrera 03e10b962f Fix typo in commit 6f7d02aa60
Per pink buildfarm.
2018-11-21 15:35:40 -03:00
Alvaro Herrera ee07e38c14 Fix PartitionDispatchData vertical whitespace
Per David Rowley
Discussion: https://postgr.es/m/CAKJS1f-MstvBWdkOzACsOHyBgj2oXcBM8kfv+NhVe-Ux-wq9Sg@mail.gmail.com
2018-11-21 15:04:25 -03:00
Alvaro Herrera 6f7d02aa60 instr_time.h: add INSTR_TIME_SET_CURRENT_LAZY
Sets the timestamp to current if not already set.  Will acquire more
callers momentarily.

Author: Fabien Coelho
Discussion: https://postgr.es/m/alpine.DEB.2.21.1808111104320.1705@lancre
2018-11-21 15:04:25 -03:00
Andres Freund 578b229718 Remove WITH OIDS support, change oid catalog column visibility.
Previously tables declared WITH OIDS, including a significant fraction
of the catalog tables, stored the oid column not as a normal column,
but as part of the tuple header.

This special column was not shown by default, which was somewhat odd,
as it's often (consider e.g. pg_class.oid) one of the more important
parts of a row.  Neither pg_dump nor COPY included the contents of the
oid column by default.

The fact that the oid column was not an ordinary column necessitated a
significant amount of special case code to support oid columns. That
already was painful for the existing, but upcoming work aiming to make
table storage pluggable, would have required expanding and duplicating
that "specialness" significantly.

WITH OIDS has been deprecated since 2005 (commit ff02d0a05280e0).
Remove it.

Removing includes:
- CREATE TABLE and ALTER TABLE syntax for declaring the table to be
  WITH OIDS has been removed (WITH (oids[ = true]) will error out)
- pg_dump does not support dumping tables declared WITH OIDS and will
  issue a warning when dumping one (and ignore the oid column).
- restoring an pg_dump archive with pg_restore will warn when
  restoring a table with oid contents (and ignore the oid column)
- COPY will refuse to load binary dump that includes oids.
- pg_upgrade will error out when encountering tables declared WITH
  OIDS, they have to be altered to remove the oid column first.
- Functionality to access the oid of the last inserted row (like
  plpgsql's RESULT_OID, spi's SPI_lastoid, ...) has been removed.

The syntax for declaring a table WITHOUT OIDS (or WITH (oids = false)
for CREATE TABLE) is still supported. While that requires a bit of
support code, it seems unnecessary to break applications / dumps that
do not use oids, and are explicit about not using them.

The biggest user of WITH OID columns was postgres' catalog. This
commit changes all 'magic' oid columns to be columns that are normally
declared and stored. To reduce unnecessary query breakage all the
newly added columns are still named 'oid', even if a table's column
naming scheme would indicate 'reloid' or such.  This obviously
requires adapting a lot code, mostly replacing oid access via
HeapTupleGetOid() with access to the underlying Form_pg_*->oid column.

The bootstrap process now assigns oids for all oid columns in
genbki.pl that do not have an explicit value (starting at the largest
oid previously used), only oids assigned later by oids will be above
FirstBootstrapObjectId. As the oid column now is a normal column the
special bootstrap syntax for oids has been removed.

Oids are not automatically assigned during insertion anymore, all
backend code explicitly assigns oids with GetNewOidWithIndex(). For
the rare case that insertions into the catalog via SQL are called for
the new pg_nextoid() function can be used (which only works on catalog
tables).

The fact that oid columns on system tables are now normal columns
means that they will be included in the set of columns expanded
by * (i.e. SELECT * FROM pg_class will now include the table's oid,
previously it did not). It'd not technically be hard to hide oid
column by default, but that'd mean confusing behavior would either
have to be carried forward forever, or it'd cause breakage down the
line.

While it's not unlikely that further adjustments are needed, the
scope/invasiveness of the patch makes it worthwhile to get merge this
now. It's painful to maintain externally, too complicated to commit
after the code code freeze, and a dependency of a number of other
patches.

Catversion bump, for obvious reasons.

Author: Andres Freund, with contributions by John Naylor
Discussion: https://postgr.es/m/20180930034810.ywp2c7awz7opzcfr@alap3.anarazel.de
2018-11-20 16:00:17 -08:00
Peter Eisentraut ea8bc349bd Make detection of SSL_CTX_set_min_proto_version more portable
As already explained in configure.in, using the OpenSSL version number
to detect presence of functions doesn't work, because LibreSSL reports
incompatible version numbers.  Fortunately, the functions we need here
are actually macros, so we can just test for them directly.
2018-11-20 22:59:36 +01:00
Peter Eisentraut e73e67c719 Add settings to control SSL/TLS protocol version
For example:

    ssl_min_protocol_version = 'TLSv1.1'
    ssl_max_protocol_version = 'TLSv1.2'

Reviewed-by: Steve Singer <steve@ssinger.info>
Discussion: https://www.postgresql.org/message-id/flat/1822da87-b862-041a-9fc2-d0310c3da173@2ndquadrant.com
2018-11-20 22:12:10 +01:00
Peter Eisentraut 2d9140ed26 Make WAL description output more consistent
The output for record types XLOG_DBASE_CREATE and XLOG_DBASE_DROP used
the order dbid/tablespaceid, whereas elsewhere the order is
tablespaceid/dbid[/relfilenodeid].  Flip the order for those two types
to make it consistent.

Author: Jean-Christophe Arnu <jcarnu@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAHZmTm18Ln62KW-G8NYvO1wbBL3QU1E76Zep=DuHmg-zS2XFAg@mail.gmail.com/
2018-11-20 13:30:01 +01:00
Peter Eisentraut a568cadaff Refine some guc.c help texts
These settings apply to communication with the sending server, which
is not necessarily a primary.

Author: Sergei Kornilov <sk@zsrv.org>
2018-11-20 06:38:34 +01:00
Tom Lane cb09903fe6 Add needed #include.
Per POSIX, WIFSIGNALED and related macros are provided by <sys/wait.h>.
Apparently on Linux they're also pulled in by some other inclusions,
but BSD-ish systems are pickier.  Fixes portability issue in ffa4cbd62.

Per buildfarm.
2018-11-19 17:28:04 -05:00
Tom Lane ffa4cbd623 Handle EPIPE more sanely when we close a pipe reading from a program.
Previously, any program launched by COPY TO/FROM PROGRAM inherited the
server's setting of SIGPIPE handling, i.e. SIG_IGN.  Hence, if we were
doing COPY FROM PROGRAM and closed the pipe early, the child process
would see EPIPE on its output file and typically would treat that as
a fatal error, in turn causing the COPY to report error.  Similarly,
one could get a failure report from a query that didn't read all of
the output from a contrib/file_fdw foreign table that uses file_fdw's
PROGRAM option.

To fix, ensure that child programs inherit SIG_DFL not SIG_IGN processing
of SIGPIPE.  This seems like an all-around better situation since if
the called program wants some non-default treatment of SIGPIPE, it would
expect to have to set that up for itself.  Then in COPY, if it's COPY
FROM PROGRAM and we stop reading short of detecting EOF, treat a SIGPIPE
exit from the called program as a non-error condition.  This still allows
us to report an error for any case where the called program gets SIGPIPE
on some other file descriptor.

As coded, we won't report a SIGPIPE if we stop reading as a result of
seeing an in-band EOF marker (e.g. COPY BINARY EOF marker).  It's
somewhat debatable whether we should complain if the called program
continues to transmit data after an EOF marker.  However, it seems like
we should avoid throwing error in any questionable cases, especially in a
back-patched fix, and anyway it would take additional code to make such
an error get reported consistently.

Back-patch to v10.  We could go further back, since COPY FROM PROGRAM
has been around awhile, but AFAICS the only way to reach this situation
using core or contrib is via file_fdw, which has only supported PROGRAM
sources since v10.  The COPY statement per se has no feature whereby
it'd stop reading without having hit EOF or an error already.  Therefore,
I don't see any upside to back-patching further that'd outweigh the
risk of complaints about behavioral change.

Per bug #15449 from Eric Cyr.

Patch by me, review by Etsuro Fujita and Kyotaro Horiguchi

Discussion: https://postgr.es/m/15449-1cf737dd5929450e@postgresql.org
2018-11-19 17:02:39 -05:00
Robert Haas 7ee5f88e65 Reduce unnecessary list construction in RelationBuildPartitionDesc.
The 'partoids' list which was constructed by the previous version
of this code was necessarily identical to 'inhoids'.  There's no
point to duplicating the list, so avoid that.  Instead, construct
the array representation directly from the original 'inhoids' list.

Also, use an array rather than a list for 'boundspecs'.  We know
exactly how many items we need to store, so there's really no
reason to use a list.  Using an array instead reduces the number
of memory allocations we perform.

Patch by me, reviewed by Michael Paquier and Amit Langote, the
latter of whom also helped with rebasing.
2018-11-19 12:10:41 -05:00
Alvaro Herrera 5c9a5513a3 Disallow COPY FREEZE on partitioned tables
This didn't actually work: COPY would fail to flush the right files, and
instead would try to flush a non-existing file, causing the whole
transaction to fail.

Cope by raising an error as soon as the command is sent instead, to
avoid a nasty later surprise.  Of course, it would be much better to
make it work, but we don't have a patch for that yet, and we don't know
if we'll want to backpatch one when we do.

Reported-by: Tomas Vondra
Author: David Rowley
Reviewed-by: Amit Langote, Steve Singer, Tomas Vondra
2018-11-19 11:16:28 -03:00
Thomas Munro 9ccdd7f66e PANIC on fsync() failure.
On some operating systems, it doesn't make sense to retry fsync(),
because dirty data cached by the kernel may have been dropped on
write-back failure.  In that case the only remaining copy of the
data is in the WAL.  A subsequent fsync() could appear to succeed,
but not have flushed the data.  That means that a future checkpoint
could apparently complete successfully but have lost data.

Therefore, violently prevent any future checkpoint attempts by
panicking on the first fsync() failure.  Note that we already
did the same for WAL data; this change extends that behavior to
non-temporary data files.

Provide a GUC data_sync_retry to control this new behavior, for
users of operating systems that don't eject dirty data, and possibly
forensic/testing uses.  If it is set to on and the write-back error
was transient, a later checkpoint might genuinely succeed (on a
system that does not throw away buffers on failure); if the error is
permanent, later checkpoints will continue to fail.  The GUC defaults
to off, meaning that we panic.

Back-patch to all supported releases.

There is still a narrow window for error-loss on some operating
systems: if the file is closed and later reopened and a write-back
error occurs in the intervening time, but the inode has the bad
luck to be evicted due to memory pressure before we reopen, we could
miss the error.  A later patch will address that with a scheme
for keeping files with dirty data open at all times, but we judge
that to be too complicated to back-patch.

Author: Craig Ringer, with some adjustments by Thomas Munro
Reported-by: Craig Ringer
Reviewed-by: Robert Haas, Thomas Munro, Andres Freund
Discussion: https://postgr.es/m/20180427222842.in2e4mibx45zdth5%40alap3.anarazel.de
2018-11-19 17:41:26 +13:00
Thomas Munro 1556cb2fc5 Don't forget about failed fsync() requests.
If fsync() fails, md.c must keep the request in its bitmap, so that
future attempts will try again.

Back-patch to all supported releases.

Author: Thomas Munro
Reviewed-by: Amit Kapila
Reported-by: Andrew Gierth
Discussion: https://postgr.es/m/87y3i1ia4w.fsf%40news-spur.riddles.org.uk
2018-11-19 17:41:26 +13:00
Michael Paquier 285bd0ac4a Remove unnecessary memcpy when reading WAL record fitting on page
When reading a WAL record, its contents are copied into an intermediate
buffer.  However, doing so is not necessary if the record fits fully
into the current page, saving one memcpy for each such record.  The
allocation handling of the intermediate buffer is also now done only
when a record crosses a page boundary, shaving some extra cycles when
reading a WAL record.

Author: Andrey Lepikhov
Reviewed-by: Kyotaro Horiguchi, Heikki Linnakangas
Discussion: https://postgr.es/m/c2ea54dd-a1d3-80eb-ddbf-7e6f258e615e@postgrespro.ru
2018-11-19 10:25:48 +09:00
Tom Lane 37afc079ab Avoid defining SIGTTIN/SIGTTOU on Windows.
Setting them to SIG_IGN seems unlikely to have any beneficial effect
on that platform, and given the signal numbering collision with SIGABRT,
it could easily have bad effects.

Given the lack of field complaints that can be traced to this, I don't
presently feel a need to back-patch.

Discussion: https://postgr.es/m/5627.1542477392@sss.pgh.pa.us
2018-11-17 16:31:16 -05:00
Tom Lane 125f551c8b Leave SIGTTIN/SIGTTOU signal handling alone in postmaster child processes.
For reasons lost in the mists of time, most postmaster child processes
reset SIGTTIN/SIGTTOU signal handling to SIG_DFL, with the major exception
that backend sessions do not.  It seems like a pretty bad idea for any
postmaster children to do that: if stderr is connected to the terminal,
and the user has put the postmaster in background, any log output would
result in the child process freezing up.  Hence, switch them all to
doing what backends do, ie, nothing.  This allows them to inherit the
postmaster's SIG_IGN setting.  On the other hand, manually-launched
processes such as standalone backends will have default processing,
which seems fine.

In passing, also remove useless resets of SIGCONT and SIGWINCH signal
processing.  Perhaps the postmaster once changed those to something
besides SIG_DFL, but it doesn't now, so these are just wasted (and
confusing) syscalls.

Basically, this propagates the changes made in commit 8e2998d8a from
backends to other postmaster children.  Probably the only reason these
calls now exist elsewhere is that I missed changing pgstat.c along with
postgres.c at the time.

Given the lack of field complaints that can be traced to this, I don't
presently feel a need to back-patch.

Discussion: https://postgr.es/m/5627.1542477392@sss.pgh.pa.us
2018-11-17 16:31:16 -05:00
Andres Freund 73616126b4 Fix some spurious new compiler warnings in MSVC.
Per buildfarm animal bowerbird.

Discussion: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bowerbird&dt=2018-11-17%2002%3A30%3A20
2018-11-17 11:41:14 -08:00
Andres Freund 4da597edf1 Make TupleTableSlots extensible, finish split of existing slot type.
This commit completes the work prepared in 1a0586de36, splitting the
old TupleTableSlot implementation (which could store buffer, heap,
minimal and virtual slots) into four different slot types.  As
described in the aforementioned commit, this is done with the goal of
making tuple table slots extensible, to allow for pluggable table
access methods.

To achieve runtime extensibility for TupleTableSlots, operations on
slots that can differ between types of slots are performed using the
TupleTableSlotOps struct provided at slot creation time.  That
includes information from the size of TupleTableSlot struct to be
allocated, initialization, deforming etc.  See the struct's definition
for more detailed information about callbacks TupleTableSlotOps.

I decided to rename TTSOpsBufferTuple to TTSOpsBufferHeapTuple and
ExecCopySlotTuple to ExecCopySlotHeapTuple, as that seems more
consistent with other naming introduced in recent patches.

There's plenty optimization potential in the slot implementation, but
according to benchmarking the state after this commit has similar
performance characteristics to before this set of changes, which seems
sufficient.

There's a few changes in execReplication.c that currently need to poke
through the slot abstraction, that'll be repaired once the pluggable
storage patchset provides the necessary infrastructure.

Author: Andres Freund and  Ashutosh Bapat, with changes by Amit Khandekar
Discussion: https://postgr.es/m/20181105210039.hh4vvi4vwoq5ba2q@alap3.anarazel.de
2018-11-16 16:35:15 -08:00
Alvaro Herrera 0201d79a55 Avoid re-typedef'ing PartitionTupleRouting
Apparently, gcc on macOS (?) doesn't like it.

Per buildfarm.
2018-11-16 16:55:53 -03:00
Andres Freund a7aa608e0f Inline hot path of slot_getsomeattrs().
This yields a minor speedup, which roughly balances the loss from the
upcoming introduction of callbacks to do some operations on slots.

Author: Andres Freund
Discussion: https://postgr.es/m/20181105210039.hh4vvi4vwoq5ba2q@alap3.anarazel.de
2018-11-16 10:29:01 -08:00
Alvaro Herrera 3f2393edef Redesign initialization of partition routing structures
This speeds up write operations (INSERT, UPDATE, DELETE, COPY, as well
as the future MERGE) on partitioned tables.

This changes the setup for tuple routing so that it does far less work
during the initial setup and pushes more work out to when partitions
receive tuples.  PartitionDispatchData structs for sub-partitioned
tables are only created when a tuple gets routed through it.  The
possibly large arrays in the PartitionTupleRouting struct have largely
been removed.  The partitions[] array remains but now never contains any
NULL gaps.  Previously the NULLs had to be skipped during
ExecCleanupTupleRouting(), which could add a large overhead to the
cleanup when the number of partitions was large.  The partitions[] array
is allocated small to start with and only enlarged when we route tuples
to enough partitions that it runs out of space. This allows us to keep
simple single-row partition INSERTs running quickly.  Redesign

The arrays in PartitionTupleRouting which stored the tuple translation maps
have now been removed.  These have been moved out into a
PartitionRoutingInfo struct which is an additional field in ResultRelInfo.

The find_all_inheritors() call still remains by far the slowest part of
ExecSetupPartitionTupleRouting(). This commit just removes the other slow
parts.

In passing also rename the tuple translation maps from being ParentToChild
and ChildToParent to being RootToPartition and PartitionToRoot. The old
names mislead you into thinking that a partition of some sub-partitioned
table would translate to the rowtype of the sub-partitioned table rather
than the root partitioned table.

Authors: David Rowley and Amit Langote, heavily revised by Álvaro Herrera
Testing help from Jesper Pedersen and Kato Sho.
Discussion: https://postgr.es/m/CAKJS1f_1RJyFquuCKRFHTdcXqoPX-PYqAd7nz=GVBwvGh4a6xA@mail.gmail.com
2018-11-16 15:01:05 -03:00
Andres Freund a387a3dff9 Fix slot type assumptions for nodeGather[Merge].
The assumption made in 1a0586de36 was wrong, as evidenced by
buildfarm failure on locust, which runs with
force_parallel_mode=regress.  The tuples accessed in either nodes are
in the outer slot, and we can't trivially rely on the slot type being
known because the leader might execute the subsidiary node directly,
or via the tuple queue on a worker. In the latter case the tuple will
always be a heaptuple slot, but in the former, it'll be whatever the
subsidiary node returns.
2018-11-15 23:16:41 -08:00
Andres Freund 7ef04e4d2c Don't generate tuple deforming functions for virtual slots.
Virtual tuple table slots never need tuple deforming. Therefore, if we
know at expression compilation time, that a certain slot will always
be virtual, there's no need to create a tuple deforming routine for
it.

Author: Andres Freund
Discussion: https://postgr.es/m/20181105210039.hh4vvi4vwoq5ba2q@alap3.anarazel.de
2018-11-15 22:00:30 -08:00
Andres Freund 15d8f83128 Verify that expected slot types match returned slot types.
This is important so JIT compilation knows what kind of tuple slot the
deforming routine can expect. There's also optimization potential for
expression initialization without JIT compilation. It e.g. seems
plausible to elide EEOP_*_FETCHSOME ops entirely when dealing with
virtual slots.

Author: Andres Freund
Discussion: https://postgr.es/m/20181105210039.hh4vvi4vwoq5ba2q@alap3.anarazel.de
2018-11-15 22:00:30 -08:00
Andres Freund 675af5c01e Compute information about EEOP_*_FETCHSOME at expression init time.
Previously this information was computed when JIT compiling an
expression.  But the information is useful for assertions in the
non-JIT case too (for assertions), therefore it makes sense to move
it.

This will, in a followup commit, allow to treat different slot types
differently. E.g. for virtual slots there's no need to generate a JIT
function to deform the slot.

Author: Andres Freund
Discussion: https://postgr.es/m/20181105210039.hh4vvi4vwoq5ba2q@alap3.anarazel.de
2018-11-15 22:00:30 -08:00
Andres Freund 1a0586de36 Introduce notion of different types of slots (without implementing them).
Upcoming work intends to allow pluggable ways to introduce new ways of
storing table data. Accessing those table access methods from the
executor requires TupleTableSlots to be carry tuples in the native
format of such storage methods; otherwise there'll be a significant
conversion overhead.

Different access methods will require different data to store tuples
efficiently (just like virtual, minimal, heap already require fields
in TupleTableSlot). To allow that without requiring additional pointer
indirections, we want to have different structs (embedding
TupleTableSlot) for different types of slots.  Thus different types of
slots are needed, which requires adapting creators of slots.

The slot that most efficiently can represent a type of tuple in an
executor node will often depend on the type of slot a child node
uses. Therefore we need to track the type of slot is returned by
nodes, so parent slots can create slots based on that.

Relatedly, JIT compilation of tuple deforming needs to know which type
of slot a certain expression refers to, so it can create an
appropriate deforming function for the type of tuple in the slot.

But not all nodes will only return one type of slot, e.g. an append
node will potentially return different types of slots for each of its
subplans.

Therefore add function that allows to query the type of a node's
result slot, and whether it'll always be the same type (whether it's
fixed). This can be queried using ExecGetResultSlotOps().

The scan, result, inner, outer type of slots are automatically
inferred from ExecInitScanTupleSlot(), ExecInitResultSlot(),
left/right subtrees respectively. If that's not correct for a node,
that can be overwritten using new fields in PlanState.

This commit does not introduce the actually abstracted implementation
of different kind of TupleTableSlots, that will be left for a followup
commit.  The different types of slots introduced will, for now, still
use the same backing implementation.

While this already partially invalidates the big comment in
tuptable.h, it seems to make more sense to update it later, when the
different TupleTableSlot implementations actually exist.

Author: Ashutosh Bapat and Andres Freund, with changes by Amit Khandekar
Discussion: https://postgr.es/m/20181105210039.hh4vvi4vwoq5ba2q@alap3.anarazel.de
2018-11-15 22:00:30 -08:00
Andres Freund 763f2edd92 Rejigger materializing and fetching a HeapTuple from a slot.
Previously materializing a slot always returned a HeapTuple. As
current work aims to reduce the reliance on HeapTuples (so other
storage systems can work efficiently), that needs to change. Thus
split the tasks of materializing a slot (i.e. making it independent
from the underlying storage / other memory contexts) from fetching a
HeapTuple from the slot.  For brevity, allow to fetch a HeapTuple from
a slot and materializing the slot at the same time, controlled by a
parameter.

For now some callers of ExecFetchSlotHeapTuple, with materialize =
true, expect that changes to the heap tuple will be reflected in the
underlying slot.  Those places will be adapted in due course, so while
not pretty, that's OK for now.

Also rename ExecFetchSlotTuple to ExecFetchSlotHeapTupleDatum and
ExecFetchSlotTupleDatum to ExecFetchSlotHeapTupleDatum, as it's likely
that future storage methods will need similar methods. There already
is ExecFetchSlotMinimalTuple, so the new names make the naming scheme
more coherent.

Author: Ashutosh Bapat and Andres Freund, with changes by Amit Khandekar
Discussion: https://postgr.es/m/20181105210039.hh4vvi4vwoq5ba2q@alap3.anarazel.de
2018-11-15 14:31:12 -08:00
Peter Eisentraut 86a4819f69 Update executor documentation for run-time partition pruning
With run-time partition pruning, there is no longer necessarily an
executor node for each corresponding plan node.

Author: David Rowley <david.rowley@2ndquadrant.com>
2018-11-15 23:02:21 +01:00
Andres Freund c058fc2a2b Rationalize expression context reset in ExecModifyTable().
The current pattern of reseting expressions both in
ExecProcessReturning() and ExecOnConflictUpdate() makes it harder than
necessary to reason about memory lifetimes.  It also requires
materializing slots unnecessarily, although this patch doesn't take
advantage of the fact that that's not necessary anymore.

Instead reset the expression context once for each input tuple.

Author: Ashutosh Bapat
Discussion: https://postgr.es/m/20181105210039.hh4vvi4vwoq5ba2q@alap3.anarazel.de
2018-11-15 11:40:50 -08:00
Tom Lane 34c9e455d0 Improve performance of partition pruning remapping a little.
ExecFindInitialMatchingSubPlans has to update the PartitionPruneState's
subplan mapping data to account for the removal of subplans it prunes.
However, that's only necessary if run-time pruning will also occur,
so we can skip it when that won't happen, which should result in not
needing to do the remapping in many cases.  (We now need it only when
some partitions are potentially startup-time prunable and others are
potentially run-time prunable, which seems like an unusual case.)

Also make some marginal performance improvements in the remapping
itself.  These will mainly win if most partitions got pruned by
the startup-time pruning, which is perhaps a debatable assumption
in this context.

Also fix some bogus comments, and rearrange code to marginally
reduce space consumption in the executor's query-lifespan context.

David Rowley, reviewed by Yoshikazu Imai

Discussion: https://postgr.es/m/CAKJS1f9+m6-di-zyy4B4AGn0y1B9F8UKDRigtBbNviXOkuyOpw@mail.gmail.com
2018-11-15 13:34:16 -05:00
Alvaro Herrera 74514bd4a5 geo_ops.c: Clarify comments and function arguments
These functions were not crystal clear about what their respective APIs
are.  Make an effort to improve that.

Emre's patch was correct AFAICT, but I (Álvaro) felt the need to improve
a few comments a bit more.  Any resulting errors are my own.

Per complaint from Coverity, Ning Yu, and Tom Lane.

Author: Emre Hasegeli, Álvaro Herrera
Reviewed-by: Tomas Vondra, Álvaro Herrera
Discussion: https://postgr.es/m/26769.1533090136@sss.pgh.pa.us
2018-11-15 12:08:03 -03:00
Thomas Munro ab8984f52d Further adjustment to random() seed initialization.
Per complaint from Tom Lane, don't chomp the timestamp at 32 bits, so we
can shift in some of its higher bits.

Discussion: https://postgr.es/m/14712.1542253115%40sss.pgh.pa.us
2018-11-15 17:38:55 +13:00
Thomas Munro 5b0ce3ec33 Increase the number of possible random seeds per time period.
Commit 197e4af9 refactored the initialization of the libc random()
seed, but reduced the number of possible seeds values that could be
chosen in a given time period.  This negation of the effects of
commit 98c50656c was unintentional.  Replace with code that
shifts the fast moving timestamp bits left, similar to the original
algorithm (though not the previous float-tolerating coding, which
is no longer necessary).

Author: Thomas Munro
Reported-by: Noah Misch
Reviewed-by: Tom Lane
Discussion: https://postgr.es/m/20181112083358.GA1073796%40rfd.leadboat.com
2018-11-15 16:25:30 +13:00
Thomas Munro aa55183042 Use 64 bit type for BufFileSize().
BufFileSize() can't use off_t, because it's only 32 bits wide on
some systems.  BufFile objects can have many 1GB segments so the
total size can exceed 2^31.  The only known client of the function
is parallel CREATE INDEX, which was reported to fail when building
large indexes on Windows.

Though this is technically an ABI break on platforms with a 32 bit
off_t and we might normally avoid back-patching it, the function is
brand new and thus unlikely to have been discovered by extension
authors yet, and it's fairly thoroughly broken on those platforms
anyway, so just fix it.

Defect in 9da0cc35.  Bug #15460.  Back-patch to 11, where this
function landed.

Author: Thomas Munro
Reported-by: Paul van der Linden, Pavel Oskin
Reviewed-by: Peter Geoghegan
Discussion: https://postgr.es/m/15460-b6db80de822fa0ad%40postgresql.org
Discussion: https://postgr.es/m/CAHDGBJP_GsESbTt4P3FZA8kMUKuYxjg57XHF7NRBoKnR%3DCAR-g%40mail.gmail.com
2018-11-15 13:13:57 +13:00
Tom Lane 600b04d6b5 Add a timezone-specific variant of date_trunc().
date_trunc(field, timestamptz, zone_name) performs truncation using
the named time zone as reference, rather than working in the session
time zone as is the default behavior.  It's equivalent to

date_trunc(field, timestamptz at time zone zone_name) at time zone zone_name

but it's faster, easier to type, and arguably easier to understand.

Vik Fearing and Tom Lane

Discussion: https://postgr.es/m/6249ffc4-2b22-4c1b-4e7d-7af84fedd7c6@2ndquadrant.com
2018-11-14 15:41:07 -05:00