1996-07-09 08:22:35 +02:00
|
|
|
/*-------------------------------------------------------------------------
|
|
|
|
*
|
1999-02-14 00:22:53 +01:00
|
|
|
* execUtils.c
|
2000-07-12 04:37:39 +02:00
|
|
|
* miscellaneous executor utility routines
|
1996-07-09 08:22:35 +02:00
|
|
|
*
|
2022-01-08 01:04:57 +01:00
|
|
|
* Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
|
2000-01-26 06:58:53 +01:00
|
|
|
* Portions Copyright (c) 1994, Regents of the University of California
|
1996-07-09 08:22:35 +02:00
|
|
|
*
|
|
|
|
*
|
|
|
|
* IDENTIFICATION
|
2010-09-20 22:08:53 +02:00
|
|
|
* src/backend/executor/execUtils.c
|
1996-07-09 08:22:35 +02:00
|
|
|
*
|
|
|
|
*-------------------------------------------------------------------------
|
|
|
|
*/
|
|
|
|
/*
|
|
|
|
* INTERFACE ROUTINES
|
2002-12-15 17:17:59 +01:00
|
|
|
* CreateExecutorState Create/delete executor working state
|
|
|
|
* FreeExecutorState
|
|
|
|
* CreateExprContext
|
2006-08-04 23:33:36 +02:00
|
|
|
* CreateStandaloneExprContext
|
2002-12-15 17:17:59 +01:00
|
|
|
* FreeExprContext
|
2003-12-18 21:21:37 +01:00
|
|
|
* ReScanExprContext
|
2002-12-15 17:17:59 +01:00
|
|
|
*
|
2000-07-12 04:37:39 +02:00
|
|
|
* ExecAssignExprContext Common code for plan node init routines.
|
2002-12-15 17:17:59 +01:00
|
|
|
* etc
|
1996-07-09 08:22:35 +02:00
|
|
|
*
|
2005-12-02 21:03:42 +01:00
|
|
|
* ExecOpenScanRelation Common code for scan node init routines.
|
2018-10-04 20:03:37 +02:00
|
|
|
*
|
2018-10-04 21:48:17 +02:00
|
|
|
* ExecInitRangeTable Set up executor's range-table-related data.
|
|
|
|
*
|
2018-10-04 20:03:37 +02:00
|
|
|
* ExecGetRangeTableRelation Fetch Relation for a rangetable entry.
|
2005-12-02 21:03:42 +01:00
|
|
|
*
|
2017-04-18 19:20:59 +02:00
|
|
|
* executor_errposition Report syntactic position of an error.
|
|
|
|
*
|
2002-05-12 22:10:05 +02:00
|
|
|
* RegisterExprContextCallback Register function shutdown callback
|
|
|
|
* UnregisterExprContextCallback Deregister function shutdown callback
|
|
|
|
*
|
Faster expression evaluation and targetlist projection.
This replaces the old, recursive tree-walk based evaluation, with
non-recursive, opcode dispatch based, expression evaluation.
Projection is now implemented as part of expression evaluation.
This both leads to significant performance improvements, and makes
future just-in-time compilation of expressions easier.
The speed gains primarily come from:
- non-recursive implementation reduces stack usage / overhead
- simple sub-expressions are implemented with a single jump, without
function calls
- sharing some state between different sub-expressions
- reduced amount of indirect/hard to predict memory accesses by laying
out operation metadata sequentially; including the avoidance of
nearly all of the previously used linked lists
- more code has been moved to expression initialization, avoiding
constant re-checks at evaluation time
Future just-in-time compilation (JIT) has become easier, as
demonstrated by released patches intended to be merged in a later
release, for primarily two reasons: Firstly, due to a stricter split
between expression initialization and evaluation, less code has to be
handled by the JIT. Secondly, due to the non-recursive nature of the
generated "instructions", less performance-critical code-paths can
easily be shared between interpreted and compiled evaluation.
The new framework allows for significant future optimizations. E.g.:
- basic infrastructure for to later reduce the per executor-startup
overhead of expression evaluation, by caching state in prepared
statements. That'd be helpful in OLTPish scenarios where
initialization overhead is measurable.
- optimizing the generated "code". A number of proposals for potential
work has already been made.
- optimizing the interpreter. Similarly a number of proposals have
been made here too.
The move of logic into the expression initialization step leads to some
backward-incompatible changes:
- Function permission checks are now done during expression
initialization, whereas previously they were done during
execution. In edge cases this can lead to errors being raised that
previously wouldn't have been, e.g. a NULL array being coerced to a
different array type previously didn't perform checks.
- The set of domain constraints to be checked, is now evaluated once
during expression initialization, previously it was re-built
every time a domain check was evaluated. For normal queries this
doesn't change much, but e.g. for plpgsql functions, which caches
ExprStates, the old set could stick around longer. The behavior
around might still change.
Author: Andres Freund, with significant changes by Tom Lane,
changes by Heikki Linnakangas
Reviewed-By: Tom Lane, Heikki Linnakangas
Discussion: https://postgr.es/m/20161206034955.bh33paeralxbtluv@alap3.anarazel.de
2017-03-14 23:45:36 +01:00
|
|
|
* GetAttributeByName Runtime extraction of columns from tuples.
|
|
|
|
* GetAttributeByNum
|
|
|
|
*
|
1997-09-07 07:04:48 +02:00
|
|
|
* NOTES
|
1996-07-09 08:22:35 +02:00
|
|
|
* This file has traditionally been the place to stick misc.
|
|
|
|
* executor support stuff that doesn't really go anyplace else.
|
|
|
|
*/
|
|
|
|
|
1996-10-31 11:12:26 +01:00
|
|
|
#include "postgres.h"
|
|
|
|
|
2018-10-03 22:05:05 +02:00
|
|
|
#include "access/parallel.h"
|
2009-12-07 06:22:23 +01:00
|
|
|
#include "access/relscan.h"
|
2019-01-21 19:18:20 +01:00
|
|
|
#include "access/table.h"
|
tableam: Add and use scan APIs.
Too allow table accesses to be not directly dependent on heap, several
new abstractions are needed. Specifically:
1) Heap scans need to be generalized into table scans. Do this by
introducing TableScanDesc, which will be the "base class" for
individual AMs. This contains the AM independent fields from
HeapScanDesc.
The previous heap_{beginscan,rescan,endscan} et al. have been
replaced with a table_ version.
There's no direct replacement for heap_getnext(), as that returned
a HeapTuple, which is undesirable for a other AMs. Instead there's
table_scan_getnextslot(). But note that heap_getnext() lives on,
it's still used widely to access catalog tables.
This is achieved by new scan_begin, scan_end, scan_rescan,
scan_getnextslot callbacks.
2) The portion of parallel scans that's shared between backends need
to be able to do so without the user doing per-AM work. To achieve
that new parallelscan_{estimate, initialize, reinitialize}
callbacks are introduced, which operate on a new
ParallelTableScanDesc, which again can be subclassed by AMs.
As it is likely that several AMs are going to be block oriented,
block oriented callbacks that can be shared between such AMs are
provided and used by heap. table_block_parallelscan_{estimate,
intiialize, reinitialize} as callbacks, and
table_block_parallelscan_{nextpage, init} for use in AMs. These
operate on a ParallelBlockTableScanDesc.
3) Index scans need to be able to access tables to return a tuple, and
there needs to be state across individual accesses to the heap to
store state like buffers. That's now handled by introducing a
sort-of-scan IndexFetchTable, which again is intended to be
subclassed by individual AMs (for heap IndexFetchHeap).
The relevant callbacks for an AM are index_fetch_{end, begin,
reset} to create the necessary state, and index_fetch_tuple to
retrieve an indexed tuple. Note that index_fetch_tuple
implementations need to be smarter than just blindly fetching the
tuples for AMs that have optimizations similar to heap's HOT - the
currently alive tuple in the update chain needs to be fetched if
appropriate.
Similar to table_scan_getnextslot(), it's undesirable to continue
to return HeapTuples. Thus index_fetch_heap (might want to rename
that later) now accepts a slot as an argument. Core code doesn't
have a lot of call sites performing index scans without going
through the systable_* API (in contrast to loads of heap_getnext
calls and working directly with HeapTuples).
Index scans now store the result of a search in
IndexScanDesc->xs_heaptid, rather than xs_ctup->t_self. As the
target is not generally a HeapTuple anymore that seems cleaner.
To be able to sensible adapt code to use the above, two further
callbacks have been introduced:
a) slot_callbacks returns a TupleTableSlotOps* suitable for creating
slots capable of holding a tuple of the AMs
type. table_slot_callbacks() and table_slot_create() are based
upon that, but have additional logic to deal with views, foreign
tables, etc.
While this change could have been done separately, nearly all the
call sites that needed to be adapted for the rest of this commit
also would have been needed to be adapted for
table_slot_callbacks(), making separation not worthwhile.
b) tuple_satisfies_snapshot checks whether the tuple in a slot is
currently visible according to a snapshot. That's required as a few
places now don't have a buffer + HeapTuple around, but a
slot (which in heap's case internally has that information).
Additionally a few infrastructure changes were needed:
I) SysScanDesc, as used by systable_{beginscan, getnext} et al. now
internally uses a slot to keep track of tuples. While
systable_getnext() still returns HeapTuples, and will so for the
foreseeable future, the index API (see 1) above) now only deals with
slots.
The remainder, and largest part, of this commit is then adjusting all
scans in postgres to use the new APIs.
Author: Andres Freund, Haribabu Kommi, Alvaro Herrera
Discussion:
https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
https://postgr.es/m/20160812231527.GA690404@alvherre.pgsql
2019-03-11 20:46:41 +01:00
|
|
|
#include "access/tableam.h"
|
2009-12-07 06:22:23 +01:00
|
|
|
#include "access/transam.h"
|
2015-04-24 08:33:23 +02:00
|
|
|
#include "executor/executor.h"
|
Fix permission checks on constraint violation errors on partitions.
If a cross-partition UPDATE violates a constraint on the target partition,
and the columns in the new partition are in different physical order than
in the parent, the error message can reveal columns that the user does not
have SELECT permission on. A similar bug was fixed earlier in commit
804b6b6db4.
The cause of the bug is that the callers of the
ExecBuildSlotValueDescription() function got confused when constructing
the list of modified columns. If the tuple was routed from a parent, we
converted the tuple to the parent's format, but the list of modified
columns was grabbed directly from the child's RTE entry.
ExecUpdateLockMode() had a similar issue. That lead to confusion on which
columns are key columns, leading to wrong tuple lock being taken on tables
referenced by foreign keys, when a row is updated with INSERT ON CONFLICT
UPDATE. A new isolation test is added for that corner case.
With this patch, the ri_RangeTableIndex field is no longer set for
partitions that don't have an entry in the range table. Previously, it was
set to the RTE entry of the parent relation, but that was confusing.
NOTE: This modifies the ResultRelInfo struct, replacing the
ri_PartitionRoot field with ri_RootResultRelInfo. That's a bit risky to
backpatch, because it breaks any extensions accessing the field. The
change that ri_RangeTableIndex is not set for partitions could potentially
break extensions, too. The ResultRelInfos are visible to FDWs at least,
and this patch required small changes to postgres_fdw. Nevertheless, this
seem like the least bad option. I don't think these fields widely used in
extensions; I don't think there are FDWs out there that uses the FDW
"direct update" API, other than postgres_fdw. If there is, you will get a
compilation error, so hopefully it is caught quickly.
Backpatch to 11, where support for both cross-partition UPDATEs, and unique
indexes on partitioned tables, were added.
Reviewed-by: Amit Langote
Security: CVE-2021-3393
2021-02-08 10:01:51 +01:00
|
|
|
#include "executor/execPartition.h"
|
2018-07-26 01:31:49 +02:00
|
|
|
#include "jit/jit.h"
|
2017-04-18 19:20:59 +02:00
|
|
|
#include "mb/pg_wchar.h"
|
2020-04-08 05:42:04 +02:00
|
|
|
#include "miscadmin.h"
|
2009-04-03 00:39:30 +02:00
|
|
|
#include "nodes/nodeFuncs.h"
|
2005-12-02 21:03:42 +01:00
|
|
|
#include "parser/parsetree.h"
|
Allow ATTACH PARTITION with only ShareUpdateExclusiveLock.
We still require AccessExclusiveLock on the partition itself, because
otherwise an insert that violates the newly-imposed partition
constraint could be in progress at the same time that we're changing
that constraint; only the lock level on the parent relation is
weakened.
To make this safe, we have to cope with (at least) three separate
problems. First, relevant DDL might commit while we're in the process
of building a PartitionDesc. If so, find_inheritance_children() might
see a new partition while the RELOID system cache still has the old
partition bound cached, and even before invalidation messages have
been queued. To fix that, if we see that the pg_class tuple seems to
be missing or to have a null relpartbound, refetch the value directly
from the table. We can't get the wrong value, because DETACH PARTITION
still requires AccessExclusiveLock throughout; if we ever want to
change that, this will need more thought. In testing, I found it quite
difficult to hit even the null-relpartbound case; the race condition
is extremely tight, but the theoretical risk is there.
Second, successive calls to RelationGetPartitionDesc might not return
the same answer. The query planner will get confused if lookup up the
PartitionDesc for a particular relation does not return a consistent
answer for the entire duration of query planning. Likewise, query
execution will get confused if the same relation seems to have a
different PartitionDesc at different times. Invent a new
PartitionDirectory concept and use it to ensure consistency. This
ensures that a single invocation of either the planner or the executor
sees the same view of the PartitionDesc from beginning to end, but it
does not guarantee that the planner and the executor see the same
view. Since this allows pointers to old PartitionDesc entries to
survive even after a relcache rebuild, also postpone removing the old
PartitionDesc entry until we're certain no one is using it.
For the most part, it seems to be OK for the planner and executor to
have different views of the PartitionDesc, because the executor will
just ignore any concurrently added partitions which were unknown at
plan time; those partitions won't be part of the inheritance
expansion, but invalidation messages will trigger replanning at some
point. Normally, this happens by the time the very next command is
executed, but if the next command acquires no locks and executes a
prepared query, it can manage not to notice until a new transaction is
started. We might want to tighten that up, but it's material for a
separate patch. There would still be a small window where a query
that started just after an ATTACH PARTITION command committed might
fail to notice its results -- but only if the command starts before
the commit has been acknowledged to the user. All in all, the warts
here around serializability seem small enough to be worth accepting
for the considerable advantage of being able to add partitions without
a full table lock.
Although in general the consequences of new partitions showing up
between planning and execution are limited to the query not noticing
the new partitions, run-time partition pruning will get confused in
that case, so that's the third problem that this patch fixes.
Run-time partition pruning assumes that indexes into the PartitionDesc
are stable between planning and execution. So, add code so that if
new partitions are added between plan time and execution time, the
indexes stored in the subplan_map[] and subpart_map[] arrays within
the plan's PartitionedRelPruneInfo get adjusted accordingly. There
does not seem to be a simple way to generalize this scheme to cope
with partitions that are removed, mostly because they could then get
added back again with different bounds, but it works OK for added
partitions.
This code does not try to ensure that every backend participating in
a parallel query sees the same view of the PartitionDesc. That
currently doesn't matter, because we never pass PartitionDesc
indexes between backends. Each backend will ignore the concurrently
added partitions which it notices, and it doesn't matter if different
backends are ignoring different sets of concurrently added partitions.
If in the future that matters, for example because we allow writes in
parallel query and want all participants to do tuple routing to the same
set of partitions, the PartitionDirectory concept could be improved to
share PartitionDescs across backends. There is a draft patch to
serialize and restore PartitionDescs on the thread where this patch
was discussed, which may be a useful place to start.
Patch by me. Thanks to Alvaro Herrera, David Rowley, Simon Riggs,
Amit Langote, and Michael Paquier for discussion, and to Alvaro
Herrera for some review.
Discussion: http://postgr.es/m/CA+Tgmobt2upbSocvvDej3yzokd7AkiT+PvgFH+a9-5VV1oJNSQ@mail.gmail.com
Discussion: http://postgr.es/m/CA+TgmoZE0r9-cyA-aY6f8WFEROaDLLL7Vf81kZ8MtFCkxpeQSw@mail.gmail.com
Discussion: http://postgr.es/m/CA+TgmoY13KQZF-=HNTrt9UYWYx3_oYOQpu9ioNT49jGgiDpUEA@mail.gmail.com
2019-03-07 17:13:12 +01:00
|
|
|
#include "partitioning/partdesc.h"
|
2017-03-21 14:48:04 +01:00
|
|
|
#include "storage/lmgr.h"
|
Faster expression evaluation and targetlist projection.
This replaces the old, recursive tree-walk based evaluation, with
non-recursive, opcode dispatch based, expression evaluation.
Projection is now implemented as part of expression evaluation.
This both leads to significant performance improvements, and makes
future just-in-time compilation of expressions easier.
The speed gains primarily come from:
- non-recursive implementation reduces stack usage / overhead
- simple sub-expressions are implemented with a single jump, without
function calls
- sharing some state between different sub-expressions
- reduced amount of indirect/hard to predict memory accesses by laying
out operation metadata sequentially; including the avoidance of
nearly all of the previously used linked lists
- more code has been moved to expression initialization, avoiding
constant re-checks at evaluation time
Future just-in-time compilation (JIT) has become easier, as
demonstrated by released patches intended to be merged in a later
release, for primarily two reasons: Firstly, due to a stricter split
between expression initialization and evaluation, less code has to be
handled by the JIT. Secondly, due to the non-recursive nature of the
generated "instructions", less performance-critical code-paths can
easily be shared between interpreted and compiled evaluation.
The new framework allows for significant future optimizations. E.g.:
- basic infrastructure for to later reduce the per executor-startup
overhead of expression evaluation, by caching state in prepared
statements. That'd be helpful in OLTPish scenarios where
initialization overhead is measurable.
- optimizing the generated "code". A number of proposals for potential
work has already been made.
- optimizing the interpreter. Similarly a number of proposals have
been made here too.
The move of logic into the expression initialization step leads to some
backward-incompatible changes:
- Function permission checks are now done during expression
initialization, whereas previously they were done during
execution. In edge cases this can lead to errors being raised that
previously wouldn't have been, e.g. a NULL array being coerced to a
different array type previously didn't perform checks.
- The set of domain constraints to be checked, is now evaluated once
during expression initialization, previously it was re-built
every time a domain check was evaluated. For normal queries this
doesn't change much, but e.g. for plpgsql functions, which caches
ExprStates, the old set could stick around longer. The behavior
around might still change.
Author: Andres Freund, with significant changes by Tom Lane,
changes by Heikki Linnakangas
Reviewed-By: Tom Lane, Heikki Linnakangas
Discussion: https://postgr.es/m/20161206034955.bh33paeralxbtluv@alap3.anarazel.de
2017-03-14 23:45:36 +01:00
|
|
|
#include "utils/builtins.h"
|
2000-07-12 04:37:39 +02:00
|
|
|
#include "utils/memutils.h"
|
2015-04-24 08:33:23 +02:00
|
|
|
#include "utils/rel.h"
|
Faster expression evaluation and targetlist projection.
This replaces the old, recursive tree-walk based evaluation, with
non-recursive, opcode dispatch based, expression evaluation.
Projection is now implemented as part of expression evaluation.
This both leads to significant performance improvements, and makes
future just-in-time compilation of expressions easier.
The speed gains primarily come from:
- non-recursive implementation reduces stack usage / overhead
- simple sub-expressions are implemented with a single jump, without
function calls
- sharing some state between different sub-expressions
- reduced amount of indirect/hard to predict memory accesses by laying
out operation metadata sequentially; including the avoidance of
nearly all of the previously used linked lists
- more code has been moved to expression initialization, avoiding
constant re-checks at evaluation time
Future just-in-time compilation (JIT) has become easier, as
demonstrated by released patches intended to be merged in a later
release, for primarily two reasons: Firstly, due to a stricter split
between expression initialization and evaluation, less code has to be
handled by the JIT. Secondly, due to the non-recursive nature of the
generated "instructions", less performance-critical code-paths can
easily be shared between interpreted and compiled evaluation.
The new framework allows for significant future optimizations. E.g.:
- basic infrastructure for to later reduce the per executor-startup
overhead of expression evaluation, by caching state in prepared
statements. That'd be helpful in OLTPish scenarios where
initialization overhead is measurable.
- optimizing the generated "code". A number of proposals for potential
work has already been made.
- optimizing the interpreter. Similarly a number of proposals have
been made here too.
The move of logic into the expression initialization step leads to some
backward-incompatible changes:
- Function permission checks are now done during expression
initialization, whereas previously they were done during
execution. In edge cases this can lead to errors being raised that
previously wouldn't have been, e.g. a NULL array being coerced to a
different array type previously didn't perform checks.
- The set of domain constraints to be checked, is now evaluated once
during expression initialization, previously it was re-built
every time a domain check was evaluated. For normal queries this
doesn't change much, but e.g. for plpgsql functions, which caches
ExprStates, the old set could stick around longer. The behavior
around might still change.
Author: Andres Freund, with significant changes by Tom Lane,
changes by Heikki Linnakangas
Reviewed-By: Tom Lane, Heikki Linnakangas
Discussion: https://postgr.es/m/20161206034955.bh33paeralxbtluv@alap3.anarazel.de
2017-03-14 23:45:36 +01:00
|
|
|
#include "utils/typcache.h"
|
1996-07-09 08:22:35 +02:00
|
|
|
|
1997-08-19 23:40:56 +02:00
|
|
|
|
Remove arbitrary 64K-or-so limit on rangetable size.
Up to now the size of a query's rangetable has been limited by the
constants INNER_VAR et al, which mustn't be equal to any real
rangetable index. 65000 doubtless seemed like enough for anybody,
and it still is orders of magnitude larger than the number of joins
we can realistically handle. However, we need a rangetable entry
for each child partition that is (or might be) processed by a query.
Queries with a few thousand partitions are getting more realistic,
so that the day when that limit becomes a problem is in sight,
even if it's not here yet. Hence, let's raise the limit.
Rather than just increase the values of INNER_VAR et al, this patch
adopts the approach of making them small negative values, so that
rangetables could theoretically become as long as INT_MAX.
The bulk of the patch is concerned with changing Var.varno and some
related variables from "Index" (unsigned int) to plain "int". This
is basically cosmetic, with little actual effect other than to help
debuggers print their values nicely. As such, I've only bothered
with changing places that could actually see INNER_VAR et al, which
the parser and most of the planner don't. We do have to be careful
in places that are performing less/greater comparisons on varnos,
but there are very few such places, other than the IS_SPECIAL_VARNO
macro itself.
A notable side effect of this patch is that while it used to be
possible to add INNER_VAR et al to a Bitmapset, that will now
draw an error. I don't see any likelihood that it wouldn't be a
bug to include these fake varnos in a bitmapset of real varnos,
so I think this is all to the good.
Although this touches outfuncs/readfuncs, I don't think a catversion
bump is required, since stored rules would never contain Vars
with these fake varnos.
Andrey Lepikhov and Tom Lane, after a suggestion by Peter Eisentraut
Discussion: https://postgr.es/m/43c7f2f5-1e27-27aa-8c65-c91859d15190@postgrespro.ru
2021-09-15 20:11:21 +02:00
|
|
|
static bool tlist_matches_tupdesc(PlanState *ps, List *tlist, int varno, TupleDesc tupdesc);
|
2009-07-18 21:15:42 +02:00
|
|
|
static void ShutdownExprContext(ExprContext *econtext, bool isCommit);
|
2002-05-12 22:10:05 +02:00
|
|
|
|
2002-12-15 17:17:59 +01:00
|
|
|
|
1996-07-09 08:22:35 +02:00
|
|
|
/* ----------------------------------------------------------------
|
2002-12-15 17:17:59 +01:00
|
|
|
* Executor state and memory management functions
|
1996-07-09 08:22:35 +02:00
|
|
|
* ----------------------------------------------------------------
|
|
|
|
*/
|
|
|
|
|
|
|
|
/* ----------------
|
2002-12-15 17:17:59 +01:00
|
|
|
* CreateExecutorState
|
2000-07-12 04:37:39 +02:00
|
|
|
*
|
2002-12-15 17:17:59 +01:00
|
|
|
* Create and initialize an EState node, which is the root of
|
|
|
|
* working storage for an entire Executor invocation.
|
1996-07-09 08:22:35 +02:00
|
|
|
*
|
2002-12-15 17:17:59 +01:00
|
|
|
* Principally, this creates the per-query memory context that will be
|
|
|
|
* used to hold all working data that lives till the end of the query.
|
|
|
|
* Note that the per-query context will become a child of the caller's
|
|
|
|
* CurrentMemoryContext.
|
1996-07-09 08:22:35 +02:00
|
|
|
* ----------------
|
|
|
|
*/
|
2002-12-15 17:17:59 +01:00
|
|
|
EState *
|
|
|
|
CreateExecutorState(void)
|
1996-07-09 08:22:35 +02:00
|
|
|
{
|
2007-02-27 02:11:26 +01:00
|
|
|
EState *estate;
|
2002-12-15 17:17:59 +01:00
|
|
|
MemoryContext qcontext;
|
2007-02-27 02:11:26 +01:00
|
|
|
MemoryContext oldcontext;
|
1997-09-07 07:04:48 +02:00
|
|
|
|
2002-12-15 17:17:59 +01:00
|
|
|
/*
|
|
|
|
* Create the per-query context for this Executor run.
|
|
|
|
*/
|
|
|
|
qcontext = AllocSetContextCreate(CurrentMemoryContext,
|
|
|
|
"ExecutorState",
|
Add macros to make AllocSetContextCreate() calls simpler and safer.
I found that half a dozen (nearly 5%) of our AllocSetContextCreate calls
had typos in the context-sizing parameters. While none of these led to
especially significant problems, they did create minor inefficiencies,
and it's now clear that expecting people to copy-and-paste those calls
accurately is not a great idea. Let's reduce the risk of future errors
by introducing single macros that encapsulate the common use-cases.
Three such macros are enough to cover all but two special-purpose contexts;
those two calls can be left as-is, I think.
While this patch doesn't in itself improve matters for third-party
extensions, it doesn't break anything for them either, and they can
gradually adopt the simplified notation over time.
In passing, change TopMemoryContext to use the default allocation
parameters. Formerly it could only be extended 8K at a time. That was
probably reasonable when this code was written; but nowadays we create
many more contexts than we did then, so that it's not unusual to have a
couple hundred K in TopMemoryContext, even without considering various
dubious code that sticks other things there. There seems no good reason
not to let it use growing blocks like most other contexts.
Back-patch to 9.6, mostly because that's still close enough to HEAD that
it's easy to do so, and keeping the branches in sync can be expected to
avoid some future back-patching pain. The bugs fixed by these changes
don't seem to be significant enough to justify fixing them further back.
Discussion: <21072.1472321324@sss.pgh.pa.us>
2016-08-27 23:50:38 +02:00
|
|
|
ALLOCSET_DEFAULT_SIZES);
|
2001-03-22 05:01:46 +01:00
|
|
|
|
2000-07-12 04:37:39 +02:00
|
|
|
/*
|
2002-12-15 17:17:59 +01:00
|
|
|
* Make the EState node within the per-query context. This way, we don't
|
|
|
|
* need a separate pfree() operation for it at shutdown.
|
2000-07-12 04:37:39 +02:00
|
|
|
*/
|
2002-12-15 17:17:59 +01:00
|
|
|
oldcontext = MemoryContextSwitchTo(qcontext);
|
|
|
|
|
|
|
|
estate = makeNode(EState);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Initialize all fields of the Executor State structure
|
|
|
|
*/
|
|
|
|
estate->es_direction = ForwardScanDirection;
|
2013-07-23 16:58:32 +02:00
|
|
|
estate->es_snapshot = InvalidSnapshot; /* caller must initialize this */
|
2004-09-11 20:28:34 +02:00
|
|
|
estate->es_crosscheck_snapshot = InvalidSnapshot; /* no crosscheck */
|
2002-12-15 17:17:59 +01:00
|
|
|
estate->es_range_table = NIL;
|
2018-10-04 21:48:17 +02:00
|
|
|
estate->es_range_table_size = 0;
|
2018-10-04 20:03:37 +02:00
|
|
|
estate->es_relations = NULL;
|
2018-10-08 16:41:34 +02:00
|
|
|
estate->es_rowmarks = NULL;
|
Re-implement EvalPlanQual processing to improve its performance and eliminate
a lot of strange behaviors that occurred in join cases. We now identify the
"current" row for every joined relation in UPDATE, DELETE, and SELECT FOR
UPDATE/SHARE queries. If an EvalPlanQual recheck is necessary, we jam the
appropriate row into each scan node in the rechecking plan, forcing it to emit
only that one row. The former behavior could rescan the whole of each joined
relation for each recheck, which was terrible for performance, and what's much
worse could result in duplicated output tuples.
Also, the original implementation of EvalPlanQual could not re-use the recheck
execution tree --- it had to go through a full executor init and shutdown for
every row to be tested. To avoid this overhead, I've associated a special
runtime Param with each LockRows or ModifyTable plan node, and arranged to
make every scan node below such a node depend on that Param. Thus, by
signaling a change in that Param, the EPQ machinery can just rescan the
already-built test plan.
This patch also adds a prohibition on set-returning functions in the
targetlist of SELECT FOR UPDATE/SHARE. This is needed to avoid the
duplicate-output-tuple problem. It seems fairly reasonable since the
other restrictions on SELECT FOR UPDATE are meant to ensure that there
is a unique correspondence between source tuples and result tuples,
which an output SRF destroys as much as anything else does.
2009-10-26 03:26:45 +01:00
|
|
|
estate->es_plannedstmt = NULL;
|
2002-12-15 17:17:59 +01:00
|
|
|
|
2009-10-12 20:10:51 +02:00
|
|
|
estate->es_junkFilter = NULL;
|
|
|
|
|
2007-11-30 22:22:54 +01:00
|
|
|
estate->es_output_cid = (CommandId) 0;
|
|
|
|
|
2002-12-15 17:17:59 +01:00
|
|
|
estate->es_result_relations = NULL;
|
Create ResultRelInfos later in InitPlan, index them by RT index.
Instead of allocating all the ResultRelInfos upfront in one big array,
allocate them in ExecInitModifyTable(). es_result_relations is now an
array of ResultRelInfo pointers, rather than an array of structs, and it
is indexed by the RT index.
This simplifies things: we get rid of the separate concept of a "result
rel index", and don't need to set it in setrefs.c anymore. This also
allows follow-up optimizations (not included in this commit yet) to skip
initializing ResultRelInfos for target relations that were not needed at
runtime, and removal of the es_result_relation_info pointer.
The EState arrays of regular result rels and root result rels are merged
into one array. Similarly, the resultRelations and rootResultRelations
lists in PlannedStmt are merged into one. It's not actually clear to me
why they were kept separate in the first place, but now that the
es_result_relations array is indexed by RT index, it certainly seems
pointless.
The PlannedStmt->resultRelations list is now only needed for
ExecRelationIsTargetRelation(). One visible effect of this change is that
ExecRelationIsTargetRelation() will now return 'true' also for the
partition root, if a partitioned table is updated. That seems like a good
thing, although the function isn't used in core code, and I don't see any
reason for an FDW to call it on a partition root.
Author: Amit Langote
Discussion: https://www.postgresql.org/message-id/CA%2BHiwqGEmiib8FLiHMhKB%2BCH5dRgHSLc5N5wnvc4kym%2BZYpQEQ%40mail.gmail.com
2020-10-13 11:57:02 +02:00
|
|
|
estate->es_opened_result_relations = NIL;
|
2018-02-08 20:29:05 +01:00
|
|
|
estate->es_tuple_routing_result_relations = NIL;
|
2007-08-15 23:39:50 +02:00
|
|
|
estate->es_trig_target_relations = NIL;
|
2005-11-14 18:42:55 +01:00
|
|
|
|
2002-12-15 17:17:59 +01:00
|
|
|
estate->es_param_list_info = NULL;
|
|
|
|
estate->es_param_exec_vals = NULL;
|
|
|
|
|
2017-04-01 06:17:18 +02:00
|
|
|
estate->es_queryEnv = NULL;
|
|
|
|
|
2002-12-15 17:17:59 +01:00
|
|
|
estate->es_query_cxt = qcontext;
|
|
|
|
|
2009-09-27 22:09:58 +02:00
|
|
|
estate->es_tupleTable = NIL;
|
2000-07-12 04:37:39 +02:00
|
|
|
|
2002-12-15 17:17:59 +01:00
|
|
|
estate->es_processed = 0;
|
|
|
|
|
2011-02-27 19:43:29 +01:00
|
|
|
estate->es_top_eflags = 0;
|
|
|
|
estate->es_instrument = 0;
|
|
|
|
estate->es_finished = false;
|
2002-12-15 17:17:59 +01:00
|
|
|
|
|
|
|
estate->es_exprcontexts = NIL;
|
|
|
|
|
2007-02-27 02:11:26 +01:00
|
|
|
estate->es_subplanstates = NIL;
|
|
|
|
|
2011-02-26 00:56:23 +01:00
|
|
|
estate->es_auxmodifytables = NIL;
|
|
|
|
|
2002-12-15 17:17:59 +01:00
|
|
|
estate->es_per_tuple_exprcontext = NULL;
|
|
|
|
|
2017-02-22 07:45:17 +01:00
|
|
|
estate->es_sourceText = NULL;
|
2002-12-15 17:17:59 +01:00
|
|
|
|
2017-10-27 16:04:01 +02:00
|
|
|
estate->es_use_parallel_mode = false;
|
|
|
|
|
2018-03-22 19:45:07 +01:00
|
|
|
estate->es_jit_flags = 0;
|
|
|
|
estate->es_jit = NULL;
|
|
|
|
|
2002-12-15 17:17:59 +01:00
|
|
|
/*
|
|
|
|
* Return the executor state structure
|
|
|
|
*/
|
|
|
|
MemoryContextSwitchTo(oldcontext);
|
|
|
|
|
|
|
|
return estate;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* ----------------
|
|
|
|
* FreeExecutorState
|
|
|
|
*
|
|
|
|
* Release an EState along with all remaining working storage.
|
|
|
|
*
|
2018-07-26 01:31:49 +02:00
|
|
|
* Note: this is not responsible for releasing non-memory resources, such as
|
|
|
|
* open relations or buffer pins. But it will shut down any still-active
|
|
|
|
* ExprContexts within the EState and deallocate associated JITed expressions.
|
|
|
|
* That is sufficient cleanup for situations where the EState has only been
|
|
|
|
* used for expression evaluation, and not to run a complete Plan.
|
2002-12-15 17:17:59 +01:00
|
|
|
*
|
|
|
|
* This can be called in any memory context ... so long as it's not one
|
|
|
|
* of the ones to be freed.
|
|
|
|
* ----------------
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
FreeExecutorState(EState *estate)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Shut down and free any remaining ExprContexts. We do this explicitly
|
|
|
|
* to ensure that any remaining shutdown callbacks get called (since they
|
|
|
|
* might need to release resources that aren't simply memory within the
|
|
|
|
* per-query memory context).
|
|
|
|
*/
|
|
|
|
while (estate->es_exprcontexts)
|
|
|
|
{
|
2004-05-26 06:41:50 +02:00
|
|
|
/*
|
|
|
|
* XXX: seems there ought to be a faster way to implement this than
|
2004-05-31 01:40:41 +02:00
|
|
|
* repeated list_delete(), no?
|
2004-05-26 06:41:50 +02:00
|
|
|
*/
|
2009-07-18 21:15:42 +02:00
|
|
|
FreeExprContext((ExprContext *) linitial(estate->es_exprcontexts),
|
|
|
|
true);
|
2002-12-15 17:17:59 +01:00
|
|
|
/* FreeExprContext removed the list link for us */
|
|
|
|
}
|
2003-08-04 02:43:34 +02:00
|
|
|
|
2018-07-26 01:31:49 +02:00
|
|
|
/* release JIT context, if allocated */
|
|
|
|
if (estate->es_jit)
|
|
|
|
{
|
|
|
|
jit_release_context(estate->es_jit);
|
|
|
|
estate->es_jit = NULL;
|
|
|
|
}
|
|
|
|
|
Allow ATTACH PARTITION with only ShareUpdateExclusiveLock.
We still require AccessExclusiveLock on the partition itself, because
otherwise an insert that violates the newly-imposed partition
constraint could be in progress at the same time that we're changing
that constraint; only the lock level on the parent relation is
weakened.
To make this safe, we have to cope with (at least) three separate
problems. First, relevant DDL might commit while we're in the process
of building a PartitionDesc. If so, find_inheritance_children() might
see a new partition while the RELOID system cache still has the old
partition bound cached, and even before invalidation messages have
been queued. To fix that, if we see that the pg_class tuple seems to
be missing or to have a null relpartbound, refetch the value directly
from the table. We can't get the wrong value, because DETACH PARTITION
still requires AccessExclusiveLock throughout; if we ever want to
change that, this will need more thought. In testing, I found it quite
difficult to hit even the null-relpartbound case; the race condition
is extremely tight, but the theoretical risk is there.
Second, successive calls to RelationGetPartitionDesc might not return
the same answer. The query planner will get confused if lookup up the
PartitionDesc for a particular relation does not return a consistent
answer for the entire duration of query planning. Likewise, query
execution will get confused if the same relation seems to have a
different PartitionDesc at different times. Invent a new
PartitionDirectory concept and use it to ensure consistency. This
ensures that a single invocation of either the planner or the executor
sees the same view of the PartitionDesc from beginning to end, but it
does not guarantee that the planner and the executor see the same
view. Since this allows pointers to old PartitionDesc entries to
survive even after a relcache rebuild, also postpone removing the old
PartitionDesc entry until we're certain no one is using it.
For the most part, it seems to be OK for the planner and executor to
have different views of the PartitionDesc, because the executor will
just ignore any concurrently added partitions which were unknown at
plan time; those partitions won't be part of the inheritance
expansion, but invalidation messages will trigger replanning at some
point. Normally, this happens by the time the very next command is
executed, but if the next command acquires no locks and executes a
prepared query, it can manage not to notice until a new transaction is
started. We might want to tighten that up, but it's material for a
separate patch. There would still be a small window where a query
that started just after an ATTACH PARTITION command committed might
fail to notice its results -- but only if the command starts before
the commit has been acknowledged to the user. All in all, the warts
here around serializability seem small enough to be worth accepting
for the considerable advantage of being able to add partitions without
a full table lock.
Although in general the consequences of new partitions showing up
between planning and execution are limited to the query not noticing
the new partitions, run-time partition pruning will get confused in
that case, so that's the third problem that this patch fixes.
Run-time partition pruning assumes that indexes into the PartitionDesc
are stable between planning and execution. So, add code so that if
new partitions are added between plan time and execution time, the
indexes stored in the subplan_map[] and subpart_map[] arrays within
the plan's PartitionedRelPruneInfo get adjusted accordingly. There
does not seem to be a simple way to generalize this scheme to cope
with partitions that are removed, mostly because they could then get
added back again with different bounds, but it works OK for added
partitions.
This code does not try to ensure that every backend participating in
a parallel query sees the same view of the PartitionDesc. That
currently doesn't matter, because we never pass PartitionDesc
indexes between backends. Each backend will ignore the concurrently
added partitions which it notices, and it doesn't matter if different
backends are ignoring different sets of concurrently added partitions.
If in the future that matters, for example because we allow writes in
parallel query and want all participants to do tuple routing to the same
set of partitions, the PartitionDirectory concept could be improved to
share PartitionDescs across backends. There is a draft patch to
serialize and restore PartitionDescs on the thread where this patch
was discussed, which may be a useful place to start.
Patch by me. Thanks to Alvaro Herrera, David Rowley, Simon Riggs,
Amit Langote, and Michael Paquier for discussion, and to Alvaro
Herrera for some review.
Discussion: http://postgr.es/m/CA+Tgmobt2upbSocvvDej3yzokd7AkiT+PvgFH+a9-5VV1oJNSQ@mail.gmail.com
Discussion: http://postgr.es/m/CA+TgmoZE0r9-cyA-aY6f8WFEROaDLLL7Vf81kZ8MtFCkxpeQSw@mail.gmail.com
Discussion: http://postgr.es/m/CA+TgmoY13KQZF-=HNTrt9UYWYx3_oYOQpu9ioNT49jGgiDpUEA@mail.gmail.com
2019-03-07 17:13:12 +01:00
|
|
|
/* release partition directory, if allocated */
|
|
|
|
if (estate->es_partition_directory)
|
|
|
|
{
|
|
|
|
DestroyPartitionDirectory(estate->es_partition_directory);
|
|
|
|
estate->es_partition_directory = NULL;
|
|
|
|
}
|
|
|
|
|
2002-12-15 17:17:59 +01:00
|
|
|
/*
|
|
|
|
* Free the per-query memory context, thereby releasing all working
|
2007-02-27 02:11:26 +01:00
|
|
|
* memory, including the EState node itself.
|
2002-12-15 17:17:59 +01:00
|
|
|
*/
|
2007-02-27 02:11:26 +01:00
|
|
|
MemoryContextDelete(estate->es_query_cxt);
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|
|
|
|
|
2020-04-08 05:42:04 +02:00
|
|
|
/*
|
|
|
|
* Internal implementation for CreateExprContext() and CreateWorkExprContext()
|
|
|
|
* that allows control over the AllocSet parameters.
|
1996-07-09 08:22:35 +02:00
|
|
|
*/
|
2020-04-08 05:42:04 +02:00
|
|
|
static ExprContext *
|
|
|
|
CreateExprContextInternal(EState *estate, Size minContextSize,
|
|
|
|
Size initBlockSize, Size maxBlockSize)
|
1996-07-09 08:22:35 +02:00
|
|
|
{
|
2002-12-15 17:17:59 +01:00
|
|
|
ExprContext *econtext;
|
|
|
|
MemoryContext oldcontext;
|
1997-09-07 07:04:48 +02:00
|
|
|
|
2002-12-15 17:17:59 +01:00
|
|
|
/* Create the ExprContext node within the per-query memory context */
|
|
|
|
oldcontext = MemoryContextSwitchTo(estate->es_query_cxt);
|
|
|
|
|
|
|
|
econtext = makeNode(ExprContext);
|
|
|
|
|
|
|
|
/* Initialize fields of ExprContext */
|
|
|
|
econtext->ecxt_scantuple = NULL;
|
2000-07-12 04:37:39 +02:00
|
|
|
econtext->ecxt_innertuple = NULL;
|
|
|
|
econtext->ecxt_outertuple = NULL;
|
2002-12-15 17:17:59 +01:00
|
|
|
|
|
|
|
econtext->ecxt_per_query_memory = estate->es_query_cxt;
|
2001-03-22 05:01:46 +01:00
|
|
|
|
2000-07-12 04:37:39 +02:00
|
|
|
/*
|
2002-12-15 17:17:59 +01:00
|
|
|
* Create working memory for expression evaluation in this context.
|
2000-07-12 04:37:39 +02:00
|
|
|
*/
|
|
|
|
econtext->ecxt_per_tuple_memory =
|
2002-12-15 17:17:59 +01:00
|
|
|
AllocSetContextCreate(estate->es_query_cxt,
|
|
|
|
"ExprContext",
|
2020-04-08 05:42:04 +02:00
|
|
|
minContextSize,
|
|
|
|
initBlockSize,
|
|
|
|
maxBlockSize);
|
2002-12-15 17:17:59 +01:00
|
|
|
|
|
|
|
econtext->ecxt_param_exec_vals = estate->es_param_exec_vals;
|
|
|
|
econtext->ecxt_param_list_info = estate->es_param_list_info;
|
|
|
|
|
2000-07-12 04:37:39 +02:00
|
|
|
econtext->ecxt_aggvalues = NULL;
|
|
|
|
econtext->ecxt_aggnulls = NULL;
|
2002-12-15 17:17:59 +01:00
|
|
|
|
2004-03-17 21:48:43 +01:00
|
|
|
econtext->caseValue_datum = (Datum) 0;
|
|
|
|
econtext->caseValue_isNull = true;
|
|
|
|
|
2002-12-15 17:17:59 +01:00
|
|
|
econtext->domainValue_datum = (Datum) 0;
|
|
|
|
econtext->domainValue_isNull = true;
|
|
|
|
|
|
|
|
econtext->ecxt_estate = estate;
|
|
|
|
|
2002-05-12 22:10:05 +02:00
|
|
|
econtext->ecxt_callbacks = NULL;
|
2000-07-12 04:37:39 +02:00
|
|
|
|
2002-12-15 17:17:59 +01:00
|
|
|
/*
|
|
|
|
* Link the ExprContext into the EState to ensure it is shut down when the
|
|
|
|
* EState is freed. Because we use lcons(), shutdowns will occur in
|
|
|
|
* reverse order of creation, which may not be essential but can't hurt.
|
|
|
|
*/
|
|
|
|
estate->es_exprcontexts = lcons(econtext, estate->es_exprcontexts);
|
|
|
|
|
|
|
|
MemoryContextSwitchTo(oldcontext);
|
|
|
|
|
2000-07-12 04:37:39 +02:00
|
|
|
return econtext;
|
|
|
|
}
|
1997-09-07 07:04:48 +02:00
|
|
|
|
2020-04-08 05:42:04 +02:00
|
|
|
/* ----------------
|
|
|
|
* CreateExprContext
|
|
|
|
*
|
|
|
|
* Create a context for expression evaluation within an EState.
|
|
|
|
*
|
|
|
|
* An executor run may require multiple ExprContexts (we usually make one
|
|
|
|
* for each Plan node, and a separate one for per-output-tuple processing
|
|
|
|
* such as constraint checking). Each ExprContext has its own "per-tuple"
|
|
|
|
* memory context.
|
|
|
|
*
|
|
|
|
* Note we make no assumption about the caller's memory context.
|
|
|
|
* ----------------
|
|
|
|
*/
|
|
|
|
ExprContext *
|
|
|
|
CreateExprContext(EState *estate)
|
|
|
|
{
|
|
|
|
return CreateExprContextInternal(estate, ALLOCSET_DEFAULT_SIZES);
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
/* ----------------
|
|
|
|
* CreateWorkExprContext
|
|
|
|
*
|
|
|
|
* Like CreateExprContext, but specifies the AllocSet sizes to be reasonable
|
|
|
|
* in proportion to work_mem. If the maximum block allocation size is too
|
|
|
|
* large, it's easy to skip right past work_mem with a single allocation.
|
|
|
|
* ----------------
|
|
|
|
*/
|
|
|
|
ExprContext *
|
|
|
|
CreateWorkExprContext(EState *estate)
|
|
|
|
{
|
|
|
|
Size minContextSize = ALLOCSET_DEFAULT_MINSIZE;
|
|
|
|
Size initBlockSize = ALLOCSET_DEFAULT_INITSIZE;
|
|
|
|
Size maxBlockSize = ALLOCSET_DEFAULT_MAXSIZE;
|
|
|
|
|
|
|
|
/* choose the maxBlockSize to be no larger than 1/16 of work_mem */
|
|
|
|
while (16 * maxBlockSize > work_mem * 1024L)
|
|
|
|
maxBlockSize >>= 1;
|
|
|
|
|
|
|
|
if (maxBlockSize < ALLOCSET_DEFAULT_INITSIZE)
|
|
|
|
maxBlockSize = ALLOCSET_DEFAULT_INITSIZE;
|
|
|
|
|
|
|
|
return CreateExprContextInternal(estate, minContextSize,
|
|
|
|
initBlockSize, maxBlockSize);
|
|
|
|
}
|
|
|
|
|
2006-08-04 23:33:36 +02:00
|
|
|
/* ----------------
|
|
|
|
* CreateStandaloneExprContext
|
|
|
|
*
|
|
|
|
* Create a context for standalone expression evaluation.
|
|
|
|
*
|
|
|
|
* An ExprContext made this way can be used for evaluation of expressions
|
|
|
|
* that contain no Params, subplans, or Var references (it might work to
|
|
|
|
* put tuple references into the scantuple field, but it seems unwise).
|
|
|
|
*
|
|
|
|
* The ExprContext struct is allocated in the caller's current memory
|
|
|
|
* context, which also becomes its "per query" context.
|
|
|
|
*
|
|
|
|
* It is caller's responsibility to free the ExprContext when done,
|
|
|
|
* or at least ensure that any shutdown callbacks have been called
|
|
|
|
* (ReScanExprContext() is suitable). Otherwise, non-memory resources
|
|
|
|
* might be leaked.
|
|
|
|
* ----------------
|
|
|
|
*/
|
|
|
|
ExprContext *
|
|
|
|
CreateStandaloneExprContext(void)
|
|
|
|
{
|
|
|
|
ExprContext *econtext;
|
|
|
|
|
|
|
|
/* Create the ExprContext node within the caller's memory context */
|
|
|
|
econtext = makeNode(ExprContext);
|
|
|
|
|
|
|
|
/* Initialize fields of ExprContext */
|
|
|
|
econtext->ecxt_scantuple = NULL;
|
|
|
|
econtext->ecxt_innertuple = NULL;
|
|
|
|
econtext->ecxt_outertuple = NULL;
|
|
|
|
|
|
|
|
econtext->ecxt_per_query_memory = CurrentMemoryContext;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Create working memory for expression evaluation in this context.
|
|
|
|
*/
|
|
|
|
econtext->ecxt_per_tuple_memory =
|
|
|
|
AllocSetContextCreate(CurrentMemoryContext,
|
|
|
|
"ExprContext",
|
Add macros to make AllocSetContextCreate() calls simpler and safer.
I found that half a dozen (nearly 5%) of our AllocSetContextCreate calls
had typos in the context-sizing parameters. While none of these led to
especially significant problems, they did create minor inefficiencies,
and it's now clear that expecting people to copy-and-paste those calls
accurately is not a great idea. Let's reduce the risk of future errors
by introducing single macros that encapsulate the common use-cases.
Three such macros are enough to cover all but two special-purpose contexts;
those two calls can be left as-is, I think.
While this patch doesn't in itself improve matters for third-party
extensions, it doesn't break anything for them either, and they can
gradually adopt the simplified notation over time.
In passing, change TopMemoryContext to use the default allocation
parameters. Formerly it could only be extended 8K at a time. That was
probably reasonable when this code was written; but nowadays we create
many more contexts than we did then, so that it's not unusual to have a
couple hundred K in TopMemoryContext, even without considering various
dubious code that sticks other things there. There seems no good reason
not to let it use growing blocks like most other contexts.
Back-patch to 9.6, mostly because that's still close enough to HEAD that
it's easy to do so, and keeping the branches in sync can be expected to
avoid some future back-patching pain. The bugs fixed by these changes
don't seem to be significant enough to justify fixing them further back.
Discussion: <21072.1472321324@sss.pgh.pa.us>
2016-08-27 23:50:38 +02:00
|
|
|
ALLOCSET_DEFAULT_SIZES);
|
2006-08-04 23:33:36 +02:00
|
|
|
|
|
|
|
econtext->ecxt_param_exec_vals = NULL;
|
|
|
|
econtext->ecxt_param_list_info = NULL;
|
|
|
|
|
|
|
|
econtext->ecxt_aggvalues = NULL;
|
|
|
|
econtext->ecxt_aggnulls = NULL;
|
|
|
|
|
|
|
|
econtext->caseValue_datum = (Datum) 0;
|
|
|
|
econtext->caseValue_isNull = true;
|
|
|
|
|
|
|
|
econtext->domainValue_datum = (Datum) 0;
|
|
|
|
econtext->domainValue_isNull = true;
|
|
|
|
|
|
|
|
econtext->ecxt_estate = NULL;
|
|
|
|
|
|
|
|
econtext->ecxt_callbacks = NULL;
|
|
|
|
|
|
|
|
return econtext;
|
|
|
|
}
|
|
|
|
|
2002-12-15 17:17:59 +01:00
|
|
|
/* ----------------
|
|
|
|
* FreeExprContext
|
|
|
|
*
|
|
|
|
* Free an expression context, including calling any remaining
|
|
|
|
* shutdown callbacks.
|
|
|
|
*
|
|
|
|
* Since we free the temporary context used for expression evaluation,
|
|
|
|
* any previously computed pass-by-reference expression result will go away!
|
|
|
|
*
|
2009-07-18 21:15:42 +02:00
|
|
|
* If isCommit is false, we are being called in error cleanup, and should
|
|
|
|
* not call callbacks but only release memory. (It might be better to call
|
|
|
|
* the callbacks and pass the isCommit flag to them, but that would require
|
|
|
|
* more invasive code changes than currently seems justified.)
|
|
|
|
*
|
2002-12-15 17:17:59 +01:00
|
|
|
* Note we make no assumption about the caller's memory context.
|
|
|
|
* ----------------
|
2000-07-12 04:37:39 +02:00
|
|
|
*/
|
|
|
|
void
|
2009-07-18 21:15:42 +02:00
|
|
|
FreeExprContext(ExprContext *econtext, bool isCommit)
|
2000-07-12 04:37:39 +02:00
|
|
|
{
|
2002-12-15 17:17:59 +01:00
|
|
|
EState *estate;
|
|
|
|
|
2002-05-12 22:10:05 +02:00
|
|
|
/* Call any registered callbacks */
|
2009-07-18 21:15:42 +02:00
|
|
|
ShutdownExprContext(econtext, isCommit);
|
2002-05-12 22:10:05 +02:00
|
|
|
/* And clean up the memory used */
|
2000-07-12 04:37:39 +02:00
|
|
|
MemoryContextDelete(econtext->ecxt_per_tuple_memory);
|
2006-08-04 23:33:36 +02:00
|
|
|
/* Unlink self from owning EState, if any */
|
2002-12-15 17:17:59 +01:00
|
|
|
estate = econtext->ecxt_estate;
|
2006-08-04 23:33:36 +02:00
|
|
|
if (estate)
|
|
|
|
estate->es_exprcontexts = list_delete_ptr(estate->es_exprcontexts,
|
|
|
|
econtext);
|
2002-12-15 17:17:59 +01:00
|
|
|
/* And delete the ExprContext node */
|
2000-07-12 04:37:39 +02:00
|
|
|
pfree(econtext);
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|
|
|
|
|
2003-12-18 21:21:37 +01:00
|
|
|
/*
|
|
|
|
* ReScanExprContext
|
|
|
|
*
|
|
|
|
* Reset an expression context in preparation for a rescan of its
|
|
|
|
* plan node. This requires calling any registered shutdown callbacks,
|
|
|
|
* since any partially complete set-returning-functions must be canceled.
|
|
|
|
*
|
|
|
|
* Note we make no assumption about the caller's memory context.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
ReScanExprContext(ExprContext *econtext)
|
|
|
|
{
|
|
|
|
/* Call any registered callbacks */
|
2009-07-18 21:15:42 +02:00
|
|
|
ShutdownExprContext(econtext, true);
|
2003-12-18 21:21:37 +01:00
|
|
|
/* And clean up the memory used */
|
|
|
|
MemoryContextReset(econtext->ecxt_per_tuple_memory);
|
|
|
|
}
|
|
|
|
|
2001-01-22 01:50:07 +01:00
|
|
|
/*
|
|
|
|
* Build a per-output-tuple ExprContext for an EState.
|
|
|
|
*
|
2002-12-15 17:17:59 +01:00
|
|
|
* This is normally invoked via GetPerTupleExprContext() macro,
|
|
|
|
* not directly.
|
2001-01-22 01:50:07 +01:00
|
|
|
*/
|
|
|
|
ExprContext *
|
|
|
|
MakePerTupleExprContext(EState *estate)
|
|
|
|
{
|
|
|
|
if (estate->es_per_tuple_exprcontext == NULL)
|
2002-12-15 17:17:59 +01:00
|
|
|
estate->es_per_tuple_exprcontext = CreateExprContext(estate);
|
2001-01-22 01:50:07 +01:00
|
|
|
|
|
|
|
return estate->es_per_tuple_exprcontext;
|
|
|
|
}
|
|
|
|
|
2002-12-15 17:17:59 +01:00
|
|
|
|
1996-07-09 08:22:35 +02:00
|
|
|
/* ----------------------------------------------------------------
|
2002-12-15 17:17:59 +01:00
|
|
|
* miscellaneous node-init support functions
|
|
|
|
*
|
|
|
|
* Note: all of these are expected to be called with CurrentMemoryContext
|
|
|
|
* equal to the per-query memory context.
|
1996-07-09 08:22:35 +02:00
|
|
|
* ----------------------------------------------------------------
|
|
|
|
*/
|
|
|
|
|
2002-12-15 17:17:59 +01:00
|
|
|
/* ----------------
|
|
|
|
* ExecAssignExprContext
|
|
|
|
*
|
|
|
|
* This initializes the ps_ExprContext field. It is only necessary
|
|
|
|
* to do this for nodes which use ExecQual or ExecProject
|
|
|
|
* because those routines require an econtext. Other nodes that
|
|
|
|
* don't have to evaluate expressions don't need to do this.
|
|
|
|
* ----------------
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
ExecAssignExprContext(EState *estate, PlanState *planstate)
|
|
|
|
{
|
|
|
|
planstate->ps_ExprContext = CreateExprContext(estate);
|
|
|
|
}
|
|
|
|
|
1996-07-09 08:22:35 +02:00
|
|
|
/* ----------------
|
|
|
|
* ExecGetResultType
|
|
|
|
* ----------------
|
|
|
|
*/
|
|
|
|
TupleDesc
|
2002-12-05 16:50:39 +01:00
|
|
|
ExecGetResultType(PlanState *planstate)
|
1996-07-09 08:22:35 +02:00
|
|
|
{
|
Don't require return slots for nodes without projection.
In a lot of nodes the return slot is not required. That can either be
because the node doesn't do any projection (say an Append node), or
because the node does perform projections but the projection is
optimized away because the projection would yield an identical row.
Slots aren't that small, especially for wide rows, so it's worthwhile
to avoid creating them. It's not possible to just skip creating the
slot - it's currently used to determine the tuple descriptor returned
by ExecGetResultType(). So separate the determination of the result
type from the slot creation. The work previously done internally
ExecInitResultTupleSlotTL() can now also be done separately with
ExecInitResultTypeTL() and ExecInitResultSlot(). That way nodes that
aren't guaranteed to need a result slot, can use
ExecInitResultTypeTL() to determine the result type of the node, and
ExecAssignScanProjectionInfo() (via
ExecConditionalAssignProjectionInfo()) determines that a result slot
is needed, it is created with ExecInitResultSlot().
Besides the advantage of avoiding to create slots that then are
unused, this is necessary preparation for later patches around tuple
table slot abstraction. In particular separating the return descriptor
and slot is a prerequisite to allow JITing of tuple deforming with
knowledge of the underlying tuple format, and to avoid unnecessarily
creating JITed tuple deforming for virtual slots.
This commit removes a redundant argument from
ExecInitResultTupleSlotTL(). While this commit touches a lot of the
relevant lines anyway, it'd normally still not worthwhile to cause
breakage, except that aforementioned later commits will touch *all*
ExecInitResultTupleSlotTL() callers anyway (but fits worse
thematically).
Author: Andres Freund
Discussion: https://postgr.es/m/20181105210039.hh4vvi4vwoq5ba2q@alap3.anarazel.de
2018-11-10 02:19:39 +01:00
|
|
|
return planstate->ps_ResultTupleDesc;
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|
|
|
|
|
Introduce notion of different types of slots (without implementing them).
Upcoming work intends to allow pluggable ways to introduce new ways of
storing table data. Accessing those table access methods from the
executor requires TupleTableSlots to be carry tuples in the native
format of such storage methods; otherwise there'll be a significant
conversion overhead.
Different access methods will require different data to store tuples
efficiently (just like virtual, minimal, heap already require fields
in TupleTableSlot). To allow that without requiring additional pointer
indirections, we want to have different structs (embedding
TupleTableSlot) for different types of slots. Thus different types of
slots are needed, which requires adapting creators of slots.
The slot that most efficiently can represent a type of tuple in an
executor node will often depend on the type of slot a child node
uses. Therefore we need to track the type of slot is returned by
nodes, so parent slots can create slots based on that.
Relatedly, JIT compilation of tuple deforming needs to know which type
of slot a certain expression refers to, so it can create an
appropriate deforming function for the type of tuple in the slot.
But not all nodes will only return one type of slot, e.g. an append
node will potentially return different types of slots for each of its
subplans.
Therefore add function that allows to query the type of a node's
result slot, and whether it'll always be the same type (whether it's
fixed). This can be queried using ExecGetResultSlotOps().
The scan, result, inner, outer type of slots are automatically
inferred from ExecInitScanTupleSlot(), ExecInitResultSlot(),
left/right subtrees respectively. If that's not correct for a node,
that can be overwritten using new fields in PlanState.
This commit does not introduce the actually abstracted implementation
of different kind of TupleTableSlots, that will be left for a followup
commit. The different types of slots introduced will, for now, still
use the same backing implementation.
While this already partially invalidates the big comment in
tuptable.h, it seems to make more sense to update it later, when the
different TupleTableSlot implementations actually exist.
Author: Ashutosh Bapat and Andres Freund, with changes by Amit Khandekar
Discussion: https://postgr.es/m/20181105210039.hh4vvi4vwoq5ba2q@alap3.anarazel.de
2018-11-16 07:00:30 +01:00
|
|
|
/*
|
|
|
|
* ExecGetResultSlotOps - information about node's type of result slot
|
|
|
|
*/
|
|
|
|
const TupleTableSlotOps *
|
|
|
|
ExecGetResultSlotOps(PlanState *planstate, bool *isfixed)
|
|
|
|
{
|
|
|
|
if (planstate->resultopsset && planstate->resultops)
|
|
|
|
{
|
|
|
|
if (isfixed)
|
|
|
|
*isfixed = planstate->resultopsfixed;
|
|
|
|
return planstate->resultops;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (isfixed)
|
|
|
|
{
|
|
|
|
if (planstate->resultopsset)
|
|
|
|
*isfixed = planstate->resultopsfixed;
|
|
|
|
else if (planstate->ps_ResultTupleSlot)
|
|
|
|
*isfixed = TTS_FIXED(planstate->ps_ResultTupleSlot);
|
|
|
|
else
|
|
|
|
*isfixed = false;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!planstate->ps_ResultTupleSlot)
|
|
|
|
return &TTSOpsVirtual;
|
|
|
|
|
|
|
|
return planstate->ps_ResultTupleSlot->tts_ops;
|
|
|
|
}
|
|
|
|
|
2009-04-03 00:39:30 +02:00
|
|
|
|
2003-01-12 05:03:34 +01:00
|
|
|
/* ----------------
|
|
|
|
* ExecAssignProjectionInfo
|
|
|
|
*
|
|
|
|
* forms the projection information from the node's targetlist
|
2007-02-02 01:07:03 +01:00
|
|
|
*
|
|
|
|
* Notes for inputDesc are same as for ExecBuildProjectionInfo: supply it
|
|
|
|
* for a relation-scan node, can pass NULL for upper-level nodes
|
2003-01-12 05:03:34 +01:00
|
|
|
* ----------------
|
|
|
|
*/
|
|
|
|
void
|
2007-02-02 01:07:03 +01:00
|
|
|
ExecAssignProjectionInfo(PlanState *planstate,
|
|
|
|
TupleDesc inputDesc)
|
2003-01-12 05:03:34 +01:00
|
|
|
{
|
|
|
|
planstate->ps_ProjInfo =
|
Faster expression evaluation and targetlist projection.
This replaces the old, recursive tree-walk based evaluation, with
non-recursive, opcode dispatch based, expression evaluation.
Projection is now implemented as part of expression evaluation.
This both leads to significant performance improvements, and makes
future just-in-time compilation of expressions easier.
The speed gains primarily come from:
- non-recursive implementation reduces stack usage / overhead
- simple sub-expressions are implemented with a single jump, without
function calls
- sharing some state between different sub-expressions
- reduced amount of indirect/hard to predict memory accesses by laying
out operation metadata sequentially; including the avoidance of
nearly all of the previously used linked lists
- more code has been moved to expression initialization, avoiding
constant re-checks at evaluation time
Future just-in-time compilation (JIT) has become easier, as
demonstrated by released patches intended to be merged in a later
release, for primarily two reasons: Firstly, due to a stricter split
between expression initialization and evaluation, less code has to be
handled by the JIT. Secondly, due to the non-recursive nature of the
generated "instructions", less performance-critical code-paths can
easily be shared between interpreted and compiled evaluation.
The new framework allows for significant future optimizations. E.g.:
- basic infrastructure for to later reduce the per executor-startup
overhead of expression evaluation, by caching state in prepared
statements. That'd be helpful in OLTPish scenarios where
initialization overhead is measurable.
- optimizing the generated "code". A number of proposals for potential
work has already been made.
- optimizing the interpreter. Similarly a number of proposals have
been made here too.
The move of logic into the expression initialization step leads to some
backward-incompatible changes:
- Function permission checks are now done during expression
initialization, whereas previously they were done during
execution. In edge cases this can lead to errors being raised that
previously wouldn't have been, e.g. a NULL array being coerced to a
different array type previously didn't perform checks.
- The set of domain constraints to be checked, is now evaluated once
during expression initialization, previously it was re-built
every time a domain check was evaluated. For normal queries this
doesn't change much, but e.g. for plpgsql functions, which caches
ExprStates, the old set could stick around longer. The behavior
around might still change.
Author: Andres Freund, with significant changes by Tom Lane,
changes by Heikki Linnakangas
Reviewed-By: Tom Lane, Heikki Linnakangas
Discussion: https://postgr.es/m/20161206034955.bh33paeralxbtluv@alap3.anarazel.de
2017-03-14 23:45:36 +01:00
|
|
|
ExecBuildProjectionInfo(planstate->plan->targetlist,
|
2003-01-12 05:03:34 +01:00
|
|
|
planstate->ps_ExprContext,
|
2007-02-02 01:07:03 +01:00
|
|
|
planstate->ps_ResultTupleSlot,
|
Faster expression evaluation and targetlist projection.
This replaces the old, recursive tree-walk based evaluation, with
non-recursive, opcode dispatch based, expression evaluation.
Projection is now implemented as part of expression evaluation.
This both leads to significant performance improvements, and makes
future just-in-time compilation of expressions easier.
The speed gains primarily come from:
- non-recursive implementation reduces stack usage / overhead
- simple sub-expressions are implemented with a single jump, without
function calls
- sharing some state between different sub-expressions
- reduced amount of indirect/hard to predict memory accesses by laying
out operation metadata sequentially; including the avoidance of
nearly all of the previously used linked lists
- more code has been moved to expression initialization, avoiding
constant re-checks at evaluation time
Future just-in-time compilation (JIT) has become easier, as
demonstrated by released patches intended to be merged in a later
release, for primarily two reasons: Firstly, due to a stricter split
between expression initialization and evaluation, less code has to be
handled by the JIT. Secondly, due to the non-recursive nature of the
generated "instructions", less performance-critical code-paths can
easily be shared between interpreted and compiled evaluation.
The new framework allows for significant future optimizations. E.g.:
- basic infrastructure for to later reduce the per executor-startup
overhead of expression evaluation, by caching state in prepared
statements. That'd be helpful in OLTPish scenarios where
initialization overhead is measurable.
- optimizing the generated "code". A number of proposals for potential
work has already been made.
- optimizing the interpreter. Similarly a number of proposals have
been made here too.
The move of logic into the expression initialization step leads to some
backward-incompatible changes:
- Function permission checks are now done during expression
initialization, whereas previously they were done during
execution. In edge cases this can lead to errors being raised that
previously wouldn't have been, e.g. a NULL array being coerced to a
different array type previously didn't perform checks.
- The set of domain constraints to be checked, is now evaluated once
during expression initialization, previously it was re-built
every time a domain check was evaluated. For normal queries this
doesn't change much, but e.g. for plpgsql functions, which caches
ExprStates, the old set could stick around longer. The behavior
around might still change.
Author: Andres Freund, with significant changes by Tom Lane,
changes by Heikki Linnakangas
Reviewed-By: Tom Lane, Heikki Linnakangas
Discussion: https://postgr.es/m/20161206034955.bh33paeralxbtluv@alap3.anarazel.de
2017-03-14 23:45:36 +01:00
|
|
|
planstate,
|
2007-02-02 01:07:03 +01:00
|
|
|
inputDesc);
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
|
2017-11-25 16:49:17 +01:00
|
|
|
/* ----------------
|
|
|
|
* ExecConditionalAssignProjectionInfo
|
|
|
|
*
|
|
|
|
* as ExecAssignProjectionInfo, but store NULL rather than building projection
|
|
|
|
* info if no projection is required
|
|
|
|
* ----------------
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
ExecConditionalAssignProjectionInfo(PlanState *planstate, TupleDesc inputDesc,
|
Remove arbitrary 64K-or-so limit on rangetable size.
Up to now the size of a query's rangetable has been limited by the
constants INNER_VAR et al, which mustn't be equal to any real
rangetable index. 65000 doubtless seemed like enough for anybody,
and it still is orders of magnitude larger than the number of joins
we can realistically handle. However, we need a rangetable entry
for each child partition that is (or might be) processed by a query.
Queries with a few thousand partitions are getting more realistic,
so that the day when that limit becomes a problem is in sight,
even if it's not here yet. Hence, let's raise the limit.
Rather than just increase the values of INNER_VAR et al, this patch
adopts the approach of making them small negative values, so that
rangetables could theoretically become as long as INT_MAX.
The bulk of the patch is concerned with changing Var.varno and some
related variables from "Index" (unsigned int) to plain "int". This
is basically cosmetic, with little actual effect other than to help
debuggers print their values nicely. As such, I've only bothered
with changing places that could actually see INNER_VAR et al, which
the parser and most of the planner don't. We do have to be careful
in places that are performing less/greater comparisons on varnos,
but there are very few such places, other than the IS_SPECIAL_VARNO
macro itself.
A notable side effect of this patch is that while it used to be
possible to add INNER_VAR et al to a Bitmapset, that will now
draw an error. I don't see any likelihood that it wouldn't be a
bug to include these fake varnos in a bitmapset of real varnos,
so I think this is all to the good.
Although this touches outfuncs/readfuncs, I don't think a catversion
bump is required, since stored rules would never contain Vars
with these fake varnos.
Andrey Lepikhov and Tom Lane, after a suggestion by Peter Eisentraut
Discussion: https://postgr.es/m/43c7f2f5-1e27-27aa-8c65-c91859d15190@postgrespro.ru
2021-09-15 20:11:21 +02:00
|
|
|
int varno)
|
2017-11-25 16:49:17 +01:00
|
|
|
{
|
|
|
|
if (tlist_matches_tupdesc(planstate,
|
|
|
|
planstate->plan->targetlist,
|
|
|
|
varno,
|
|
|
|
inputDesc))
|
Introduce notion of different types of slots (without implementing them).
Upcoming work intends to allow pluggable ways to introduce new ways of
storing table data. Accessing those table access methods from the
executor requires TupleTableSlots to be carry tuples in the native
format of such storage methods; otherwise there'll be a significant
conversion overhead.
Different access methods will require different data to store tuples
efficiently (just like virtual, minimal, heap already require fields
in TupleTableSlot). To allow that without requiring additional pointer
indirections, we want to have different structs (embedding
TupleTableSlot) for different types of slots. Thus different types of
slots are needed, which requires adapting creators of slots.
The slot that most efficiently can represent a type of tuple in an
executor node will often depend on the type of slot a child node
uses. Therefore we need to track the type of slot is returned by
nodes, so parent slots can create slots based on that.
Relatedly, JIT compilation of tuple deforming needs to know which type
of slot a certain expression refers to, so it can create an
appropriate deforming function for the type of tuple in the slot.
But not all nodes will only return one type of slot, e.g. an append
node will potentially return different types of slots for each of its
subplans.
Therefore add function that allows to query the type of a node's
result slot, and whether it'll always be the same type (whether it's
fixed). This can be queried using ExecGetResultSlotOps().
The scan, result, inner, outer type of slots are automatically
inferred from ExecInitScanTupleSlot(), ExecInitResultSlot(),
left/right subtrees respectively. If that's not correct for a node,
that can be overwritten using new fields in PlanState.
This commit does not introduce the actually abstracted implementation
of different kind of TupleTableSlots, that will be left for a followup
commit. The different types of slots introduced will, for now, still
use the same backing implementation.
While this already partially invalidates the big comment in
tuptable.h, it seems to make more sense to update it later, when the
different TupleTableSlot implementations actually exist.
Author: Ashutosh Bapat and Andres Freund, with changes by Amit Khandekar
Discussion: https://postgr.es/m/20181105210039.hh4vvi4vwoq5ba2q@alap3.anarazel.de
2018-11-16 07:00:30 +01:00
|
|
|
{
|
2017-11-25 16:49:17 +01:00
|
|
|
planstate->ps_ProjInfo = NULL;
|
Introduce notion of different types of slots (without implementing them).
Upcoming work intends to allow pluggable ways to introduce new ways of
storing table data. Accessing those table access methods from the
executor requires TupleTableSlots to be carry tuples in the native
format of such storage methods; otherwise there'll be a significant
conversion overhead.
Different access methods will require different data to store tuples
efficiently (just like virtual, minimal, heap already require fields
in TupleTableSlot). To allow that without requiring additional pointer
indirections, we want to have different structs (embedding
TupleTableSlot) for different types of slots. Thus different types of
slots are needed, which requires adapting creators of slots.
The slot that most efficiently can represent a type of tuple in an
executor node will often depend on the type of slot a child node
uses. Therefore we need to track the type of slot is returned by
nodes, so parent slots can create slots based on that.
Relatedly, JIT compilation of tuple deforming needs to know which type
of slot a certain expression refers to, so it can create an
appropriate deforming function for the type of tuple in the slot.
But not all nodes will only return one type of slot, e.g. an append
node will potentially return different types of slots for each of its
subplans.
Therefore add function that allows to query the type of a node's
result slot, and whether it'll always be the same type (whether it's
fixed). This can be queried using ExecGetResultSlotOps().
The scan, result, inner, outer type of slots are automatically
inferred from ExecInitScanTupleSlot(), ExecInitResultSlot(),
left/right subtrees respectively. If that's not correct for a node,
that can be overwritten using new fields in PlanState.
This commit does not introduce the actually abstracted implementation
of different kind of TupleTableSlots, that will be left for a followup
commit. The different types of slots introduced will, for now, still
use the same backing implementation.
While this already partially invalidates the big comment in
tuptable.h, it seems to make more sense to update it later, when the
different TupleTableSlot implementations actually exist.
Author: Ashutosh Bapat and Andres Freund, with changes by Amit Khandekar
Discussion: https://postgr.es/m/20181105210039.hh4vvi4vwoq5ba2q@alap3.anarazel.de
2018-11-16 07:00:30 +01:00
|
|
|
planstate->resultopsset = planstate->scanopsset;
|
|
|
|
planstate->resultopsfixed = planstate->scanopsfixed;
|
|
|
|
planstate->resultops = planstate->scanops;
|
|
|
|
}
|
2017-11-25 16:49:17 +01:00
|
|
|
else
|
Don't require return slots for nodes without projection.
In a lot of nodes the return slot is not required. That can either be
because the node doesn't do any projection (say an Append node), or
because the node does perform projections but the projection is
optimized away because the projection would yield an identical row.
Slots aren't that small, especially for wide rows, so it's worthwhile
to avoid creating them. It's not possible to just skip creating the
slot - it's currently used to determine the tuple descriptor returned
by ExecGetResultType(). So separate the determination of the result
type from the slot creation. The work previously done internally
ExecInitResultTupleSlotTL() can now also be done separately with
ExecInitResultTypeTL() and ExecInitResultSlot(). That way nodes that
aren't guaranteed to need a result slot, can use
ExecInitResultTypeTL() to determine the result type of the node, and
ExecAssignScanProjectionInfo() (via
ExecConditionalAssignProjectionInfo()) determines that a result slot
is needed, it is created with ExecInitResultSlot().
Besides the advantage of avoiding to create slots that then are
unused, this is necessary preparation for later patches around tuple
table slot abstraction. In particular separating the return descriptor
and slot is a prerequisite to allow JITing of tuple deforming with
knowledge of the underlying tuple format, and to avoid unnecessarily
creating JITed tuple deforming for virtual slots.
This commit removes a redundant argument from
ExecInitResultTupleSlotTL(). While this commit touches a lot of the
relevant lines anyway, it'd normally still not worthwhile to cause
breakage, except that aforementioned later commits will touch *all*
ExecInitResultTupleSlotTL() callers anyway (but fits worse
thematically).
Author: Andres Freund
Discussion: https://postgr.es/m/20181105210039.hh4vvi4vwoq5ba2q@alap3.anarazel.de
2018-11-10 02:19:39 +01:00
|
|
|
{
|
|
|
|
if (!planstate->ps_ResultTupleSlot)
|
Introduce notion of different types of slots (without implementing them).
Upcoming work intends to allow pluggable ways to introduce new ways of
storing table data. Accessing those table access methods from the
executor requires TupleTableSlots to be carry tuples in the native
format of such storage methods; otherwise there'll be a significant
conversion overhead.
Different access methods will require different data to store tuples
efficiently (just like virtual, minimal, heap already require fields
in TupleTableSlot). To allow that without requiring additional pointer
indirections, we want to have different structs (embedding
TupleTableSlot) for different types of slots. Thus different types of
slots are needed, which requires adapting creators of slots.
The slot that most efficiently can represent a type of tuple in an
executor node will often depend on the type of slot a child node
uses. Therefore we need to track the type of slot is returned by
nodes, so parent slots can create slots based on that.
Relatedly, JIT compilation of tuple deforming needs to know which type
of slot a certain expression refers to, so it can create an
appropriate deforming function for the type of tuple in the slot.
But not all nodes will only return one type of slot, e.g. an append
node will potentially return different types of slots for each of its
subplans.
Therefore add function that allows to query the type of a node's
result slot, and whether it'll always be the same type (whether it's
fixed). This can be queried using ExecGetResultSlotOps().
The scan, result, inner, outer type of slots are automatically
inferred from ExecInitScanTupleSlot(), ExecInitResultSlot(),
left/right subtrees respectively. If that's not correct for a node,
that can be overwritten using new fields in PlanState.
This commit does not introduce the actually abstracted implementation
of different kind of TupleTableSlots, that will be left for a followup
commit. The different types of slots introduced will, for now, still
use the same backing implementation.
While this already partially invalidates the big comment in
tuptable.h, it seems to make more sense to update it later, when the
different TupleTableSlot implementations actually exist.
Author: Ashutosh Bapat and Andres Freund, with changes by Amit Khandekar
Discussion: https://postgr.es/m/20181105210039.hh4vvi4vwoq5ba2q@alap3.anarazel.de
2018-11-16 07:00:30 +01:00
|
|
|
{
|
|
|
|
ExecInitResultSlot(planstate, &TTSOpsVirtual);
|
|
|
|
planstate->resultops = &TTSOpsVirtual;
|
|
|
|
planstate->resultopsfixed = true;
|
|
|
|
planstate->resultopsset = true;
|
|
|
|
}
|
2017-11-25 16:49:17 +01:00
|
|
|
ExecAssignProjectionInfo(planstate, inputDesc);
|
Don't require return slots for nodes without projection.
In a lot of nodes the return slot is not required. That can either be
because the node doesn't do any projection (say an Append node), or
because the node does perform projections but the projection is
optimized away because the projection would yield an identical row.
Slots aren't that small, especially for wide rows, so it's worthwhile
to avoid creating them. It's not possible to just skip creating the
slot - it's currently used to determine the tuple descriptor returned
by ExecGetResultType(). So separate the determination of the result
type from the slot creation. The work previously done internally
ExecInitResultTupleSlotTL() can now also be done separately with
ExecInitResultTypeTL() and ExecInitResultSlot(). That way nodes that
aren't guaranteed to need a result slot, can use
ExecInitResultTypeTL() to determine the result type of the node, and
ExecAssignScanProjectionInfo() (via
ExecConditionalAssignProjectionInfo()) determines that a result slot
is needed, it is created with ExecInitResultSlot().
Besides the advantage of avoiding to create slots that then are
unused, this is necessary preparation for later patches around tuple
table slot abstraction. In particular separating the return descriptor
and slot is a prerequisite to allow JITing of tuple deforming with
knowledge of the underlying tuple format, and to avoid unnecessarily
creating JITed tuple deforming for virtual slots.
This commit removes a redundant argument from
ExecInitResultTupleSlotTL(). While this commit touches a lot of the
relevant lines anyway, it'd normally still not worthwhile to cause
breakage, except that aforementioned later commits will touch *all*
ExecInitResultTupleSlotTL() callers anyway (but fits worse
thematically).
Author: Andres Freund
Discussion: https://postgr.es/m/20181105210039.hh4vvi4vwoq5ba2q@alap3.anarazel.de
2018-11-10 02:19:39 +01:00
|
|
|
}
|
2017-11-25 16:49:17 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
static bool
|
Remove arbitrary 64K-or-so limit on rangetable size.
Up to now the size of a query's rangetable has been limited by the
constants INNER_VAR et al, which mustn't be equal to any real
rangetable index. 65000 doubtless seemed like enough for anybody,
and it still is orders of magnitude larger than the number of joins
we can realistically handle. However, we need a rangetable entry
for each child partition that is (or might be) processed by a query.
Queries with a few thousand partitions are getting more realistic,
so that the day when that limit becomes a problem is in sight,
even if it's not here yet. Hence, let's raise the limit.
Rather than just increase the values of INNER_VAR et al, this patch
adopts the approach of making them small negative values, so that
rangetables could theoretically become as long as INT_MAX.
The bulk of the patch is concerned with changing Var.varno and some
related variables from "Index" (unsigned int) to plain "int". This
is basically cosmetic, with little actual effect other than to help
debuggers print their values nicely. As such, I've only bothered
with changing places that could actually see INNER_VAR et al, which
the parser and most of the planner don't. We do have to be careful
in places that are performing less/greater comparisons on varnos,
but there are very few such places, other than the IS_SPECIAL_VARNO
macro itself.
A notable side effect of this patch is that while it used to be
possible to add INNER_VAR et al to a Bitmapset, that will now
draw an error. I don't see any likelihood that it wouldn't be a
bug to include these fake varnos in a bitmapset of real varnos,
so I think this is all to the good.
Although this touches outfuncs/readfuncs, I don't think a catversion
bump is required, since stored rules would never contain Vars
with these fake varnos.
Andrey Lepikhov and Tom Lane, after a suggestion by Peter Eisentraut
Discussion: https://postgr.es/m/43c7f2f5-1e27-27aa-8c65-c91859d15190@postgrespro.ru
2021-09-15 20:11:21 +02:00
|
|
|
tlist_matches_tupdesc(PlanState *ps, List *tlist, int varno, TupleDesc tupdesc)
|
2017-11-25 16:49:17 +01:00
|
|
|
{
|
|
|
|
int numattrs = tupdesc->natts;
|
|
|
|
int attrno;
|
|
|
|
ListCell *tlist_item = list_head(tlist);
|
|
|
|
|
|
|
|
/* Check the tlist attributes */
|
|
|
|
for (attrno = 1; attrno <= numattrs; attrno++)
|
|
|
|
{
|
|
|
|
Form_pg_attribute att_tup = TupleDescAttr(tupdesc, attrno - 1);
|
|
|
|
Var *var;
|
|
|
|
|
|
|
|
if (tlist_item == NULL)
|
|
|
|
return false; /* tlist too short */
|
|
|
|
var = (Var *) ((TargetEntry *) lfirst(tlist_item))->expr;
|
|
|
|
if (!var || !IsA(var, Var))
|
|
|
|
return false; /* tlist item not a Var */
|
|
|
|
/* if these Asserts fail, planner messed up */
|
|
|
|
Assert(var->varno == varno);
|
|
|
|
Assert(var->varlevelsup == 0);
|
|
|
|
if (var->varattno != attrno)
|
|
|
|
return false; /* out of order */
|
|
|
|
if (att_tup->attisdropped)
|
|
|
|
return false; /* table contains dropped columns */
|
2018-03-28 02:13:52 +02:00
|
|
|
if (att_tup->atthasmissing)
|
|
|
|
return false; /* table contains cols with missing values */
|
2017-11-25 16:49:17 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Note: usually the Var's type should match the tupdesc exactly, but
|
|
|
|
* in situations involving unions of columns that have different
|
|
|
|
* typmods, the Var may have come from above the union and hence have
|
|
|
|
* typmod -1. This is a legitimate situation since the Var still
|
|
|
|
* describes the column, just not as exactly as the tupdesc does. We
|
|
|
|
* could change the planner to prevent it, but it'd then insert
|
|
|
|
* projection steps just to convert from specific typmod to typmod -1,
|
|
|
|
* which is pretty silly.
|
|
|
|
*/
|
|
|
|
if (var->vartype != att_tup->atttypid ||
|
|
|
|
(var->vartypmod != att_tup->atttypmod &&
|
|
|
|
var->vartypmod != -1))
|
|
|
|
return false; /* type mismatch */
|
|
|
|
|
Represent Lists as expansible arrays, not chains of cons-cells.
Originally, Postgres Lists were a more or less exact reimplementation of
Lisp lists, which consist of chains of separately-allocated cons cells,
each having a value and a next-cell link. We'd hacked that once before
(commit d0b4399d8) to add a separate List header, but the data was still
in cons cells. That makes some operations -- notably list_nth() -- O(N),
and it's bulky because of the next-cell pointers and per-cell palloc
overhead, and it's very cache-unfriendly if the cons cells end up
scattered around rather than being adjacent.
In this rewrite, we still have List headers, but the data is in a
resizable array of values, with no next-cell links. Now we need at
most two palloc's per List, and often only one, since we can allocate
some values in the same palloc call as the List header. (Of course,
extending an existing List may require repalloc's to enlarge the array.
But this involves just O(log N) allocations not O(N).)
Of course this is not without downsides. The key difficulty is that
addition or deletion of a list entry may now cause other entries to
move, which it did not before.
For example, that breaks foreach() and sister macros, which historically
used a pointer to the current cons-cell as loop state. We can repair
those macros transparently by making their actual loop state be an
integer list index; the exposed "ListCell *" pointer is no longer state
carried across loop iterations, but is just a derived value. (In
practice, modern compilers can optimize things back to having just one
loop state value, at least for simple cases with inline loop bodies.)
In principle, this is a semantics change for cases where the loop body
inserts or deletes list entries ahead of the current loop index; but
I found no such cases in the Postgres code.
The change is not at all transparent for code that doesn't use foreach()
but chases lists "by hand" using lnext(). The largest share of such
code in the backend is in loops that were maintaining "prev" and "next"
variables in addition to the current-cell pointer, in order to delete
list cells efficiently using list_delete_cell(). However, we no longer
need a previous-cell pointer to delete a list cell efficiently. Keeping
a next-cell pointer doesn't work, as explained above, but we can improve
matters by changing such code to use a regular foreach() loop and then
using the new macro foreach_delete_current() to delete the current cell.
(This macro knows how to update the associated foreach loop's state so
that no cells will be missed in the traversal.)
There remains a nontrivial risk of code assuming that a ListCell *
pointer will remain good over an operation that could now move the list
contents. To help catch such errors, list.c can be compiled with a new
define symbol DEBUG_LIST_MEMORY_USAGE that forcibly moves list contents
whenever that could possibly happen. This makes list operations
significantly more expensive so it's not normally turned on (though it
is on by default if USE_VALGRIND is on).
There are two notable API differences from the previous code:
* lnext() now requires the List's header pointer in addition to the
current cell's address.
* list_delete_cell() no longer requires a previous-cell argument.
These changes are somewhat unfortunate, but on the other hand code using
either function needs inspection to see if it is assuming anything
it shouldn't, so it's not all bad.
Programmers should be aware of these significant performance changes:
* list_nth() and related functions are now O(1); so there's no
major access-speed difference between a list and an array.
* Inserting or deleting a list element now takes time proportional to
the distance to the end of the list, due to moving the array elements.
(However, it typically *doesn't* require palloc or pfree, so except in
long lists it's probably still faster than before.) Notably, lcons()
used to be about the same cost as lappend(), but that's no longer true
if the list is long. Code that uses lcons() and list_delete_first()
to maintain a stack might usefully be rewritten to push and pop at the
end of the list rather than the beginning.
* There are now list_insert_nth...() and list_delete_nth...() functions
that add or remove a list cell identified by index. These have the
data-movement penalty explained above, but there's no search penalty.
* list_concat() and variants now copy the second list's data into
storage belonging to the first list, so there is no longer any
sharing of cells between the input lists. The second argument is
now declared "const List *" to reflect that it isn't changed.
This patch just does the minimum needed to get the new implementation
in place and fix bugs exposed by the regression tests. As suggested
by the foregoing, there's a fair amount of followup work remaining to
do.
Also, the ENABLE_LIST_COMPAT macros are finally removed in this
commit. Code using those should have been gone a dozen years ago.
Patch by me; thanks to David Rowley, Jesper Pedersen, and others
for review.
Discussion: https://postgr.es/m/11587.1550975080@sss.pgh.pa.us
2019-07-15 19:41:58 +02:00
|
|
|
tlist_item = lnext(tlist, tlist_item);
|
2017-11-25 16:49:17 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
if (tlist_item)
|
|
|
|
return false; /* tlist too long */
|
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
1999-03-20 02:13:22 +01:00
|
|
|
/* ----------------
|
|
|
|
* ExecFreeExprContext
|
2002-12-15 17:17:59 +01:00
|
|
|
*
|
2005-04-23 23:32:34 +02:00
|
|
|
* A plan node's ExprContext should be freed explicitly during executor
|
|
|
|
* shutdown because there may be shutdown callbacks to call. (Other resources
|
|
|
|
* made by the above routines, such as projection info, don't need to be freed
|
2002-12-15 17:17:59 +01:00
|
|
|
* explicitly because they're just memory in the per-query memory context.)
|
2005-04-23 23:32:34 +02:00
|
|
|
*
|
|
|
|
* However ... there is no particular need to do it during ExecEndNode,
|
|
|
|
* because FreeExecutorState will free any remaining ExprContexts within
|
|
|
|
* the EState. Letting FreeExecutorState do it allows the ExprContexts to
|
|
|
|
* be freed in reverse order of creation, rather than order of creation as
|
|
|
|
* will happen if we delete them here, which saves O(N^2) work in the list
|
|
|
|
* cleanup inside FreeExprContext.
|
1999-03-20 02:13:22 +01:00
|
|
|
* ----------------
|
|
|
|
*/
|
|
|
|
void
|
2002-12-05 16:50:39 +01:00
|
|
|
ExecFreeExprContext(PlanState *planstate)
|
1999-03-20 02:13:22 +01:00
|
|
|
{
|
|
|
|
/*
|
2005-04-23 23:32:34 +02:00
|
|
|
* Per above discussion, don't actually delete the ExprContext. We do
|
|
|
|
* unlink it from the plan node, though.
|
1999-03-20 02:13:22 +01:00
|
|
|
*/
|
2002-12-05 16:50:39 +01:00
|
|
|
planstate->ps_ExprContext = NULL;
|
1999-03-20 02:13:22 +01:00
|
|
|
}
|
|
|
|
|
2018-02-17 06:17:38 +01:00
|
|
|
|
1996-07-09 08:22:35 +02:00
|
|
|
/* ----------------------------------------------------------------
|
2018-02-17 06:17:38 +01:00
|
|
|
* Scan node support
|
1996-07-09 08:22:35 +02:00
|
|
|
* ----------------------------------------------------------------
|
|
|
|
*/
|
|
|
|
|
|
|
|
/* ----------------
|
|
|
|
* ExecAssignScanType
|
|
|
|
* ----------------
|
|
|
|
*/
|
|
|
|
void
|
2006-06-16 20:42:24 +02:00
|
|
|
ExecAssignScanType(ScanState *scanstate, TupleDesc tupDesc)
|
1996-07-09 08:22:35 +02:00
|
|
|
{
|
2002-12-05 16:50:39 +01:00
|
|
|
TupleTableSlot *slot = scanstate->ss_ScanTupleSlot;
|
1997-09-07 07:04:48 +02:00
|
|
|
|
2006-06-16 20:42:24 +02:00
|
|
|
ExecSetSlotDescriptor(slot, tupDesc);
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
/* ----------------
|
2019-06-17 09:13:16 +02:00
|
|
|
* ExecCreateScanSlotFromOuterPlan
|
1996-07-09 08:22:35 +02:00
|
|
|
* ----------------
|
|
|
|
*/
|
|
|
|
void
|
Introduce notion of different types of slots (without implementing them).
Upcoming work intends to allow pluggable ways to introduce new ways of
storing table data. Accessing those table access methods from the
executor requires TupleTableSlots to be carry tuples in the native
format of such storage methods; otherwise there'll be a significant
conversion overhead.
Different access methods will require different data to store tuples
efficiently (just like virtual, minimal, heap already require fields
in TupleTableSlot). To allow that without requiring additional pointer
indirections, we want to have different structs (embedding
TupleTableSlot) for different types of slots. Thus different types of
slots are needed, which requires adapting creators of slots.
The slot that most efficiently can represent a type of tuple in an
executor node will often depend on the type of slot a child node
uses. Therefore we need to track the type of slot is returned by
nodes, so parent slots can create slots based on that.
Relatedly, JIT compilation of tuple deforming needs to know which type
of slot a certain expression refers to, so it can create an
appropriate deforming function for the type of tuple in the slot.
But not all nodes will only return one type of slot, e.g. an append
node will potentially return different types of slots for each of its
subplans.
Therefore add function that allows to query the type of a node's
result slot, and whether it'll always be the same type (whether it's
fixed). This can be queried using ExecGetResultSlotOps().
The scan, result, inner, outer type of slots are automatically
inferred from ExecInitScanTupleSlot(), ExecInitResultSlot(),
left/right subtrees respectively. If that's not correct for a node,
that can be overwritten using new fields in PlanState.
This commit does not introduce the actually abstracted implementation
of different kind of TupleTableSlots, that will be left for a followup
commit. The different types of slots introduced will, for now, still
use the same backing implementation.
While this already partially invalidates the big comment in
tuptable.h, it seems to make more sense to update it later, when the
different TupleTableSlot implementations actually exist.
Author: Ashutosh Bapat and Andres Freund, with changes by Amit Khandekar
Discussion: https://postgr.es/m/20181105210039.hh4vvi4vwoq5ba2q@alap3.anarazel.de
2018-11-16 07:00:30 +01:00
|
|
|
ExecCreateScanSlotFromOuterPlan(EState *estate,
|
|
|
|
ScanState *scanstate,
|
|
|
|
const TupleTableSlotOps *tts_ops)
|
1996-07-09 08:22:35 +02:00
|
|
|
{
|
2002-12-05 16:50:39 +01:00
|
|
|
PlanState *outerPlan;
|
1996-07-09 08:22:35 +02:00
|
|
|
TupleDesc tupDesc;
|
1997-09-07 07:04:48 +02:00
|
|
|
|
2002-12-05 16:50:39 +01:00
|
|
|
outerPlan = outerPlanState(scanstate);
|
2003-05-05 19:57:47 +02:00
|
|
|
tupDesc = ExecGetResultType(outerPlan);
|
1996-07-09 08:22:35 +02:00
|
|
|
|
Introduce notion of different types of slots (without implementing them).
Upcoming work intends to allow pluggable ways to introduce new ways of
storing table data. Accessing those table access methods from the
executor requires TupleTableSlots to be carry tuples in the native
format of such storage methods; otherwise there'll be a significant
conversion overhead.
Different access methods will require different data to store tuples
efficiently (just like virtual, minimal, heap already require fields
in TupleTableSlot). To allow that without requiring additional pointer
indirections, we want to have different structs (embedding
TupleTableSlot) for different types of slots. Thus different types of
slots are needed, which requires adapting creators of slots.
The slot that most efficiently can represent a type of tuple in an
executor node will often depend on the type of slot a child node
uses. Therefore we need to track the type of slot is returned by
nodes, so parent slots can create slots based on that.
Relatedly, JIT compilation of tuple deforming needs to know which type
of slot a certain expression refers to, so it can create an
appropriate deforming function for the type of tuple in the slot.
But not all nodes will only return one type of slot, e.g. an append
node will potentially return different types of slots for each of its
subplans.
Therefore add function that allows to query the type of a node's
result slot, and whether it'll always be the same type (whether it's
fixed). This can be queried using ExecGetResultSlotOps().
The scan, result, inner, outer type of slots are automatically
inferred from ExecInitScanTupleSlot(), ExecInitResultSlot(),
left/right subtrees respectively. If that's not correct for a node,
that can be overwritten using new fields in PlanState.
This commit does not introduce the actually abstracted implementation
of different kind of TupleTableSlots, that will be left for a followup
commit. The different types of slots introduced will, for now, still
use the same backing implementation.
While this already partially invalidates the big comment in
tuptable.h, it seems to make more sense to update it later, when the
different TupleTableSlot implementations actually exist.
Author: Ashutosh Bapat and Andres Freund, with changes by Amit Khandekar
Discussion: https://postgr.es/m/20181105210039.hh4vvi4vwoq5ba2q@alap3.anarazel.de
2018-11-16 07:00:30 +01:00
|
|
|
ExecInitScanTupleSlot(estate, scanstate, tupDesc, tts_ops);
|
1996-07-09 08:22:35 +02:00
|
|
|
}
|
|
|
|
|
2005-12-03 06:51:03 +01:00
|
|
|
/* ----------------------------------------------------------------
|
|
|
|
* ExecRelationIsTargetRelation
|
|
|
|
*
|
|
|
|
* Detect whether a relation (identified by rangetable index)
|
|
|
|
* is one of the target relations of the query.
|
Make queries' locking of indexes more consistent.
The assertions added by commit b04aeb0a0 exposed that there are some
code paths wherein the executor will try to open an index without
holding any lock on it. We do have some lock on the index's table,
so it seems likely that there's no fatal problem with this (for
instance, the index couldn't get dropped from under us). Still,
it's bad practice and we should fix it.
To do so, remove the optimizations in ExecInitIndexScan and friends
that tried to avoid taking a lock on an index belonging to a target
relation, and just take the lock always. In non-bug cases, this
will result in no additional shared-memory access, since we'll find
in the local lock table that we already have a lock of the desired
type; hence, no significant performance degradation should occur.
Also, adjust the planner and executor so that the type of lock taken
on an index is always identical to the type of lock taken for its table,
by relying on the recently added RangeTblEntry.rellockmode field.
This avoids some corner cases where that might not have been true
before (possibly resulting in extra locking overhead), and prevents
future maintenance issues from having multiple bits of logic that
all needed to be in sync. In addition, this change removes all core
calls to ExecRelationIsTargetRelation, which avoids a possible O(N^2)
startup penalty for queries with large numbers of target relations.
(We'd probably remove that function altogether, were it not that we
advertise it as something that FDWs might want to use.)
Also adjust some places in selfuncs.c to not take any lock on indexes
they are transiently opening, since we can assume that plancat.c
did that already.
In passing, change gin_clean_pending_list() to take RowExclusiveLock
not AccessShareLock on its target index. Although it's not clear that
that's actually a bug, it seemed very strange for a function that's
explicitly going to modify the index to use only AccessShareLock.
David Rowley, reviewed by Julien Rouhaud and Amit Langote,
a bit of further tweaking by me
Discussion: https://postgr.es/m/19465.1541636036@sss.pgh.pa.us
2019-04-04 21:12:51 +02:00
|
|
|
*
|
|
|
|
* Note: This is currently no longer used in core. We keep it around
|
|
|
|
* because FDWs may wish to use it to determine if their foreign table
|
|
|
|
* is a target relation.
|
2005-12-03 06:51:03 +01:00
|
|
|
* ----------------------------------------------------------------
|
|
|
|
*/
|
|
|
|
bool
|
|
|
|
ExecRelationIsTargetRelation(EState *estate, Index scanrelid)
|
|
|
|
{
|
Create ResultRelInfos later in InitPlan, index them by RT index.
Instead of allocating all the ResultRelInfos upfront in one big array,
allocate them in ExecInitModifyTable(). es_result_relations is now an
array of ResultRelInfo pointers, rather than an array of structs, and it
is indexed by the RT index.
This simplifies things: we get rid of the separate concept of a "result
rel index", and don't need to set it in setrefs.c anymore. This also
allows follow-up optimizations (not included in this commit yet) to skip
initializing ResultRelInfos for target relations that were not needed at
runtime, and removal of the es_result_relation_info pointer.
The EState arrays of regular result rels and root result rels are merged
into one array. Similarly, the resultRelations and rootResultRelations
lists in PlannedStmt are merged into one. It's not actually clear to me
why they were kept separate in the first place, but now that the
es_result_relations array is indexed by RT index, it certainly seems
pointless.
The PlannedStmt->resultRelations list is now only needed for
ExecRelationIsTargetRelation(). One visible effect of this change is that
ExecRelationIsTargetRelation() will now return 'true' also for the
partition root, if a partitioned table is updated. That seems like a good
thing, although the function isn't used in core code, and I don't see any
reason for an FDW to call it on a partition root.
Author: Amit Langote
Discussion: https://www.postgresql.org/message-id/CA%2BHiwqGEmiib8FLiHMhKB%2BCH5dRgHSLc5N5wnvc4kym%2BZYpQEQ%40mail.gmail.com
2020-10-13 11:57:02 +02:00
|
|
|
return list_member_int(estate->es_plannedstmt->resultRelations, scanrelid);
|
2005-12-03 06:51:03 +01:00
|
|
|
}
|
|
|
|
|
2005-12-02 21:03:42 +01:00
|
|
|
/* ----------------------------------------------------------------
|
|
|
|
* ExecOpenScanRelation
|
|
|
|
*
|
|
|
|
* Open the heap relation to be scanned by a base-level scan plan node.
|
|
|
|
* This should be called during the node's ExecInit routine.
|
|
|
|
* ----------------------------------------------------------------
|
|
|
|
*/
|
|
|
|
Relation
|
2013-04-27 23:48:57 +02:00
|
|
|
ExecOpenScanRelation(EState *estate, Index scanrelid, int eflags)
|
2005-12-02 21:03:42 +01:00
|
|
|
{
|
2013-04-27 23:48:57 +02:00
|
|
|
Relation rel;
|
2005-12-02 21:03:42 +01:00
|
|
|
|
2018-10-04 20:03:37 +02:00
|
|
|
/* Open the relation. */
|
|
|
|
rel = ExecGetRangeTableRelation(estate, scanrelid);
|
2013-04-27 23:48:57 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Complain if we're attempting a scan of an unscannable relation, except
|
|
|
|
* when the query won't actually be run. This is a slightly klugy place
|
|
|
|
* to do this, perhaps, but there is no better place.
|
|
|
|
*/
|
|
|
|
if ((eflags & (EXEC_FLAG_EXPLAIN_ONLY | EXEC_FLAG_WITH_NO_DATA)) == 0 &&
|
|
|
|
!RelationIsScannable(rel))
|
|
|
|
ereport(ERROR,
|
|
|
|
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
|
|
|
|
errmsg("materialized view \"%s\" has not been populated",
|
|
|
|
RelationGetRelationName(rel)),
|
|
|
|
errhint("Use the REFRESH MATERIALIZED VIEW command.")));
|
|
|
|
|
|
|
|
return rel;
|
2005-12-02 21:03:42 +01:00
|
|
|
}
|
|
|
|
|
2018-10-04 21:48:17 +02:00
|
|
|
/*
|
|
|
|
* ExecInitRangeTable
|
|
|
|
* Set up executor's range-table-related data
|
|
|
|
*
|
2019-08-12 17:58:35 +02:00
|
|
|
* In addition to the range table proper, initialize arrays that are
|
|
|
|
* indexed by rangetable index.
|
2018-10-04 21:48:17 +02:00
|
|
|
*/
|
|
|
|
void
|
|
|
|
ExecInitRangeTable(EState *estate, List *rangeTable)
|
|
|
|
{
|
|
|
|
/* Remember the range table List as-is */
|
|
|
|
estate->es_range_table = rangeTable;
|
|
|
|
|
2019-08-12 17:58:35 +02:00
|
|
|
/* Set size of associated arrays */
|
2018-10-04 21:48:17 +02:00
|
|
|
estate->es_range_table_size = list_length(rangeTable);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Allocate an array to store an open Relation corresponding to each
|
|
|
|
* rangetable entry, and initialize entries to NULL. Relations are opened
|
|
|
|
* and stored here as needed.
|
|
|
|
*/
|
|
|
|
estate->es_relations = (Relation *)
|
|
|
|
palloc0(estate->es_range_table_size * sizeof(Relation));
|
2018-10-08 16:41:34 +02:00
|
|
|
|
|
|
|
/*
|
Create ResultRelInfos later in InitPlan, index them by RT index.
Instead of allocating all the ResultRelInfos upfront in one big array,
allocate them in ExecInitModifyTable(). es_result_relations is now an
array of ResultRelInfo pointers, rather than an array of structs, and it
is indexed by the RT index.
This simplifies things: we get rid of the separate concept of a "result
rel index", and don't need to set it in setrefs.c anymore. This also
allows follow-up optimizations (not included in this commit yet) to skip
initializing ResultRelInfos for target relations that were not needed at
runtime, and removal of the es_result_relation_info pointer.
The EState arrays of regular result rels and root result rels are merged
into one array. Similarly, the resultRelations and rootResultRelations
lists in PlannedStmt are merged into one. It's not actually clear to me
why they were kept separate in the first place, but now that the
es_result_relations array is indexed by RT index, it certainly seems
pointless.
The PlannedStmt->resultRelations list is now only needed for
ExecRelationIsTargetRelation(). One visible effect of this change is that
ExecRelationIsTargetRelation() will now return 'true' also for the
partition root, if a partitioned table is updated. That seems like a good
thing, although the function isn't used in core code, and I don't see any
reason for an FDW to call it on a partition root.
Author: Amit Langote
Discussion: https://www.postgresql.org/message-id/CA%2BHiwqGEmiib8FLiHMhKB%2BCH5dRgHSLc5N5wnvc4kym%2BZYpQEQ%40mail.gmail.com
2020-10-13 11:57:02 +02:00
|
|
|
* es_result_relations and es_rowmarks are also parallel to
|
|
|
|
* es_range_table, but are allocated only if needed.
|
2018-10-08 16:41:34 +02:00
|
|
|
*/
|
Create ResultRelInfos later in InitPlan, index them by RT index.
Instead of allocating all the ResultRelInfos upfront in one big array,
allocate them in ExecInitModifyTable(). es_result_relations is now an
array of ResultRelInfo pointers, rather than an array of structs, and it
is indexed by the RT index.
This simplifies things: we get rid of the separate concept of a "result
rel index", and don't need to set it in setrefs.c anymore. This also
allows follow-up optimizations (not included in this commit yet) to skip
initializing ResultRelInfos for target relations that were not needed at
runtime, and removal of the es_result_relation_info pointer.
The EState arrays of regular result rels and root result rels are merged
into one array. Similarly, the resultRelations and rootResultRelations
lists in PlannedStmt are merged into one. It's not actually clear to me
why they were kept separate in the first place, but now that the
es_result_relations array is indexed by RT index, it certainly seems
pointless.
The PlannedStmt->resultRelations list is now only needed for
ExecRelationIsTargetRelation(). One visible effect of this change is that
ExecRelationIsTargetRelation() will now return 'true' also for the
partition root, if a partitioned table is updated. That seems like a good
thing, although the function isn't used in core code, and I don't see any
reason for an FDW to call it on a partition root.
Author: Amit Langote
Discussion: https://www.postgresql.org/message-id/CA%2BHiwqGEmiib8FLiHMhKB%2BCH5dRgHSLc5N5wnvc4kym%2BZYpQEQ%40mail.gmail.com
2020-10-13 11:57:02 +02:00
|
|
|
estate->es_result_relations = NULL;
|
2018-10-08 16:41:34 +02:00
|
|
|
estate->es_rowmarks = NULL;
|
2018-10-04 21:48:17 +02:00
|
|
|
}
|
|
|
|
|
2018-10-04 20:03:37 +02:00
|
|
|
/*
|
|
|
|
* ExecGetRangeTableRelation
|
|
|
|
* Open the Relation for a range table entry, if not already done
|
2005-12-02 21:03:42 +01:00
|
|
|
*
|
2018-10-04 20:03:37 +02:00
|
|
|
* The Relations will be closed again in ExecEndPlan().
|
2005-12-02 21:03:42 +01:00
|
|
|
*/
|
2018-10-04 20:03:37 +02:00
|
|
|
Relation
|
|
|
|
ExecGetRangeTableRelation(EState *estate, Index rti)
|
2005-12-02 21:03:42 +01:00
|
|
|
{
|
2018-10-04 20:03:37 +02:00
|
|
|
Relation rel;
|
|
|
|
|
2018-10-04 21:48:17 +02:00
|
|
|
Assert(rti > 0 && rti <= estate->es_range_table_size);
|
2018-10-04 20:03:37 +02:00
|
|
|
|
|
|
|
rel = estate->es_relations[rti - 1];
|
|
|
|
if (rel == NULL)
|
|
|
|
{
|
|
|
|
/* First time through, so open the relation */
|
2018-10-04 21:48:17 +02:00
|
|
|
RangeTblEntry *rte = exec_rt_fetch(rti, estate);
|
2018-10-04 20:03:37 +02:00
|
|
|
|
|
|
|
Assert(rte->rtekind == RTE_RELATION);
|
|
|
|
|
2018-10-06 21:49:37 +02:00
|
|
|
if (!IsParallelWorker())
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* In a normal query, we should already have the appropriate lock,
|
|
|
|
* but verify that through an Assert. Since there's already an
|
2019-01-21 19:32:19 +01:00
|
|
|
* Assert inside table_open that insists on holding some lock, it
|
2018-10-06 21:49:37 +02:00
|
|
|
* seems sufficient to check this only when rellockmode is higher
|
|
|
|
* than the minimum.
|
|
|
|
*/
|
2019-01-21 19:32:19 +01:00
|
|
|
rel = table_open(rte->relid, NoLock);
|
2018-10-06 21:49:37 +02:00
|
|
|
Assert(rte->rellockmode == AccessShareLock ||
|
|
|
|
CheckRelationLockedByMe(rel, rte->rellockmode, false));
|
|
|
|
}
|
|
|
|
else
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* If we are a parallel worker, we need to obtain our own local
|
|
|
|
* lock on the relation. This ensures sane behavior in case the
|
|
|
|
* parent process exits before we do.
|
|
|
|
*/
|
2019-01-21 19:32:19 +01:00
|
|
|
rel = table_open(rte->relid, rte->rellockmode);
|
2018-10-06 21:49:37 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
estate->es_relations[rti - 1] = rel;
|
2018-10-04 20:03:37 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
return rel;
|
2005-12-02 21:03:42 +01:00
|
|
|
}
|
|
|
|
|
Create ResultRelInfos later in InitPlan, index them by RT index.
Instead of allocating all the ResultRelInfos upfront in one big array,
allocate them in ExecInitModifyTable(). es_result_relations is now an
array of ResultRelInfo pointers, rather than an array of structs, and it
is indexed by the RT index.
This simplifies things: we get rid of the separate concept of a "result
rel index", and don't need to set it in setrefs.c anymore. This also
allows follow-up optimizations (not included in this commit yet) to skip
initializing ResultRelInfos for target relations that were not needed at
runtime, and removal of the es_result_relation_info pointer.
The EState arrays of regular result rels and root result rels are merged
into one array. Similarly, the resultRelations and rootResultRelations
lists in PlannedStmt are merged into one. It's not actually clear to me
why they were kept separate in the first place, but now that the
es_result_relations array is indexed by RT index, it certainly seems
pointless.
The PlannedStmt->resultRelations list is now only needed for
ExecRelationIsTargetRelation(). One visible effect of this change is that
ExecRelationIsTargetRelation() will now return 'true' also for the
partition root, if a partitioned table is updated. That seems like a good
thing, although the function isn't used in core code, and I don't see any
reason for an FDW to call it on a partition root.
Author: Amit Langote
Discussion: https://www.postgresql.org/message-id/CA%2BHiwqGEmiib8FLiHMhKB%2BCH5dRgHSLc5N5wnvc4kym%2BZYpQEQ%40mail.gmail.com
2020-10-13 11:57:02 +02:00
|
|
|
/*
|
|
|
|
* ExecInitResultRelation
|
|
|
|
* Open relation given by the passed-in RT index and fill its
|
|
|
|
* ResultRelInfo node
|
|
|
|
*
|
|
|
|
* Here, we also save the ResultRelInfo in estate->es_result_relations array
|
|
|
|
* such that it can be accessed later using the RT index.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
ExecInitResultRelation(EState *estate, ResultRelInfo *resultRelInfo,
|
|
|
|
Index rti)
|
|
|
|
{
|
|
|
|
Relation resultRelationDesc;
|
|
|
|
|
|
|
|
resultRelationDesc = ExecGetRangeTableRelation(estate, rti);
|
|
|
|
InitResultRelInfo(resultRelInfo,
|
|
|
|
resultRelationDesc,
|
|
|
|
rti,
|
|
|
|
NULL,
|
|
|
|
estate->es_instrument);
|
|
|
|
|
|
|
|
if (estate->es_result_relations == NULL)
|
|
|
|
estate->es_result_relations = (ResultRelInfo **)
|
|
|
|
palloc0(estate->es_range_table_size * sizeof(ResultRelInfo *));
|
|
|
|
estate->es_result_relations[rti - 1] = resultRelInfo;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Saving in the list allows to avoid needlessly traversing the whole
|
|
|
|
* array when only a few of its entries are possibly non-NULL.
|
|
|
|
*/
|
|
|
|
estate->es_opened_result_relations =
|
|
|
|
lappend(estate->es_opened_result_relations, resultRelInfo);
|
|
|
|
}
|
|
|
|
|
2003-02-09 01:30:41 +01:00
|
|
|
/*
|
|
|
|
* UpdateChangedParamSet
|
|
|
|
* Add changed parameters to a plan node's chgParam set
|
|
|
|
*/
|
1998-02-13 04:26:53 +01:00
|
|
|
void
|
2003-02-09 01:30:41 +01:00
|
|
|
UpdateChangedParamSet(PlanState *node, Bitmapset *newchg)
|
1998-02-13 04:26:53 +01:00
|
|
|
{
|
2003-02-09 01:30:41 +01:00
|
|
|
Bitmapset *parmset;
|
1998-02-26 05:46:47 +01:00
|
|
|
|
2003-02-09 01:30:41 +01:00
|
|
|
/*
|
|
|
|
* The plan node only depends on params listed in its allParam set. Don't
|
|
|
|
* include anything else into its chgParam set.
|
|
|
|
*/
|
|
|
|
parmset = bms_intersect(node->plan->allParam, newchg);
|
2003-08-04 02:43:34 +02:00
|
|
|
|
2003-02-09 01:30:41 +01:00
|
|
|
/*
|
|
|
|
* Keep node->chgParam == NULL if there's not actually any members; this
|
|
|
|
* allows the simplest possible tests in executor node files.
|
|
|
|
*/
|
|
|
|
if (!bms_is_empty(parmset))
|
|
|
|
node->chgParam = bms_join(node->chgParam, parmset);
|
|
|
|
else
|
|
|
|
bms_free(parmset);
|
1998-02-13 04:26:53 +01:00
|
|
|
}
|
2002-05-12 22:10:05 +02:00
|
|
|
|
2017-04-18 19:20:59 +02:00
|
|
|
/*
|
|
|
|
* executor_errposition
|
|
|
|
* Report an execution-time cursor position, if possible.
|
|
|
|
*
|
|
|
|
* This is expected to be used within an ereport() call. The return value
|
|
|
|
* is a dummy (always 0, in fact).
|
|
|
|
*
|
|
|
|
* The locations stored in parsetrees are byte offsets into the source string.
|
|
|
|
* We have to convert them to 1-based character indexes for reporting to
|
|
|
|
* clients. (We do things this way to avoid unnecessary overhead in the
|
|
|
|
* normal non-error case: computing character indexes would be much more
|
|
|
|
* expensive than storing token offsets.)
|
|
|
|
*/
|
2020-03-25 16:57:36 +01:00
|
|
|
int
|
2017-04-18 19:20:59 +02:00
|
|
|
executor_errposition(EState *estate, int location)
|
|
|
|
{
|
|
|
|
int pos;
|
|
|
|
|
|
|
|
/* No-op if location was not provided */
|
|
|
|
if (location < 0)
|
2020-03-25 16:57:36 +01:00
|
|
|
return 0;
|
2017-04-18 19:20:59 +02:00
|
|
|
/* Can't do anything if source text is not available */
|
|
|
|
if (estate == NULL || estate->es_sourceText == NULL)
|
2020-03-25 16:57:36 +01:00
|
|
|
return 0;
|
2017-04-18 19:20:59 +02:00
|
|
|
/* Convert offset to character number */
|
|
|
|
pos = pg_mbstrlen_with_len(estate->es_sourceText, location) + 1;
|
|
|
|
/* And pass it to the ereport mechanism */
|
2020-03-25 16:57:36 +01:00
|
|
|
return errposition(pos);
|
2017-04-18 19:20:59 +02:00
|
|
|
}
|
|
|
|
|
2002-05-12 22:10:05 +02:00
|
|
|
/*
|
|
|
|
* Register a shutdown callback in an ExprContext.
|
|
|
|
*
|
|
|
|
* Shutdown callbacks will be called (in reverse order of registration)
|
|
|
|
* when the ExprContext is deleted or rescanned. This provides a hook
|
|
|
|
* for functions called in the context to do any cleanup needed --- it's
|
|
|
|
* particularly useful for functions returning sets. Note that the
|
|
|
|
* callback will *not* be called in the event that execution is aborted
|
|
|
|
* by an error.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
RegisterExprContextCallback(ExprContext *econtext,
|
|
|
|
ExprContextCallbackFunction function,
|
|
|
|
Datum arg)
|
|
|
|
{
|
|
|
|
ExprContext_CB *ecxt_callback;
|
|
|
|
|
|
|
|
/* Save the info in appropriate memory context */
|
|
|
|
ecxt_callback = (ExprContext_CB *)
|
|
|
|
MemoryContextAlloc(econtext->ecxt_per_query_memory,
|
|
|
|
sizeof(ExprContext_CB));
|
|
|
|
|
|
|
|
ecxt_callback->function = function;
|
|
|
|
ecxt_callback->arg = arg;
|
|
|
|
|
|
|
|
/* link to front of list for appropriate execution order */
|
|
|
|
ecxt_callback->next = econtext->ecxt_callbacks;
|
|
|
|
econtext->ecxt_callbacks = ecxt_callback;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Deregister a shutdown callback in an ExprContext.
|
|
|
|
*
|
|
|
|
* Any list entries matching the function and arg will be removed.
|
|
|
|
* This can be used if it's no longer necessary to call the callback.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
UnregisterExprContextCallback(ExprContext *econtext,
|
|
|
|
ExprContextCallbackFunction function,
|
|
|
|
Datum arg)
|
|
|
|
{
|
|
|
|
ExprContext_CB **prev_callback;
|
|
|
|
ExprContext_CB *ecxt_callback;
|
|
|
|
|
|
|
|
prev_callback = &econtext->ecxt_callbacks;
|
|
|
|
|
|
|
|
while ((ecxt_callback = *prev_callback) != NULL)
|
|
|
|
{
|
|
|
|
if (ecxt_callback->function == function && ecxt_callback->arg == arg)
|
|
|
|
{
|
|
|
|
*prev_callback = ecxt_callback->next;
|
|
|
|
pfree(ecxt_callback);
|
|
|
|
}
|
|
|
|
else
|
|
|
|
prev_callback = &ecxt_callback->next;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Call all the shutdown callbacks registered in an ExprContext.
|
|
|
|
*
|
|
|
|
* The callback list is emptied (important in case this is only a rescan
|
|
|
|
* reset, and not deletion of the ExprContext).
|
2009-07-18 21:15:42 +02:00
|
|
|
*
|
|
|
|
* If isCommit is false, just clean the callback list but don't call 'em.
|
|
|
|
* (See comment for FreeExprContext.)
|
2002-05-12 22:10:05 +02:00
|
|
|
*/
|
|
|
|
static void
|
2009-07-18 21:15:42 +02:00
|
|
|
ShutdownExprContext(ExprContext *econtext, bool isCommit)
|
2002-05-12 22:10:05 +02:00
|
|
|
{
|
|
|
|
ExprContext_CB *ecxt_callback;
|
2002-12-15 17:17:59 +01:00
|
|
|
MemoryContext oldcontext;
|
|
|
|
|
|
|
|
/* Fast path in normal case where there's nothing to do. */
|
|
|
|
if (econtext->ecxt_callbacks == NULL)
|
|
|
|
return;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Call the callbacks in econtext's per-tuple context. This ensures that
|
|
|
|
* any memory they might leak will get cleaned up.
|
|
|
|
*/
|
|
|
|
oldcontext = MemoryContextSwitchTo(econtext->ecxt_per_tuple_memory);
|
2002-05-12 22:10:05 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Call each callback function in reverse registration order.
|
|
|
|
*/
|
|
|
|
while ((ecxt_callback = econtext->ecxt_callbacks) != NULL)
|
|
|
|
{
|
|
|
|
econtext->ecxt_callbacks = ecxt_callback->next;
|
2009-07-18 21:15:42 +02:00
|
|
|
if (isCommit)
|
2017-09-07 18:06:23 +02:00
|
|
|
ecxt_callback->function(ecxt_callback->arg);
|
2002-05-12 22:10:05 +02:00
|
|
|
pfree(ecxt_callback);
|
|
|
|
}
|
2002-12-15 17:17:59 +01:00
|
|
|
|
|
|
|
MemoryContextSwitchTo(oldcontext);
|
2002-05-12 22:10:05 +02:00
|
|
|
}
|
2017-03-21 14:48:04 +01:00
|
|
|
|
Faster expression evaluation and targetlist projection.
This replaces the old, recursive tree-walk based evaluation, with
non-recursive, opcode dispatch based, expression evaluation.
Projection is now implemented as part of expression evaluation.
This both leads to significant performance improvements, and makes
future just-in-time compilation of expressions easier.
The speed gains primarily come from:
- non-recursive implementation reduces stack usage / overhead
- simple sub-expressions are implemented with a single jump, without
function calls
- sharing some state between different sub-expressions
- reduced amount of indirect/hard to predict memory accesses by laying
out operation metadata sequentially; including the avoidance of
nearly all of the previously used linked lists
- more code has been moved to expression initialization, avoiding
constant re-checks at evaluation time
Future just-in-time compilation (JIT) has become easier, as
demonstrated by released patches intended to be merged in a later
release, for primarily two reasons: Firstly, due to a stricter split
between expression initialization and evaluation, less code has to be
handled by the JIT. Secondly, due to the non-recursive nature of the
generated "instructions", less performance-critical code-paths can
easily be shared between interpreted and compiled evaluation.
The new framework allows for significant future optimizations. E.g.:
- basic infrastructure for to later reduce the per executor-startup
overhead of expression evaluation, by caching state in prepared
statements. That'd be helpful in OLTPish scenarios where
initialization overhead is measurable.
- optimizing the generated "code". A number of proposals for potential
work has already been made.
- optimizing the interpreter. Similarly a number of proposals have
been made here too.
The move of logic into the expression initialization step leads to some
backward-incompatible changes:
- Function permission checks are now done during expression
initialization, whereas previously they were done during
execution. In edge cases this can lead to errors being raised that
previously wouldn't have been, e.g. a NULL array being coerced to a
different array type previously didn't perform checks.
- The set of domain constraints to be checked, is now evaluated once
during expression initialization, previously it was re-built
every time a domain check was evaluated. For normal queries this
doesn't change much, but e.g. for plpgsql functions, which caches
ExprStates, the old set could stick around longer. The behavior
around might still change.
Author: Andres Freund, with significant changes by Tom Lane,
changes by Heikki Linnakangas
Reviewed-By: Tom Lane, Heikki Linnakangas
Discussion: https://postgr.es/m/20161206034955.bh33paeralxbtluv@alap3.anarazel.de
2017-03-14 23:45:36 +01:00
|
|
|
/*
|
|
|
|
* GetAttributeByName
|
|
|
|
* GetAttributeByNum
|
|
|
|
*
|
|
|
|
* These functions return the value of the requested attribute
|
|
|
|
* out of the given tuple Datum.
|
|
|
|
* C functions which take a tuple as an argument are expected
|
|
|
|
* to use these. Ex: overpaid(EMP) might call GetAttributeByNum().
|
|
|
|
* Note: these are actually rather slow because they do a typcache
|
|
|
|
* lookup on each call.
|
|
|
|
*/
|
|
|
|
Datum
|
|
|
|
GetAttributeByName(HeapTupleHeader tuple, const char *attname, bool *isNull)
|
|
|
|
{
|
|
|
|
AttrNumber attrno;
|
|
|
|
Datum result;
|
|
|
|
Oid tupType;
|
|
|
|
int32 tupTypmod;
|
|
|
|
TupleDesc tupDesc;
|
|
|
|
HeapTupleData tmptup;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
if (attname == NULL)
|
|
|
|
elog(ERROR, "invalid attribute name");
|
|
|
|
|
|
|
|
if (isNull == NULL)
|
|
|
|
elog(ERROR, "a NULL isNull pointer was passed");
|
|
|
|
|
|
|
|
if (tuple == NULL)
|
|
|
|
{
|
|
|
|
/* Kinda bogus but compatible with old behavior... */
|
|
|
|
*isNull = true;
|
|
|
|
return (Datum) 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
tupType = HeapTupleHeaderGetTypeId(tuple);
|
|
|
|
tupTypmod = HeapTupleHeaderGetTypMod(tuple);
|
|
|
|
tupDesc = lookup_rowtype_tupdesc(tupType, tupTypmod);
|
|
|
|
|
|
|
|
attrno = InvalidAttrNumber;
|
|
|
|
for (i = 0; i < tupDesc->natts; i++)
|
|
|
|
{
|
2017-08-20 20:19:07 +02:00
|
|
|
Form_pg_attribute att = TupleDescAttr(tupDesc, i);
|
|
|
|
|
|
|
|
if (namestrcmp(&(att->attname), attname) == 0)
|
Faster expression evaluation and targetlist projection.
This replaces the old, recursive tree-walk based evaluation, with
non-recursive, opcode dispatch based, expression evaluation.
Projection is now implemented as part of expression evaluation.
This both leads to significant performance improvements, and makes
future just-in-time compilation of expressions easier.
The speed gains primarily come from:
- non-recursive implementation reduces stack usage / overhead
- simple sub-expressions are implemented with a single jump, without
function calls
- sharing some state between different sub-expressions
- reduced amount of indirect/hard to predict memory accesses by laying
out operation metadata sequentially; including the avoidance of
nearly all of the previously used linked lists
- more code has been moved to expression initialization, avoiding
constant re-checks at evaluation time
Future just-in-time compilation (JIT) has become easier, as
demonstrated by released patches intended to be merged in a later
release, for primarily two reasons: Firstly, due to a stricter split
between expression initialization and evaluation, less code has to be
handled by the JIT. Secondly, due to the non-recursive nature of the
generated "instructions", less performance-critical code-paths can
easily be shared between interpreted and compiled evaluation.
The new framework allows for significant future optimizations. E.g.:
- basic infrastructure for to later reduce the per executor-startup
overhead of expression evaluation, by caching state in prepared
statements. That'd be helpful in OLTPish scenarios where
initialization overhead is measurable.
- optimizing the generated "code". A number of proposals for potential
work has already been made.
- optimizing the interpreter. Similarly a number of proposals have
been made here too.
The move of logic into the expression initialization step leads to some
backward-incompatible changes:
- Function permission checks are now done during expression
initialization, whereas previously they were done during
execution. In edge cases this can lead to errors being raised that
previously wouldn't have been, e.g. a NULL array being coerced to a
different array type previously didn't perform checks.
- The set of domain constraints to be checked, is now evaluated once
during expression initialization, previously it was re-built
every time a domain check was evaluated. For normal queries this
doesn't change much, but e.g. for plpgsql functions, which caches
ExprStates, the old set could stick around longer. The behavior
around might still change.
Author: Andres Freund, with significant changes by Tom Lane,
changes by Heikki Linnakangas
Reviewed-By: Tom Lane, Heikki Linnakangas
Discussion: https://postgr.es/m/20161206034955.bh33paeralxbtluv@alap3.anarazel.de
2017-03-14 23:45:36 +01:00
|
|
|
{
|
2017-08-20 20:19:07 +02:00
|
|
|
attrno = att->attnum;
|
Faster expression evaluation and targetlist projection.
This replaces the old, recursive tree-walk based evaluation, with
non-recursive, opcode dispatch based, expression evaluation.
Projection is now implemented as part of expression evaluation.
This both leads to significant performance improvements, and makes
future just-in-time compilation of expressions easier.
The speed gains primarily come from:
- non-recursive implementation reduces stack usage / overhead
- simple sub-expressions are implemented with a single jump, without
function calls
- sharing some state between different sub-expressions
- reduced amount of indirect/hard to predict memory accesses by laying
out operation metadata sequentially; including the avoidance of
nearly all of the previously used linked lists
- more code has been moved to expression initialization, avoiding
constant re-checks at evaluation time
Future just-in-time compilation (JIT) has become easier, as
demonstrated by released patches intended to be merged in a later
release, for primarily two reasons: Firstly, due to a stricter split
between expression initialization and evaluation, less code has to be
handled by the JIT. Secondly, due to the non-recursive nature of the
generated "instructions", less performance-critical code-paths can
easily be shared between interpreted and compiled evaluation.
The new framework allows for significant future optimizations. E.g.:
- basic infrastructure for to later reduce the per executor-startup
overhead of expression evaluation, by caching state in prepared
statements. That'd be helpful in OLTPish scenarios where
initialization overhead is measurable.
- optimizing the generated "code". A number of proposals for potential
work has already been made.
- optimizing the interpreter. Similarly a number of proposals have
been made here too.
The move of logic into the expression initialization step leads to some
backward-incompatible changes:
- Function permission checks are now done during expression
initialization, whereas previously they were done during
execution. In edge cases this can lead to errors being raised that
previously wouldn't have been, e.g. a NULL array being coerced to a
different array type previously didn't perform checks.
- The set of domain constraints to be checked, is now evaluated once
during expression initialization, previously it was re-built
every time a domain check was evaluated. For normal queries this
doesn't change much, but e.g. for plpgsql functions, which caches
ExprStates, the old set could stick around longer. The behavior
around might still change.
Author: Andres Freund, with significant changes by Tom Lane,
changes by Heikki Linnakangas
Reviewed-By: Tom Lane, Heikki Linnakangas
Discussion: https://postgr.es/m/20161206034955.bh33paeralxbtluv@alap3.anarazel.de
2017-03-14 23:45:36 +01:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (attrno == InvalidAttrNumber)
|
|
|
|
elog(ERROR, "attribute \"%s\" does not exist", attname);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* heap_getattr needs a HeapTuple not a bare HeapTupleHeader. We set all
|
|
|
|
* the fields in the struct just in case user tries to inspect system
|
|
|
|
* columns.
|
|
|
|
*/
|
|
|
|
tmptup.t_len = HeapTupleHeaderGetDatumLength(tuple);
|
|
|
|
ItemPointerSetInvalid(&(tmptup.t_self));
|
|
|
|
tmptup.t_tableOid = InvalidOid;
|
|
|
|
tmptup.t_data = tuple;
|
|
|
|
|
|
|
|
result = heap_getattr(&tmptup,
|
|
|
|
attrno,
|
|
|
|
tupDesc,
|
|
|
|
isNull);
|
|
|
|
|
|
|
|
ReleaseTupleDesc(tupDesc);
|
|
|
|
|
|
|
|
return result;
|
|
|
|
}
|
|
|
|
|
|
|
|
Datum
|
|
|
|
GetAttributeByNum(HeapTupleHeader tuple,
|
|
|
|
AttrNumber attrno,
|
|
|
|
bool *isNull)
|
|
|
|
{
|
|
|
|
Datum result;
|
|
|
|
Oid tupType;
|
|
|
|
int32 tupTypmod;
|
|
|
|
TupleDesc tupDesc;
|
|
|
|
HeapTupleData tmptup;
|
|
|
|
|
|
|
|
if (!AttributeNumberIsValid(attrno))
|
|
|
|
elog(ERROR, "invalid attribute number %d", attrno);
|
|
|
|
|
|
|
|
if (isNull == NULL)
|
|
|
|
elog(ERROR, "a NULL isNull pointer was passed");
|
|
|
|
|
|
|
|
if (tuple == NULL)
|
|
|
|
{
|
|
|
|
/* Kinda bogus but compatible with old behavior... */
|
|
|
|
*isNull = true;
|
|
|
|
return (Datum) 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
tupType = HeapTupleHeaderGetTypeId(tuple);
|
|
|
|
tupTypmod = HeapTupleHeaderGetTypMod(tuple);
|
|
|
|
tupDesc = lookup_rowtype_tupdesc(tupType, tupTypmod);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* heap_getattr needs a HeapTuple not a bare HeapTupleHeader. We set all
|
|
|
|
* the fields in the struct just in case user tries to inspect system
|
|
|
|
* columns.
|
|
|
|
*/
|
|
|
|
tmptup.t_len = HeapTupleHeaderGetDatumLength(tuple);
|
|
|
|
ItemPointerSetInvalid(&(tmptup.t_self));
|
|
|
|
tmptup.t_tableOid = InvalidOid;
|
|
|
|
tmptup.t_data = tuple;
|
|
|
|
|
|
|
|
result = heap_getattr(&tmptup,
|
|
|
|
attrno,
|
|
|
|
tupDesc,
|
|
|
|
isNull);
|
|
|
|
|
|
|
|
ReleaseTupleDesc(tupDesc);
|
|
|
|
|
|
|
|
return result;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Number of items in a tlist (including any resjunk items!)
|
|
|
|
*/
|
|
|
|
int
|
|
|
|
ExecTargetListLength(List *targetlist)
|
|
|
|
{
|
|
|
|
/* This used to be more complex, but fjoins are dead */
|
|
|
|
return list_length(targetlist);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Number of items in a tlist, not including any resjunk items
|
|
|
|
*/
|
|
|
|
int
|
|
|
|
ExecCleanTargetListLength(List *targetlist)
|
|
|
|
{
|
|
|
|
int len = 0;
|
|
|
|
ListCell *tl;
|
|
|
|
|
|
|
|
foreach(tl, targetlist)
|
|
|
|
{
|
Improve castNode notation by introducing list-extraction-specific variants.
This extends the castNode() notation introduced by commit 5bcab1114 to
provide, in one step, extraction of a list cell's pointer and coercion to
a concrete node type. For example, "lfirst_node(Foo, lc)" is the same
as "castNode(Foo, lfirst(lc))". Almost half of the uses of castNode
that have appeared so far include a list extraction call, so this is
pretty widely useful, and it saves a few more keystrokes compared to the
old way.
As with the previous patch, back-patch the addition of these macros to
pg_list.h, so that the notation will be available when back-patching.
Patch by me, after an idea of Andrew Gierth's.
Discussion: https://postgr.es/m/14197.1491841216@sss.pgh.pa.us
2017-04-10 19:51:29 +02:00
|
|
|
TargetEntry *curTle = lfirst_node(TargetEntry, tl);
|
Faster expression evaluation and targetlist projection.
This replaces the old, recursive tree-walk based evaluation, with
non-recursive, opcode dispatch based, expression evaluation.
Projection is now implemented as part of expression evaluation.
This both leads to significant performance improvements, and makes
future just-in-time compilation of expressions easier.
The speed gains primarily come from:
- non-recursive implementation reduces stack usage / overhead
- simple sub-expressions are implemented with a single jump, without
function calls
- sharing some state between different sub-expressions
- reduced amount of indirect/hard to predict memory accesses by laying
out operation metadata sequentially; including the avoidance of
nearly all of the previously used linked lists
- more code has been moved to expression initialization, avoiding
constant re-checks at evaluation time
Future just-in-time compilation (JIT) has become easier, as
demonstrated by released patches intended to be merged in a later
release, for primarily two reasons: Firstly, due to a stricter split
between expression initialization and evaluation, less code has to be
handled by the JIT. Secondly, due to the non-recursive nature of the
generated "instructions", less performance-critical code-paths can
easily be shared between interpreted and compiled evaluation.
The new framework allows for significant future optimizations. E.g.:
- basic infrastructure for to later reduce the per executor-startup
overhead of expression evaluation, by caching state in prepared
statements. That'd be helpful in OLTPish scenarios where
initialization overhead is measurable.
- optimizing the generated "code". A number of proposals for potential
work has already been made.
- optimizing the interpreter. Similarly a number of proposals have
been made here too.
The move of logic into the expression initialization step leads to some
backward-incompatible changes:
- Function permission checks are now done during expression
initialization, whereas previously they were done during
execution. In edge cases this can lead to errors being raised that
previously wouldn't have been, e.g. a NULL array being coerced to a
different array type previously didn't perform checks.
- The set of domain constraints to be checked, is now evaluated once
during expression initialization, previously it was re-built
every time a domain check was evaluated. For normal queries this
doesn't change much, but e.g. for plpgsql functions, which caches
ExprStates, the old set could stick around longer. The behavior
around might still change.
Author: Andres Freund, with significant changes by Tom Lane,
changes by Heikki Linnakangas
Reviewed-By: Tom Lane, Heikki Linnakangas
Discussion: https://postgr.es/m/20161206034955.bh33paeralxbtluv@alap3.anarazel.de
2017-03-14 23:45:36 +01:00
|
|
|
|
|
|
|
if (!curTle->resjunk)
|
|
|
|
len++;
|
|
|
|
}
|
|
|
|
return len;
|
|
|
|
}
|
2019-02-27 05:30:28 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Return a relInfo's tuple slot for a trigger's OLD tuples.
|
|
|
|
*/
|
|
|
|
TupleTableSlot *
|
|
|
|
ExecGetTriggerOldSlot(EState *estate, ResultRelInfo *relInfo)
|
|
|
|
{
|
|
|
|
if (relInfo->ri_TrigOldSlot == NULL)
|
|
|
|
{
|
|
|
|
Relation rel = relInfo->ri_RelationDesc;
|
|
|
|
MemoryContext oldcontext = MemoryContextSwitchTo(estate->es_query_cxt);
|
|
|
|
|
|
|
|
relInfo->ri_TrigOldSlot =
|
|
|
|
ExecInitExtraTupleSlot(estate,
|
|
|
|
RelationGetDescr(rel),
|
tableam: Add and use scan APIs.
Too allow table accesses to be not directly dependent on heap, several
new abstractions are needed. Specifically:
1) Heap scans need to be generalized into table scans. Do this by
introducing TableScanDesc, which will be the "base class" for
individual AMs. This contains the AM independent fields from
HeapScanDesc.
The previous heap_{beginscan,rescan,endscan} et al. have been
replaced with a table_ version.
There's no direct replacement for heap_getnext(), as that returned
a HeapTuple, which is undesirable for a other AMs. Instead there's
table_scan_getnextslot(). But note that heap_getnext() lives on,
it's still used widely to access catalog tables.
This is achieved by new scan_begin, scan_end, scan_rescan,
scan_getnextslot callbacks.
2) The portion of parallel scans that's shared between backends need
to be able to do so without the user doing per-AM work. To achieve
that new parallelscan_{estimate, initialize, reinitialize}
callbacks are introduced, which operate on a new
ParallelTableScanDesc, which again can be subclassed by AMs.
As it is likely that several AMs are going to be block oriented,
block oriented callbacks that can be shared between such AMs are
provided and used by heap. table_block_parallelscan_{estimate,
intiialize, reinitialize} as callbacks, and
table_block_parallelscan_{nextpage, init} for use in AMs. These
operate on a ParallelBlockTableScanDesc.
3) Index scans need to be able to access tables to return a tuple, and
there needs to be state across individual accesses to the heap to
store state like buffers. That's now handled by introducing a
sort-of-scan IndexFetchTable, which again is intended to be
subclassed by individual AMs (for heap IndexFetchHeap).
The relevant callbacks for an AM are index_fetch_{end, begin,
reset} to create the necessary state, and index_fetch_tuple to
retrieve an indexed tuple. Note that index_fetch_tuple
implementations need to be smarter than just blindly fetching the
tuples for AMs that have optimizations similar to heap's HOT - the
currently alive tuple in the update chain needs to be fetched if
appropriate.
Similar to table_scan_getnextslot(), it's undesirable to continue
to return HeapTuples. Thus index_fetch_heap (might want to rename
that later) now accepts a slot as an argument. Core code doesn't
have a lot of call sites performing index scans without going
through the systable_* API (in contrast to loads of heap_getnext
calls and working directly with HeapTuples).
Index scans now store the result of a search in
IndexScanDesc->xs_heaptid, rather than xs_ctup->t_self. As the
target is not generally a HeapTuple anymore that seems cleaner.
To be able to sensible adapt code to use the above, two further
callbacks have been introduced:
a) slot_callbacks returns a TupleTableSlotOps* suitable for creating
slots capable of holding a tuple of the AMs
type. table_slot_callbacks() and table_slot_create() are based
upon that, but have additional logic to deal with views, foreign
tables, etc.
While this change could have been done separately, nearly all the
call sites that needed to be adapted for the rest of this commit
also would have been needed to be adapted for
table_slot_callbacks(), making separation not worthwhile.
b) tuple_satisfies_snapshot checks whether the tuple in a slot is
currently visible according to a snapshot. That's required as a few
places now don't have a buffer + HeapTuple around, but a
slot (which in heap's case internally has that information).
Additionally a few infrastructure changes were needed:
I) SysScanDesc, as used by systable_{beginscan, getnext} et al. now
internally uses a slot to keep track of tuples. While
systable_getnext() still returns HeapTuples, and will so for the
foreseeable future, the index API (see 1) above) now only deals with
slots.
The remainder, and largest part, of this commit is then adjusting all
scans in postgres to use the new APIs.
Author: Andres Freund, Haribabu Kommi, Alvaro Herrera
Discussion:
https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
https://postgr.es/m/20160812231527.GA690404@alvherre.pgsql
2019-03-11 20:46:41 +01:00
|
|
|
table_slot_callbacks(rel));
|
2019-02-27 05:30:28 +01:00
|
|
|
|
|
|
|
MemoryContextSwitchTo(oldcontext);
|
|
|
|
}
|
|
|
|
|
|
|
|
return relInfo->ri_TrigOldSlot;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Return a relInfo's tuple slot for a trigger's NEW tuples.
|
|
|
|
*/
|
|
|
|
TupleTableSlot *
|
|
|
|
ExecGetTriggerNewSlot(EState *estate, ResultRelInfo *relInfo)
|
|
|
|
{
|
|
|
|
if (relInfo->ri_TrigNewSlot == NULL)
|
|
|
|
{
|
|
|
|
Relation rel = relInfo->ri_RelationDesc;
|
|
|
|
MemoryContext oldcontext = MemoryContextSwitchTo(estate->es_query_cxt);
|
|
|
|
|
|
|
|
relInfo->ri_TrigNewSlot =
|
|
|
|
ExecInitExtraTupleSlot(estate,
|
|
|
|
RelationGetDescr(rel),
|
tableam: Add and use scan APIs.
Too allow table accesses to be not directly dependent on heap, several
new abstractions are needed. Specifically:
1) Heap scans need to be generalized into table scans. Do this by
introducing TableScanDesc, which will be the "base class" for
individual AMs. This contains the AM independent fields from
HeapScanDesc.
The previous heap_{beginscan,rescan,endscan} et al. have been
replaced with a table_ version.
There's no direct replacement for heap_getnext(), as that returned
a HeapTuple, which is undesirable for a other AMs. Instead there's
table_scan_getnextslot(). But note that heap_getnext() lives on,
it's still used widely to access catalog tables.
This is achieved by new scan_begin, scan_end, scan_rescan,
scan_getnextslot callbacks.
2) The portion of parallel scans that's shared between backends need
to be able to do so without the user doing per-AM work. To achieve
that new parallelscan_{estimate, initialize, reinitialize}
callbacks are introduced, which operate on a new
ParallelTableScanDesc, which again can be subclassed by AMs.
As it is likely that several AMs are going to be block oriented,
block oriented callbacks that can be shared between such AMs are
provided and used by heap. table_block_parallelscan_{estimate,
intiialize, reinitialize} as callbacks, and
table_block_parallelscan_{nextpage, init} for use in AMs. These
operate on a ParallelBlockTableScanDesc.
3) Index scans need to be able to access tables to return a tuple, and
there needs to be state across individual accesses to the heap to
store state like buffers. That's now handled by introducing a
sort-of-scan IndexFetchTable, which again is intended to be
subclassed by individual AMs (for heap IndexFetchHeap).
The relevant callbacks for an AM are index_fetch_{end, begin,
reset} to create the necessary state, and index_fetch_tuple to
retrieve an indexed tuple. Note that index_fetch_tuple
implementations need to be smarter than just blindly fetching the
tuples for AMs that have optimizations similar to heap's HOT - the
currently alive tuple in the update chain needs to be fetched if
appropriate.
Similar to table_scan_getnextslot(), it's undesirable to continue
to return HeapTuples. Thus index_fetch_heap (might want to rename
that later) now accepts a slot as an argument. Core code doesn't
have a lot of call sites performing index scans without going
through the systable_* API (in contrast to loads of heap_getnext
calls and working directly with HeapTuples).
Index scans now store the result of a search in
IndexScanDesc->xs_heaptid, rather than xs_ctup->t_self. As the
target is not generally a HeapTuple anymore that seems cleaner.
To be able to sensible adapt code to use the above, two further
callbacks have been introduced:
a) slot_callbacks returns a TupleTableSlotOps* suitable for creating
slots capable of holding a tuple of the AMs
type. table_slot_callbacks() and table_slot_create() are based
upon that, but have additional logic to deal with views, foreign
tables, etc.
While this change could have been done separately, nearly all the
call sites that needed to be adapted for the rest of this commit
also would have been needed to be adapted for
table_slot_callbacks(), making separation not worthwhile.
b) tuple_satisfies_snapshot checks whether the tuple in a slot is
currently visible according to a snapshot. That's required as a few
places now don't have a buffer + HeapTuple around, but a
slot (which in heap's case internally has that information).
Additionally a few infrastructure changes were needed:
I) SysScanDesc, as used by systable_{beginscan, getnext} et al. now
internally uses a slot to keep track of tuples. While
systable_getnext() still returns HeapTuples, and will so for the
foreseeable future, the index API (see 1) above) now only deals with
slots.
The remainder, and largest part, of this commit is then adjusting all
scans in postgres to use the new APIs.
Author: Andres Freund, Haribabu Kommi, Alvaro Herrera
Discussion:
https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
https://postgr.es/m/20160812231527.GA690404@alvherre.pgsql
2019-03-11 20:46:41 +01:00
|
|
|
table_slot_callbacks(rel));
|
2019-02-27 05:30:28 +01:00
|
|
|
|
|
|
|
MemoryContextSwitchTo(oldcontext);
|
|
|
|
}
|
|
|
|
|
|
|
|
return relInfo->ri_TrigNewSlot;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Return a relInfo's tuple slot for processing returning tuples.
|
|
|
|
*/
|
|
|
|
TupleTableSlot *
|
|
|
|
ExecGetReturningSlot(EState *estate, ResultRelInfo *relInfo)
|
|
|
|
{
|
|
|
|
if (relInfo->ri_ReturningSlot == NULL)
|
|
|
|
{
|
|
|
|
Relation rel = relInfo->ri_RelationDesc;
|
|
|
|
MemoryContext oldcontext = MemoryContextSwitchTo(estate->es_query_cxt);
|
|
|
|
|
|
|
|
relInfo->ri_ReturningSlot =
|
|
|
|
ExecInitExtraTupleSlot(estate,
|
|
|
|
RelationGetDescr(rel),
|
tableam: Add and use scan APIs.
Too allow table accesses to be not directly dependent on heap, several
new abstractions are needed. Specifically:
1) Heap scans need to be generalized into table scans. Do this by
introducing TableScanDesc, which will be the "base class" for
individual AMs. This contains the AM independent fields from
HeapScanDesc.
The previous heap_{beginscan,rescan,endscan} et al. have been
replaced with a table_ version.
There's no direct replacement for heap_getnext(), as that returned
a HeapTuple, which is undesirable for a other AMs. Instead there's
table_scan_getnextslot(). But note that heap_getnext() lives on,
it's still used widely to access catalog tables.
This is achieved by new scan_begin, scan_end, scan_rescan,
scan_getnextslot callbacks.
2) The portion of parallel scans that's shared between backends need
to be able to do so without the user doing per-AM work. To achieve
that new parallelscan_{estimate, initialize, reinitialize}
callbacks are introduced, which operate on a new
ParallelTableScanDesc, which again can be subclassed by AMs.
As it is likely that several AMs are going to be block oriented,
block oriented callbacks that can be shared between such AMs are
provided and used by heap. table_block_parallelscan_{estimate,
intiialize, reinitialize} as callbacks, and
table_block_parallelscan_{nextpage, init} for use in AMs. These
operate on a ParallelBlockTableScanDesc.
3) Index scans need to be able to access tables to return a tuple, and
there needs to be state across individual accesses to the heap to
store state like buffers. That's now handled by introducing a
sort-of-scan IndexFetchTable, which again is intended to be
subclassed by individual AMs (for heap IndexFetchHeap).
The relevant callbacks for an AM are index_fetch_{end, begin,
reset} to create the necessary state, and index_fetch_tuple to
retrieve an indexed tuple. Note that index_fetch_tuple
implementations need to be smarter than just blindly fetching the
tuples for AMs that have optimizations similar to heap's HOT - the
currently alive tuple in the update chain needs to be fetched if
appropriate.
Similar to table_scan_getnextslot(), it's undesirable to continue
to return HeapTuples. Thus index_fetch_heap (might want to rename
that later) now accepts a slot as an argument. Core code doesn't
have a lot of call sites performing index scans without going
through the systable_* API (in contrast to loads of heap_getnext
calls and working directly with HeapTuples).
Index scans now store the result of a search in
IndexScanDesc->xs_heaptid, rather than xs_ctup->t_self. As the
target is not generally a HeapTuple anymore that seems cleaner.
To be able to sensible adapt code to use the above, two further
callbacks have been introduced:
a) slot_callbacks returns a TupleTableSlotOps* suitable for creating
slots capable of holding a tuple of the AMs
type. table_slot_callbacks() and table_slot_create() are based
upon that, but have additional logic to deal with views, foreign
tables, etc.
While this change could have been done separately, nearly all the
call sites that needed to be adapted for the rest of this commit
also would have been needed to be adapted for
table_slot_callbacks(), making separation not worthwhile.
b) tuple_satisfies_snapshot checks whether the tuple in a slot is
currently visible according to a snapshot. That's required as a few
places now don't have a buffer + HeapTuple around, but a
slot (which in heap's case internally has that information).
Additionally a few infrastructure changes were needed:
I) SysScanDesc, as used by systable_{beginscan, getnext} et al. now
internally uses a slot to keep track of tuples. While
systable_getnext() still returns HeapTuples, and will so for the
foreseeable future, the index API (see 1) above) now only deals with
slots.
The remainder, and largest part, of this commit is then adjusting all
scans in postgres to use the new APIs.
Author: Andres Freund, Haribabu Kommi, Alvaro Herrera
Discussion:
https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de
https://postgr.es/m/20160812231527.GA690404@alvherre.pgsql
2019-03-11 20:46:41 +01:00
|
|
|
table_slot_callbacks(rel));
|
2019-02-27 05:30:28 +01:00
|
|
|
|
|
|
|
MemoryContextSwitchTo(oldcontext);
|
|
|
|
}
|
|
|
|
|
|
|
|
return relInfo->ri_ReturningSlot;
|
|
|
|
}
|
Fix permission checks on constraint violation errors on partitions.
If a cross-partition UPDATE violates a constraint on the target partition,
and the columns in the new partition are in different physical order than
in the parent, the error message can reveal columns that the user does not
have SELECT permission on. A similar bug was fixed earlier in commit
804b6b6db4.
The cause of the bug is that the callers of the
ExecBuildSlotValueDescription() function got confused when constructing
the list of modified columns. If the tuple was routed from a parent, we
converted the tuple to the parent's format, but the list of modified
columns was grabbed directly from the child's RTE entry.
ExecUpdateLockMode() had a similar issue. That lead to confusion on which
columns are key columns, leading to wrong tuple lock being taken on tables
referenced by foreign keys, when a row is updated with INSERT ON CONFLICT
UPDATE. A new isolation test is added for that corner case.
With this patch, the ri_RangeTableIndex field is no longer set for
partitions that don't have an entry in the range table. Previously, it was
set to the RTE entry of the parent relation, but that was confusing.
NOTE: This modifies the ResultRelInfo struct, replacing the
ri_PartitionRoot field with ri_RootResultRelInfo. That's a bit risky to
backpatch, because it breaks any extensions accessing the field. The
change that ri_RangeTableIndex is not set for partitions could potentially
break extensions, too. The ResultRelInfos are visible to FDWs at least,
and this patch required small changes to postgres_fdw. Nevertheless, this
seem like the least bad option. I don't think these fields widely used in
extensions; I don't think there are FDWs out there that uses the FDW
"direct update" API, other than postgres_fdw. If there is, you will get a
compilation error, so hopefully it is caught quickly.
Backpatch to 11, where support for both cross-partition UPDATEs, and unique
indexes on partitioned tables, were added.
Reviewed-by: Amit Langote
Security: CVE-2021-3393
2021-02-08 10:01:51 +01:00
|
|
|
|
Postpone some stuff out of ExecInitModifyTable.
Arrange to do some things on-demand, rather than immediately during
executor startup, because there's a fair chance of never having to do
them at all:
* Don't open result relations' indexes until needed.
* Don't initialize partition tuple routing, nor the child-to-root
tuple conversion map, until needed.
This wins in UPDATEs on partitioned tables when only some of the
partitions will actually receive updates; with larger partition
counts the savings is quite noticeable. Also, we can remove some
sketchy heuristics in ExecInitModifyTable about whether to set up
tuple routing.
Also, remove execPartition.c's private hash table tracking which
partitions were already opened by the ModifyTable node. Instead
use the hash added to ModifyTable itself by commit 86dc90056.
To allow lazy computation of the conversion maps, we now set
ri_RootResultRelInfo in all child ResultRelInfos. We formerly set it
only in some, not terribly well-defined, cases. This has user-visible
side effects in that now more error messages refer to the root
relation instead of some partition (and provide error data in the
root's column order, too). It looks to me like this is a strict
improvement in consistency, so I don't have a problem with the
output changes visible in this commit.
Extracted from a larger patch, which seemed to me to be too messy
to push in one commit.
Amit Langote, reviewed at different times by Heikki Linnakangas and
myself
Discussion: https://postgr.es/m/CA+HiwqG7ZruBmmih3wPsBZ4s0H2EhywrnXEduckY5Hr3fWzPWA@mail.gmail.com
2021-04-06 21:56:55 +02:00
|
|
|
/*
|
|
|
|
* Return the map needed to convert given child result relation's tuples to
|
|
|
|
* the rowtype of the query's main target ("root") relation. Note that a
|
|
|
|
* NULL result is valid and means that no conversion is needed.
|
|
|
|
*/
|
|
|
|
TupleConversionMap *
|
|
|
|
ExecGetChildToRootMap(ResultRelInfo *resultRelInfo)
|
|
|
|
{
|
|
|
|
/* If we didn't already do so, compute the map for this child. */
|
|
|
|
if (!resultRelInfo->ri_ChildToRootMapValid)
|
|
|
|
{
|
|
|
|
ResultRelInfo *rootRelInfo = resultRelInfo->ri_RootResultRelInfo;
|
|
|
|
|
|
|
|
if (rootRelInfo)
|
|
|
|
resultRelInfo->ri_ChildToRootMap =
|
|
|
|
convert_tuples_by_name(RelationGetDescr(resultRelInfo->ri_RelationDesc),
|
|
|
|
RelationGetDescr(rootRelInfo->ri_RelationDesc));
|
|
|
|
else /* this isn't a child result rel */
|
|
|
|
resultRelInfo->ri_ChildToRootMap = NULL;
|
|
|
|
|
|
|
|
resultRelInfo->ri_ChildToRootMapValid = true;
|
|
|
|
}
|
|
|
|
|
|
|
|
return resultRelInfo->ri_ChildToRootMap;
|
|
|
|
}
|
|
|
|
|
Fix permission checks on constraint violation errors on partitions.
If a cross-partition UPDATE violates a constraint on the target partition,
and the columns in the new partition are in different physical order than
in the parent, the error message can reveal columns that the user does not
have SELECT permission on. A similar bug was fixed earlier in commit
804b6b6db4.
The cause of the bug is that the callers of the
ExecBuildSlotValueDescription() function got confused when constructing
the list of modified columns. If the tuple was routed from a parent, we
converted the tuple to the parent's format, but the list of modified
columns was grabbed directly from the child's RTE entry.
ExecUpdateLockMode() had a similar issue. That lead to confusion on which
columns are key columns, leading to wrong tuple lock being taken on tables
referenced by foreign keys, when a row is updated with INSERT ON CONFLICT
UPDATE. A new isolation test is added for that corner case.
With this patch, the ri_RangeTableIndex field is no longer set for
partitions that don't have an entry in the range table. Previously, it was
set to the RTE entry of the parent relation, but that was confusing.
NOTE: This modifies the ResultRelInfo struct, replacing the
ri_PartitionRoot field with ri_RootResultRelInfo. That's a bit risky to
backpatch, because it breaks any extensions accessing the field. The
change that ri_RangeTableIndex is not set for partitions could potentially
break extensions, too. The ResultRelInfos are visible to FDWs at least,
and this patch required small changes to postgres_fdw. Nevertheless, this
seem like the least bad option. I don't think these fields widely used in
extensions; I don't think there are FDWs out there that uses the FDW
"direct update" API, other than postgres_fdw. If there is, you will get a
compilation error, so hopefully it is caught quickly.
Backpatch to 11, where support for both cross-partition UPDATEs, and unique
indexes on partitioned tables, were added.
Reviewed-by: Amit Langote
Security: CVE-2021-3393
2021-02-08 10:01:51 +01:00
|
|
|
/* Return a bitmap representing columns being inserted */
|
|
|
|
Bitmapset *
|
|
|
|
ExecGetInsertedCols(ResultRelInfo *relinfo, EState *estate)
|
|
|
|
{
|
|
|
|
/*
|
2021-02-15 08:28:08 +01:00
|
|
|
* The columns are stored in the range table entry. If this ResultRelInfo
|
|
|
|
* represents a partition routing target, and doesn't have an entry of its
|
|
|
|
* own in the range table, fetch the parent's RTE and map the columns to
|
|
|
|
* the order they are in the partition.
|
Fix permission checks on constraint violation errors on partitions.
If a cross-partition UPDATE violates a constraint on the target partition,
and the columns in the new partition are in different physical order than
in the parent, the error message can reveal columns that the user does not
have SELECT permission on. A similar bug was fixed earlier in commit
804b6b6db4.
The cause of the bug is that the callers of the
ExecBuildSlotValueDescription() function got confused when constructing
the list of modified columns. If the tuple was routed from a parent, we
converted the tuple to the parent's format, but the list of modified
columns was grabbed directly from the child's RTE entry.
ExecUpdateLockMode() had a similar issue. That lead to confusion on which
columns are key columns, leading to wrong tuple lock being taken on tables
referenced by foreign keys, when a row is updated with INSERT ON CONFLICT
UPDATE. A new isolation test is added for that corner case.
With this patch, the ri_RangeTableIndex field is no longer set for
partitions that don't have an entry in the range table. Previously, it was
set to the RTE entry of the parent relation, but that was confusing.
NOTE: This modifies the ResultRelInfo struct, replacing the
ri_PartitionRoot field with ri_RootResultRelInfo. That's a bit risky to
backpatch, because it breaks any extensions accessing the field. The
change that ri_RangeTableIndex is not set for partitions could potentially
break extensions, too. The ResultRelInfos are visible to FDWs at least,
and this patch required small changes to postgres_fdw. Nevertheless, this
seem like the least bad option. I don't think these fields widely used in
extensions; I don't think there are FDWs out there that uses the FDW
"direct update" API, other than postgres_fdw. If there is, you will get a
compilation error, so hopefully it is caught quickly.
Backpatch to 11, where support for both cross-partition UPDATEs, and unique
indexes on partitioned tables, were added.
Reviewed-by: Amit Langote
Security: CVE-2021-3393
2021-02-08 10:01:51 +01:00
|
|
|
*/
|
|
|
|
if (relinfo->ri_RangeTableIndex != 0)
|
|
|
|
{
|
|
|
|
RangeTblEntry *rte = exec_rt_fetch(relinfo->ri_RangeTableIndex, estate);
|
|
|
|
|
|
|
|
return rte->insertedCols;
|
|
|
|
}
|
2021-02-15 08:28:08 +01:00
|
|
|
else if (relinfo->ri_RootResultRelInfo)
|
Fix permission checks on constraint violation errors on partitions.
If a cross-partition UPDATE violates a constraint on the target partition,
and the columns in the new partition are in different physical order than
in the parent, the error message can reveal columns that the user does not
have SELECT permission on. A similar bug was fixed earlier in commit
804b6b6db4.
The cause of the bug is that the callers of the
ExecBuildSlotValueDescription() function got confused when constructing
the list of modified columns. If the tuple was routed from a parent, we
converted the tuple to the parent's format, but the list of modified
columns was grabbed directly from the child's RTE entry.
ExecUpdateLockMode() had a similar issue. That lead to confusion on which
columns are key columns, leading to wrong tuple lock being taken on tables
referenced by foreign keys, when a row is updated with INSERT ON CONFLICT
UPDATE. A new isolation test is added for that corner case.
With this patch, the ri_RangeTableIndex field is no longer set for
partitions that don't have an entry in the range table. Previously, it was
set to the RTE entry of the parent relation, but that was confusing.
NOTE: This modifies the ResultRelInfo struct, replacing the
ri_PartitionRoot field with ri_RootResultRelInfo. That's a bit risky to
backpatch, because it breaks any extensions accessing the field. The
change that ri_RangeTableIndex is not set for partitions could potentially
break extensions, too. The ResultRelInfos are visible to FDWs at least,
and this patch required small changes to postgres_fdw. Nevertheless, this
seem like the least bad option. I don't think these fields widely used in
extensions; I don't think there are FDWs out there that uses the FDW
"direct update" API, other than postgres_fdw. If there is, you will get a
compilation error, so hopefully it is caught quickly.
Backpatch to 11, where support for both cross-partition UPDATEs, and unique
indexes on partitioned tables, were added.
Reviewed-by: Amit Langote
Security: CVE-2021-3393
2021-02-08 10:01:51 +01:00
|
|
|
{
|
|
|
|
ResultRelInfo *rootRelInfo = relinfo->ri_RootResultRelInfo;
|
|
|
|
RangeTblEntry *rte = exec_rt_fetch(rootRelInfo->ri_RangeTableIndex, estate);
|
|
|
|
|
|
|
|
if (relinfo->ri_RootToPartitionMap != NULL)
|
|
|
|
return execute_attr_map_cols(relinfo->ri_RootToPartitionMap->attrMap,
|
|
|
|
rte->insertedCols);
|
|
|
|
else
|
|
|
|
return rte->insertedCols;
|
|
|
|
}
|
2021-02-15 08:28:08 +01:00
|
|
|
else
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* The relation isn't in the range table and it isn't a partition
|
|
|
|
* routing target. This ResultRelInfo must've been created only for
|
|
|
|
* firing triggers and the relation is not being inserted into. (See
|
|
|
|
* ExecGetTriggerResultRel.)
|
|
|
|
*/
|
|
|
|
return NULL;
|
|
|
|
}
|
Fix permission checks on constraint violation errors on partitions.
If a cross-partition UPDATE violates a constraint on the target partition,
and the columns in the new partition are in different physical order than
in the parent, the error message can reveal columns that the user does not
have SELECT permission on. A similar bug was fixed earlier in commit
804b6b6db4.
The cause of the bug is that the callers of the
ExecBuildSlotValueDescription() function got confused when constructing
the list of modified columns. If the tuple was routed from a parent, we
converted the tuple to the parent's format, but the list of modified
columns was grabbed directly from the child's RTE entry.
ExecUpdateLockMode() had a similar issue. That lead to confusion on which
columns are key columns, leading to wrong tuple lock being taken on tables
referenced by foreign keys, when a row is updated with INSERT ON CONFLICT
UPDATE. A new isolation test is added for that corner case.
With this patch, the ri_RangeTableIndex field is no longer set for
partitions that don't have an entry in the range table. Previously, it was
set to the RTE entry of the parent relation, but that was confusing.
NOTE: This modifies the ResultRelInfo struct, replacing the
ri_PartitionRoot field with ri_RootResultRelInfo. That's a bit risky to
backpatch, because it breaks any extensions accessing the field. The
change that ri_RangeTableIndex is not set for partitions could potentially
break extensions, too. The ResultRelInfos are visible to FDWs at least,
and this patch required small changes to postgres_fdw. Nevertheless, this
seem like the least bad option. I don't think these fields widely used in
extensions; I don't think there are FDWs out there that uses the FDW
"direct update" API, other than postgres_fdw. If there is, you will get a
compilation error, so hopefully it is caught quickly.
Backpatch to 11, where support for both cross-partition UPDATEs, and unique
indexes on partitioned tables, were added.
Reviewed-by: Amit Langote
Security: CVE-2021-3393
2021-02-08 10:01:51 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Return a bitmap representing columns being updated */
|
|
|
|
Bitmapset *
|
|
|
|
ExecGetUpdatedCols(ResultRelInfo *relinfo, EState *estate)
|
|
|
|
{
|
|
|
|
/* see ExecGetInsertedCols() */
|
|
|
|
if (relinfo->ri_RangeTableIndex != 0)
|
|
|
|
{
|
|
|
|
RangeTblEntry *rte = exec_rt_fetch(relinfo->ri_RangeTableIndex, estate);
|
|
|
|
|
|
|
|
return rte->updatedCols;
|
|
|
|
}
|
2021-02-15 08:28:08 +01:00
|
|
|
else if (relinfo->ri_RootResultRelInfo)
|
Fix permission checks on constraint violation errors on partitions.
If a cross-partition UPDATE violates a constraint on the target partition,
and the columns in the new partition are in different physical order than
in the parent, the error message can reveal columns that the user does not
have SELECT permission on. A similar bug was fixed earlier in commit
804b6b6db4.
The cause of the bug is that the callers of the
ExecBuildSlotValueDescription() function got confused when constructing
the list of modified columns. If the tuple was routed from a parent, we
converted the tuple to the parent's format, but the list of modified
columns was grabbed directly from the child's RTE entry.
ExecUpdateLockMode() had a similar issue. That lead to confusion on which
columns are key columns, leading to wrong tuple lock being taken on tables
referenced by foreign keys, when a row is updated with INSERT ON CONFLICT
UPDATE. A new isolation test is added for that corner case.
With this patch, the ri_RangeTableIndex field is no longer set for
partitions that don't have an entry in the range table. Previously, it was
set to the RTE entry of the parent relation, but that was confusing.
NOTE: This modifies the ResultRelInfo struct, replacing the
ri_PartitionRoot field with ri_RootResultRelInfo. That's a bit risky to
backpatch, because it breaks any extensions accessing the field. The
change that ri_RangeTableIndex is not set for partitions could potentially
break extensions, too. The ResultRelInfos are visible to FDWs at least,
and this patch required small changes to postgres_fdw. Nevertheless, this
seem like the least bad option. I don't think these fields widely used in
extensions; I don't think there are FDWs out there that uses the FDW
"direct update" API, other than postgres_fdw. If there is, you will get a
compilation error, so hopefully it is caught quickly.
Backpatch to 11, where support for both cross-partition UPDATEs, and unique
indexes on partitioned tables, were added.
Reviewed-by: Amit Langote
Security: CVE-2021-3393
2021-02-08 10:01:51 +01:00
|
|
|
{
|
|
|
|
ResultRelInfo *rootRelInfo = relinfo->ri_RootResultRelInfo;
|
|
|
|
RangeTblEntry *rte = exec_rt_fetch(rootRelInfo->ri_RangeTableIndex, estate);
|
|
|
|
|
|
|
|
if (relinfo->ri_RootToPartitionMap != NULL)
|
|
|
|
return execute_attr_map_cols(relinfo->ri_RootToPartitionMap->attrMap,
|
|
|
|
rte->updatedCols);
|
|
|
|
else
|
|
|
|
return rte->updatedCols;
|
|
|
|
}
|
2021-02-15 08:28:08 +01:00
|
|
|
else
|
|
|
|
return NULL;
|
Fix permission checks on constraint violation errors on partitions.
If a cross-partition UPDATE violates a constraint on the target partition,
and the columns in the new partition are in different physical order than
in the parent, the error message can reveal columns that the user does not
have SELECT permission on. A similar bug was fixed earlier in commit
804b6b6db4.
The cause of the bug is that the callers of the
ExecBuildSlotValueDescription() function got confused when constructing
the list of modified columns. If the tuple was routed from a parent, we
converted the tuple to the parent's format, but the list of modified
columns was grabbed directly from the child's RTE entry.
ExecUpdateLockMode() had a similar issue. That lead to confusion on which
columns are key columns, leading to wrong tuple lock being taken on tables
referenced by foreign keys, when a row is updated with INSERT ON CONFLICT
UPDATE. A new isolation test is added for that corner case.
With this patch, the ri_RangeTableIndex field is no longer set for
partitions that don't have an entry in the range table. Previously, it was
set to the RTE entry of the parent relation, but that was confusing.
NOTE: This modifies the ResultRelInfo struct, replacing the
ri_PartitionRoot field with ri_RootResultRelInfo. That's a bit risky to
backpatch, because it breaks any extensions accessing the field. The
change that ri_RangeTableIndex is not set for partitions could potentially
break extensions, too. The ResultRelInfos are visible to FDWs at least,
and this patch required small changes to postgres_fdw. Nevertheless, this
seem like the least bad option. I don't think these fields widely used in
extensions; I don't think there are FDWs out there that uses the FDW
"direct update" API, other than postgres_fdw. If there is, you will get a
compilation error, so hopefully it is caught quickly.
Backpatch to 11, where support for both cross-partition UPDATEs, and unique
indexes on partitioned tables, were added.
Reviewed-by: Amit Langote
Security: CVE-2021-3393
2021-02-08 10:01:51 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Return a bitmap representing generated columns being updated */
|
|
|
|
Bitmapset *
|
|
|
|
ExecGetExtraUpdatedCols(ResultRelInfo *relinfo, EState *estate)
|
|
|
|
{
|
|
|
|
/* see ExecGetInsertedCols() */
|
|
|
|
if (relinfo->ri_RangeTableIndex != 0)
|
|
|
|
{
|
|
|
|
RangeTblEntry *rte = exec_rt_fetch(relinfo->ri_RangeTableIndex, estate);
|
|
|
|
|
|
|
|
return rte->extraUpdatedCols;
|
|
|
|
}
|
2021-02-15 08:28:08 +01:00
|
|
|
else if (relinfo->ri_RootResultRelInfo)
|
Fix permission checks on constraint violation errors on partitions.
If a cross-partition UPDATE violates a constraint on the target partition,
and the columns in the new partition are in different physical order than
in the parent, the error message can reveal columns that the user does not
have SELECT permission on. A similar bug was fixed earlier in commit
804b6b6db4.
The cause of the bug is that the callers of the
ExecBuildSlotValueDescription() function got confused when constructing
the list of modified columns. If the tuple was routed from a parent, we
converted the tuple to the parent's format, but the list of modified
columns was grabbed directly from the child's RTE entry.
ExecUpdateLockMode() had a similar issue. That lead to confusion on which
columns are key columns, leading to wrong tuple lock being taken on tables
referenced by foreign keys, when a row is updated with INSERT ON CONFLICT
UPDATE. A new isolation test is added for that corner case.
With this patch, the ri_RangeTableIndex field is no longer set for
partitions that don't have an entry in the range table. Previously, it was
set to the RTE entry of the parent relation, but that was confusing.
NOTE: This modifies the ResultRelInfo struct, replacing the
ri_PartitionRoot field with ri_RootResultRelInfo. That's a bit risky to
backpatch, because it breaks any extensions accessing the field. The
change that ri_RangeTableIndex is not set for partitions could potentially
break extensions, too. The ResultRelInfos are visible to FDWs at least,
and this patch required small changes to postgres_fdw. Nevertheless, this
seem like the least bad option. I don't think these fields widely used in
extensions; I don't think there are FDWs out there that uses the FDW
"direct update" API, other than postgres_fdw. If there is, you will get a
compilation error, so hopefully it is caught quickly.
Backpatch to 11, where support for both cross-partition UPDATEs, and unique
indexes on partitioned tables, were added.
Reviewed-by: Amit Langote
Security: CVE-2021-3393
2021-02-08 10:01:51 +01:00
|
|
|
{
|
|
|
|
ResultRelInfo *rootRelInfo = relinfo->ri_RootResultRelInfo;
|
|
|
|
RangeTblEntry *rte = exec_rt_fetch(rootRelInfo->ri_RangeTableIndex, estate);
|
|
|
|
|
|
|
|
if (relinfo->ri_RootToPartitionMap != NULL)
|
|
|
|
return execute_attr_map_cols(relinfo->ri_RootToPartitionMap->attrMap,
|
|
|
|
rte->extraUpdatedCols);
|
|
|
|
else
|
|
|
|
return rte->extraUpdatedCols;
|
|
|
|
}
|
2021-02-15 08:28:08 +01:00
|
|
|
else
|
|
|
|
return NULL;
|
Fix permission checks on constraint violation errors on partitions.
If a cross-partition UPDATE violates a constraint on the target partition,
and the columns in the new partition are in different physical order than
in the parent, the error message can reveal columns that the user does not
have SELECT permission on. A similar bug was fixed earlier in commit
804b6b6db4.
The cause of the bug is that the callers of the
ExecBuildSlotValueDescription() function got confused when constructing
the list of modified columns. If the tuple was routed from a parent, we
converted the tuple to the parent's format, but the list of modified
columns was grabbed directly from the child's RTE entry.
ExecUpdateLockMode() had a similar issue. That lead to confusion on which
columns are key columns, leading to wrong tuple lock being taken on tables
referenced by foreign keys, when a row is updated with INSERT ON CONFLICT
UPDATE. A new isolation test is added for that corner case.
With this patch, the ri_RangeTableIndex field is no longer set for
partitions that don't have an entry in the range table. Previously, it was
set to the RTE entry of the parent relation, but that was confusing.
NOTE: This modifies the ResultRelInfo struct, replacing the
ri_PartitionRoot field with ri_RootResultRelInfo. That's a bit risky to
backpatch, because it breaks any extensions accessing the field. The
change that ri_RangeTableIndex is not set for partitions could potentially
break extensions, too. The ResultRelInfos are visible to FDWs at least,
and this patch required small changes to postgres_fdw. Nevertheless, this
seem like the least bad option. I don't think these fields widely used in
extensions; I don't think there are FDWs out there that uses the FDW
"direct update" API, other than postgres_fdw. If there is, you will get a
compilation error, so hopefully it is caught quickly.
Backpatch to 11, where support for both cross-partition UPDATEs, and unique
indexes on partitioned tables, were added.
Reviewed-by: Amit Langote
Security: CVE-2021-3393
2021-02-08 10:01:51 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Return columns being updated, including generated columns */
|
|
|
|
Bitmapset *
|
|
|
|
ExecGetAllUpdatedCols(ResultRelInfo *relinfo, EState *estate)
|
|
|
|
{
|
|
|
|
return bms_union(ExecGetUpdatedCols(relinfo, estate),
|
|
|
|
ExecGetExtraUpdatedCols(relinfo, estate));
|
|
|
|
}
|